XML Parsing

XML Parsing Reference

The parse XML command parses a block of XML data into an AppleScript record hierarchy representing the XML information. The XML data may be read from a file or passed directly to the parse XML command.

The result of the parse XML command is a nested structure of XML element classes. Each XML element class represents an element (tag) in your XML data. Nested XML elements are returned in a list within the enclosing XML element. The root XML element is returned in an XML document class. The XML document class contains additional properties describing root level aspects of your XML data (namespaces used, XML declaration, DOCTYPE declaration, etc.).

Parse XML Syntax

parse XML <xml-data> or <file spec> or <alias> --> XML Document class

Example

This example illustrates what happens when you parse a simple XML document:

set theXML to parse XML "<?xml version=\"1.0\"?>
<data>

   <!-- this is an XML comment which does not appear in the parsed result -->

   <tag1 attName="attribute value">hello</tag1>
   <tag2>
      <a/>
      <b>World</b>
   </tag2>
</data>
"
--   Result:
--   {
--      class:XML document, 
--      XML tag:"data", 
--      XML contents:{
--         {
--            class:XML element, 
--            XML tag:"tag1", 
--            XML attributes:{|attName|"attribute value"}, 
--            XML contents:{"hello"}
--         }, 
--         {
--            class:XML element, 
--            XML tag:"tag2", 
--            XML contents:{
--               {
--                  class:XML element, 
--                  XML tag:"a"
--               }, 
--               {
--                  class:XML element, 
--                  XML tag:"b", 
--                  XML contents:{"World"}
--               }
--            }
--         }
--      }
--   }

The resulting structure is a nested collection of XML element classes matching the logical structure of the XML document parsed. Note that white space around tags is stripped and note also that the XML contents property is omitted in empty XML elements (e.g. and ) so that you can detect empty tags vs space stripped tags.

To get the text value of the XML element:

item 1 of XML contents of item 1 of XML contents of theXML
--   Result: "hello"

To get the attName attribute of the XML element:

|attName| of XML attributes of item 1 of XML contents of theXML
--   Result: "attribute value"

Note the use of AppleScript’s pipe syntax to specify the attribute property name. This is required because AppleScript converts all identifiers to lowercase internally. The use of pipe syntax preserves identifer case to match that in the XML data.

This example parses the same XML data from a file:

tell application "Finder" to set myFolder to container of (path to me) as string
parse XML alias (myFolder & "data.xml")

The Utility AppleScript Code page gives more sample code showing how to access information from the data structure returned by the parse XML command.

NOTE: XML Tools accepts the following XML encodings: UTF-8, UTF-16, ISO-8859-1, US-ASCII and MacRoman. When parsing XML strings directly, make sure that the string type (plain text, UTF-8 or UTF-16) is compatible with the encoding specified in the header. More information on handling encodings with XML Tools is available here.

Parameters:

SAX handler (new in v2.6)

script object

When the SAX handler parameter is specified, parse XML switches to a SAX-like event-based mode of parsing where handlers in the script object specified are called in response to events as the XML data is parsed. This mode of operation is explained here.

Some of the parse XML parameters are ignored when SAX handler is specified. The meaning of others may change. Refer the XML Event-Based Parsing page for more details.

strict standalone

boolean

When true, the Expat parser reports an error when parsing XML data that is not standalone.

expanding external entities

boolean

By default, external entity references (e.g. DTDs) are ignored since XML Tools is a non-validating XML parser. When expanding external entities is true, XML Tools uses the Mac OS URL Access facilities to access the externally referenced entity.

If the external entity exists on another machine, you must have an active internet connection.

Supported URL formats: file:///…, http://…, and ftp://…

encoding

(new in v2.7)
string

By default, XML Tools looks (a) for an encoding in the XML declaration, and (b) for Unicode BOM marks to determine the encoding to use when processing XML. This parameter allows you to override the encoding in the XML declaration.

Valid encodings are: UTF-8, UTF-16, ISO-8859-1, US-ASCII and MacRoman.

including comments

boolean

By default, comments in your XML data are ignored. When including comments is true, comments are included in the resulting XML Contents list for each XML Element. Comments are expressed as instances of the XML Comment class.

parse XML "
<data>
   hello
   <!-- a comment -->
   world
</data>" with including comments
--   Result:
--   {
--      class:XML document, 
--      XML tag:"data", 
--      XML contents:{
--         "hello", 
--         {
--            class:XML comment, 
--            XML comment:" a comment "
--         }, 
--         "world"
--      }
--   }

including processing instructions

boolean
default: false

By default, XML processing instructions are ignored. When including processing instructions is true, XML processing instructions are included in the resulting XML Contents list for each XML Element. Processing Instructions are expressed as instances of the XML Process Instruction class.

parse XML "
<data>
   hello
   <?TARGET xxxx?>
   world
</data>" with including processing instructions
--   Result:
--   {
--      class:XML document, 
--      XML tag:"data", 
--      XML contents:{
--         "hello", 
--         {
--            class:XML processing instruction, 
--            XML target:"TARGET", 
--            XML target data:"xxxx"
--         }, 
--         "world"
--      }
--   }

serializing

boolean
default: false

When serializing is true, the parse XML command adds a uniquely valued id property to the XML Attributes record of each XML Element. This is useful if you later move elements around and need to locate a particular element.

parse XML "<?xml version="1.0"?>
<data>
   <tag/>
   <tag/> 
</data>" with serializing
--   Result:
--   {
--      class:XML document, 
--      XML element id:3,  -- added by with serialization
--      XML tag:"data", 
--      XML attributes:{
--         id:3 -- added by with serialization
--      }, 
--      XML contents:{
--         {
--            class:XML element, 
--            XML element id:1, -- added by with serialization
--            XML tag:"tag", 
--            XML attributes:{
--               id:1 -- added by with serialization
--            }
--         }, 
--         {
--            class:XML element, 
--            XML element id:2, -- added by with serialization
--            XML tag:"tag", 
--            XML attributes:{
--               id:2 -- added by with serialization
--            }
--         }
--      }
--   }

base path

string

Provides a base URL for all external entity IDs. For example, the following code uses a DTD loaded from http://www.latenightsw.com/dtds/mydtd.dtd.

parse XML "<?xml version="1.0"?> 
<!DOCTYPE data SYSTEM "mydtd.dtd">
<data>
    <data>
    <tag/> 
</data>" base path "http://www.latenightsw.com/dtds/"

preserving whitespace

boolean

By default, the parse XML command strips all leading and trailing whitespace characters and normalizes multiple whitespace characters within a string to a single space.

NOTE: The xml-space=”preserve” attribute is honored when preserving whitespace is false.
NOTE: The xml-space=”ignore” attribute is not honored when preserving whitespace is true.
NOTE: Whitespace characters in CDATA sections are never stripped.

When preserving whitespace is true, parse XML returns all XML data including, white space.

allowing leading whitespace

boolean

The XML specification states that well formed XML documents have no leading whitespace before the declaration. However, for historical reasons, XML Tools allows XML documents to contain leading whitespace data. If allowing leading whitespace is false, XML Tools will report an error when whitespace appears at the beginning of an XML document.

NOTE: This only applies to documents that begin with a declaration. If your document does not have an XML declaration, this option is ignored.

including empty elements

(new in v2.6)
boolean

When parsing empty XML elements ( and ), the parse XML command returns an XML contents property containing “”. When including empty elements is false, the parse XML command does not include an XML contents property value for empty XML elements (this is how the parse XML command operated prior to v2.6).

separate namespace URIs

boolean

When parsing documents that utilize XML namespaces, parse XML normally returns an element’s tag name and the associated namespace URI as separate properties.

Here is an example of how parse XML returns XML namespace information:

set theXML to "<?xml version="1.0" encoding="UTF-8"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="doc">
    <out><xsl:value-of select="."/></out>
  </xsl:template>
</xsl:stylesheet>
"
parse XML theXML
--   Result:
--   {
--      class:XML document, 
--      XML namespace prefix:"xsl", 
--      XML namespace uri:"http://www.w3.org/1999/XSL/Transform", 
--      XML tag:"stylesheet", 
--      XML attributes:{|version|:"1.0"}, 
--      XML contents:{
--         {
--            class:XML element, 
--            XML namespace prefix:"xsl", 
--            XML namespace uri:"http://www.w3.org/1999/XSL/Transform", 
--            XML tag:"template", 
--            XML attributes:{match:"doc"}, 
--            XML contents:{
--               {
--                  class:XML element, 
--                  XML tag:"out", 
--                  XML contents:{
--                     {
--                        class:XML element, 
--                        XML namespace prefix:"xsl", 
--                        XML namespace uri:"http://www.w3.org/1999/XSL/Transform", 
--                        XML tag:"value-of", 
--                        XML attributes:{|select|:"."}
--                     }
--                  }
--               }
--            }
--         }
--      }, 
--      XML namespaces:{
--         {
--            class:XML namespace, 
--            XML namespace prefix:"xsl", 
--            XML namespace uri:"http://www.w3.org/1999/XSL/Transform"
--         }
--      }
--   }

When separate namespace URIs is false, parse XML returns the namespace URI and element tag name as a single string in the format “URI:TagName”.

parse XML theXML without seperate namespace URIs
--   Result:
--   {
--      class:XML document, 
--      XML tag:"http://www.w3.org/1999/XSL/Transform:stylesheet", 
--      XML attributes:{|version|:"1.0"}, 
--      XML contents:{
--         {
--            class:XML element, 
--            XML tag:"http://www.w3.org/1999/XSL/Transform:template", 
--            XML attributes:{match:"doc"}, 
--            XML contents:{
--               {
--                  class:XML element, 
--                  XML tag:"out", 
--                  XML contents:{
--                     {
--                        class:XML element, 
--                        XML tag:"http://www.w3.org/1999/XSL/Transform:value-of", 
--                        XML attributes:{|select|:"."}
--                     }
--                  }
--               }
--            }
--         }
--      }, 
--      XML namespaces:{
--         {
--            class:XML namespace, 
--            XML namespace prefix:"xsl", 
--            XML namespace uri:"http://www.w3.org/1999/XSL/Transform"
--         }
--      }
--   }

Leave a Reply

The home of Script Debugger