SAX (Event Based) XML Parsing

SAX-like XML Event Processing

With the release of XML Tools 2.6, it is possible to implement event-based XML processing using XML Tools. This is done by passing an AppleScript script object to the new SAX handler parameter of the parse XML command. This script object is expected to provide a series of handlers that respond to XML parsing events.

This approach is useful when you want to populate a custom data structure directly from XML data instead of extracting the data from the nested collection of XML element classes normally generated by the parse XML command.

Here is a very simple example illustrating how this works:

~~~~ script EventProcessor property elementNames : {}

   on XMLStartElement(elementName, elementAttributes)
      -- called when an XML element begins   
      set end of elementNames to elementName
   end XMLStartElement
end script

set theXML to "<data>
    <test/>
    <test name="mark">
   data in second test element
    </test>
    data in root element
</data>"

set xxx to parse XML theXML SAX handler EventProcessor
xxx's elementNames
--   Result:
--   {"data", "test", "test"}

~~~~

In this example, a copy of the EventProcessor script object is passed to the parse XML command. As the parse XML command is parsing the XML data, it will call the EventProcess’s XMLStartElement handler whenever a new XML element begins. When parsing completes, the EventProcessor object is returned to AppleScript. In this particular case, the XMLStartElement handler records the name of each XML element tag.

Here is an example of an event handler object implementing all the handlers that the XML parse command can call. You need only include the handlers for events that you are interested in handling:

~~~~ script AllEventHandlers

   on XMLStartElement(elementName, elementAttributes)
      -- called when a new XML element begins
      display dialog "XMLStartElement: " & elementName & ", Attributes: " & (length of elementAttributes)
   end XMLStartElement

   on XMLEndElement(elementName)
      -- called when an XML element ends
      display dialog "XMLEndElement: " & elementName
   end XMLEndElement

   on XMLCharacterData(xmlData)
      --called when there is XML data for an element
      display dialog "XMLCharacterData: " & xmlData
   end XMLCharacterData

   on XMLComment(comment)
      -- called when an XML comment is encoutered
      -- must call parse XML with comments
      display dialog "XMLComment: " & comment
   end XMLComment

   on XMLDefaultContent(xmlData)
      -- called for content outside the root element (i.e. XML declaration)
      display dialog "XMLDefaultContent: " & xmlData
   end XMLDefaultContent

   on XMLStartCData()
      -- called at the beginning of an XML CData section
      display dialog "XMLStartCData"
   end XMLStartCData

   on XMLEndCData()
      -- called at the end of an XML CData section
      display dialog "XMLEndCData"
   end XMLEndCData

   on XMLStartNamespace(prefix, uri)
      -- called when a namespace reference begins
      display dialog "XMLStartNamespace: " & prefix & ", URI: " & uri
   end XMLStartNamespace

   on XMLEndNamespace(prefix)
      -- called when a namespace reference ends
      display dialog "XMLEndNamespace: " & prefix
   end XMLEndNamespace

   on XMLProcessingInstruction(target, piData)
      -- called when an XML processing instruction is encountered
      -- must call parse XML with including processing instructions
      display dialog "XMLStartNamespace: " & target & ", Data: " & piData
   end XMLProcessingInstruction

   on XMLNotStandalone()
      -- called when XML is not standalone, and there is no DTD.  Return true to allow processing to 
      -- continue if this handler is missing, parse XML's strict standalone parameter value is used
      display dialog "XMLNotStandalone"
      return true -- allow processing to continue
   end XMLNotStandalone

   on XMLStartDocTypeDecl(docTypeName, systemID, publicID, hasInternalSubset)
      -- called at the beginning of a DOCTYPE declaration
      display dialog "XMLStartDocTypeDecl: " & docTypeName & ", systemID: " & systemID & ", ¬
      publicID: " & publicID & ", hasInternalSubset: " & hasInternalSubset
   end XMLStartDocTypeDecl

   on XMLEndDocTypeDecl()
      -- called at the end of a DOCTYPE declaration
      display dialog "XMLEndDocTypeDecl"
   end XMLEndDocTypeDecl

   on XMLExternalEntityRef(context, base, systemID, publicID)
      -- called after an external entity (DTD) has been loaded
      display dialog "XMLExternalEntityRef: " & context & ", base: " & base & ", ¬
      systemID: " & systemID & ", publicID: " & publicID
   end XMLExternalEntityRef

   on XMLUnparsedEntityDecl(entityName, base, systemID, publidID, notationName)
      display dialog "XMLUnparsedEntityDecl: " & entityName & ", base: " & base & ", ¬
      systemID: " & systemID & ", publicID: " & publicID & ", notationName: " & notationName
   end XMLUnparsedEntityDecl

   on XMLNotationDecl(notationName, base, systemID, publidID)
      display dialog "XMLNotationDecl: " & notationName & ", base: " & base & ", ¬
      systemID: " & systemID & ", publicID: " & publicID
   end XMLNotationDecl

   on XMLParseResult(errNumber, errMessage)
      -- if the parsing is aborted doe to an AppleScript error, errNumber and errMessage describe the error. Otherwise
      -- these parameters contain missing value.

      -- return the data you want parse XML to return. If this method is omitted, the entire script object is returned
      return "some data"
   end XMLParseResult
end script

~~~~

NOTE: Attributes are passed to the XMLStartElement as a record where keys are the attribute name and values are the corresponding attribute value.

NOTE: If there is an error in one of the XML event handlers, parse XML will abort the parse. When this happens parse XML will return the result of XMLParseResult() handler or the script object, if XMLParseResult() is not defined, in the partial result of the error. You can extract this information using this syntax:

~~~~ script SAXHandler property elementNames : {}

   on XMLStartElement(elementName, elementAttributes)
      -- called when an XML element begins
      set end of elementNames to elementName
      error "Error Message from SAXHandler" -- signal an error to abort parsing the rest of the XML stream
   end XMLStartElement

   on XMLParseResult()
      -- return the data you want parse XML to return.  If this method is omitted, the entire script object is returned
      return elementNames
   end XMLParseResult
end script

try
   set xxx to parse XML "<data>
    <!-- a comment -->
    <test/>
    <test name="mark">
   data in second test element
    </test>
    data in root element
</data>" SAX handler SAXHandler with including processing instructions and including comments
on error errMsg partial result pr
   {errMsg, pr} -- partial result is the data returned by XMLParseResult
end try
--   Result:
--   {
--      "xmlstartelement SAX handler error: Error Message from SAXHandler", 
--      {
--         "data"
--      }
--   }

~~~~

NOTE: Script Debugger’s AppleScript debugger is unable to debug XML event handlers while they are being executed by the parse XM command.

Parameters

SAX handler

(new in v2.6)
script object

When the SAX handler parameter is specified, parse XML switches to a SAX-like event-based mode of parsing where handlers in the script object specified are called in response to events as the XML data is parsed.

When this parameter is omitted, parse XML performs as it has done in the past and returns an XML document class containing a nested data structure representing the content of the parsed XML data.

strict standalone

boolean

Ignored if the event handler object implements the XMLNotStandalone handler.

expanding external entities

boolean

By default, external entity references (e.g. DTDs) are ignored since XML Tools is a non-validating XML parser. When expanding external entities is true, XML Tools uses the Mac OS URL Access facilities to access the externally referenced entity.

If the external entity exists on another machine, you must have an active internet connection.

Supported URL formats: file:///…, http://…, and ftp://…

NOTE: The XMLExternalEditityRef handler is called after the external entity has been loaded.

including comments

boolean

By default, comments in your XML data are ignored. The including comments parameter must be true in order for the event handler’s XMLComment handler to be called.

including processing instructions

boolean

By default, XML processing instructions are ignored. The including processing instructions parameter must be true in order for the event handler’s XMLProcessingInstruction handler to be called.

serializing

boolean

Ignored.

base path

string

Provides a base URL for all external entity IDs. For example, the following code uses a DTD loaded from http://www.latenightsw.com/dtds/mydtd.dtd:

~~~~ parse XML “ ” base path “http://www.latenightsw.com/dtds/” ~~~~

preserving whitespace

boolean

By default, the parse XML command strips all leading and trailing whitespace characters and normalizes multiple whitespace characters within a string to a single space.

NOTE: The xml-space=”preserve” attribute is honored when preserving whitespace is false.
NOTE: The xml-space=”ignore” attribute is not honored when preserving whitespace is true.
NOTE: Whitespace characters in CDATA sections are never stripped.

When preserving whitespace is true, parse XML returns all XML data, including whitespace.

The parse XML command will strip whitespace according to these rules before calling the event handler’s XMLCharacterData handler.

allowing leading whitespace

boolean

The XML specification states that well formed XML documents have no leading whitespace before the declaration. However, for historical reasons, XML Tools allows XML documents to contain leading whitespace data. If allowing leading whitespace is false, XML Tools will report an error when whitespace appears at the beginning of an XML document.

NOTE: This only applies to documents that begin with a declaration. If your document does not have an XML declaration, this option is ignored.

seperate namespace URIs

boolean

Ignored.

Leave a Reply

The home of Script Debugger