Encodings

Encodings

XML Tools supports the following XML encodings: UTF-8, UTF-16, ISO-8859-1, US-ASCII and MacRoman.

By default, XML Tools looks (a) for an encoding in the XML declaration, and (b) for Unicode BOM marks to determine the encoding to use when processing XML. The encoding parameter to the parse XML command allows you to override the encoding in the XML declaration.

MacRoman XML Files

XML Tools 2.6 and later handle XML encodings differently from previous versions. Prior to version 2.6, XML Tools would convert all text to UTF-16 using the encoding conversions provided by the Mac OS AppleEvent Manager, before parsing the data. This meant that all plain text (i.e. not UTF-8 or UTF-16) was assumed to be MacRoman. The encoding specified in the XML Declaration was ignored because all data was converted to UTF-16 prior to parsing.

With XML Tools 2.6 and later, this changed so that XML data is parsed un-converted and the Encoding specified in the XML Declaration is honored. Consequently, unsupported Encodings now generate an error.

Simulating Pre-XML Tools 2.6 Behavior

To simulate the pre-XML Tools 2.6 behavior, you must use code like this which converts the text to UTF-16, and then overrides the declared XML Encoding with UTF-16:

~~~~ on parseXMLFile(theFile) local theData, theXML

   set theData to (read theFile) as Unicode text
   set theXML to parse XML theData encoding "UTF-16"

   return theXML
end parseXMLFile

~~~~

Parsing MacRoman XML

Prior to XML Tools 2.6, files with MacRoman high-bit characters parsed correctly because they were converted to UTF-16 implicitly. With XML Tools 2.6 and later you must explicitly tell XML Tools to use MacRoman encoding. This can be done by using an XML Declaration like this:

~~~~ ~~~~

or you can can use the encoding parameter of the parse XML command to force the sue of MacRoman encoding:

~~~~ on parseXMLFile(theFile) local theXML

   set theXML to parse XML theFile encoding "MacRoman"

   return theXML
end parseXMLFile

~~~~

XML Tools 2.6 and later defaults to UTF-8 (versions prior to 2.6 defaulted to MacRoman), so you must specify the MacRoman encoding in order to successfully parse high-bit MacRoman characters.

AppleScript And Unicode

AppleScript handles all Unicode data in UTF-16. If you read UTF-8 unicode into a variable, it is converted to UTF-16 by AppleScript. This is important when parsing Unicode strings with XML Tools because AppleScript will always present the data to XML Tools as UTF-16, even if it started out as UTF-8.

Leave a Reply

The home of Script Debugger