Unicode Text

Like the Macintosh itself, the AppleScript string class has long been bedeviled by the existence of text encodings representing characters outside its own native encoding, which is MacRoman . With the coming of Mac OS X, this problem is essentially solved at system level: text is now Unicode . Unicode expresses tens of thousands of characters in a single massive encoding, and in its fullest form will express about a million characters, embracing every character of every written language in history. Unfortunately, AppleScript precedes Mac OS X, and the string class is still its primary text class. Over the years, various secondary classes have been fudged into AppleScript in an attempt to increase a string's representational power and to improve AppleScript's compatibility with text in the world around it. At the moment, the most important of these is the Unicode text class, which has the UTF-16 encoding.

Text supplied by the system is often Unicode text rather than a string. For example:

tell application "Finder" to set x to (get name of disk 1)
class of x -- Unicode text

Similarly, some Mac OS X-native applications, such as TextEdit, return text values as Unicode text.

The trouble is that Unicode text remains very much a second-class citizen within AppleScript. Perhaps someday all AppleScript text will be Unicode text, but that day has not yet come. A literal string (the stuff between quotes in your code) is still a string, not Unicode text. Thus, you can't even enter a Unicode string directly; you can try, but non-MacRoman characters are lost at compile time. AppleScript's supplied string manipulation commands, such as the scripting addition command ASCII character, don't work outside the MacRoman range. The character string element knows nothing of composed characters. Unicode text display (in a result, for example) isn't particularly good either; many non-MacRoman characters are not displayed properly. Unicode text communication between a script and a Unicode-savvy application works, but problems can arise.

Then there's the business of how a Unicode text value will interact with a string value, or with a command that expects a string. The good news is that in Tiger such interaction is much improved over previous versions of AppleScript. Whatever you can do to a string, you can do to Unicode text. If you get an element of a Unicode text value, the result is Unicode text. If you concatenate Unicode text and a string, the result is Unicode text (in earlier versions of AppleScript this was not true, which was a big source of trouble). You can explicitly coerce between a string and Unicode text; AppleScript also implicitly coerces for you as appropriate. And scripting addition commands have now mostly been revised to accept Unicode text parameters.

As I mentioned earlier, you can't type a non-MacRoman literal directly. This section provides some workarounds, all of them more or less horrible.

Behind the scenes, a Unicode text string is a 'utxt' resource consisting of a stream of UTF-16 hex bytes. This suggests that you can form such a resource directly as raw data (see "Data," earlier in this chapter) and coerce it to Unicode text. For example:

set n to «data utxt03910313030103BB03BA03B703C303C403B903C2»
set n to n as Unicode text
tell application "Finder"
    set name of folder "Mannie" to n
end tell

(To enter guillemets on a U.S. keyboard layout, type Option-\ and Option-Shift-\.) The result is shown in Figure 13-1. We've successfully given a folder the name of an Ancient Greek tragedy, creating that name ex nihilo in AppleScript.

Another approach is to write the data out to a file and read it back in, which works because AppleScript gives you more ways to treat file data than it gives you to treat text data. Here's an example (on reading and writing files, see Chapter 21). We start with a decimal representation of the same bytes as in the previous example; we write these bytes to a file:

set L to {913, 787, 769, 955, 954, 951, 963, 964, 953, 962}
set s to (path to desktop as string) & "tempFile"
set f to a reference to file s
open for access f with write permission
repeat with aChar in L
    write aChar to f as small integer -- two bytes per character
end repeat
close access f

If we were to open this file as UTF-16 in a word processor, we would see that we've successfully written out the desired string (Figure 13-2).

We can obtain this string by reading the file back into AppleScript as Unicode text:

set s to (path to desktop as string) & "tempFile"
set f to a reference to file s
open for access f
set s to read f as Unicode text
close access f

After that, s is the desired Unicode text. There is also support for exchanging data with a file as UTF-8 ; but there is no internal support for AppleScript text in UTF-8 encoding , so you have to express this as «class utf8», and if you read text as UTF-8, it is converted to UTF-16.

Still another approach is to talk to the shell. The do shell script scripting addition command returns Unicode text by default, so if you can get a Unix scripting language, such as Perl, to construct the string for you, you can obtain it. So:

set p to "print pack(\"U10\", 0x0391, 0x0313, 0x0301, 0x03BB, 0x03BA, " & ¬
    "0x03B7, 0x03C3, 0x03C4, 0x03B9, 0x03C2);"
set s to do shell script "perl -e " & quoted form of p

After that, s is that same Unicode text. No doubt there's a better way to do this (there's always a better way to do things in Perl), but you get the idea.

Various older text classes, fudged into AppleScript (as I mentioned before) to grapple with the problem of encodings, are still around. These are generally to be avoided nowadays, though they can crop up occasionally.

For example, the international text class was a way of representing text in accordance with a particular language and script (where "script" means a writing system); each language-script combination had its own rules (an encoding) for how particular sequences of bytes were mapped to characters (glyphs). The mess created by this multiplicity of encodings is the reason why Unicode is a Good Thing.

The styled text class is another case in point. A style is an attribute of text, such as its font and size, whether it's underlined, that sort of thing. AppleScript defines a styled text class, but you can't manipulate it in any interesting way; in fact, you can barely even detect that it exists, because if you happen to encounter one and ask for its class, you're told it's a string. Nor is it used very much for representing style information; most applications that provide scriptable text styling use a more sophisticated class that lets you access and manipulate styles. Nevertheless, you might encounter styled text from time to time, such as when retrieving text data from the clipboard. You can detect that this has happened by coercing the text to a record:

tell application "Finder"
    activate
    set x to (the clipboard)
end tell
x as record
-- {«class ktxt»:"test"«class ksty»:«data styl000100000000000D000A00100000000C000000000000»}

As you can see, the string is actually made up of text information and style information. But the text information is all that AppleScript is normally willing to show you.

The style resource can be used (perhaps one should say "misused") as a way of carrying encoding information, by associating a font with an encoding. When you coerce an alias to a string, for example, AppleScript actually returns styled text, on the off chance (I suppose) that if the pathname contains any characters outside the MacRoman encoding, the extra encoding information in the style resource can help represent them. But AppleScript conceals from you the fact that it's doing this:

class of ((path to desktop) as string) -- string

Despite AppleScript's answer here, the result is really styled text. As before, we can detect this fact by coercing to a record:

(path to desktop) as string
result as record -- {«class ktxt»:"feathers:Users:mattneub:Desktop:",
«class ksty»:«data styl0001000000000010000E00030000000C000000000000»}

Similarly, any Unicode text coerced to a string is secretly coerced to styled text:

"howdy" as Unicode text as string as record -- {«class ktxt»:"howdy",
«class ksty»:«data styl0001000000000010000E00030000000C000000000000»}

To make things even more complicated, international text and styled text sometimes give the impression of being interchangeable. For example:

get name of application "Tex-Edit Plus"

According to Tex-Edit Plus's dictionary, the result should be international text, but in fact it is styled text. All this is fairly mystifying, and the undeniable impression is that AppleScript text handling is messy and it's trying to conceal the mess (mostly by sweeping it under the carpet).