Chapter 5. String and Text Processing

Someone will call Something will fall And smash on the floor Without reading the text Know what comes next Seen it before And it’s painful Things must change We must rearrange them Or we’ll have to estrange them All that I’m saying The game’s not worth playing Over and over again

Users who come to Mathematica for its superior mathematical capabilities are pleasantly surprised to find strong abilities in programming areas outside of mathematics proper. This is certainly true in the area of textual and string processing. Mathematica’s rich library of functions for string and structured text manipulation rivals Java, Perl, or any other modern language you can tie a string around.

The sections in this introduction provide information on some of the basic tools of strings and string manipulation.

Mathematica uses Unicode internally, but externally (e.g., when saving a notebook) it uses ASCII codes, encoding non-ASCII characters in a special form.

For example, lowercase Greek letters and other non-ASCII characters are encoded using backslash-bracketed character names (\[name]).

In[1]:=  alpha = "α"
Out[1]=  α

The function ToString will translate strings using different encoding schemes.

Characters and Character Encodings

The default character encoding used by Mathematica is stored in $CharacterEncoding, and the native character encoding of the underlying operating system Mathematica is running is stored in $SystemCharacterEncoding. All available encodings are stored in $CharacterEncodings.

    In[3]:=  $CharacterEncoding
    Out[3]=  UTF-8

    In[4]:=  $SystemCharacterEncoding
    Out[4]=  UTF-8

    In[5]:=  Partition[$CharacterEncodings, 4] // TableForm
Out[5]//TableForm=
             AdobeStandard      ASCII                       CP936                       CP949
             CP950              Custom                      EUC-JP                      EUC
             IBM-850            ISO10646-1                  ISO8859-10                  ISO8859-11
             ISO8859-13         ISO8859-14                  ISO8859-15                  ISO8859-16
             ISO8859-1          ISO8859-2                   ISO8859-3                   ISO8859-4
             ISO8859-5          ISO8859-6                   ISO8859-7                   ISO8859-8
             ISO8859-9          ISOLatin1                   ISOLatin2                   ISOLatin3
             ISOLatin4          ISOLatinCyrillic            Klingon                     koi8-r
             MacintoshArabic    MacintoshChineseSimplified  MacintoshChineseTraditional MacintoshCroatian
             MacintoshCyrillic  MacintoshGreek              MacintoshHebrew             MacintoshIcelandic
             MacintoshKorean    MacintoshNonCyrillicSlavic  MacintoshRomanian           MacintoshRoman
             MacintoshThai      MacintoshTurkish            MacintoshUkrainian          Math1
             Math2              Math3                       Math4                       Math5
             Mathematica1       Mathematica2                Mathematica3                Mathematica4
             Mathematica5       Mathematica6                Mathematica7                PrintableASCII
             ShiftJIS           Symbol                      Unicode                     UTF8
             WindowsANSI        WindowsBaltic               WindowsCyrillic             WindowsEastEurope
             WindowsGreek       WindowsThai                 WindowsTurkish              ZapfDingbats

Notice how UTF-8 needs two bytes to display alpha.

Characters and Character Encodings

ToCharacterCode gives the numerical representation.

Characters and Character Encodings

You can map from character codes back to characters using FromCharacterCode[].

In[8]:=  FromCharacterCode[{87, 88, 89, 90}]
Out[8]=  WXYZ

The mapping may not be one-to-one for certain encodings.

In[9]:=  FromCharacterCode[{206, 177}, "UTF8"]
Out[9]=  α

A great deal of Mathematica’s prowess in text processing comes from its rich support for pattern matching. There are two basic classes of string patterns: string expressions and regular expressions. Introduced in version 5.1, each has a similar expressive power. The advantage of StringExpression is that it is less cryptic because it uses more words than symbols to express patterns. The advantage of RegularExpression is that it is more standardized with other languages such as Perl, Ruby, Java, and so on. Non-Mathematica programmers, especially those with a background in Unix, are more likely to understand regular expressions, although these expressions are cryptic to the uninitiated. You should become familiar with both if you plan to do much string manipulation. If you program frequently in languages outside of Mathematica, master the regular expression syntax. If you work strictly in Mathematica, choose the one that most appeals to you. If you learn the string expression syntax, you will have a leg up on learning Mathematica’s more general pattern-matching syntax, which is used in many contexts outside text processing. You can also mix string expressions and regular expressions into compound patterns.

StringExpressions are mostly written using the infix operator ~~, which is a syntactic shortcut for the StringExpression[] function. StringExpression uses Mathematica’s blanks notation (e.g., _, __, and ___) to represent wild cards. See Chapter 4 for more on blanks.

Match "xy" followed by any character.

In[10]:=  "xy" ~~ _;

In[11]:=  StringMatchQ["xyz" , "xy" ~~ _]
Out[11]=  True

In[12]:=  StringMatchQ["xyzz" , "xy" ~~ _]
Out[12]=  False

Match "xy" followed by one or more characters.

In[13]:=  "xy" ~~ __;

In[14]:=  StringMatchQ["xyzz" , "xy" ~~ __]
Out[14]=  True

In[15]:=  StringMatchQ["xy" , "xy" ~~ __]
Out[15]=  False

Match "xy" followed by zero or more characters.

In[16]:=  "xy" ~~ ___;

In[17]:=  StringMatchQ["xyz" , "xy" ~~ ___]
Out[17]=  True

In[18]:=  StringMatchQ["xy" , "xy" ~~ ___]
Out[18]=  True

Patterns can be associated with variables so that the matching portion can be referred to in a subsequent expression. For example, the following pattern will match if the string begins and ends with the same sequence of characters.

String expressions

Table 5-1 shows some of the common raw ingredients for string expressions. If you have already read Chapter 4 on pattern matching, you can see that all the same constructs are available for strings. The full set of string expression primitives can be found in tutorial/WorkingWithStringPatterns.

Table 5-2 shows some of the common raw ingredients for regular expressions. The full set of regular expression primitives can be found in tutorial/WorkingWithStringPatterns. Here c or c n , where n is a number, is a placeholder for an arbitrary character, and p n is a placeholder for an arbitrary regular expression.

Sometimes you know exactly where the characters are that you want to remove. In that case, StringDrop[] is a lot more efficient. StringDrop[] takes the string and a second argument, which can be an offset from the front, an offset from the end, specific positions, or a range of positions.

Consider:

In[41]:= myString = "abcdefghijklmnop" ;

Here you drop the first three characters.

In[42]:= StringDrop[myString, 3]
Out[42]= defghijklmnop

Alternatively, you drop the last three characters, like so.

In[43]:= StringDrop[myString, -3]
Out[43]= abcdefghijklm

Drop only the third character, like this.

In[44]:= StringDrop[myString, {3}]
Out[44]= abdefghijklmnop

And drop the third through fifth ("cde"), using a range list.

In[45]:= StringDrop[myString, {3, 5}]
Out[45]= abfghijklmnop

The step size in the range can even be greater than one by specifying it as the third element. Here you specify a step size of two to remove every other character. The -1 upper limit is a convenient way to specify the end of the string without having to know its length.

In[46]:= StringDrop[myString, {l, -1, 2}]
Out[46]= bdfhjlnp

You can also act on several strings at once.

In[47]:= otherString = "1234567890";

In[48]:= StringDrop[{myString, otherString}, {3, 5}]
Out[48]= {abfghijklmnop, 1267890}

The positional form for replacement is called StringReplacePart[], and it works using similar conventions for specifying positions. The difference is that you must always provide a contiguous range or a list of such ranges.

In[49]:= StringReplacePart[myString, "ZZZ", {3, 5}]
Out[49]= abZZZfghijklmnop

In[50]:= StringReplacePart[myString, "ZZZ", {{3, 5}, {10, 15}}]
Out[50]= abZZZfghiZZZp

Each range can also have its own replacement string.

In[51]:= StringReplacePart[myString, {"ZZZ", "WWW"}, {{3, 5}, {10, 15}}]
Out[51]= abZZZfghiWWWp

See 2.4 Mapping Multiple Functions in a Single Pass for use of StringPosition[], which returns sequence specification that can be fed into StringReplacePart[] and StringDrop[].

See 2.8 Defining Indexed Functions and 2.9 Understanding the Use of Fold As an Alternative to Recursion for more sophisticated forms of XML processing.

Sometimes you know exactly where the characters are that you want to remove. In that case, StringTake[] is a lot more efficient. StringTake[] takes the string and a second argument, which can be an offset from the front, an offset from the end, specific positions, or a range of positions.

Consider:

In[66]:= myString = "abcdefghijklmnop";

Here you take the first three characters.

In[67]:= StringTake[myString, 3]
Out[67]= abc

Alternatively, you take the last three characters, like so.

In[68]:= StringTake[myString, -3]
Out[68]= nop

Take only the third character, like this.

In[69]:= StringTake[myString, {3}]
Out[69]= c

And take the third through fifth ("cde") using a range list.

In[70]:= StringTake[myString, {3, 5}]
Out[70]= cde

The step size in the range can even be greater than one by specifying it as the third element. Here you specify a step size of two to take every other character. The -1 upper limit is a convenient way to specify the end of the string without having to know its length.

In[71]:= StringTake[myString, {1, -1, 2}]
Out[71]= aeegikmo

Conveniently, you can also act on several strings at once.

In[72]:= otherString = "1234567890";

In[73]:= StringTake[{myString, otherstring}, {3, 5}]
Out[73]= {cde, 345}

If you have read 5.2 Removing and Replacing Characters from Strings, you see that StringTake has very similar parameter variations as StringDrop[]. However, StringTake has an additional feature: it can take a list of position specifications and produce a list of the resulting extracts.

In[74]:= StringTake[myString, {{1}, {3}, {8, 10}}]
Out[74]= {a, c, hij}

This is useful for picking multiple segments from a string in one step. However, if you want a string rather than a list, simply wrap the expression in a StringJoin[].

In[75]:= StringJoin[StringTake[myString, {{1}, {3}, {8, lO}}]]
Out[75]= aehij

This is a simple recipe, and I include it because it’s something you expect to be bundled as a native function, but it’s not. For most practical applications, the solution is fine, but for very large n, a doubling approach will have better performance. Rather than doing the math to get the exact string size, we simply truncate the closest sized string obtained from pure doubling of the seed.

In[84]:=  stringDup2[seed_, n_] :=
           StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n*StringLength[seed]]

In[85]:=  Mean[Table[Timing[stringDup["-", 100 000]][[1]], {10}]]
Out[85]=  0.0486878

In[86]:=  Mean[Table[Timing[stringDup2["-", 100 000]][[1]], {10}]]
Out[86]=  0.0031014

This solution may not be obvious, so let’s break it down. It should be clear that mapping the function #<>#& to a list containing a string will double that string (recall that <> is string concatenation).

In[87]:= # <> #& /@ {"_"}
Out[87]= {--}

It follows that doing this twice will quadruple it.

In[88]:= # <> #&/@ (# <> #& /@ {"_";})
Out[88]= {----)

Repeating this process m times will create a string of length 2^m. However, the input is the desired length n, not the number of doublings, so we know we need at least Ceiling[Log[2, n]] doublings; by using Nest with this number, we get exactly that. However, this overshoots the desired length in most cases, because we rarely expect n to be an exact power of 2. So we use Take to extract the correct length. The reason this is fast for large n is that it reduces a 0(n) operation in terms of Table to a 0(log n) operation using StringJoin.

You can bundle these versions together into one function that gives good performance across all sizes.

In[89]:=  Clear[stringDup];
          stringDup[seed_String, n_Integer /; n >=2^12] :=
           StringTake[Nest[# <> # &, seed, Ceiling[Log[2, n]]], n]
          stringDup[seed_String, n_Integer: 2] :=  StringJoin@[rray[seed &, n]

StringMatchQ[] and StringFreeQ[] very often find application in restricting inputs to functions.

In[97]:=  classify [word_String /; StringMatchQ[word,{"I", "me", "we",
                "you", "they", "him", "her", "it"}]] := pronoun[word]
          classify[word_String /; StringMatchQ[word, {"and", "or", "nor",
                "after", "although", "as", "because", "before", "how", "if",
                 "once", "since", "than", "that", "though", "till", "until",
                 "when", "where", "whether", "while"}]] := conjunction[word]
          classify[word_String /; StringMatchQ[word, DatePattern[{"DayName"}]]]  :=
           dayofweek[word]
          classify[word_String /; StringMatchQ[word, DatePattern[{"MonthName"}]]]  :=
           month[word]
          (*...*)
          classify[word_String]  := other[word] ;

You can also use them as input to other functions, like Pick[] in the following grep implementation adapted from an example in Mathematica documentation. Recall that in the standard Unix grep, option -v instructs grep to return lines that don’t match the pattern. Here Transpose and Range are used to number each line so the result contains a list of pairs {line, match text}. This grep function was implemented in terms of StringFreeQ rather than StringMatchQ since the latter only succeeds if the entire string matches.

  In[102]:=  grep[file_, patt_, "-v"]  := grepImpl[file, patt, True ]
             grep[file_, patt_] := grepImpl[file, patt, False]
             grepImpl[file_, patt_, value_] := With[{data =Import[file, "Lines"]},
               Pick[Transpose[{Range[Length[data]], data=],
                StringFreeQ[data, RegularExpression[patt]], value ]]

  In[105]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=], "bar"}] //
              TableForm
Out[105]//TableForm=
             1 bar
             4 foo bar
             5 foobar
             6 barfo

  In[106]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"=}], "bar$"]
  Out[106]=  {{1, bar}, {4, foo bar}, {5, foobar}}

  In[107]:=  grep[FileNameJoin[{NotebookDirectory[], "greptest.txt"}], "bar", "-v"]
  Out[107]=  {{2, foo}, {3, baz}, {7, fo o}}

Both StringMatchQ[] and StringFreeQ[] support the IgnoreCase → True option. StringMatchQ also supports option SpellingCorrection → True, which allows the match to succeed even if a small number of characters are wrong. However, in many cases a small number can mean only 1, as the following example demonstrates, so I would not rely too heavily on this "feature."

Discussion

The output of StringPosition[] can be used as the input to StringTake.

In[110]:=  With[{str ="1234abcd54321"},
            StringTake[str, StringPosition[str, NumberString]]]
Out[110]=  {1234, 234, 34, 4, 54321, 4321, 321, 21, 1}

If you want to use it with StringDrop[], you need to map StringDrop[] over the list returned by StringPosition[]. This will give you a list with each matching segment dropped. More than likely, you will want to set Overlaps → False in this case. Try Overlaps → True with the expression given below to see why it is undesirable.

Discussion

As of version 6, Mathematica comes bundled with many useful data sources. One of these sources is an integrated English language dictionary (dictionaries for other languages can be installed).

Look up words that begin with th and end with y.

In[116]:=  DictionaryLookup["th" ~~ ___ ~~"y"]
Out[116]=  {thankfully, thanklessly, theatricality, theatrically,
            thematically, theocracy, theologically, theology, theoretically,
            theory, theosophy, therapeutically, therapy, thereby, thermally,
            thermodynamically, thermostatically, they, thickly, thievery,
            thingummy, thingy, thinly, thirdly, thirstily, thirsty, thirty, thorny,
            thoroughly, thoughtfully, thoughtlessly, thready, threateningly,
            threepenny, threnody, thriftily, thrifty, thrillingly, throatily,
            throaty, throwaway, thruway, thuggery, thunderously, thundery, thy}

Look up words that end in ee.

In[117]:=  DictionaryLookup[___  ~~  "ee"]
Out[117]=  {absentee, addressee, agree, Aimee, Albee, amputee, apogee, appointee,
            Ashlee, attendee, Attlee, axletree, banshee, bee, bootee, bumblebee,
            bungee, carefree, Chattahoochee, Cherokee, chickadee, chimpanzee,
            coffee, committee, conferee, consignee, coulee, Cree, debauchee, decree,
            Dee, degree, deportee, Desiree, detainee, devotee, disagree, divorcee,
            draftee, Dundee, dungaree, Elysee, emcee, employee, enlistee, entree,
            epee, escapee, evacuee, fat-free, fee, fiancee, filigree, flee, foresee,
            franchisee, free, fricassee, Frisbee, fusee, Galilee, garnishee, gee, ghee,
            glee, goatee, grandee, grantee, guarantee, gumtree, honeybee, honoree,
            Humvee, inductee, internee, interviewee, invitee, jamboree, Jaycee,
            jubilee, kedgeree, Klee, knee, lee, Lee, legatee, Legree, lessee, levee,
            licensee, manatee, marquee, matinee, McGee, McKee, melee, Menominee,
            Milwaukee, mortgagee, Murrumbidgee, Muskogee, nee, negligee, nominee,
            Okeechobee, Okefenokee, oversee, parolee, Pawnee, payee, pedigree, pee,
            peewee, Pelee, perigee, pewee, pharisee, Pharisee, pongee, prithee,
            protegee, puree, puttee, quadtree, ranee, referee, refugee, Renee,
            repartee, retiree, returnee, Rhee, rupee, Sadducee, scree, see, settee,
            Shawnee, Sheree, shoetree, singletree, sirree, Slurpee, soiree, spree,
            squeegee, standee, subcommittee, subtree, suttee, Suwanee, Swanee,
            Tallahassee, tee, Tennessee, tepee, thee, three, toffee, toll-free, topee,
            toupee, towhee, townee, Toynbee, trainee, transferee, tree, trochee,
            Truckee, trustee, Tuskegee, twee, Tweedledee, Tyree, wannabee, wee, whee,
            whiffletree, whippletree, whoopee, Yahtzee, Yankee, yippee, Zebedee}

There are a lot of neat applications for an integrated dictionary.

By using all the words in the dictionary with Nearest[], you can create a rudimentary spell-checker. For our first attempt, we’ll use Nearest's default distance function. We’ll return a list for which the first element is True or False depending on the word’s inclusion in the dictionary and the second element is a list of potential correct spellings.

In[128]:=  
             (*Returns a function that, when applied to <word> and an integer <n>, 
           returns a list containing the n words in the integrated dictionary considered
           to be closest to <word>*)

           nf1 = Nearest[DictionaryLookup[]];

           SpellCheck1[word_]  := Module[{corrections = nf1[word, 15]} ,
             If[ MemberQ[ corrections, word], {True, word}, {False, corrections}]]
In[130]:=  SpellCheck1["pickel"]
Out[130]=  {False, {nickel, picked, picker, picket, bicker, dicker, dickey,
             hickey, kicked, kicker, licked, Michel, mickey, Mickey, nicked}}

We see that the default distance function used for strings (EditDistance) does not make the greatest spell-checker: the obvious suggestion of pickle is not among the first 15 nearest words. You can experiment with other distance functions. Here is one that penalizes more heavily for mistakes in consonants than for mistakes in vowels.

Spell-checker

Here we test on some commonly misspelled words (according to the Oxford Dictionaries website: http://bit.ly/KuIQ2 ).

In[135]:=  SpellCheck2["accomodate"]
Out[135]=  {False, {accommodate, accommodated, accommodates, accumulate, accelerate,
             accentuate, acclimate, accolade, accommodation, accordant}}

In[136]:=  SpellCheck2["alcahol"]
Out[136]=  {False, {alcohol, alcohols, alcoholic,
             achoo, ahchoo, algal, anchor, carol, lethal, local}}

In[137]:=  SpellCheck2["mispell"]
Out[137]=  {False, {misspell, Aspell, Ispell, miscall,
             respell, spell, dispel, dispels, misdeal, misplay}}

This returns useful results, but performance (speed) is poor.

In[138]:=  SpellCheck2["pickel"]  // Timing
Out[138]=  {2.22533, {False, {nickel, picked, picker,
              picket, pickle, packed, packer, packet, pecked, pick}}}

We can improve the speed using a divide-and-conquer approach: pick a large but manageable number (e.g., 100) of nearest words according to simple EditDistance, and then do a second pass on the smaller set with the EditDistance sans vowels. We define a distance function called ConsonantDistance[] for the second pass.

Spell-checker

Good results and about 43 times faster!

Mathematica also provides WordData[], which returns information about properties of words, such as parts of speech and definitions.

In[142]:=  WordData["run"]
Out[142]=  {{run, Noun, Score}, {run, Noun, Travel}, {run, Noun, RegularTrip},
            {run, Noun, ShortTrip}, {run, Noun, FootballPlay}, {run, Noun, Endeavor},
            {run, Noun, Successiveness}, {run, Noun, Flow}, {run, Noun, Damage},
            {run, Noun, Footrace}, {run, Noun, Campaign}, {run, Noun, Streak},
            {run, Noun, Stream}, {run, Noun, IndefiniteQuantity},
            {run, Noun, Liberty}, {run, Noun, TimePeriod}, {run, Verb, Disintegrate},
            {run, Verb, SplitUp}, {run, Verb, Dissolve}, {run, Verb, Treat},
            {run, Verb, Change}, {run, Verb, Get}, {run, Verb, Vie}, {run, Verb, Race},
            {run, Verb, Catch}, {run, Verb, Draw}, {run, Verb, Operate},
            {run, Verb, Function}, {run, Verb, CarryThrough}, {run, Verb, Play},
            {run, Verb, Circularize}, {run, Verb, Trip}, {run, Verb, GoThrough},
            {run, Verb, Hurry}, {run, Verb, TravelRapidly}, {run, Verb, Sport},
            {run, Verb, Accompany}, {run, Verb, Sail}, {run, Verb, SpreadOut},
            {run, Verb, Flow}, {run, Verb, GoAway}, {run, Verb, Displace},
            {run, Verb, MoveFreely}, {run, Verb, Trade}, {run, Verb, Loose},
            {run, Verb, Direct}, {run, Verb, Succeed}, {run, Verb, Implement},
            {run, Verb, Occur}, {run, Verb, Continue},{run, Verb, Endure},
            {run, Verb, Extend}, {run, Verb, MakePass}, {run, Verb, Lean},
            {run, Verb, Incur}, {run, Verb, Go}, {run, Verb, Range}}

Mathematica imports XML into expression form. You can manipulate the expression just like you would any other Mathematica expression, but first you need to understand the structure, which is a bit unconventional. Mathematica uses two types of heads to encode XML. XMLObject[ "type" ] is used to represent everything that is not an element, including the entire document (type = "Document"), comments (type = "Comment}), CDATA sections (type = "CDATASection"), processing instructions (type = "Processinglnstruction"), declarations (type = "Declaration), and document types (type = "Doctype"). In the XML above, you see examples for document, declaration, comment, and processing instruction. XMLElement[tag,{attr1→val1,...},{data1,...}] is used to represent element data for both simple (text values) and complex element types (those with child elements). Don’t get tripped up by the XMLObject notation. The entire syntax XMLObject["type"] is the head of the expression, while the remainder is a sequence of one or more arguments that depends on the type.

  In[144]:=  Head[data] // InputForm
Out[144]//InputForm=
             XMLObject["Document"]

The document version consists of three arguments: a list containing the declaration and possibly other objects, the document content, and a list of any objects (such as comments) that might appear past the last XML element. A very crude way to access structure is through Part[] or, equivalently, [[n]].

Discussion

Pattern matching is much more elegant and more resilient to changes in document structure. Here we extract male elements using Cases with a pattern and an infinite level specification. This is roughly equivalent to using XPath in native XML processing.

  In[150]:=  Cases[data, XMLElement[_, _, {_, XMLElement["sex", _, {"male"}], ___}],
               Infinity] // TableForm
Out[150]//TableForm=
             XMLElement[item, {}, {XMLElement[name, {}, {Leonardo}], XMLElement[sex,
              {}, {male}], XMLElement[age, {}, {8}], XMLElement[height, {}, {4.7}]}]
             XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}], XMLElement[sex,
              {}, {male}], XMLElement[age, {}, {5}], XMLElement[height, {}, {4.1}]}]

Sometimes, the XMLObject and XMLElement notation can be a bit too heavy, and it is easier to work with simple nested lists. This can be done with Apply plus List, specifying all levels.

In[151]:=  list = Apply[List, data, {0, Infinity}]
Out[151]=  {{{{Version, 1.0}, {Encoding, UTF-8}},
             { Some data to use as a test for Mathematica's XML import },
             {test, Just for didactic purposes}}, {data, {},
             {{item, {}, {{name, {}, {Leonardo}}, {sex, {}, {male}}, {age, {}, {8}},
                {height, {}, {4.7}}}}, {item, {}, {{name, {}, {Salvatore}},
                {sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}},
              {item, {}, {{name, {}, {Alexis}}, {sex, {}, {female}},
                {age, {}, {6}}, {height, {}, {4.4}}}}}}, {{  Comment at end }}}

This can shorten the patterns needed to extract content.

In[152]:=  Cases[list, {___, {"sex", _, {"male"}}, ___}, Infinity]
Out[152]=  {{{name, {}, {Leonardo}}, {sex, {}, {male}},
             {age, {}, {8}], {height, {}, {4.7}}}, {{name, {}, {Salvatore}},
             {sex, {}, {male}}, {age, {}, {5}}, {height, {}, {4.1}}}}

Another useful transformation is to change all heads to the symbolic form of the element tag. Here we use //. (ReplaceRepeated) with rules that strip XMLObject and convert XMLElement expressions. I show the output in tree form to make it clear what this transformation does.

Discussion

The format of imported XML is a bit heavy. You use pattern matching and ReplaceAll to transform it into something more digestible. Here we take our row-oriented XML data into a simple matrix.

Solution

This technique has two basic steps. First, you use Cases to extract the relevant elements. Second, you use a series of one or more transformations to massage the data into the form you want. In the first transformation, elements are taken to primitive values. Here you rely on the column position to determine when strings need conversion into numbers via ToExpression[]. The second transformation strips out the remaining XMLElement content. Until you have some experience with these types of transformations it is unlikely that you’ll whip them up off the top of your head. The final form of this transformation reflects the fact that I developed it in stages. Here are the successive refinements.

Choose the relevant elements.

In[156]:=  Cases[data , XMLElement["item", _, _], Infinity]
Out[156]=  {XMLElement[item, {},
             {XMLElement[name, {}, {Leonardo}], XMLElement[sex, {}, {male}],
              XMLElement[age, {}, {{8}], XMLElement[height, {}, {4.7}]}],
            XMLElement[item, {}, {XMLElement[name, {}, {Salvatore}],
              XMLElement[sex, {}, {male}], XMLElement[age, {}, {5}],
              XMLElement[height, {}, {4.1}]}], XMLElement[item, {},
             {XMLElement[name, {}, {[Alexis}], XMLElement[sex, {}, {female}],
              XMLElement[age, {}, {6}], XMLElement[height, {}, {4.4 }]}]}

Strip out the data-level XML structure.

Solution

Strip out the row-level XML structure, leaving the data in matrix form but all the primitive values as strings.

Solution

Finally, do the type conversion.

Solution

There are always many ways to solve the same transformation problem. The tradeoffs involve brevity, clarity, generality, and performance. The solution has clarity, because it accomplishes the transformation in a step-by-step fashion. However, it is neither brief nor very general. The following transformation does the same thing but is more general. It will work on any two-level XML document because it does not match on specific element names (like "item"). It also does not hardcode which columns contain numeric data. However, it is a bit more cryptic because it does not mention XMLElement at all. Rather, it immediately converts the data to a list (using Apply with head List), and it uses [[n]] to pick out the relevant items.

Discussion

I demonstrate the generality by processing an XML file with a different number of rows, columns, and data types.

Discussion

The pure pattern-based approach of 5.9 Transforming XML Using Patterns and Rules is too awkward, cryptic, or complex for your particular transformation problem.

A natural objection to using this style of transformation rather than using replacement rules is that it is more verbose. This verbosity comes with some advantages. The first advantage is that when things go wrong, it is generally easier to debug a set of discrete functions than a replacement pattern. Most of the action of a replacement pattern is happening under the covers. The second advantage comes in cases where you need to make many changes at different levels in the XML hierarchy. Here the overhead of the recursive approach is less bothersome. We implement a transformation that changes elements to attributes, renames the "item" element to "row", changes "sex" to "gender", and converts the height from feet to meters—all with very little extra overhead.

Discussion
Discussion

One of the first things you learn about XSLT is that if you create an empty stylesheet (XSLT’s equivalent of a program), you get some default transformation rules that act to output just the text nodes of the XML data. We can emulate that behavior in Mathematica with the following functions.

In[187]:=  ClearAll[transform]
           transform[XMLObject[type_][content__]] :=
            StringJoin[transform[#] & /@ List[content]]
           transform[XML]lement[tag_, attrs_List, data_List]] :=
            StringJoin[transform[#] & /@ data ]
           transform[text_String] := text
           transform[_] := ""
In[192]:=  transform[data]
Out[192]=  Leonardomale84.7Salvatoremale54.1Alexisfemale64.4

So far, so good, but can we do something more interesting? Suppose we want to clone our XML document but replace all occurrences of the element "sex" with the element "gender".

Discussion

This recursive transformational approach is overkill in this scenario since we can more easily express this transformation using ReplaceAll.

Discussion

There are certain types of structure-adding transformations that were difficult to do in XSLT until a grouping construct was added (xsl:for-each-group) in XSLT 2.0. Here is a solution to a grouping problem using Mathematica’s Sort[] and Split[] functions.

Discussion

The goal of this transformation is to group all employees in the same department under a new element <Dept dept="num">. Notice how this is accomplished with little additional code. Helper functions define an ordering and an equivalence relation for Sort and Split, respectively, and a transform[] applies the additional level of grouping when it matches the "employees" element.

Discussion
Discussion

Of course, there are significant differences between these transformations and XSLT. For example, in XSLT, you operate on a tree and, hence, can navigate upward from child elements to parent elements. This is not the case for Mathematica’s representation of XML. The tutorial mentioned in the following See Also section provides some guidance for working around these issues.

The tutorial XML/tutorial/TransformingXML in the Mathematica documentation (also at http://bit.ly/4tS1Ce ) has a section comparing Mathematica to XSLT and can provide further help in exploiting these techniques.

You can learn more about XSLT at the XSL Working Group’s website: http://bit.ly/1fJsB.

The easiest type of parser to write in Mathematica is a recursive descent parser. Before writing the parser, we need to know the grammar of the language we will parse. The most common notation for grammars is Backus-Naur Form (BNF), but for reasons that will become apparent in the discussion, I use Mathematica itself to represent the grammar. For this example, I use a simplified English grammar. The presentation here is a variation of one developed and given by Daniel Lichtblau of Wolfram Research at the Wolfram Developer’s Conference in 1999. Refer to the See Also section for more information.

First, we need some helper functions to make creating the grammar easier. We use two functions, sequence and choose, with attribute HoldAll to prevent them from evaluating their arguments and causing an infinite recursion. As its name would suggest, sequence[] represents a sequence of terms of the grammar. Choose represents a choice of one out of two or more possible terms. I allow choose to take an extra argument, which is a list of probabilities for the choices. More on that later.

In[212]:=  SetAttributes[{sequence, choose}, HoldAll]
           NILL = "";

This grammar is for a small subset of English.

In[214]:=  sentence := choose[declarative, interrogative, imperative]
           declarative := sequence[subject, predicatepast]
           interrogative := sequence[qverb, subject, predicatepresent]
           imperative := sequence[actverb, subject]
           subject := choose[nounclause, sequence[nounclause, prepositionclause]]
           nounclause := sequence[adjectiveclause, noun]
           noun = {"skyscraper", "ball", "dog", "cow", "shark", "attorney", "hatter",
              "programmer", "city", "village", "buffalo", "moon", "librarian", "sheep"} ;
           adjectiveclause := sequence[article, adjectivelist]
           adjectivelist := choose[NILL, sequence[adjective, adjectivelist] , {0.7}]
           article = {"a", "the", "this", "that"};
           adjective =
             {"big", "wet", "mad", "hideous", "red", "repugnant", "slimy", "delectable",
              "mild-mannered", "lazy", "silly", "crazy", "ferocious", "cute"} ;
           prepositionclause := sequence[preposition, nounclause]
           preposition = {"in", "above", "under", "from", "near", "at", "with"} ;
           predicatepresent := sequence[verbpresent, subject]
           predicatepast := sequence[verbclause, subject]
           verbclause := sequence[adverblist, verbpast]
           adverblist := choose[NILL, sequence[adverb, adverblist ], {0.6}]
           adverb=
             {"swiftly", "unflinchingly", "smugly", "selflessly", "oddly", "mightily"} ;
           verbpast = {"ate", "threw", "gnashed", "boiled",
              "grated", "milked", "spanked", "jumped"};
           verbpresent= {"eat", "throw", "gnash", "boil", "grate",
              "milk", "spank", "salivate", "jump"};
           qverb = {"did", "will", "could", "should"} ;
           actverb= {"break","fix", "launch", "squeeze", "fetch"} ;

This grammar becomes the specification for our parser. Recursive descent parsers are probably the easiest parsers to craft by hand because their structure mimics the grammar. The goal of this parser is to create a labeled parse tree from a sentence. The parser is very simple: it contains no provision for error handling and relies on the grammar being completely conflict free. For example, the major sentence types are completely determined by the first word. Real languages or even artificial languages (like programming languages) are rarely that clean.

In[237]:=  (*Test for membership of a terminal
            symbol in a list of terminal symbols.*)
           isQ[type_, word_]:= MemberQ[type, word]

           (*Get next word for parser.*)
           getNextWord[{}] := ""
           getNextWord[words_List] := First[words]

           (*Parse a single word, classifying it as head, and return length of 1.*)
           atomParse[head_, words_List] := {head[getNextWord[words]], 1}

           (*Top level parse function for
            sentences. Dispatches based on first word.*)
           sentenceParse[sentence_sentenceType] :=
            Module[{sentencelist =Apply[List, sentence], firstWord },
             firstWord = First[sentencelist];
             If[isQ[qverb, firstWord], interrogativeParse[sentencelist],
                If[isQ[actverb, firstWord], imperativeParse[sentencelist],
                declarativeParse[sentencelist]]]]

           (*declarative := sequence[subject, predicatepast]*)
           declarativeParse[words_List]:=
            Module[{subject =subjectParse[words], predicate},
             predicate=predicatepastParse[Drop[words, subject[[2]]]];
             "DECLARATIVE SENTENCE"[subject[[1]], predicate[[1]]]]

           (*interrogative := sequence[qverb, subject, predicatepresent]*)
           interrogativeParse[words_List] :=
            Module[{qverb =atomParse["QUESTION VERB", words], subject, predicate},
             subject=subjectParse[Drop[words, qverb[[2]]]];
             predicate=predicatepresentParse[
               Drop[words, qverb[[2]] +subject[[2]]]];
             "INTERROGATIVE SENTENCE"[qverb[[1]], subject[[1]], predicate[[1]]]]
             (**)

           (*imperative := sequence[actverb, subject]*)
           imperativeParse[words_List] :=
            Module[{actverb =atomParse["ACTION VERB", words], subject},
             subject = subjectParse[Drop[words, actverb[[2]]]];
             "IMPERATIVE SENTENCE"[actverb[[1]], subject[[1]]]]
          (*subject :=
           choose[nounclause, sequence[nounclause, prepositionclause]]*)
          subjectParse[words_List] :=
           Module[{nounclause =nounclauseParse[words], prepositionclause},
            prepositionclause=Drop[words, nounclause[[2]]];
            If[! isQ[preposition, getNextWord[prepositionclause]],
             {"SUBJECT"[nounclause[[1]]], nounclause[[2]]},
             prepositionclause=prepositionclauseParse[prepositionclause];
             {"SUBJECT"[nounclause[[1]], prepositionclause[[1]]],
             nounclause[[2]] +prepositionclause[[2]]}]]

         (*predicatepast :=sequence[verbclause,subject]*)
         predicatepastParse[words_List] :=
          Module[{verbclause = verbclauseParse[words], subject},
           subject = subjectParse[Drop[words, verbclause[[2]]]];
           {"PREDICATE"[verbclause[[1]], subject[[1]]],
            verbclause[[2]] + subject[[2]]}]

        (*predicatepresent:=sequence[verbpresent,subject]*)
        predicatepresentParse[words_List] :=
         Module[{verb =atomParse["VERB (PRESENT TENSE)", words], subject},
          subject=subjectParse[Drop[words, verb[[2]]]];
          {"PREDICATE"[verb[[1]], subject[[1]]], verb[[2]] +subject[[2]]}]

        (*verbclause:=sequence[adverblist,verbpast]*)
        verbclauseParse[words_List] :=
         Module[{adverbs =adverblistParse[words], verb},
          verb=atomParse["VERB (PAST TENSE)", Drop[words, adverbs[[2]]]];
          If[adverbs[[2]] == 0, verb,
           {"VERB CLAUS]"[adverbs[[1]], verb[[1]]], adverbs[[2]] + verb[[2]]}]]

        (*nounclause:= sequence[adjectiveclause, noun]*)
        nounclauseParse[words_List] :=
         Module[{adjectiveclause = adjectiveclauseParse[words], noun},
          noun = atomParse["NOUN", Drop[words, adjectiveclause[[2]]]];
          {"NOUN CLAUS]"[adjectiveclause[[1]], noun[[1]]],
           adjectiveclause[[2]] +noun[[2]]}]

        (*adjectiveclause := sequence[article, adjectivelist]*)
        adjectiveclauseParse[words_List] :=
         Module[{art =atomParse["ARTIC)]", words], adjlist},
          adjlist=adjectivelistParse[Drop[words, art[[2]]]];
         If[adjlist[[2]] ==0, art,{"ADJECTIVE CLAUSE"[art[[1]], adjlist[[1]]],
           art[[2]] +adjlist[[2]]}]]

        (* Parse (possibly empty) list of adjectives.*)
        (*adjectivelist :=
         choose[NILL, sequence[adjective, adjectivelist] , {0.7}]* )
        adjectivelistParse[words_List] :=
         Module[{words2 =words, adj, result, len=0}, result ="ADJECTIVE LIST"[];
          While[isQ[adjective, getNextWord[words2]],
           adj=atomParse["ADJECTIVE", words2];
           len+=adj[[2]];
           result="ADJECTIVE LIST"[result, adj[[1]]];
           words2=Drop[words2, adj[[2]]]];
          {Flatten[result, Infinity, "ADJECTIVE LIST"], len}]

        (*prepositionclause := sequence[preposition, nounclause]*)
        prepositionclauseParse[words_List] :=
         Module[{preposition =atomParse["PREPOSITION", words], nounclause},
          nounclause=nounclauseParse[Drop[words, preposition[[2]]]];
          {"PREPOSITION CLAUSE"[preposition[[1]], nounclause[[1]]],
           preposition[[2]] +nounclause[[2]]}]

       (* Parse Ipossibly empty M list of adverbs.*)
       (*adverblist := choose[NILL, sequence[adverb,adverblist], {0.6}]*)
       adverblistParse[words_List] :=
        Module[{words2 =words, adv, result, len=0}, result="ADVERB LIST"[];
         While[isQ[adverb, getNextWord[words2]],
          adv=atomParse["ADVERB", words2];
          len+=adv[[2]];
          result="ADVERB LIST"[result, adv[[1]]];
          words2=Drop[words2, adv[[2]]]];
         {Flatten[result, Infinity, "ADVERB LIST"], len}]

We can test the parser on a sentence that conforms to the grammar.

In[254]:=  sentenceParse[
            sentenceType["will", "the", "wet", "programmer", "spank", "the", "moon"]]
Out[254]=  INTERROGATIVE SENTENCE[QUESTION VERB[will],
            SUBJECT[NOUN CLAUSE[ADJECTIVE CLAUSE[ARTICLE[the],
               ADJECTIVE LIST[ADJECTIVE[wet]]], NOUN[programmer]]],
            PREDICATE[VERB (PRESENT TENSE)[spank],
             SUBJECT[NOUN CLAUSE[ARTICLE[the], NOUN[moon]]]]]

You may wonder why I took the trouble to specify the grammar using Mathematica if I was going to write the parser by hand. First, I did not write this parser; I just prettied up a parser written by Daniel Lichtblau! The more serious answer is that the grammar can be used to easily create a language generator to go along with the parser. The generator is very useful for testing the parser. Here I based a generator on Lichtblau’s implementation but made some significant improvements. The first improvement is that my implementation is more declarative than procedural because it leverages Mathematica’s pattern matching. The second improvement is that the generator absorbs all the complexity so the grammar can remain very clean. In Lichtblau’s original grammar, the representation was soiled by the presence of programmatic constructs, like Hold[] and his implementation of random choice. Other than the presence of probabilities, the grammar in the preceding Solution section is completely clean. In fact, it reads as easy as BNF. Refer to the URL in the See Also section to compare this implementation with the original.

In[255]:=  <<Combinatorica`
           (*needed for BinarySearch[]*)
           (*randomChoose[parts_List,probs_List]  selects an item from
            parts_List based on a list of probabilities the length of
            which must be one less than the number of parts and the sum
            of which is less than one. The interpretation is that each
            probability corresponds to the probability of the item in the same
            position except for the last item, which gets the residual.*)
           randomChoose[parts_List, probs_List]:= Module[{weights, test, pos},
             weights = N[Append[FoldList[Plus, First[probs], Rest[probs]], 1]];
             test = RandomReal[]; pos = Ceiling[BinarySearch[weights, test]];
             parts[[pos]]]
           (*randomPart[]  is responsible for interpreting the grammar in
            a random manner. There is a variation for each possible term,
           and recursion is used to expand nonterminals.*)
           randomPart[sequence[parts__ ]]:= randomPart[#] & /@ List[parts]
           randomPart[choose[parts__, probs_List]]  :=
            Union[Flatten[List[randomPart[randomChoose[List[parts], probs  ]]]]]
           randomPart[choose[parts__ ]]:= Module[{partList, numParts},
             partList = List[parts]; numParts =Length[partList];
             randomPart[randomChoose[partList, Table[1/numParts, {numParts -1}]]]]
           randomPart[terminals_List] :=
            terminals[[ RandomInteger[ {1, Length[terminals ]}] ]]
           randomPart[NILL ] := {}
           (*randomSentence[]  is the entry point for
            generating a random sentence of the grammar.* )
           randomSentence [] := sentenceType @@ Flatten[randomPart[sentence]]
           (* We provide a nice textual formatting for
            sentences that also takes care of punctuation.* )
           Format[sentence_sentenceType]:=
            Module[{word =First[sentence], words, punc},
             words=Map[StringJoin[#, " "] &, sentence];
             punc=If[isQ[qverb, word], "?", If[isQ[actverb, word], "!", "."]];
             words[[Length[words]]] =StringReplacePart[Last[words], punc, -1];
             words[[1]] =StringReplacePart[First[words],
                ToUpperCase[StringTake[First[words], 1]], 1];
             Apply[StringJoin, words ]]

Here you can see the result of generating 10 random sentences. They are, for the most part, utter gibberish, but some are kind of funny. They all conform to the grammar, as we can see by running them through the parser.

Discussion

The parser we wrote by hand is an instance of a predictive recursive descent parser because it looks ahead wherever there is a choice so that it does not take a wrong path through the grammar. In contrast, a backtracking parser simply starts over from where it left off if a particular parse path fails. If you are ambitious, you can continue this recipe and write a backtracking parser generator in Mathematica. The references in the following See Also section provide some background.

See Daniel Lichtblau’s original implementation at http://bit.ly/zXhUm .

Packrat parsing is amenable to Mathematica implementation. See http://bit.ly/RsNCe .

A functional approach to parsing is discussed in "Monadic Parser Combinators" by Graham Hutton and Erik Meijer, published in Journal of Functional Programming, Volume 8, Issue 4, 1996. See http://bit.ly/PIVAh (PostScript file).