More Atoms

By this point, we have seen the simplest atoms that can be used in a pattern: “1,” “5,” and “\.” are atoms that exactly match a character. The other atoms that can be used in patterns are special characters, a wildcard that matches any character, or predefined and user-defined character classes.

Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.

The character “.” has a special meaning: it’s a wildcard atom that matches any XML valid character except newlines and carriage returns. As with any atom, “.” may be followed by an optional quantifier and “.*” is a common construct to match zero or more occurrences of any character. To illustrate the usage of “.*” (and the fact that xs:pattern is a Swiss army knife), a pattern may be used to define the integers that are multiples of 10:

<xs:simpleType name="multipleOfTen">
  <xs:restriction base="xs:integer">
    <xs:pattern value=".*0"/>
  </xs:restriction>
</xs:simpleType>

W3C XML Schema has adopted the “classical” Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl).

W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each of these classes are designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:

These character classes may be used with an optional quantifier like any other atom. The last pattern that we saw:

<xs:pattern value=".*0"/>

constrains the lexical space to be a string of characters ending with a zero. Knowing that the base type is a xs:integer, this is good enough for our purposes, but if the base type had been a xs:decimal (or xs:string ), we could be more restrictive and write:

<xs:pattern value="-?\d*0"/>

This checks that the characters before the trailing zero are digits with an optional leading - (we will see later on in Section 6.5.2.2 how to specify an optional leading - or +).

Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently to their localization (letters, uppercase, digit, punctuation, etc.), while blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).

The syntax \p{Name} is similar for blocks and categories; the prefix Is is added to the name of categories to make the distinction. The syntax \P{Name} is also available to select the characters that do not match a block or category. A list of Unicode blocks and categories is given in the specification. Table 6-2 shows the Unicode character classes and Table 6-3 shows the Unicode character blocks.

Table 6-3. Unicode character blocks

AlphabeticPresentationForms

Arabic

ArabicPresentationForms-A

ArabicPresentationForms-B

Armenian

Arrows

BasicLatin

Bengali

BlockElements

Bopomofo

BopomofoExtended

BoxDrawing

BraillePatterns

ByzantineMusicalSymbols

Cherokee

CJKCompatibility

CJKCompatibilityForms

CJKCompatibilityIdeographs

CJKCompatibilityIdeographsSupplement

CJKRadicalsSupplement

CJKSymbolsandPunctuation

CJKUnifiedIdeographs

CJKUnifiedIdeographsExtensionA

CJKUnifiedIdeographsExtensionB

CombiningDiacriticalMarks

CombiningHalfMarks

CombiningMarksforSymbols

ControlPictures

CurrencySymbols

Cyrillic

Deseret

Devanagari

Dingbats

EnclosedAlphanumerics

EnclosedCJKLettersandMonths

Ethiopic

GeneralPunctuation

GeometricShapes

Georgian

Gothic

Greek

GreekExtended

Gujarati

Gurmukhi

HalfwidthandFullwidthForms

HangulCompatibilityJamo

HangulJamo

HangulSyllables

Hebrew

HighPrivateUseSurrogates

HighSurrogates

Hiragana

IdeographicDescriptionCharacters

IPAExtensions

Kanbun

KangxiRadicals

Kannada

Katakana

Khmer

Lao

Latin-1Supplement

LatinExtended-A

LatinExtendedAdditional

LatinExtended-B

LetterlikeSymbols

LowSurrogates

Malayalam

MathematicalAlphanumericSymbols

MathematicalOperators

MiscellaneousSymbols

MiscellaneousTechnical

Mongolian

MusicalSymbols

Myanmar

NumberForms

Ogham

OldItalic

OpticalCharacterRecognition

Oriya

PrivateUse

PrivateUse

PrivateUse

Runic

Sinhala

SmallFormVariants

SpacingModifierLetters

Specials

Specials

SuperscriptsandSubscripts

Syriac

Tags

Tamil

Telugu

Thaana

Thai

Tibetan

UnifiedCanadianAboriginalSyllabics

YiRadicals

YiSyllables

We don’t yet know how to specify intersections between a block and a category in a single pattern, or how to specify that a datatype must be composed of only basic Latin letters. So, to “cross” these classifications and define the intersection of the block L (all the letters) and the category BasicLatin (ASCII characters below #x7F), we can perform two successive restrictions:

<xs:simpleType name="BasicLatinLetters">
  <xs:restriction>
    <xs:simpleType>
      <xs:restriction base="xs:token">
        <xs:pattern value="\p{IsBasicLatin}*"/>
      </xs:restriction>
    </xs:simpleType>
    <xs:pattern value="\p{L}*"/>
  </xs:restriction>
</xs:simpleType>

These classes are lists of characters between square brackets that accept - signs to define ranges and a leading ^ to negate the whole list—for instance:

[azertyuiop]

to define the list of letters on the first row of a French keyboard,

[a-z]

to specify all the characters between “a” and “z”,

[^a-z]

for all the characters that are not between “a” and “z,” but also

[-^\\]

to define the characters “-,” “^,” and “\,” or

[-+]

to specify a decimal sign.

These examples are enough to see that what’s between these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Between the atoms and the character class is the set of characters matching any of the atoms found between the brackets.

We see also two special characters that have a different meaning depending on their location! The character -, which is a range delimiter when it is between a and z, is a normal character when it is just after the opening bracket or just before the closing bracket ([+-] and [-+] are, therefore, both legal). On the contrary, ^, which is a negator when it appears at the beginning of a class, loses this special meaning to become a normal character later in the class definition.

We also notice that characters may or must be escaped: “\\” is used to match the character “\”. In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:

[-^\\]

can also be written as:

[\-\^\\]

or as:

[\^\\-]

since the location of the characters doesn’t matter any longer when they are escaped.

Within square brackets, the character “\” also keeps its meaning of a reference to a Perl or Unicode class. The following:

[\d\p{Lu}]

is a set of decimal digits (Perl class \d) and uppercase letters (Unicode category “Lu”).

Mathematicians have found that three basic operations are needed to manipulate sets and that these operations can be chosen from a larger set of operations. In our square brackets, we already saw two of these operations: union (the square bracket is an implicit union of its atoms) and complement (a leading ^ realizes the complement of the set defined in the square bracket). W3C XML Schema extended the syntax of the Perl regular expressions to introduce a third operation: the difference between sets. The syntax follows:

[set1-[set2]]

Its meaning is all the characters in set1 that do not belong to set2, where set1 and set2 can use all the syntactic tricks that we have seen up to now.

This operator can be used to perform intersections of character classes (the intersection between two sets A and B is the difference between A and the complement of B), and we can now define a class for the BasicLatin Letters as:

[\p{IsBasicLatin}-[^\p{L}]]

Or, using the \P construct, which is also a complement, we can define the class as:

[\p{IsBasicLatin}-[\P{L}]]

The corresponding datatype definition would be:

<xs:simpleType name="BasicLatinLetters">
  <xs:restriction base="xs:token">
    <xs:pattern value="[\p{IsBasicLatin}-[\P{L}]]*"/>
  </xs:restriction>
</xs:simpleType>

In our first example pattern, we used three separate patterns to express three possible values. We can condense this definition using the “|” character, which is the “or” operator when used outside square brackets. The simple type definition is then:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="1|5|15"/>
  </xs:restriction>
</xs:simpleType>

This syntax is more concise, but whether or not it’s more readable is subject to discussion. Also, these “ors” would not be very interesting if it were not possible to use them in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets (“(” and “)”). To define a comma-separated list of “1,” “5,” or “15,” ignoring whitespaces between values and commas, the following pattern could be used:

<xs:simpleType name="myListOfBytes">
  <xs:restriction base="xs:token">
    <xs:pattern value="(1|5|15)( *, *(1|5|15))*"/>
  </xs:restriction>
</xs:simpleType>

Note how we have relied on the whitespace processing of the base datatype ( xs:token collapses the whitespaces). We have not tested leading and trailing whitespaces that are trimmed and we have only tested single occurrences of spaces with the following atom:

run back " * " run back

before and after the comma.