By this point, we have seen the simplest atoms that can be used in a pattern: “1,” “5,” and “\.” are atoms that exactly match a character. The other atoms that can be used in patterns are special characters, a wildcard that matches any character, or predefined and user-defined character classes.
Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.
The
character
“.” has a special meaning:
it’s a wildcard atom that matches any XML valid
character except newlines and carriage returns. As with any atom,
“.” may be followed by an optional
quantifier and “.*” is a common
construct to match zero or more occurrences of any character. To
illustrate the usage of “.*” (and
the fact that
xs:pattern
is a Swiss army knife), a
pattern may be used to define the integers that are multiples of 10:
<xs:simpleType name="multipleOfTen"> <xs:restriction base="xs:integer"> <xs:pattern value=".*0"/> </xs:restriction> </xs:simpleType>
W3C XML Schema has adopted the “classical” Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl).
W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each of these classes are designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:
\s
Spaces. Matches the XML whitespaces (space #x20, tabulation #x09, line feed #x0A, and carriage return #x0D).
\S
\d
\D
\w
Extended “word” characters (any Unicode character not defined as “punctuation”, “separator,” and “other”). This conforms to the Perl definition, assuming UTF8 support has been switched on.
\W
\i
XML 1.0 initial name characters (i.e., all the “letters” plus “-”). This is a W3C XML Schema extension over Perl regular expressions.
\I
Characters that may not be used as a XML initial name character.
\c
XML 1.0 name characters (initial name characters, digits, “.”, “:”, “-”, and the characters defined by Unicode as “combining” or “extender”). This is a W3C XML Schema extension to Perl regular expressions.
\C
These character classes may be used with an optional quantifier like any other atom. The last pattern that we saw:
<xs:pattern value=".*0"/>
constrains the lexical space to be a string of characters ending with
a zero. Knowing that the base type is a
xs:integer
, this is good enough for our purposes,
but if the base type had been a
xs:decimal
(or
xs:string
), we could be more restrictive and
write:
<xs:pattern value="-?\d*0"/>
This checks that the characters before the trailing zero are digits
with an optional leading -
(we will see later on
in Section 6.5.2.2 how to specify an optional
leading -
or +
).
Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently to their localization (letters, uppercase, digit, punctuation, etc.), while blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).
The syntax \p{Name}
is similar for blocks and
categories; the prefix Is
is added to the name of
categories to make the distinction. The syntax
\P{Name}
is also available to select the
characters that do not match a block or category. A list of Unicode
blocks and categories is given in the specification. Table 6-2 shows the Unicode character classes and Table 6-3 shows the Unicode character blocks.
Table 6-2. Unicode character classes
Unicode Character Class |
Includes |
---|---|
C |
Other characters (non-letters, non symbols, non-numbers, non-separators) |
Cc |
Control characters |
Cf |
Format characters |
Cn |
Unassigned code points |
Co |
Private use characters |
L |
Letters |
Ll |
Lowercase letters |
Lm |
Modifier letters |
Lo |
Other letters |
Lt |
Titlecase letters |
Lu |
Uppercase letters |
M |
All Marks |
Mc |
Spacing combining marks |
Me |
Enclosing marks |
Mn |
Non-spacing marks |
N |
Numbers |
Nd |
Decimal digits |
Nl |
Number letters |
No |
Other numbers |
P |
Punctuation |
Pc |
Connector punctuation |
Pd |
Dashes |
Pe |
Closing punctuation |
Pf |
Final quotes (may behave like Ps or Pe) |
Pi |
Initial quotes (may behave like Ps or Pe) |
Po |
Other forms of punctuation |
Ps |
Opening punctuation |
S |
Symbols |
Sc |
Currency symbols |
Sk |
Modifier symbols |
Sm |
Mathematical symbols |
So |
Other symbols |
Z |
Separators |
Zl |
Line breaks |
Zp |
Paragraph breaks |
Zs |
Spaces |
Table 6-3. Unicode character blocks
AlphabeticPresentationForms |
Arabic |
ArabicPresentationForms-A |
ArabicPresentationForms-B |
Armenian |
Arrows |
BasicLatin |
Bengali |
BlockElements |
Bopomofo |
BopomofoExtended |
BoxDrawing |
BraillePatterns |
ByzantineMusicalSymbols |
Cherokee |
CJKCompatibility |
CJKCompatibilityForms |
CJKCompatibilityIdeographs |
CJKCompatibilityIdeographsSupplement |
CJKRadicalsSupplement |
CJKSymbolsandPunctuation |
CJKUnifiedIdeographs |
CJKUnifiedIdeographsExtensionA |
CJKUnifiedIdeographsExtensionB |
CombiningDiacriticalMarks |
CombiningHalfMarks |
CombiningMarksforSymbols |
ControlPictures |
CurrencySymbols |
Cyrillic |
Deseret |
Devanagari |
Dingbats |
EnclosedAlphanumerics |
EnclosedCJKLettersandMonths |
Ethiopic |
GeneralPunctuation |
GeometricShapes |
Georgian |
Gothic |
Greek |
GreekExtended |
Gujarati |
Gurmukhi |
HalfwidthandFullwidthForms |
HangulCompatibilityJamo |
HangulJamo |
HangulSyllables |
Hebrew |
HighPrivateUseSurrogates |
HighSurrogates |
Hiragana |
IdeographicDescriptionCharacters |
IPAExtensions |
Kanbun |
KangxiRadicals |
Kannada |
Katakana |
Khmer |
Lao |
Latin-1Supplement |
LatinExtended-A |
LatinExtendedAdditional |
LatinExtended-B |
LetterlikeSymbols |
LowSurrogates |
Malayalam |
MathematicalAlphanumericSymbols |
MathematicalOperators |
MiscellaneousSymbols |
MiscellaneousTechnical |
Mongolian |
MusicalSymbols |
Myanmar |
NumberForms |
Ogham |
OldItalic |
OpticalCharacterRecognition |
Oriya |
PrivateUse |
PrivateUse |
PrivateUse |
Runic |
Sinhala |
SmallFormVariants |
SpacingModifierLetters |
Specials |
Specials |
SuperscriptsandSubscripts |
Syriac |
Tags |
Tamil |
Telugu |
Thaana |
Thai |
Tibetan |
UnifiedCanadianAboriginalSyllabics |
YiRadicals |
YiSyllables |
We don’t yet know how to specify intersections
between a block and a category in a single pattern, or how to specify
that a datatype must be composed of only basic Latin letters. So, to
“cross” these classifications and
define the intersection of the block L
(all the
letters) and the category BasicLatin
(ASCII
characters below #x7F), we can perform two successive restrictions:
<xs:simpleType name="BasicLatinLetters"> <xs:restriction> <xs:simpleType> <xs:restriction base="xs:token"> <xs:pattern value="\p{IsBasicLatin}*"/> </xs:restriction> </xs:simpleType> <xs:pattern value="\p{L}*"/> </xs:restriction> </xs:simpleType>
These
classes are lists of characters between square
brackets that
accept -
signs to
define ranges and a
leading ^
to negate
the whole list—for instance:
[azertyuiop]
to define the list of letters on the first row of a French keyboard,
[a-z]
to specify all the characters between “a” and “z”,
[^a-z]
for all the characters that are not between “a” and “z,” but also
[-^\\]
to define the characters “-,” “^,” and “\,” or
[-+]
to specify a decimal sign.
These examples are enough to see that what’s between these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Between the atoms and the character class is the set of characters matching any of the atoms found between the brackets.
We see also two special characters that have a different meaning
depending on their location! The character -
,
which is a range delimiter when it is between a
and z
, is a normal character when it is just after
the opening bracket or just before the closing bracket
([+-]
and [-+]
are, therefore,
both legal). On the contrary, ^
, which is a
negator when it appears at the beginning of a class, loses this
special meaning to become a normal character later in the class
definition.
We also notice that characters may or must be escaped: “\\” is used to match the character “\”. In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:
[-^\\]
can also be written as:
[\-\^\\]
or as:
[\^\\-]
since the location of the characters doesn’t matter any longer when they are escaped.
Within square brackets, the character “\” also keeps its meaning of a reference to a Perl or Unicode class. The following:
[\d\p{Lu}]
is a set of decimal digits (Perl class \d
) and
uppercase letters (Unicode category
“Lu”).
Mathematicians have found that three basic
operations are needed to manipulate sets and that these operations
can be chosen from a larger set of operations. In our square
brackets, we already saw two of these operations:
union (the
square bracket is an implicit union of its atoms) and
complement (a leading
^
realizes the complement of the set defined in
the square bracket). W3C XML Schema extended the syntax of the Perl
regular expressions to introduce a third operation: the difference
between sets. The syntax follows:
[set1-[set2]]
Its meaning is all the characters in set1
that do
not belong to set2
, where set1
and set2
can use all the syntactic tricks that we
have seen up to now.
This operator can be used to perform
intersections of character classes (the
intersection between two sets A and B is the difference between A and
the complement of B), and we can now define a class for the
BasicLatin Letters
as:
[\p{IsBasicLatin}-[^\p{L}]]
Or, using the \P
construct, which is also a
complement, we can define the class as:
[\p{IsBasicLatin}-[\P{L}]]
The corresponding datatype definition would be:
<xs:simpleType name="BasicLatinLetters"> <xs:restriction base="xs:token"> <xs:pattern value="[\p{IsBasicLatin}-[\P{L}]]*"/> </xs:restriction> </xs:simpleType>
In our first example pattern, we used three separate patterns to express three possible values. We can condense this definition using the “|” character, which is the “or” operator when used outside square brackets. The simple type definition is then:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:pattern value="1|5|15"/> </xs:restriction> </xs:simpleType>
This syntax is more concise, but whether or not it’s more readable is subject to discussion. Also, these “ors” would not be very interesting if it were not possible to use them in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets (“(” and “)”). To define a comma-separated list of “1,” “5,” or “15,” ignoring whitespaces between values and commas, the following pattern could be used:
<xs:simpleType name="myListOfBytes"> <xs:restriction base="xs:token"> <xs:pattern value="(1|5|15)( *, *(1|5|15))*"/> </xs:restriction> </xs:simpleType>
Note how we have relied on the whitespace processing of the base
datatype (
xs:token
collapses the whitespaces). We
have not tested leading and trailing whitespaces that are trimmed and
we have only tested single occurrences of spaces with the following
atom:
run back " * " run back