More Atoms

Special Characters

Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.

Table 6-1. Special characters

\n	New line (can also be written as “ — since we are in a XML document).
\r	Carriage return (can also be written as “ -- ).
\t	Tabulation (can also be written as “ -- )
\\	Character “\”
\\|	Character “\|”
\.	Character “.”
\-	Character “-”
\^	Character “^”
\?	Character “?”
\*	Character “*”
\+	Character “+”
\{	Character “{”
\}	Character “}”
\(	Character “(”
\)	Character “)”
\[	Character “[”
\]	Character “]”

Wildcard

The character “.” has a special meaning: it’s a wildcard atom that matches any XML valid character except newlines and carriage returns. As with any atom, “.” may be followed by an optional quantifier and “.*” is a common construct to match zero or more occurrences of any character. To illustrate the usage of “.*” (and the fact that xs:pattern is a Swiss army knife), a pattern may be used to define the integers that are multiples of 10:

<xs:simpleType name="multipleOfTen">
  <xs:restriction base="xs:integer">
    <xs:pattern value=".*0"/>
  </xs:restriction>
</xs:simpleType>

Unicode Character Class	Includes
C	Other characters (non-letters, non symbols, non-numbers, non-separators)
Cc	Control characters
Cf	Format characters
Cn	Unassigned code points
Co	Private use characters
L	Letters
Ll	Lowercase letters
Lm	Modifier letters
Lo	Other letters
Lt	Titlecase letters
Lu	Uppercase letters
M	All Marks
Mc	Spacing combining marks
Me	Enclosing marks
Mn	Non-spacing marks
N	Numbers
Nd	Decimal digits
Nl	Number letters
No	Other numbers
P	Punctuation
Pc	Connector punctuation
Pd	Dashes
Pe	Closing punctuation
Pf	Final quotes (may behave like Ps or Pe)
Pi	Initial quotes (may behave like Ps or Pe)
Po	Other forms of punctuation
Ps	Opening punctuation
S	Symbols
Sc	Currency symbols
Sk	Modifier symbols
Sm	Mathematical symbols
So	Other symbols
Z	Separators
Zl	Line breaks
Zp	Paragraph breaks
Zs	Spaces

These examples are enough to see that what’s between these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Between the atoms and the character class is the set of characters matching any of the atoms found between the brackets.

We see also two special characters that have a different meaning depending on their location! The character -, which is a range delimiter when it is between a and z, is a normal character when it is just after the opening bracket or just before the closing bracket ([+-] and [-+] are, therefore, both legal). On the contrary, ^, which is a negator when it appears at the beginning of a class, loses this special meaning to become a normal character later in the class definition.

We also notice that characters may or must be escaped: “\\” is used to match the character “\”. In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:

[-^\\]

can also be written as:

[\-\^\\]

or as:

[\^\\-]

since the location of the characters doesn’t matter any longer when they are escaped.

Within square brackets, the character “\” also keeps its meaning of a reference to a Perl or Unicode class. The following:

[\d\p{Lu}]

is a set of decimal digits (Perl class \d) and uppercase letters (Unicode category “Lu”).

Mathematicians have found that three basic operations are needed to manipulate sets and that these operations can be chosen from a larger set of operations. In our square brackets, we already saw two of these operations: union (the square bracket is an implicit union of its atoms) and complement (a leading ^ realizes the complement of the set defined in the square bracket). W3C XML Schema extended the syntax of the Perl regular expressions to introduce a third operation: the difference between sets. The syntax follows:

[set1-[set2]]

Its meaning is all the characters in set1 that do not belong to set2, where set1 and set2 can use all the syntactic tricks that we have seen up to now.

This operator can be used to perform intersections of character classes (the intersection between two sets A and B is the difference between A and the complement of B), and we can now define a class for the BasicLatin Letters as:

[\p{IsBasicLatin}-[^\p{L}]]

Or, using the \P construct, which is also a complement, we can define the class as:

[\p{IsBasicLatin}-[\P{L}]]

The corresponding datatype definition would be:

<xs:simpleType name="BasicLatinLetters">
  <xs:restriction base="xs:token">
    <xs:pattern value="[\p{IsBasicLatin}-[\P{L}]]*"/>
  </xs:restriction>
</xs:simpleType>

Oring and Grouping

In our first example pattern, we used three separate patterns to express three possible values. We can condense this definition using the “|” character, which is the “or” operator when used outside square brackets. The simple type definition is then:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="1|5|15"/>
  </xs:restriction>
</xs:simpleType>

This syntax is more concise, but whether or not it’s more readable is subject to discussion. Also, these “ors” would not be very interesting if it were not possible to use them in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets (“(” and “)”). To define a comma-separated list of “1,” “5,” or “15,” ignoring whitespaces between values and commas, the following pattern could be used:

<xs:simpleType name="myListOfBytes">
  <xs:restriction base="xs:token">
    <xs:pattern value="(1|5|15)( *, *(1|5|15))*"/>
  </xs:restriction>
</xs:simpleType>

Note how we have relied on the whitespace processing of the base datatype ( xs:token collapses the whitespaces). We have not tested leading and trailing whitespaces that are trimmed and we have only tested single occurrences of spaces with the following atom:

run back " * " run back

before and after the comma.

AlphabeticPresentationForms	Arabic	ArabicPresentationForms-A
ArabicPresentationForms-B	Armenian	Arrows
BasicLatin	Bengali	BlockElements
Bopomofo	BopomofoExtended	BoxDrawing
BraillePatterns	ByzantineMusicalSymbols	Cherokee
CJKCompatibility	CJKCompatibilityForms	CJKCompatibilityIdeographs
CJKCompatibilityIdeographsSupplement	CJKRadicalsSupplement	CJKSymbolsandPunctuation
CJKUnifiedIdeographs	CJKUnifiedIdeographsExtensionA	CJKUnifiedIdeographsExtensionB
CombiningDiacriticalMarks	CombiningHalfMarks	CombiningMarksforSymbols
ControlPictures	CurrencySymbols	Cyrillic
Deseret	Devanagari	Dingbats
EnclosedAlphanumerics	EnclosedCJKLettersandMonths	Ethiopic
GeneralPunctuation	GeometricShapes	Georgian
Gothic	Greek	GreekExtended
Gujarati	Gurmukhi	HalfwidthandFullwidthForms
HangulCompatibilityJamo	HangulJamo	HangulSyllables
Hebrew	HighPrivateUseSurrogates	HighSurrogates
Hiragana	IdeographicDescriptionCharacters	IPAExtensions
Kanbun	KangxiRadicals	Kannada
Katakana	Khmer	Lao
Latin-1Supplement	LatinExtended-A	LatinExtendedAdditional
LatinExtended-B	LetterlikeSymbols	LowSurrogates
Malayalam	MathematicalAlphanumericSymbols	MathematicalOperators
MiscellaneousSymbols	MiscellaneousTechnical	Mongolian
MusicalSymbols	Myanmar	NumberForms
Ogham	OldItalic	OpticalCharacterRecognition
Oriya	PrivateUse	PrivateUse
PrivateUse	Runic	Sinhala
SmallFormVariants	SpacingModifierLetters	Specials
Specials	SuperscriptsandSubscripts	Syriac
Tags	Tamil	Telugu
Thaana	Thai	Tibetan
UnifiedCanadianAboriginalSyllabics	YiRadicals	YiSyllables

More Atoms

Special Characters

Wildcard

Character Classes

Classical Perl character classes

Unicode character classes

User-defined character classes

Oring and Grouping