Chapter 11. Built-in simple types

This chapter describes the 49 built-in simple types that are included in XML Schema. These simple types represent common types that can be used directly in schemas. They are also the foundation for deriving other simple types, as described in Chapter 8. A complete reference to the built-in simple types and the facets that apply to them can be found in Appendix B.

11.1. The XML Schema type system

There are 49 simple types built into XML Schema. They are specified in Part 2 of the XML Schema recommendation. This part of the recommendation makes a distinction between “datatypes” and “simple types.” Datatypes are abstract concepts of data, such as “integer.” Simple types are the concrete representations of these datatypes. Most of the built-in types are atomic types, although there are three list types as well.

11.1.1. The type hierarchy

Types in the XML Schema type system form a hierarchy. Figure 11–1 depicts the hierarchy of the built-in types, showing that some built-in types are derived from other built-in types. The downward arrows represent derivation by restriction, so the types become more restrictive toward the bottom of the diagram. For example, nonPositiveInteger is more restrictive than integer, which is more restrictive than anyAtomicType.

Figure 11–1. The built-in type hierarchy

At the top of the hierarchy are three special types:

1. anyType is a generic complex type that allows anything: any attributes, any child elements, any text content.

2. anySimpleType, derived from anyType, is the base of all simple types, including atomic, list, and union types.

3. anyAtomicType, derived from anySimpleType, is a generic type from which all atomic types are derived.

anyType can be declared as the type of an element, in which case that element can have any content. It is also the default type for elements if none is specified. anyType can also be extended or restricted by complex type definitions. anySimpleType and anyAtomicType are special types that cannot be used as the base for user-defined types in a schema. However, they can be declared as the type of an element or attribute.

The types directly under anyAtomicType are known as primitive types, while the rest are derived built-in types. The primitive types represent basic type concepts, and all other built-in atomic types are restrictions of those types. When you define new simple types in your schema, they can never be primitive; they must be derived from a built-in primitive type.

Figure 11–1 shows that the three built-in list types (NMTOKENS, ENTITIES, and IDREFS) are derived from anySimpleType. Any user-defined list and union types are also derived from anySimpleType, although they have item types or member types that may be specific atomic types.

Starting in version 1.1, it is possible for implementations to support other primitive types, in addition to the built-in types described in this chapter. Consult the documentation of your XML Schema processor to determine whether you have access to any additional primitive types.

11.1.2. Value spaces and lexical spaces

Every type in the XML Schema type system has a value space. This value space represents the set of possible values for a type. For example, for the int type, it is the set of integers from –2147483648 to 2147483647. Every type also has a lexical space, which includes all the possible representations of those values. For int, each value in the value space might have several lexical representations. For example, the value 12 could also be written as 012 or +12. All of these values are considered equal for this type (but not if their type were string).

Of the lexical representations, one is considered the canonical representation: It maps one-to-one to a value in the value space and can be used to determine whether two values are equal. For the int type, the rule is that the canonical representation has no plus sign and no leading zeros. If you turn each of the three values 12, +12, and 012 into their canonical representation using that rule, they would all be 12 and therefore equal to each other. Some primitive types, such as string, only have one lexical representation, which becomes, by default, the canonical representation. In this chapter, the canonical representation of a particular type is only mentioned if there can be more than one lexical representation per value in the value space.

11.1.3. Facets and built-in types

As we saw in Chapter 8, simple types inherit the facets of their ancestors. For example, the integer type has a fractionDigits facet that is set to 0. This means that all of the twelve types derived (directly and indirectly) from integer also have a fractionDigits of 0.

However, it is not just the facet value that is inherited, but also the applicability of a facet. Each primitive type has certain facets that are applicable to it. For example, the string type has length as an applicable facet, but not totalDigits, because it does not make sense to apply totalDigits to a string. Therefore, the totalDigits facet cannot be applied to string or any of its derived types, whether they are built-in or user-defined.

It is not necessary to remember which types are primitive and which are derived. This chapter lists the applicable facets for all of the built-in types, not just the primitive types. When you derive new types from the built-in types, you may simply check which facets are applicable to the built-in type, regardless of whether it is primitive or derived.

11.2. String-based types

11.2.1. string, normalizedString, and token

The types string, normalizedString, and token represent a character string that may contain any Unicode characters allowed by XML. Certain characters, namely the “less than” symbol (<) and the ampersand (&), must be escaped (using the entities < and &, respectively) when used in strings in XML instances. The only difference between the three types is in the way whitespace is handled by a schema-aware processor, as shown in Table 11–1.

Table 11–1. Whitespace handling of string types

The string type has a whiteSpace facet of preserve, which means that all whitespace characters (spaces, tabs, carriage returns, and line feeds) are preserved by the processor.

The normalizedString type has a whiteSpace facet of replace, which means that the processor replaces each carriage return, line feed, and tab by a single space. There is no collapsing of multiple consecutive spaces into a single space.

The token type represents a tokenized string. The name token may be slightly confusing because it implies that there may be only one token with no whitespace. In fact, there can be whitespace in a token value. The token type has a whiteSpace facet of collapse, which means that the processor replaces each carriage return, line feed, and tab by a single space. After this replacement, each group of consecutive spaces is collapsed into one space character, and all leading and trailing spaces are removed.

Table 11–2 shows some valid and invalid values of the string types.

Table 11–2. Values of the string types

The facets indicated in Table 11–3 can restrict string, normalizedString, and token, and their derived types.¹

Table 11–3. Facets applicable to string, normalizedString, and token types

11.2.1.1. Design hint: Should I use string, normalizedString, or token?

First, consider whether to use a string-based simple type at all. If it is a long string of general text, such as a letter or a long description of an item, this may not be wise. This is because simple types are not extensible. Later, you may want to allow XHTML markup in the letter, or break the item description down into more structured components. It will be impossible to do this without altering the schema in a way that is not backwards compatible.

Additionally, simple types cannot support internationalization requirements such as Ruby annotations and BIDI (bidirectionality) elements.

In these cases, you should instead declare a complex type with mixed content, and include a wildcard to allow for future extensions. The complex type definition shown in Example 11–1 accomplishes this purpose. The character data content of an element of this type will have its whitespace preserved.

Example 11–1. Extensible mixed content

Click here to view code image

<xs:complexType name="TextType" mixed="true">
  <xs:sequence>
    <xs:any namespace="##any" processContents="lax"
            minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>
  <xs:attribute ref="xml:lang"/>
</xs:complexType>

For short, atomic items, such as a postal code or a gender, it does make sense to use string, normalizedString, or token. But which one? Here are some general guidelines:

• string should be used for text where formatting, such as tabs and line breaks, is significant. However, as mentioned above, it may be better to use mixed complex type in this case.

• normalizedString is used when formatting is not significant but consecutive whitespace characters are significant. This can be used when the information in the string is positional.

• token should be used for most short, atomic strings, especially ones that have an enumerated set of values. Basing your enumerated types on token means that <gender> M </gender> will be valid as well as <gender>M</gender>.

11.2.2. Name

The type Name represents an XML name, which can be used as an element name or attribute name, among other things. Values of this type must start with a letter, underscore (_), or colon (:), and may contain only letters, digits, underscores (_), colons (:), hyphens (-), and periods (.). Colons should only be used to separate namespace prefixes from local names.

Table 11–4 shows some valid and invalid values of the Name type.

Table 11–4. Values of the Name type

The facets indicated in Table 11–5 can restrict Name and its derived types.

Table 11–5. Facets applicable to Name type

11.2.3. NCName

The type NCName represents an XML non-colonized name, which is simply a name that does not contain colons. An NCName must start with either a letter or underscore (_) and may contain only letters, digits, underscores (_), hyphens (-), and periods (.). This is identical to the Name type, except that colons are not permitted.

Table 11–6 shows some valid and invalid values of the NCName type.

Table 11–6. Values of the NCName type

The facets indicated in Table 11–7 can restrict NCName and its derived types.

Table 11–7. Facets applicable to NCName type

11.2.4. language

The type language represents a natural language identifier, generally used to indicate the language of a document or a part of a document. Before creating a new attribute of type language, consider using the xml:lang attribute that is intended to indicate the natural language of the element and its content.

Values of the language type conform to RFC 3066, Tags for the Identification of Languages, in version 1.0 and to RFC 4646, Tags for Identifying Languages, and RFC 4647, Matching of Language Tags, in version 1.1. The three most common formats are:

• For ISO-recognized languages, the format is a two- or three-letter (usually lowercase) language code that conforms to ISO 639, optionally followed by a hyphen and a two-letter, usually uppercase, country code that conforms to ISO 3166. For example, en or en-US.

• For languages registered by the Internet Assigned Numbers Authority (IANA), the format is i-langname, where langname is the registered name. For example, i-navajo.

• For unofficial languages, the format is x-langname, where langname is a name of up to eight characters agreed upon by the two parties sharing the document. For example, x-Newspeak.

Any of these three formats may have additional parts, each preceded by a hyphen, which identify more countries or dialects. Schema processors will not verify that values of the language type conform to the above rules. They will simply validate them based on the pattern specified for this type, which says that it must consist of one or more parts of up to eight characters each, separated by hyphens.

Table 11–8 shows some valid and invalid values of the language type.

Table 11–8. Values of the language type

The facets indicated in Table 11–9 can restrict language and its derived types.

Table 11–9. Facets applicable to language type

11.3. Numeric types

11.3.1. float and double

The type float represents an IEEE single-precision 32-bit floating-point number, and double represents an IEEE double-precision 64-bit floating-point number. The lexical representation of both float and double values is a mantissa (a number which conforms to the type decimal described in the next section) followed, optionally, by the character “E” or “e” followed by an exponent. The exponent must be an integer. For example, 3E2 represents 3 times 10 to the 2nd power, or 300.

In addition, the following values are valid: INF (infinity), +INF (positive infinity, version 1.1 only), -INF (negative infinity), 0 (positive 0), -0 (negative 0), and NaN (Not a Number). 0 and -0 are considered equal. INF and +INF are equal and are considered to be greater than all other values, while -INF is less than all other values. The value NaN cannot be compared to any other values.

The canonical representation for float and double always contains an uppercase letter E and a decimal point in the mantissa. No leading or trailing zeros are present, except that there is always at least one digit before and after the decimal point in the mantissa, and at least one digit in the exponent. No positive signs are included. For example, the canonical representation of the float value +12 is 12.0E0.

Table 11–10 shows some valid and invalid values of the float and double types.

Table 11–10. Values of the float and double types

The facets indicated in Table 11–11 can restrict float, double, and their derived types.

Table 11–11. Facets applicable to float and double types

11.3.2. decimal

The type decimal represents a decimal number of arbitrary precision. Schema processors vary in the number of significant digits they support, but a minimally conforming processor must support at least 16 significant digits. The lexical representation of decimal is a sequence of digits optionally preceded by a sign (“+” or “-”) and optionally containing a period. If the fractional part is 0 then the period and trailing zeros may be omitted. Leading and trailing zeros are permitted but not considered significant. That is, the decimal values 3.0 and 3.0000 are considered equal.

The canonical representation of decimal always contains a decimal point. No leading or trailing zeros are present, except that there is always at least one digit before and after the decimal point. No positive signs are included.

Table 11–12 shows some valid and invalid values of the decimal type.

Table 11–12. Values of the decimal type

The facets indicated in Table 11–13 can restrict decimal and its derived types.

Table 11–13. Facets applicable to decimal type

11.3.3. Integer types

The type integer represents an arbitrarily large integer; from it, 12 other built-in integer types are derived (directly or indirectly). The lexical representation of the integer types is a sequence of digits. Some of the integer types allow or require a sign (“+” or “-”) to precede the numbers, others prohibit it. Leading zeros are permitted, but decimal points are not.

The canonical representations of integer types do not contain leading zeros or positive signs. Table 11–14 lists all of the integer types, with their bounds and the rules for preceding signs.

Table 11–14. Integer types

Table 11–15 shows some valid and invalid values of the integer types.

The facets indicated in Table 11–16 can restrict the integer types and their derived types.

Table 11–15. Values of the integer types

Table 11–16. Facets applicable to integer types

11.3.3.1. Design hint: Is it an integer or a string?

When defining types for values that are sequences of digits, it may be difficult to determine whether the type should be based on an integer or a string. For example, a quantity is an example of a value that is better treated as an integer than a string. 5-digit U.S. zip codes, on the other hand, are valid integers, but they are probably better interpreted as strings. Here are some general guidelines:

Use integer (or, more likely, nonNegativeInteger) if:

• You will ever compare two values of that type numerically. For example, if you compare the quantity 100 to the quantity 99, you obviously want 100 to be greater. But if you define them as strings, they will be compared as strings in languages such as XSLT 2.0 and XQuery, and 100 will be considered less than 99.

• You will ever perform mathematical operations on values of that type. You might want to double a quantity, but you are unlikely to want to double a zip code.

• You want to restrict their values’ bounds. For example, you may require that quantity must be between 0 and 100. While it can be possible to restrict a string in this way, by applying a pattern, it is more cumbersome.

Use string (or, more likely, token) if:

• You want to restrict your values’ lexical length. For example, zip codes must be five digits long; 8540 is not a valid zip code, but 08540 is valid. While it is technically possible to restrict an integer to five digits by applying a pattern, it is more cumbersome.

• You will ever take a substring. For example, you may want to extract the central processing facility as the first three digits of a zip code.

• You plan to derive nonnumeric types from this type, or use it in a substitution group with nonnumeric types. For example, if you plan to also define types for international postal codes, which may contain letters or other characters, it is safer to base your U.S. zip code elements on a string type, so that they can be used in a substitution group with other postal code elements.

11.4. Date and time types

XML Schema provides a number of built-in date and time types, whose formats are based on ISO 8601. This section explains each of the date and time types and provides general information that applies to all date and time types.

11.4.1. date

The type date represents a Gregorian calendar date. The lexical representation of date is YYYY-MM-DD where YY represents the year, MM the month and DD the day. No left truncation is allowed for any part of the date. To represent years later than 9999, additional digits can be added to the left of the year value, but extra leading zeros are not permitted. To represent years before 0000, a preceding minus sign (“-”) is allowed. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–17 shows some valid and invalid values of the date type.

Table 11–17. Values of the date type

11.4.2. time

The type time represents a time of day. The lexical representation of time is hh:mm:ss.sss where hh represents the hour, mm the minutes, and ss.sss the seconds. An unlimited number of additional digits can be used to increase the precision of fractional seconds if desired. The time is based on a 24-hour time period, so hours should be represented as 00 through 24. Either of the values 00:00:00 or 24:00:00 can be used to represent midnight. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–18 shows some valid and invalid values of the time type.

Table 11–18. Values of the time type

11.4.3. dateTime

The type dateTime represents a specific date and time. The lexical representation of dateTime is YYYY-MM-DDThh:mm:ss.sss, which is a concatenation of the date and time forms, separated by a literal letter T. All of the same rules that apply to the date and time types are applicable to dateTime as well. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–19 shows some valid and invalid values of the dateTime type.

Table 11–19. Values of the dateTime type

11.4.4. dateTimeStamp

The type dateTimeStamp represents a specific date and time, but with a time zone required. It is derived from dateTime and has the same lexical representation and rules. The only difference is that a value is required to end in a time zone, as described in Section 11.4.13 on p. 233.

Table 11–20 shows some valid and invalid values of the dateTimeStamp type.

Table 11–20. Values of the dateTimeStamp type

11.4.5. gYear

The type gYear represents a specific Gregorian calendar year. The letter g at the beginning of most date and time types signifies “Gregorian.” The lexical representation of gYear is YYYY. No left truncation is allowed. To represent years later than 9999, additional digits can be added to the left of the year value. To represent years before 0000, a preceding minus sign (“-”) is allowed. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–21 shows some valid and invalid values of the gYear type.

Table 11–21. Values of the gYear type

11.4.6. gYearMonth

The type gYearMonth represents a specific month of a specific year. The lexical representation of gYearMonth is YYYY-MM. No left truncation is allowed on either part. To represent years later than 9999, additional digits can be added to the left of the year value. To represent years before 0000, a preceding minus sign (“-”) is permitted. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–22 shows some valid and invalid values of the gYearMonth type.

Table 11–22. Values of the gYearMonth type

11.4.7. gMonth

The type gMonth represents a specific month that recurs every year. It can be used to indicate, for example, that fiscal year-end processing occurs in September of every year. To represent a duration in months, use the duration type instead. The lexical representation of gMonth is --MM. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233. No preceding sign is allowed.

Table 11–23 shows some valid and invalid values of the gMonth type.

Table 11–23. Values of the gMonth type

11.4.8. gMonthDay

The type gMonthDay represents a specific day that recurs every year. It can be used to say, for example, that your birthday is on the 12th of April every year. The lexical representation of gMonthDay is --MM-DD. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–24 shows some valid and invalid values of the gMonthDay type.

Table 11–24. Values of the gMonthDay type

11.4.9. gDay

The type gDay represents a day that recurs every month. It can be used to say, for example, that checks are paid on the 5th of each month. To represent a duration in days, use the duration type instead. The lexical representation of gDay is ---DD. An optional time zone expression may be added at the end, as described in Section 11.4.13 on p. 233.

Table 11–25 shows some valid and invalid values of the gDay type.

Table 11–25. Values of the gDay type

11.4.10. duration

The type duration represents a duration of time expressed as a number of years, months, days, hours, minutes, and seconds. The lexical representation of duration is PnYnMnDTnHnMnS, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, T is a literal value that separates the date and time, nH is the number of hours followed by a literal H, nM is the number of minutes followed by a literal M, and nS is the number of seconds followed by a literal S. The following rules apply to duration values:

• Any of these numbers and corresponding designators may be absent if they are equal to 0, but at least one number and designator must appear.

• The numbers may be any unsigned integer, with the exception of the number of seconds, which may be an unsigned decimal number.

• If a decimal point appears in the number of seconds, there must be at least one digit after the decimal point.

• A minus sign may appear before the P to specify a negative duration.

• If no time items (hours, minutes, seconds) are present, the letter T must not appear.

In the canonical representation of duration, the months value must be less than 12, the hours value less than 24, and the minutes and seconds values less than 60. This means that P15M and P1Y3M are both valid (and equal) lexical representations that map to the same canonical value P1Y3M.

Table 11–26 shows some valid and invalid values of the duration type.

Table 11–26. Values of the duration type

When deriving types from duration, applying the bounds facets (minExclusive, minInclusive, maxInclusive, and maxExclusive) can have unexpected results. For example, if the maxInclusive value for a duration-based type is P1M, and an instance value contains P30D, it is ambiguous. Months may have 28, 29, 30, or 31 days, so is 30 days less than a month or not?

It is best to avoid the ambiguity by always specifying bounds for durations in the same unit in which the instance values will appear, in this case setting maxExclusive to P32D instead of P1M. You can use the pattern facet to force a particular unit of duration. For example, the pattern P\d+D applied to the duration type would force the duration to be expressed in days only.

Alternatively, if you are using version 1.1, you can use one of the two totally ordered duration types, yearMonthDuration or dayTimeDuration, described in the next two sections.

11.4.11. yearMonthDuration

The type yearMonthDuration, new in version 1.1, represents a duration of time expressed as a number of years and months. The lexical representation of duration is PnYnM, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, and nM is the number of months followed by a literal M.

yearMonthDuration is derived from duration, and all of the same lexical rules apply.

Table 11–27 shows some valid and invalid values of the yearMonthDuration type.

Table 11–27. Values of the yearMonthDuration type

11.4.12. dayTimeDuration

The type dayTimeDuration, new in version 1.1, represents a duration of time expressed as a number of days, hours, minutes, and seconds. The lexical representation of duration is PnDTnHnMnS, where P is a literal value that starts the expression, nD is the number of days followed by a literal D, T is a literal value that separates the date and time, nH is the number of hours followed by a literal H, nM is the number of minutes followed by a literal M, and nS is the number of seconds followed by a literal S.

dayTimeDuration is derived from duration, and all of the same lexical rules apply.

Table 11–28 shows some valid and invalid values of the dayTimeDuration type.

Table 11–28. Values of the dayTimeDuration type

11.4.13. Representing time zones

All of the date and time types, with the exception of the duration types, allow a time zone indicator at the end. The letter Z is used to indicate Coordinated Universal Time (UTC). All other time zones are represented by their difference from Coordinated Universal Time in the format +hh:mm or -hh:mm. These values may range from -14:00 to 14:00.

For example, US Eastern Standard Time, which is 5 hours behind UTC, is represented as -05:00. If no time zone value is present, it is considered unknown; it is not assumed to be UTC.

For most built-in types, a time zone is optional. However, for the dateTimeStamp type, it is required. This is because that type has its explicitTimezone facet set to required. In user-defined types derived from the date and time types, you can choose to leave the time zone optional, or require or prohibit a time zone using the explicitTimezone facet, as described in Section 8.4.7 on p. 150.

Table 11–29 shows some valid and invalid values of time zones.

Table 11–29. Time zone values

11.4.14. Facets

The facets indicated in Table 11–30 can restrict the date and time types as well as their derived types.

Table 11–30. Facets applicable to date and time types

11.4.15. Date and time ordering

When deriving types from date and time types (other than the duration types), it is important to note that applying the bounds facets (minExclusive, minInclusive, maxInclusive, and maxExclusive) can have unexpected results. If the values of the bounds facets specify time zones and the instance values do not, or vice versa, it may be impossible to compare the two. For example, if maxInclusive for a time-based type is 14:30:00Z, this means that the maximum value is 2:30 P.M. in UTC. If the value 13:30:00 appears in an instance, which is 1:30 P.M. with no time zone specified, it is impossible to tell if this value is valid. It could be 1:30 P.M. in UTC, which would be valid, or 1:30 P.M. US Eastern Standard Time, which would be 6:30 P.M. UTC, and therefore invalid. Since this is indeterminate, the schema processor will consider it an invalid value.

To avoid this problem, either use time zones in both bounds facet values and instance values, or do not use time zones at all. If both the bounds and the instance values have a time zone, the two values can be compared. Likewise, if neither has a time zone, the two values are assumed to be in the same time zone and compared as such.

11.5. Legacy types

The XML DTD types described in this section are attribute types that are specified in the XML recommendation. It is recommended that these types are only used for attributes, in order to maintain compatibility with XML DTDs. However, it is not an error to use these types in element declarations.

11.5.1. ID

The type ID is used for an attribute that uniquely identifies an element in an XML document. An ID value must conform to the rules for an NCName, as described in Section 11.2.3 on p. 210. This means that it must start with a letter or underscore, and can only contain letters, digits, underscores, hyphens, and periods.

ID values must be unique within an XML instance, regardless of the attribute’s name or its element name. Example 11–2 is invalid if attributes custID and orderID are both declared to be of type ID.

Example 11–2. Invalid nonunique IDs

Click here to view code image

In version 1.0, ID carries two additional constraints, both of which have been eliminated in version 1.1:

1. A complex type cannot include more than one attribute of type ID or of any type derived from ID. The type definition in Example 11–3 is illegal.

2. ID attributes cannot have default or fixed values specified. The attribute declarations in Example 11–4 are illegal.

Example 11–3. Illegal duplication of ID attributes (version 1.0)

Click here to view code image

<xs:complexType name="CustType">
<xs:attribute name="id" type="xs:ID"/>
<xs:attribute name="custID" type="xs:ID"/>
</xs:complexType>

Example 11–4. Illegal attribute declarations (version 1.0)

Click here to view code image

<xs:attribute name="id" type="xs:ID" fixed="A123"/>
<xs:attribute name="custID" type="xs:ID" default="C00000"/>

The facets indicated in Table 11–31 can restrict ID and its derived types.

Table 11–31. Facets applicable to ID type

11.5.2. IDREF

The type IDREF is used for an attribute that references an ID. A common use case for IDREF is to create a cross-reference to a particular section of a document. Like ID, an IDREF value must be an NCName, as described in Section 11.2.3 on p. 210.

All attributes of type IDREF must reference an ID in the same XML document. In Example 11–5, the ref attribute of quote is of type IDREF, and the id attribute of footnote is of type ID. The instance contains a reference between them.

Example 11–5. Using IDREF

Schema:

Click here to view code image

<xs:element name="quote">
  <xs:complexType>
    
    <xs:attribute name="ref" type="xs:IDREF"/>
  </xs:complexType>
</xs:element>
<xs:element name="footnote">
  <xs:complexType>
    
    <xs:attribute name="id" type="xs:ID" use="required"/>
  </xs:complexType>
</xs:element>

Instance:

Click here to view code image

ID and IDREF are best used for referencing unique locations in document-oriented XML. To enforce complex uniqueness of data values, and primary and foreign key references, consider using identity constraints, which are described in Chapter 17.

The facets indicated in Table 11–32 can restrict IDREF and its derived types.

Table 11–32. Facets applicable to IDREF type

11.5.3. IDREFS

The type IDREFS represents a list of IDREF values separated by whitespace. There must be at least one IDREF in the list.

Each of the values in an attribute of type IDREFS must reference an ID in the same XML document. In Example 11–6, the refs attribute of quote is of type IDREFS, and the id attribute of footnote is of type ID. The instance contains a reference from the quote element to two footnote elements, with their IDs (fn1 and fn2) separated by whitespace.

Example 11–6. Using IDREFS

Schema:

Click here to view code image

<xs:element name="quote">
  <xs:complexType>
    
    <xs:attribute name="refs" type="xs:IDREFS"/>
  </xs:complexType>
</xs:element>
<xs:element name="footnote">
  <xs:complexType>
    
    <xs:attribute name="id" type="xs:ID" use="required"/>
  </xs:complexType>
</xs:element>

Instance:

Click here to view code image

The facets indicated in Table 11–33 can restrict IDREFS and its derived types.

Table 11–33. Facets applicable to IDREFS type

Since IDREFS is a list type, restricting an IDREFS value with these facets may not behave as you expect. The facets length, minLength, and maxLength apply to the number of items in the IDREFS list, not the length of each item. The enumeration facet applies to the whole list, not the individual items in the list. For more information, see Section 10.3.3 on p. 190.

11.5.4. ENTITY

The type ENTITY represents a reference to an unparsed entity. The ENTITY type is most often used to include information from another location that is not in XML format, such as graphics. An ENTITY value must be an NCName, as described in Section 11.2.3 on p. 210. An ENTITY value carries the additional constraint that it must match the name of an unparsed entity in a document type definition (DTD) for the instance.

Example 11–7 shows an XML document that links product numbers to pictures of the products. In the schema, the picture element declaration declares an attribute location that has the type ENTITY. In the instance, each value of the location attribute (in this case, prod557 and prod563) matches the name of an entity declared in the internal DTD subset of the instance.

The facets indicated in Table 11–34 can restrict ENTITY and its derived types.

Table 11–34. Facets applicable to ENTITY type

Example 11–7. Using an unparsed entity

Schema:

Click here to view code image

<xs:element name="picture">
  <xs:complexType>
    <xs:attribute name="location" type="xs:ENTITY"/>
  </xs:complexType>
</xs:element>

Instance:

Click here to view code image

<!DOCTYPE catalog SYSTEM "catalog.dtd" [
<!NOTATION jpeg SYSTEM "JPG">
<!ENTITY prod557 SYSTEM "prod557.jpg" NDATA jpeg>
<!ENTITY prod563 SYSTEM "prod563.jpg" NDATA jpeg>
]>

<catalog>
  <product>
    <number>557</number>
    <picture location="prod557"/>
  </product>
  <product>
    <number>563</number>
    <picture location="prod563"/>
  </product>
</catalog>

11.5.5. ENTITIES

The type ENTITIES represents a list of ENTITY values separated by whitespace. There must be at least one ENTITY in the list. Each of the ENTITY values must match the name of an unparsed entity that has been declared in a document type definition (DTD) for the instance.

Expanding on the example from the previous section, Example 11–8 shows the declaration of an attribute named location that is of type ENTITIES. In the instance, the location attribute can include a list of entity names. Each value (in this case there are two: prod557a and prod557b) matches the name of an entity that is declared in the internal DTD subset for the instance.

Example 11–8. Using ENTITIES

Schema:

Click here to view code image

<xs:element name="pictures">
  <xs:complexType>
    <xs:attribute name="location" type="xs:ENTITIES"/>
  </xs:complexType>
</xs:element>

Instance:

Click here to view code image

<!DOCTYPE catalog SYSTEM "catalog.dtd" [
<!NOTATION jpeg SYSTEM "JPG">
<!ENTITY prod557a SYSTEM "prod557a.jpg" NDATA jpeg>
<!ENTITY prod557b SYSTEM "prod557b.jpg" NDATA jpeg>
]>

<catalog>
  <product>
    <number>557</number>
    <pictures location="prod557a prod557b"/>
  </product>
</catalog>

The facets indicated in Table 11–35 can restrict ENTITIES and its derived types.

Table 11–35. Facets applicable to ENTITIES type

Since ENTITIES is a list type, restricting an ENTITIES value with these facets may not behave as you expect. The facets length, minLength, and maxLength apply to the number of items in the ENTITIES list, not the length of each item. The enumeration facet applies to the whole list, not the individual items in the list. For more information, see Section 10.3.3 on p. 190.

11.5.6. NMTOKEN

The type NMTOKEN represents a single string token. NMTOKEN values may consist of letters, digits, periods (.), hyphens (-), underscores (_), and colons (:). They may start with any of these characters. NMTOKEN has a whiteSpace facet value of collapse, so any leading or trailing whitespace will be removed. However, no whitespace may appear within the value itself. Table 11–36 shows some valid and invalid values of the NMTOKEN type.

Table 11–36. Values of the NMTOKEN type

The facets indicated in Table 11–37 can restrict NMTOKEN and its derived types.

Table 11–37. Facets applicable to NMTOKEN type

11.5.7. NMTOKENS

The type NMTOKENS represents a list of NMTOKEN values separated by whitespace. There must be at least one NMTOKEN in the list. Table 11–38 shows some valid and invalid values of the NMTOKENS type.

Table 11–38. Values of the NMTOKENS type

The facets indicated in Table 11–39 can restrict NMTOKENS and its derived types.

Table 11–39. Facets applicable to NMTOKENS type

Since NMTOKENS is a list type, restricting an NMTOKENS value with these facets may not behave as you expect. The facets length, minLength, and maxLength apply to the number of items in the NMTOKENS list, not the length of each item. The enumeration facet applies to the whole list, not the individual items in the list. For more information, see Section 10.3.3 on p. 190.

11.5.8. NOTATION

The type NOTATION represents a reference to a notation. A notation is a method of interpreting XML and non-XML content. For example, if an element in an XML document contains binary graphics data in JPEG format, a notation can be declared to indicate that this is JPEG data. An attribute of type NOTATION can then be used to indicate which notation applies to the element’s content. A NOTATION value must be a QName as described in Section 11.6.1 on p. 246.

NOTATION is the only built-in type that cannot be the type of attributes or elements. Instead, you must define a new type that restricts NOTATION, applying one or more enumeration facets. Each of these enumeration values must match the name of a declared notation. For more information on declaring notations and NOTATION-based types, see Section 19.7 on p. 493.

The facets indicated in Table 11–40 can restrict NOTATION and its derived types.

Table 11–40. Facets applicable to NOTATION type

11.6. Other types

11.6.1. QName

The type QName represents an XML namespace-qualified name that consists of a namespace name and a local part.

When appearing in XML documents, the lexical representation of a QName consists of a prefix and a local part, separated by a colon, both of which are NCNames. The prefix and colon are optional.

The lexical structure is mapped onto the QName value in the context of namespace declarations, as described in Chapter 3. If the QName value is prefixed, the namespace name is that which is in scope for that prefix. If it is not prefixed, the default namespace declaration in scope (if any) becomes the QName’s namespace.

QName is not based on string like the other name-related types, because it has this special two-part value with additional constraints that cannot be expressed with XML Schema facets. Table 11–41 shows some valid and invalid values of the QName type.

Table 11–41. Values of the QName type

The facets indicated in Table 11–42 can restrict QName and its derived types.

Table 11–42. Facets applicable to QName type

11.6.2. boolean

The type boolean represents logical yes/no values. The valid values for boolean are true, false, 0, and 1. Values that are capitalized (e.g., TRUE) or abbreviated (e.g., T) are not valid. Table 11–43 shows some valid and invalid values of the boolean type.

Table 11–43. Values of the boolean type

The facets indicated in Table 11–44 can restrict boolean and its derived types.

Table 11–44. Facets applicable to boolean type

11.6.3. The binary types

The types hexBinary and base64Binary represent binary data. Their lexical representation is a sequence of binary octets.

The type hexBinary uses hexadecimal encoding, where each binary octet is a two-character hexadecimal number. Lowercase and uppercase letters A through F are permitted. For example, 0FB8 and 0fb8 are two equal hexBinary representations consisting of two octets. The canonical representation of hexBinary uses only uppercase letters.

The type base64Binary, typically used for embedding images and other binary content, uses base64 encoding, as described in RFC 3548. The following rules apply to base64Binary values:

• The following characters are allowed: the letters A to Z (upper and lower case), digits 0 through 9, the plus sign (+), the slash (/), the equals sign (=), and XML whitespace characters.

• XML whitespace characters may appear anywhere in the value.

• The number of nonwhitespace characters must be divisible by 4.

• Equals signs may only appear at the end of the value, and there may be zero, one, or two of them. If there are two equals signs, they must be preceded by one of the following characters: AQgw. If there is only one equals sign, it must be preceded by one of the following characters: AEIMQUYcgkosw048. In either case, there may be whitespace between the necessary characters and the equals sign(s).

The canonical representation of base64Binary removes all whitespace characters. For more information on base64 encoding, see RFC 3548, The Base16, Base32, and Base64 Data Encodings.

Table 11–45 shows some valid and invalid values of the binary types.

The facets indicated in Table 11–46 can restrict hexBinary, base64Binary, and their derived types.

Table 11–45. Values of the binary types

The length facet for the binary types represents the number of binary octets (groups of 8 bits each). For example, the length of the hexBinary value 0FB8 is equal to 2. Since base64 characters represent 6 bits each, the length of the base64Binary value 0FB8 is equal to 3. Whitespace and equals signs are ignored when calculating the length of a base64Binary value.

Table 11–46. Facets applicable to binary types

11.6.4. anyURI

The type anyURI represents a Uniform Resource Identifier (URI) reference. URIs are used to identify resources, and they may be absolute or relative. Absolute URIs provide the entire context for locating a resource, such as http://datypic.com/prod.html. Relative URIs are specified as the difference from a base URI, for example ../prod.html. It is also possible to specify a fragment identifier using the # character, for example ../prod.html#shirt.

The three previous examples happen to be HTTP URLs (Uniform Resource Locators), but URIs also encompass URLs of other schemes (e.g., FTP, gopher, telnet), as well as URNs (Uniform Resource Names). URIs are not required to be dereferenceable; that is, it is not necessary for there to be a web page at http://datypic.com/prod.html in order for this to be a valid URI.

URIs require that some characters be escaped with their hexadecimal Unicode code point preceded by the % character. This includes non-ASCII characters and some ASCII characters including control characters, space, and certain punctuation characters. For example, ../édition.html must be represented instead as ../%C3%A9dition.html, with the é escaped as %C3%A9. However, the anyURI type will accept these characters either escaped or unescaped. With the exception of the characters % and #, it will assume that unescaped characters are intended to be escaped when used in an actual URI, although the schema processor will do nothing to alter them. It is valid for an anyURI value to contain a space, but this practice is strongly discouraged. Spaces should instead be escaped using %20. For more information on URIs, see RFC 2396, Uniform Resource Identifiers (URI): Generic Syntax.

Version 1.1 expands the definition of anyURI to include IRfIs, or Internationalized Resource Identifiers. Compared to URIs, IRIs allow a much broader range of characters without requiring them to be escaped. Since the anyURI type does not require escaping anyway, this has little practical impact on your schemas. For more information about IRIs, see RFC 3987, Internationalized Resource Identifiers (IRIs).

Note that when relative URI references such as ../prod are used as values of anyURI, no attempt is made by the schema processor to determine or keep track of the base URI to which they may be applied. For example, it will not attempt to resolve the value relative to the URL of the containing document, or any xml:base attributes that may appear in it.

Table 11–47 shows some examples of valid and invalid anyURI values. The schema processor is not required to parse the contents of an anyURI value to determine whether it is valid according to any particular URI scheme. Since the bare minimum rules for valid URI references are fairly generic, the schema processor will accept most character strings, including an empty value. The only values that are not accepted are ones that make inappropriate use of reserved characters, such as those containing multiple # characters or % characters not followed by two hexadecimal digits.

Table 11–47. Valid values of anyURI type

The facets indicated in Table 11–48 can restrict anyURI and its derived types.

Table 11–48. Facets applicable to anyURI type

11.7. Comparing typed values

When a schema processor is comparing two values, it does more than compare lexical values as if they were strings. It takes into account the types of the values. This comes into play during validation of an instance in several places:

• Validating fixed values

• Validating enumerated values

• Validating values against bounds facets

• Determining uniqueness of identity constraint fields

• Validating key references

• Comparisons in assertions

This is also important to consider when using schema-aware languages, such as XSLT 2.0 and XQuery, which allow a processor to use type information from the schema when comparing values.

One of the factors used in determining the equality of two values is the relationship of their types in the derivation hierarchy. Types that are related to each other by restriction, list, or union can have values that are equal. For example, the value 2 of type integer and the value 2 of type positiveInteger are considered equal, since positiveInteger is derived from integer. Types that are not related in the hierarchy can never have values that are equal. This means that an integer value will never equal a string value, even if they are both 2. This is true of both the built-in and user-derived types. Example 11–9 illustrates this point.¹

Example 11–9. Equality based on type definition hierarchy

Click here to view code image

<integer>2</integer> does not equal <string>2</string>
<integer>2</integer> equals <positiveInteger>2</positiveInteger>
<string>abc</string> equals <NCName>abc</NCName>
<string>abc</string> does not equal <QName>abc</QName>
<IDREFS>abc</IDREFS> equals <IDREF>abc</IDREF>

Some of the built-in types have multiple lexical representations that are equivalent. For example, an integer may be represented as 2, 02, +2, or +00002. These values are all considered equal if they have the type integer, because they all represent the same canonical value. However, these same lexical values are unequal if they have the type string. Example 11–10 illustrates this point.

Another factor to take into account is whitespace normalization. Whitespace is normalized before any validation takes place. Therefore, it plays a role in determining whether two values are equal. For example, the string type has a whiteSpace facet value of preserve, while the token type’s is collapse. The value “ a ” that has the type string will not equal “ a ” that has the type token, because the leading and trailing spaces will be stripped for the token value. Example 11–11 illustrates this point.

Example 11–10. Equality based on equivalent lexical representations

Click here to view code image

<integer>2</integer> equals <integer>02</integer>
<integer>2</integer> equals <positiveInteger>02</positiveInteger>
<string>2</string> does not equal <string>02</string>
<boolean>true</boolean> equals <boolean>1</boolean>
<hexBinary>0fb8</hexBinary> equals <hexBinary>0FB8</hexBinary>
<time>13:20:00-05:00</time> equals <time>12:20:00-06:00</time>

Example 11–11. Equality based on whitespace normalization

Click here to view code image

<string> a </string> does not equal <token> a </token>
<string>a</string> equals <token> a </token>
<token>a</token> equals <token> a </token>