ICU Internationalization Extension

SQLite provides full support for Unicode text values. Unicode provides a way to encode many different character representations, allowing a string of bytes to represent written characters, glyphs, and accents from a multitude of languages and writing systems. What Unicode does not provide is any information or understanding of the sorting rules, capitalization rules, or equivalence rules and customs of a given language or location.

This is a problem for pattern matching, sorting, or anything that depends on comparing text values. For example, most text-sorting systems will ignore case differences between words. Some languages will also ignore certain accent marks, but often those rules depend on the specific accent mark and character. Occasionally, the rules and conventions used within a language change from location to location. By default, the only character system SQLite understands is 7-bit ASCII. Any character encoding of 128 or above will be treated as a binary value with no awareness of capitalization or equivalence conventions. While this is often sufficient for English, it is usually insufficient for other languages.

For more complete internationalization support, you’ll need to build SQLite with the ICU extension enabled. The International Components for Unicode project is an open-source library that implements a vast number of language-related functions. These functions are customized for different locales. The SQLite ICU extension allows SQLite to utilize different aspects of the ICU library, allowing locale-aware sorts and comparisons, as well as locale-aware versions of upper() and lower().

To use the ICU extension, you must first download and build the ICU library. The library source code, along with build instructions, can be downloaded from the project website at http://www.icu-project.org/. You must then build SQLite with the ICU extension enabled, and link it against the ICU library. To enable the ICU extension in an amalgamation build, define the SQLITE_ENABLE_ICU compiler directive.

You’ll want to take a look at the original README document. It explains how to utilize the extension to create locale-specific collations and operators. You can find a copy of the README file in the full source distribution (in the ext/icu directory) or online at http://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt.

The main disadvantage of the ICU library is size. In addition to the library itself, the locale information for all of the languages and locations adds up to a considerable bulk. This extra data may not be significant for a desktop system, but it may prove impractical on a handheld or embedded device.

Although the ICU extension can provide location-aware sorting and comparison capabilities, you still need to pick a specific locale to define those sorting and comparison rules. This is simple enough if you’re only working with one language in one location, but it can be quite complex when languages are mixed. If you must deal with cross-locale sorts or other complex internationalization issues, it may be easier to pull that logic up into your application’s code.