Full-Text Search Module

SQLite includes a Full-Text Search (FTS) engine. The current version is known as FTS3. The FTS3 engine is designed to catalog and index large bodies of text. Once indexed, the FTS3 engine can quickly search for documents based off various types of keyword searches. Although the FTS3 source is now maintained by the SQLite team, parts of the engine were originally contributed by members of Google’s engineering team.

The FTS3 engine is a virtual table module. Virtual tables are similar to views, in that they wrap a data source to make it look and act like a normal table. Views get their data from a SELECT statement, while virtual tables depend on user-defined C functions. All of the functions required to implement a virtual table are wrapped up in an extension known as a module. For more information on how SQLite modules and virtual tables work, see Chapter 10.

Full-Text Search is an important and evolving technology, and is one of the areas that is targeted for improvements and enhancements as this book goes to press. Although this section gives a brief overview of the core FTS3 features, if you find yourself considering the FTS3 module, I would encourage you to review the full documentation on the SQLite website.

The FTS3 engine is included in all standard distributions of the SQLite source (including the amalgamation), but is turned off by default. To enable basic FTS functionality, define the SQLITE_ENABLE_FTS3 compiler directive when building the SQLite library. To enable the more advanced matching syntax, also define SQLITE_ENABLE_FTS3_PARENTHESIS.

Once SQLite has been compiled with the FTS3 engine enabled, you can create a document table with an SQL statement similar to this:

CREATE VIRTUAL TABLE table_name USING FTS3 ( col1,... );

In addition to providing a table name, you can define zero or more column names. The name of the column is the only information that will actually be used. Any type information or column constraints will be ignored. If no column names are given, FTS will automatically create one column with the name content.

FTS tables are often used to hold whole documents, in which case they only need one column. Other times, they are used to hold different categories of related information, and require multiple columns. For example, if you wanted to store email messages in an FTS table, it might make sense to create separate columns for the "SUBJECT:" line, "FROM:" line, "TO:" line, and message body. This would allow you to limit searches to a specific column (and the data it contains). The column specification for an FTS table is largely determined by how you want to search for the data. FTS also provides an optimized way to look for a search term across all of the indexed columns.

You can use the standard INSERT, UPDATE, and DELETE statements to manipulate data within an FTS table. Like traditional tables, FTS tables have a ROWID column that contains a unique integer for each entry in the table. This column can also be referred to using the alias DOCID. Unlike traditional tables, the ROWID of an FTS table is stable through a vacuum (VACUUM in Appendix C), so it can be reliably referenced through a foreign key. Additionally, FTS tables have an internal column with the same name as the table name. This column is used for special operations. You cannot insert or update data in this column.

Any virtual table, including FTS tables, can be deleted with the standard DROP TABLE command.

FTS tables are designed so that any SELECT statement will work correctly. You can even search for specific text values or patterns directly with the == or LIKE operators. These will work, although they’ll be somewhat slow, since standard operators will require a full table scan.

The real power of the FTS system comes from a custom MATCH operator. This operator is able to take advantage of the indexes built around the individual text values, allowing very fast searches over large bodies of text. Generally, searches are done using a query similar to:

SELECT * FROM fts_table WHERE fts_column MATCH search_term;

The search term used by the MATCH operator has very similar semantics to those used in a web search engine. Search terms are broken down and matched against words and terms found in the text values of the FTS table. Generally, the FTS MATCH operator is case-insensitive and will only match against whole words. For example, the search term 'data' will match 'research data', but not 'database'. The order of the search terms does not matter. The terms 'cat dog' and 'dog cat' will match the same set of rows.

By default, MATCH will only match records that contain every word or term found in the search term. If the extended syntax is enabled, more complex logic statements can also be used to define more complex search patterns.

Normally, the terms will only be matched against the specified column. However, every FTS table has a special hidden column that has the same name as the table itself. If this column is specified in the match expression, then all of the columns will be searched. This makes it easy to construct “find all” type searches.

The FTS module is fairly advanced, and offers a large number of search options and index optimizations. If you plan on using the FTS engine in your application, I strongly suggest you spend some time reading the online documentation (http://www.sqlite.org/fts3.html). The official documentation is quite extensive and covers the more advanced search features in some detail, complete with examples.

In addition to explaining the different search patterns, the online documentation also covers index optimization and maintenance. While this level of detail isn’t always required, it can be beneficial for applications that heavily depend on FTS. The documents also explain how to provide a custom tokenizer to adapt the FTS engine to specific applications.

For those that want to dig even deeper, later sections of the document explain some of the internal workings of the FTS index. Information is provided on the shadow tables used by the FTS engine, as well as the process used to generate the token index. This level of knowledge isn’t required to use the FTS system, but it is extremely useful if you are looking to modify the code, or if you’re just curious about what is going on under the hood.