SQLite includes a Full-Text Search (FTS) engine. The current version is known as FTS3. The FTS3 engine is designed to catalog and index large bodies of text. Once indexed, the FTS3 engine can quickly search for documents based off various types of keyword searches. Although the FTS3 source is now maintained by the SQLite team, parts of the engine were originally contributed by members of Google’s engineering team.
The FTS3 engine is a virtual table module. Virtual
tables are similar to views, in that they wrap a data source to make it look
and act like a normal table. Views get their data from a SELECT
statement, while virtual tables
depend on user-defined C functions. All of the functions required to
implement a virtual table are wrapped up in an extension known as a module.
For more information on how SQLite modules and virtual tables work, see
Chapter 10.
Full-Text Search is an important and evolving technology, and is one of the areas that is targeted for improvements and enhancements as this book goes to press. Although this section gives a brief overview of the core FTS3 features, if you find yourself considering the FTS3 module, I would encourage you to review the full documentation on the SQLite website.
The FTS3 engine is included in all standard
distributions of the SQLite source (including the amalgamation), but is
turned off by default. To enable basic FTS functionality, define
the
SQLITE_ENABLE_FTS3
compiler directive
when building the SQLite library. To enable the
more advanced matching syntax, also define SQLITE_ENABLE_
FTS3_
PARENTHESIS
.
Once SQLite has been compiled with the FTS3 engine enabled, you can create a document table with an SQL statement similar to this:
CREATE VIRTUAL TABLEtable_name
USING FTS3 (col1
,... );
In addition to providing a table name, you can
define zero or more column names. The name of the column is the only
information that will actually be used. Any type information or column
constraints will be ignored. If no column names are given, FTS will
automatically create one column with the name content
.
FTS tables are often used to hold whole documents,
in which case they only need one column. Other times, they are used to
hold different categories of related information, and require multiple
columns. For example, if you wanted to store email messages in an FTS
table, it might make sense to create separate columns for the "SUBJECT:"
line, "FROM:"
line, "TO:"
line, and message body. This would allow you to
limit searches to a specific column (and the data it contains). The
column specification for an FTS table is largely determined by how you
want to search for the data. FTS also provides an optimized way to
look for a search term across all of the indexed columns.
You can use the standard INSERT
, UPDATE
,
and DELETE
statements to manipulate
data within an FTS table. Like traditional tables, FTS tables have a
ROWID
column that contains a
unique integer for each entry in the table. This column can also be
referred to using the alias DOCID
.
Unlike traditional tables, the ROWID
of an FTS table is stable through a vacuum
(VACUUM in Appendix C), so it can be reliably referenced through a foreign key.
Additionally, FTS tables have an internal column with the same name as
the table name. This column is used for special operations. You cannot
insert or update data in this column.
Any virtual table, including FTS tables, can be
deleted with the standard
DROP TABLE
command.
FTS tables are designed so that any SELECT
statement will work correctly. You can even
search for specific text values or patterns directly with the ==
or LIKE
operators. These will work, although they’ll be
somewhat slow, since standard operators will require a full table
scan.
The real power of the FTS system comes from a
custom
MATCH
operator. This operator is
able to take advantage of the indexes built around the individual text
values, allowing very fast searches over large bodies of text.
Generally, searches are done using a query similar to:
SELECT * FROMfts_table
WHEREfts_column
MATCHsearch_term
;
The search term used by the MATCH
operator has very similar
semantics to those used in a web search engine. Search terms are
broken down and matched against words and terms found in the text
values of the FTS table. Generally, the FTS MATCH
operator is case-insensitive and will only
match against whole words. For example, the search term 'data'
will match 'research data'
, but not 'database'
. The order of the search
terms does not matter. The terms 'cat
dog'
and 'dog cat'
will match the same set of rows.
By default, MATCH
will only match records that contain every word
or term found in the search term. If the extended syntax is enabled,
more complex logic statements can also be used to define more complex
search patterns.
Normally, the terms will only be matched against the specified column. However, every FTS table has a special hidden column that has the same name as the table itself. If this column is specified in the match expression, then all of the columns will be searched. This makes it easy to construct “find all” type searches.
The FTS module is fairly advanced, and offers a large number of search options and index optimizations. If you plan on using the FTS engine in your application, I strongly suggest you spend some time reading the online documentation (http://www.sqlite.org/fts3.html). The official documentation is quite extensive and covers the more advanced search features in some detail, complete with examples.
In addition to explaining the different search patterns, the online documentation also covers index optimization and maintenance. While this level of detail isn’t always required, it can be beneficial for applications that heavily depend on FTS. The documents also explain how to provide a custom tokenizer to adapt the FTS engine to specific applications.
For those that want to dig even deeper, later sections of the document explain some of the internal workings of the FTS index. Information is provided on the shadow tables used by the FTS engine, as well as the process used to generate the token index. This level of knowledge isn’t required to use the FTS system, but it is extremely useful if you are looking to modify the code, or if you’re just curious about what is going on under the hood.