Indexes (or indices) are auxiliary data structures used by the database system to enforce unique constraints, speed up sort operations, and provide faster access to specific records. They are often created in hopes of making queries run faster by avoiding table scans and providing a more direct lookup method.
Understanding where and when to create indexes can be an important factor in achieving good performance, especially as datasets grow. Without proper indexes in place, the database system has no option but to do full table scans for every lookup. Table scans can be especially expensive when joining tables together.
Indexes are not without cost, however. Each index must
maintain a one-to-one correspondence between index entries and table rows.
If a new row is inserted, updated, or deleted from a table, that same
modification must be made to all associated indexes. Each new index will add
additional overhead to the INSERT
,
UPDATE
, and DELETE
commands. Indexes also consume space,
both on disk as well as in the SQLite page cache. A proper, well-placed
index is worth the cost, but it isn’t wise to just create random indexes, in
hopes that one of them will prove useful. A poorly placed index still has
all the costs, and can actually slow queries down. You can have too much of
a good thing.
Each index is associated with a specific table. Indexes can be either single column or multicolumn, but all the columns of a given index must belong to the same table. There is no limit to the number of indexes a table can have, nor is there a limit on the number of indexes a column can belong to. You cannot create an index on a view or on a virtual table.
Internally, the rows of a normal table are stored
in an indexed structure. SQLite uses a B-Tree for this purpose, which
is a specific type of multi-child, balanced tree. The details are
unimportant, other than understanding that as rows are inserted into
the tree, the rows are sorted, organized, and optimized, so that a row
with a specific, known ROWID
can be
retrieved relatively directly and quickly.
When you create an index, the database system
creates another tree structure to hold the index data. Rather than
using the ROWID
column as the sort
key, the tree is sorted and organized using the column or columns
you’ve specified in the index definition. The index entry consists of
a copy of the values from each of the indexed columns, as well as a
copy of the corresponding ROWID
value. This allows an indexed entry to be found very quickly using the
values from the indexed columns.
CREATE TABLE tbl ( a, b, c, d );
And then create an index that looks like this:
CREATE INDEX idx_tbl_a_b ON tbl ( a, b );
The database generates an internal data structure that is conceptually similar to:
SELECT a, b, ROWID FROM tbl ORDER BY a, b;
If SQLite needs to quickly find all the rows
where, for example, a = 45
, it can
use the sorted index to quickly jump to that range of values and
extract the relevant index entries. If it’s looking for the value of
b
, it can simply extract that
from the index and be done. If we need any other value in the row, it
needs to fetch the full row. This is done by looking up the ROWID
. The last value of any index
entry is the ROWID
of its
corresponding table row. Once SQLite has a list of ROWID
values for all rows where
a = 45
, it can efficiently
look up those rows in the original table and retrieve the full row. If
everything works correctly, the process of looking up a small set of
index entries, and then using those to look up a small set of table
rows, will be much faster and more efficient than doing a full table
scan.
To have a positive impact on query performance, indexes must be diverse. The cost of fetching a single row through an index is significantly higher than fetching a single row through a table scan. To reduce the overall cost, an index must overcome this overhead by eliminating the vast majority of row fetches. By significantly reducing the number of row lookups, the total query cost can be reduced, even if the individual row lookups are more expensive.
The break-even point for index performance is somewhere in the 10% to 20% range. Any query that fetches more rows from a table will do better with a table scan, while any query that fetches fewer rows will see an improvement by using an index. If an index cannot isolate rows to this level, there is no reason for it to be there. In fact, if the query optimizer mistakenly uses an index that returns a large percentage of rows, the index can reduce performance and slow things down.
This creates two concerns. First, if you’re trying to improve the performance of a query that fetches a moderate percentage of rows, adding an index is unlikely to help. An index can only improve performance if it is used to target a focused number of rows.
Second, even if the query is asking for a specific value, this can still lead to a higher percentage of fetched rows if the indexed columns are not reasonably unique. For example, if a column only has four unique values, a successful query will never fetch fewer than approximately 25% of the rows (assuming a reasonable distribution of values or queries). Adding an index to this type of column will not improve performance. Creating an index on a true/false column would be even worse.
When the query optimizer is trying to figure out
if it should use an index or not, it typically has very little
information to go on. For the most part, it assumes that if an index
exists, it must be a good index, and it will tend to use it. The
ANALYZE
command can help
mitigate this by providing statistical information to the query
optimizer (see ANALYZE for more details), but
keeping that updated and well maintained can be difficult in many
environments.
To avoid problems, you should avoid creating indexes on columns that are not reasonably unique. Indexing such columns rarely provides any benefit, and it can confuse the query optimizer.
When you declare one or more columns to be a PRIMARY
KEY
, the database system automatically creates a
unique index over those columns. The fundamental purpose of this index
is to enforce the UNIQUE
constraint
that is implied with every PRIMARY
KEY
. It also happens that many database operations
typically involve the primary key, such as natural joins, or
conditional lookups in UPDATE
or
DELETE
commands. Even if the
index wasn’t required to enforce the UNIQUE
constraint, chances are good you would want an
index over those columns anyway.
There are further advantages to using an INTEGER PRIMARY KEY
. As we’ve
discussed, when an INTEGER PRIMARY
KEY
is declared, that column replaces the automatic
ROWID
column, and becomes the
root column for the tree. In essence, the column becomes the index
used to store the table itself, eliminating the need for a second,
external index.
INTEGER PRIMARY
KEY
columns also provide better performance than
standard indexes. INTEGER PRIMARY
KEY
s have direct access to the full set of row data. A
normal index simply references the ROWID
value, which is then looked up in the table
root. INTEGER PRIMARY KEY
s can
provide indexed lookups, but skip this second lookup step, and
directly access the table data.
If you’re using generic row ID values, it is worth
the effort to define them as INTEGER PRIMARY
KEY
columns. Not only will this reduce the size of the
database, it can make your queries faster and more efficient.
When creating a multicolumn index for performance purposes, the column order is very important. If an index has only one column, that column is used as the sort key. If additional columns are specified in the index definition, the additional columns will be used as secondary, tertiary, etc., sort keys. All of the data is sorted by the first column, then any groups of duplicate values are further sorted by the second column, and so on. This is very similar to a typical phonebook. Most phone books sort by last name. Any group of common last names is then sorted by first name, and then by middle name or initial.
In order to utilize a multicolumn index, a query must contain conditions that are able to utilize the sort keys in the same order they appear in the index definition. If the query does not contain a condition that keys off the first column, the index cannot be used.
Consider looking up a name in the phonebook. You can use a phonebook to quickly find the phone number for “Jennifer T. Smith.” First, you look up the last name “Smith.” Then you refine the search, looking for the first name, “Jennifer,” and finally the middle initial “T” (if required). This sequence should allow you to focus in on a specific entry very quickly.
A phonebook isn’t much use to quickly find all the people with the first name “Jennifer.” Even though the first name is part of the phonebook index, it isn’t the first column of the index. If our query cannot provide a specific condition for the first column of a multicolumn index, the index cannot be used, and our only option is to do a full scan. At that point it is usually more efficient to do a full scan on the table itself, rather than the index.
It is not necessary to utilize every column, only that you utilize the columns in order. If you’re looking up a very unique name in the phone book, you may not need to search off the first name or middle initial. Or, perhaps your query is to find every “Smith.” In that case, the conditions of the query are satisfied with just the first column.
In the end, you need to be very careful when considering the order of a multicolumn index. The additional columns should be viewed as a refinement on the first column, not a replacement. A single multicolumn index is very different from multiple single-column indexes. If you have different queries that use different columns as the main lookup condition, you may be better off with multiple small indexes, rather than one large multicolumn index.
Multicolumn indexes are very useful in some situations, but there are limitations on when they can be effectively used. In some cases, it is more appropriate to create multiple single-column indexes on the same table.
This too has limitations. You cannot create a single-column index on every column of a table and expect the query system to mix and match whatever indexes it needs. Once a subset of rows is extracted from a table (via index or full table scan), that set of rows conceptually becomes an independent temporary table that is no longer part of the original table. At that point, any index associated with the source table becomes unavailable.
In general, the query optimizer can only utilize
one index per table-instance per query. Multiple indexes on a single
table only make sense if you have a different queries that extract
rows using different columns (or have multiple UNIQUE
constraints). Different queries
that key off different columns can pick the index that is most
appropriate, but each individual query will only use one index per
table-instance. If you want a single query to utilize multiple indexed
columns, you need a multicolumn index.
The major exception is a series of OR
conditions. If a WHERE
clause has a chain of OR
conditions on one or more columns of the same
table, then there are cases when SQLite may go ahead and use multiple
indexes for the same table in a single query.
For all their power, indexes are often a source of great frustration. The right index will result in a huge performance boost, while the wrong index can significantly slow things down. All indexes add cost to table modifications and add bulk to the database size. A well-placed index is often worth the overhead costs, but it can be difficult to understand what makes a well-placed index.
To complicate matters, you don’t get to tell the query optimizer when or how to use your indexes—the query planner makes its own decisions. Having everything be automatic means that queries will work no matter what, but it can also make it difficult to determine if an index is being used. It can be even more difficult to figure out why an index is being ignored.
This essentially leaves the database designer in the position of second-guessing the query optimizer. Changes and optimizations must be searched for on something of a trial-and-error basis. To make things worse, as your application changes and evolves, changes in the query structure or patterns can cause a change in index use. All in all, it can be a difficult situation to manage.
Traditionally, this is where the role of the DBA, or Database Administrator, comes in. This person is responsible for the care and feeding of a large database server. They do things like monitor query efficiency, adjust tuning parameters and indexes, and make sure all the statistical data is kept up to date. Unfortunately, that whole idea is somewhat contrary to what SQLite is all about.
So how do you maintain performance? For starters, have a solid design. A well-crafted and normalized database is going to go a long way toward defining your access patterns. Taking the time to correctly identify and define your primary keys will allow the database system to create appropriate indexes and give the query optimizer (and future developers) some insight into your design and intentions.
Also remember that (with the exception of UNIQUE
constraints) a database will
operate correctly without any indexes. It might be slow, but all the
answers will be correct. This lets you worry about design first, and
performance later. Once you have a working design, it is always
possible to add indexes after the fact, without needing to alter your
table structure or your query commands.
The queries of a fully normalized database will typically have quite a
few JOIN
operations. Nearly all of
these joins are across primary and foreign keys. Primary keys are
automatically indexed, but in most cases you need to manually create
indexes on your foreign keys. With those in place, your database
should already have the most critical indexes defined.
Start from here. Measure the performance of the
different types of queries your application uses and find the problem
areas. Looking at these queries, try to find columns where an index
might boost performance. Look for any constraints and conditions that
may benefit from an index, especially in join conditions that
reference nonkey columns. Use
EXPLAIN
and EXPLAIN QUERY PLAN
to understand how
SQLite is accessing the data. These commands can also be used to
verify if a query is using an index or not. For more information, see
EXPLAIN in Appendix C.
You can also use the
sqlite3_stmt_status()
function to
get a more measured understanding of statement efficiency. See sqlite3_stmt_status() in Appendix G for more details.
All in all, you shouldn’t get too hung up on
precise index placement. You may need a few well-placed indexes to
deal with larger tables, but the stock PRIMARY KEY
indexes do surprisingly well in most
cases. Further index placement should be considered performance tuning
and shouldn’t be worried about until an actual need is identified.
Never forget that indexes have cost, so you shouldn’t create them
unless you can identify a need.