Indexes

Indexes (or indices) are auxiliary data structures used by the database system to enforce unique constraints, speed up sort operations, and provide faster access to specific records. They are often created in hopes of making queries run faster by avoiding table scans and providing a more direct lookup method.

Understanding where and when to create indexes can be an important factor in achieving good performance, especially as datasets grow. Without proper indexes in place, the database system has no option but to do full table scans for every lookup. Table scans can be especially expensive when joining tables together.

Indexes are not without cost, however. Each index must maintain a one-to-one correspondence between index entries and table rows. If a new row is inserted, updated, or deleted from a table, that same modification must be made to all associated indexes. Each new index will add additional overhead to the INSERT, UPDATE, and DELETE commands. Indexes also consume space, both on disk as well as in the SQLite page cache. A proper, well-placed index is worth the cost, but it isn’t wise to just create random indexes, in hopes that one of them will prove useful. A poorly placed index still has all the costs, and can actually slow queries down. You can have too much of a good thing.

Each index is associated with a specific table. Indexes can be either single column or multicolumn, but all the columns of a given index must belong to the same table. There is no limit to the number of indexes a table can have, nor is there a limit on the number of indexes a column can belong to. You cannot create an index on a view or on a virtual table.

Internally, the rows of a normal table are stored in an indexed structure. SQLite uses a B-Tree for this purpose, which is a specific type of multi-child, balanced tree. The details are unimportant, other than understanding that as rows are inserted into the tree, the rows are sorted, organized, and optimized, so that a row with a specific, known ROWID can be retrieved relatively directly and quickly.

When you create an index, the database system creates another tree structure to hold the index data. Rather than using the ROWID column as the sort key, the tree is sorted and organized using the column or columns you’ve specified in the index definition. The index entry consists of a copy of the values from each of the indexed columns, as well as a copy of the corresponding ROWID value. This allows an indexed entry to be found very quickly using the values from the indexed columns.

If we have a table like this:

CREATE TABLE tbl ( a, b, c, d );

And then create an index that looks like this:

CREATE INDEX idx_tbl_a_b ON tbl ( a, b );

The database generates an internal data structure that is conceptually similar to:

SELECT a, b, ROWID FROM tbl ORDER BY a, b;

If SQLite needs to quickly find all the rows where, for example, a = 45, it can use the sorted index to quickly jump to that range of values and extract the relevant index entries. If it’s looking for the value of b, it can simply extract that from the index and be done. If we need any other value in the row, it needs to fetch the full row. This is done by looking up the ROWID. The last value of any index entry is the ROWID of its corresponding table row. Once SQLite has a list of ROWID values for all rows where a = 45, it can efficiently look up those rows in the original table and retrieve the full row. If everything works correctly, the process of looking up a small set of index entries, and then using those to look up a small set of table rows, will be much faster and more efficient than doing a full table scan.

To have a positive impact on query performance, indexes must be diverse. The cost of fetching a single row through an index is significantly higher than fetching a single row through a table scan. To reduce the overall cost, an index must overcome this overhead by eliminating the vast majority of row fetches. By significantly reducing the number of row lookups, the total query cost can be reduced, even if the individual row lookups are more expensive.

The break-even point for index performance is somewhere in the 10% to 20% range. Any query that fetches more rows from a table will do better with a table scan, while any query that fetches fewer rows will see an improvement by using an index. If an index cannot isolate rows to this level, there is no reason for it to be there. In fact, if the query optimizer mistakenly uses an index that returns a large percentage of rows, the index can reduce performance and slow things down.

This creates two concerns. First, if you’re trying to improve the performance of a query that fetches a moderate percentage of rows, adding an index is unlikely to help. An index can only improve performance if it is used to target a focused number of rows.

Second, even if the query is asking for a specific value, this can still lead to a higher percentage of fetched rows if the indexed columns are not reasonably unique. For example, if a column only has four unique values, a successful query will never fetch fewer than approximately 25% of the rows (assuming a reasonable distribution of values or queries). Adding an index to this type of column will not improve performance. Creating an index on a true/false column would be even worse.

When the query optimizer is trying to figure out if it should use an index or not, it typically has very little information to go on. For the most part, it assumes that if an index exists, it must be a good index, and it will tend to use it. The ANALYZE command can help mitigate this by providing statistical information to the query optimizer (see ANALYZE for more details), but keeping that updated and well maintained can be difficult in many environments.

To avoid problems, you should avoid creating indexes on columns that are not reasonably unique. Indexing such columns rarely provides any benefit, and it can confuse the query optimizer.

When you declare one or more columns to be a PRIMARY KEY, the database system automatically creates a unique index over those columns. The fundamental purpose of this index is to enforce the UNIQUE constraint that is implied with every PRIMARY KEY. It also happens that many database operations typically involve the primary key, such as natural joins, or conditional lookups in UPDATE or DELETE commands. Even if the index wasn’t required to enforce the UNIQUE constraint, chances are good you would want an index over those columns anyway.

There are further advantages to using an INTEGER PRIMARY KEY. As we’ve discussed, when an INTEGER PRIMARY KEY is declared, that column replaces the automatic ROWID column, and becomes the root column for the tree. In essence, the column becomes the index used to store the table itself, eliminating the need for a second, external index.

INTEGER PRIMARY KEY columns also provide better performance than standard indexes. INTEGER PRIMARY KEYs have direct access to the full set of row data. A normal index simply references the ROWID value, which is then looked up in the table root. INTEGER PRIMARY KEYs can provide indexed lookups, but skip this second lookup step, and directly access the table data.

If you’re using generic row ID values, it is worth the effort to define them as INTEGER PRIMARY KEY columns. Not only will this reduce the size of the database, it can make your queries faster and more efficient.

When creating a multicolumn index for performance purposes, the column order is very important. If an index has only one column, that column is used as the sort key. If additional columns are specified in the index definition, the additional columns will be used as secondary, tertiary, etc., sort keys. All of the data is sorted by the first column, then any groups of duplicate values are further sorted by the second column, and so on. This is very similar to a typical phonebook. Most phone books sort by last name. Any group of common last names is then sorted by first name, and then by middle name or initial.

In order to utilize a multicolumn index, a query must contain conditions that are able to utilize the sort keys in the same order they appear in the index definition. If the query does not contain a condition that keys off the first column, the index cannot be used.

Consider looking up a name in the phonebook. You can use a phonebook to quickly find the phone number for “Jennifer T. Smith.” First, you look up the last name “Smith.” Then you refine the search, looking for the first name, “Jennifer,” and finally the middle initial “T” (if required). This sequence should allow you to focus in on a specific entry very quickly.

A phonebook isn’t much use to quickly find all the people with the first name “Jennifer.” Even though the first name is part of the phonebook index, it isn’t the first column of the index. If our query cannot provide a specific condition for the first column of a multicolumn index, the index cannot be used, and our only option is to do a full scan. At that point it is usually more efficient to do a full scan on the table itself, rather than the index.

It is not necessary to utilize every column, only that you utilize the columns in order. If you’re looking up a very unique name in the phone book, you may not need to search off the first name or middle initial. Or, perhaps your query is to find every “Smith.” In that case, the conditions of the query are satisfied with just the first column.

In the end, you need to be very careful when considering the order of a multicolumn index. The additional columns should be viewed as a refinement on the first column, not a replacement. A single multicolumn index is very different from multiple single-column indexes. If you have different queries that use different columns as the main lookup condition, you may be better off with multiple small indexes, rather than one large multicolumn index.

Multicolumn indexes are very useful in some situations, but there are limitations on when they can be effectively used. In some cases, it is more appropriate to create multiple single-column indexes on the same table.

This too has limitations. You cannot create a single-column index on every column of a table and expect the query system to mix and match whatever indexes it needs. Once a subset of rows is extracted from a table (via index or full table scan), that set of rows conceptually becomes an independent temporary table that is no longer part of the original table. At that point, any index associated with the source table becomes unavailable.

In general, the query optimizer can only utilize one index per table-instance per query. Multiple indexes on a single table only make sense if you have a different queries that extract rows using different columns (or have multiple UNIQUE constraints). Different queries that key off different columns can pick the index that is most appropriate, but each individual query will only use one index per table-instance. If you want a single query to utilize multiple indexed columns, you need a multicolumn index.

The major exception is a series of OR conditions. If a WHERE clause has a chain of OR conditions on one or more columns of the same table, then there are cases when SQLite may go ahead and use multiple indexes for the same table in a single query.

For all their power, indexes are often a source of great frustration. The right index will result in a huge performance boost, while the wrong index can significantly slow things down. All indexes add cost to table modifications and add bulk to the database size. A well-placed index is often worth the overhead costs, but it can be difficult to understand what makes a well-placed index.

To complicate matters, you don’t get to tell the query optimizer when or how to use your indexes—the query planner makes its own decisions. Having everything be automatic means that queries will work no matter what, but it can also make it difficult to determine if an index is being used. It can be even more difficult to figure out why an index is being ignored.

This essentially leaves the database designer in the position of second-guessing the query optimizer. Changes and optimizations must be searched for on something of a trial-and-error basis. To make things worse, as your application changes and evolves, changes in the query structure or patterns can cause a change in index use. All in all, it can be a difficult situation to manage.

Traditionally, this is where the role of the DBA, or Database Administrator, comes in. This person is responsible for the care and feeding of a large database server. They do things like monitor query efficiency, adjust tuning parameters and indexes, and make sure all the statistical data is kept up to date. Unfortunately, that whole idea is somewhat contrary to what SQLite is all about.

So how do you maintain performance? For starters, have a solid design. A well-crafted and normalized database is going to go a long way toward defining your access patterns. Taking the time to correctly identify and define your primary keys will allow the database system to create appropriate indexes and give the query optimizer (and future developers) some insight into your design and intentions.

Also remember that (with the exception of UNIQUE constraints) a database will operate correctly without any indexes. It might be slow, but all the answers will be correct. This lets you worry about design first, and performance later. Once you have a working design, it is always possible to add indexes after the fact, without needing to alter your table structure or your query commands.

The queries of a fully normalized database will typically have quite a few JOIN operations. Nearly all of these joins are across primary and foreign keys. Primary keys are automatically indexed, but in most cases you need to manually create indexes on your foreign keys. With those in place, your database should already have the most critical indexes defined.

Start from here. Measure the performance of the different types of queries your application uses and find the problem areas. Looking at these queries, try to find columns where an index might boost performance. Look for any constraints and conditions that may benefit from an index, especially in join conditions that reference nonkey columns. Use EXPLAIN and EXPLAIN QUERY PLAN to understand how SQLite is accessing the data. These commands can also be used to verify if a query is using an index or not. For more information, see EXPLAIN in Appendix C. You can also use the sqlite3_stmt_status() function to get a more measured understanding of statement efficiency. See sqlite3_stmt_status() in Appendix G for more details.

All in all, you shouldn’t get too hung up on precise index placement. You may need a few well-placed indexes to deal with larger tables, but the stock PRIMARY KEY indexes do surprisingly well in most cases. Further index placement should be considered performance tuning and shouldn’t be worried about until an actual need is identified. Never forget that indexes have cost, so you shouldn’t create them unless you can identify a need.