Hour 22

Using Indexes to Improve Performance

What You’ll Learn in This Hour:

In this hour, you learn how to improve SQL statement performance by creating and using indexes. You begin with the CREATE INDEX command and learn how to use indexes that have been created on tables.

Defining an Index

An index is basically a pointer to data in a table. An index in a database is similar to an index in the back of a book. If you want to reference all the pages in a book that discuss a certain topic, you first refer to the index, which lists all topics alphabetically; the index then refers you to one or more specific page numbers. An index in a database works the same way: It points a query to the exact physical location of data in a table. You are actually directed to the data’s location in an underlying file of the database, but as far as you are concerned, you are referring to a table.

Which works faster, looking through a book page by page for some information or searching the book’s index and getting a page number? Of course, using the book’s index is the most efficient method, especially if the book is large. If you have a smaller book with just a few pages, however, flipping through the chapters for the information might go faster than flipping back and forth between the index and the chapters. When a database does not use an index, it is performing a full table scan, the same as flipping through a book page by page. Hour 23, “Improving Database Performance,” discusses full table scans.

An index is typically stored separately from the table for which the index was created. An index’s main purpose is to improve the performance of data retrieval. Indexes can be created or dropped with no effect on the data. However, after an index is dropped, performance of data retrieval might slow. Indexes take up physical space and can often grow larger than the table. Therefore, you need to consider them when estimating your database storage needs.

Understanding How Indexes Work

When an index is created, it records the location of values in a table that are associated with the column that is indexed. Entries are added to the index when new data is added to the table. When a query is executed against the database and a condition is specified on a column in the WHERE clause that is indexed, the index is first searched for the values specified in the WHERE clause. If the value is found in the index, the index returns the exact location of the searched data in the table. Figure 22.1 illustrates the function of an index.

Illustration of how table is accessed using an index.

FIGURE 22.1
Table access using an index

Suppose the following query was issued:

SELECT *
FROM TABLE_NAME
WHERE NAME = 'SMITH';

As Figure 22.1 shows, the NAME index is referenced to resolve the location of all names equal to SMITH. After the location is determined, the data can quickly be retrieved from the table. The data—names, in this case—is alphabetized in the index.

Note

Variations of Index Creation

Indexes can be created during table creation in certain implementations. Most implementations accommodate a command, aside from the CREATE TABLE command, used to create indexes. Check your particular implementation for the exact syntax for the command, if any, that is available to create an index.

A full table scan occurs if there is no index on the table and the same query is executed. This means, then, that every row of data in the table is read to retrieve information pertaining to all individuals with the name SMITH.

An index works faster because it typically stores information in an orderly, treelike format. Say that you have a list of books upon which you place an index. The index has a root node, which is the beginning point of each query. Then it is split into branches. Maybe in this case the index has two branches, one for letters AL and the other for letters MZ. If you ask for a book with a name that starts with the letter M, you enter the index at the root node and immediately travel to the branch containing letters MZ. This effectively shortens the time needed to find the book by eliminating nearly half of the possibilities.

Using the CREATE INDEX Command

As with many other statements in SQL, the CREATE INDEX statement, varies greatly among relational database vendors. This command is used to create the various types of indexes available for a table. Most relational database implementations use the CREATE INDEX statement:

CREATE INDEX INDEX_NAME ON TABLE_NAME

Some implementations allow the specification of a storage clause (as with the CREATE TABLE statement), ordering (DESC||ASC), and the use of clusters. Check your particular implementation for its correct syntax.

Identifying Types of Indexes

You can create different types of indexes on tables in a database, all of which serve the same goal: to improve database performance by expediting data retrieval. This hour discusses single-column indexes, composite indexes, and unique indexes.

Single-Column Indexes

Indexing on a single column of a table is the simplest and most common manifestation of an index. A single-column index is one that is created based on only one table column. The basic syntax follows:

CREATE INDEX INDEX_NAME_IDX
ON TABLE_NAME (COLUMN_NAME)

As you look at a few examples of creating indexes, you first create a table called BIRDS_TRAVEL that is based on birds and their migration habits. Following is the CREATE statement and a SELECT statement from the new table that returns 56 rows of data. Remember that even though the database contains only 23 birds in the database, many birds migrate to many different locations. This creates a situation of duplicate bird names and duplicate migration location names in a table. In this example, 56 rows is not a lot. However, in a real organization, tables can easily contain millions of rows of data. This is where indexing is necessary to maintain acceptable performance.

SQL> create table birds_travel as
  2  select b.bird_id, b.bird_name, b.wingspan, m.migration_location
  3  from birds b,
  4       migration m,
  5       birds_migration bm
  6  where b.bird_id = bm.bird_id
  7    and m.migration_id = bm.migration_id;

Table created.

SQL> select * from birds_travel;

   BIRD_ID BIRD_NAME              WINGSPAN MIGRATION_LOCATION
---------- -------------------- ---------- ---------------------
         1 Great Blue Heron             78 Central America
         3 Common Loon                  99 Central America
         7 Osprey                       72 Central America
         8 Belted Kingfisher            23 Central America
        10 Pied-billed Grebe           6.5 Central America
        11 American Coot                29 Central America
        12 Common Sea Gull              18 Central America
        14 Double-crested Cormorant     54 Central America
        15 Common Merganser             34 Central America
        16 Turkey Vulture               72 Central America
        18 Green Heron                26.8 Central America
        21 Great Egret                67.2 Central America
         1 Great Blue Heron             78 Mexico
         3 Common Loon                  99 Mexico
         7 Osprey                       72 Mexico
         8 Belted Kingfisher            23 Mexico
        10 Pied-billed Grebe           6.5 Mexico
        11 American Coot                29 Mexico
        12 Common Sea Gull              18 Mexico
        13 Ring-billed Gull             50 Mexico
        14 Double-crested Cormorant     54 Mexico
        15 Common Merganser             34 Mexico
        16 Turkey Vulture               72 Mexico
        18 Green Heron                26.8 Mexico
        21 Great Egret                67.2 Mexico
        22 Anhinga                      42 Mexico
         5 Golden Eagle                 99 No Significant Migration
         9 Canadian Goose               99 No Significant Migration
        17 American Crow              39.6 No Significant Migration
        20 Brown Pelican                99 No Significant Migration
        23 Black Skimmer                15 No Significant Migration
        19 Mute Swan                    99 Partial, Open Water
         1 Great Blue Heron             78 South America
         7 Osprey                       72 South America
         8 Belted Kingfisher            23 South America
        12 Common Sea Gull              18 South America
        16 Turkey Vulture               72 South America
        18 Green Heron                26.8 South America
         1 Great Blue Heron             78 Southern United States
         2 Mallard                      72 Southern United States
         3 Common Loon                  99 Southern United States
         4 Bald Eagle                   99 Southern United States
         6 Red Tailed Hawk              48 Southern United States
         7 Osprey                       72 Southern United States
         8 Belted Kingfisher            23 Southern United States
         9 Canadian Goose               99 Southern United States
        10 Pied-billed Grebe           6.5 Southern United States
        11 American Coot                29 Southern United States
        12 Common Sea Gull              18 Southern United States
        13 Ring-billed Gull             50 Southern United States
        14 Double-crested Cormorant     54 Southern United States
        15 Common Merganser             34 Southern United States
        16 Turkey Vulture               72 Southern United States
        18 Green Heron                26.8 Southern United States
        21 Great Egret                67.2 Southern United States
        22 Anhinga                      42 Southern United States

56 rows selected.

The following focuses on creating a simple index on the BIRDS_TRAVEL table on the BIRD_NAME column. This column is being indexed because it would be typically used in a query to search for specific data within this table.

SQL> create index birds_travel_idx1
  2  on birds_travel (bird_name);

Index created.

Tip

Best Places for Single-Column Indexes

Single-column indexes are most effective when they are used on columns that are frequently used alone in the WHERE clause as query conditions. Good candidates for a single-column index are individual identification numbers, serial numbers, and system-assigned keys.

Unique Indexes

You use unique indexes for performance and data integrity. A unique index does not allow duplicate values to be inserted into the table. Otherwise, the unique index performs the same way a regular index performs. The syntax follows:

CREATE UNIQUE INDEX INDEX_NAME
ON TABLE_NAME (COLUMN_NAME)

The first example that follows attempts to create a unique index on the BIRD_NAME within the BIRDS_TRAVEL table. An error message is returned. A unique index cannot be created for this column because duplicate rows of data are already found in the table. Refer back to the previous output of the query. You can see duplicate values of bird names in the BIRDS_TRAVEL table because birds can migrate to many different locations.

SQL> create unique index birds_travel_idx2
  2  on birds_travel (bird_name);
on birds_travel (bird_name)
   *
ERROR at line 2:
ORA-01452: cannot CREATE UNIQUE INDEX; duplicate keys found

Next is a legitimate example. Here a unique index is created on the combination of the BIRD_NAME column in the MIGRATION_LOCATION column. Even though both birds and migration locations have duplicate entries in this table, the combination of a bird’s name and the migration location should always be unique.

SQL> create unique index birds_travel_idx3
  2  on birds_travel (bird_name, migration_location);

Index created.

You might be wondering, what happens if a bird’s identification number is the primary key for a table? An index is usually implicitly created when you define a primary key for a table, so normally you do not also have to create a unique index on the table.

When working with objects such as unique indexes, it is often beneficial to create the indexes on empty tables during the creation of the database structure. This ensures that the data going into the structure already meets the demand of the constraints you want to place on it. If you work with existing data, you want to analyze the impact of whether the data needs to be adjusted to properly apply the index.

Tip

Unique Index Constraints

You can create a unique index only on a column in a table whose values are unique. In other words, you cannot create a unique index on an existing table with data that already contains records on the indexed key that are not unique. Similarly, you cannot create a unique index on a column that allows for NULL values. If you attempt to create a unique index on a column that violates one of these principles, the statement fails.

Composite Indexes

A composite index is an index on two or more columns of a table. You should consider performance when creating a composite index because the order of columns in the index has a measurable effect on the data retrieval speed. Generally, the most restrictive value should be placed first for optimum performance. However, the columns that are always specified in your queries should be placed first. The syntax follows:

CREATE INDEX INDEX_NAME
ON TABLE_NAME (COLUMN1, COLUMN2)

An example of a composite index follows:

SQL> create index birds_travel_idx4
  2  on birds_travel (bird_name, wingspan);

Index created.

In this example, you create a composite index based on two columns in the BIRDS_TRAVEL table: BIRD_NAME and WINGSPAN. You might create a composite index here because of the assumption that these two columns are frequently used together as conditions in the WHERE clause of a query.

In deciding whether to create a single-column index or a composite index, consider the column(s) that you might use frequently in a query’s WHERE clause as filter conditions. If only one column is used, choose a single-column index. If two or more columns are frequently used in the WHERE clause as filters, a composite index is the best choice.

Implicit Indexes

Implicit indexes are indexes that are automatically created by the database server when an object is created. Indexes are automatically created for primary key constraints and unique constraints.

Why are indexes automatically created for these constraints? Imagine a database server. Now say that a user adds a new product to the database. The product identification is the primary key on the table, which means that it must be a unique value. To efficiently make sure the new value is unique among hundreds or thousands of records, the product identifications in the table must be indexed. Therefore, when you create a primary key or a unique constraint, an index is automatically created for you.

Knowing When to Consider Using an Index

Unique indexes are implicitly used with a primary key for the primary key to work. Foreign keys are also excellent candidates for an index because you often use them to join the parent table. Most, if not all, columns used for table joins should be indexed.

Columns that you frequently reference in the ORDER BY and GROUP BY clauses should be considered for indexes. For example, if you are sorting on an individual’s name, having an index on the name column is beneficial: It renders an automatic alphabetical order on every name, thus simplifying the actual sort operation and expediting the output results.

Furthermore, you should create indexes on columns with a high number of unique values, or columns that, when used as filter conditions in the WHERE clause, return a low percentage of rows of data from a table. This is where trial and error might come into play. Just as you should always test production code and database structures before implementing them into production, you should test indexes. Your testing should center on trying different combinations of indexes, no indexes, single-column indexes, and composite indexes; no cut-and-dried rule governs using indexes. The effective use of indexes requires a thorough knowledge of table relationships, query and transaction requirements, and the data itself.

Note

Plan for Indexing Accordingly

Be sure to plan your tables and indexes. Don’t assume that because an index has been created, all performance issues are resolved. The index might not help at all—it might actually hinder performance and take up disk space.

Knowing When to Avoid Indexes

Although indexes are intended to enhance a database’s performance, sometimes you should avoid them. The following guidelines indicate when to reconsider using an index:

  •    Avoid using indexes on small tables. Indexes have an overhead associated with them in terms of query time to access them. In the case of small tables, the query engine to do a quick scan over the table more quickly than it can look at an index first.

  •    Do not use indexes on columns that return a high percentage of data rows when used as a filter condition in a query’s WHERE clause. For instance, you do not see entries for the words the or and in the index of a book.

  •    You can index tables that have a frequent, large batch update jobs run. However, the index considerably slows the batch job’s performance. You can correct the conflict of having an index on a table that is frequently loaded or manipulated by a large batch process in this way: Drop the index before the batch job and then re-create the index after the job has completed. The indexes are also updated as the data is inserted, causing additional overhead.

  •    Do not use indexes on columns that contain a high number of NULL values. Indexes operate best on columns that have a higher uniqueness of data between rows. If a lot of NULL values exist, the index will skew toward the NULL values, which might affect performance.

  •    Avoid indexing columns that are frequently manipulated. Maintenance on the index can become excessive.

You can see in Figure 22.2 that an index on a column such as gender might not prove beneficial. For example, suppose the following query was submitted to the database:

SELECT *
FROM TABLE_NAME
WHERE GENDER = 'FEMALE';
Illustration of an ineffective index.

FIGURE 22.2
An example of an ineffective index

Referring to Figure 22.2, which is based on the previous query, you can see constant activity between the table and its index. Because a high number of data rows is returned for WHERE GENDER = 'FEMALE' (or 'MALE'), the database server constantly has to read the index, then the table, then the index, then the table, and so on. In this case, it might be more efficient to use a full table scan because a high percentage of the table must be read anyway.

Caution

Indexes Can Sometimes Lead to Performance Problems

Use caution when creating indexes on a table’s extremely long keys. Performance is inevitably slowed by high I/O costs.

As a general rule, do not use an index on a column used in a query’s condition that returns a high percentage of data rows from the table. In other words, do not create an index on a column such as gender or any column that contains few distinct values. This is often referred to as a column’s cardinality, or the uniqueness of the data. High cardinality refers to a very unique quality and describes information such as identification numbers. Low-cardinality values are not very unique and refer to columns such as gender.

Altering an Index

You can alter an index after it has been created using syntax that is similar to the CREATE INDEX syntax. The types of alterations that you can manage with the statement differ between implementations, but they handle all the basic variations of an index in terms of columns, ordering, and such. The syntax follows:

ALTER INDEX INDEX_NAME

Take care when altering an existing index on production systems. In most cases, the index is immediately rebuilt, which creates an overhead in terms of resources. In addition, on most basic implementations, while the index is being rebuilt, it cannot be utilized for queries. This further hinders system performance.

Dropping an Index

An index can be dropped rather simply. Check your particular implementation for the exact syntax, but most major implementations use the DROP command. Take care when dropping an index because it can drastically slow (or, conversely, improve) performance. The syntax follows:

DROP INDEX INDEX_NAME

SQL> drop index birds_travel_idx4;

Index dropped.

MySQL uses a slightly different syntax; you also specify the table name of the table that you are dropping the index from:

DROP INDEX INDEX_NAME ON TABLE_NAME

The most common reason for dropping an index is to improve performance. Remember that if you drop an index, you can re-create it later. You might need to rebuild an index to reduce fragmentation. Often you must experiment with indexes in a database to determine the route to best performance; this might involve creating an index, dropping it, and eventually re-creating it, with or without modifications.

Summary

In this hour, you learned that you can use indexes to improve the overall performance of queries and transactions performed within the database. As with the index of a book, database indexes enable specific data to be quickly referenced from a table. The most common method for creating indexes is to use the CREATE INDEX command. Different types of indexes are available among SQL implementations, including unique indexes, single-column indexes, and composite indexes. You need to consider many factors when deciding on the index type that best meets the needs of your database. The effective use of indexes often requires some experimentation, a thorough knowledge of table relationships and data, and a little patience—being patient when you create an index can save minutes, hours, or even days of work later.

Q&A

Q. Does an index actually take up space the way a table does?

A. Yes, an index takes up physical space in a database. In fact, an index can become much larger than the table for which the index was created.

Q. If you drop an index so that a batch job can complete faster, how long does it take to re-create the index?

A. Many factors are involved, such as the size of the index being dropped, the CPU usage, and the machine’s power.

Q. Should all indexes be unique?

A. No, unique indexes allow no duplicate values. You might need to allow duplicate values in a table.

Workshop

The following workshop consists of a series of quiz questions and practical exercises. The quiz questions are designed to test your overall understanding of the current material. The practical exercises give you the opportunity to apply the concepts discussed during the current hour, as well as build on the knowledge you acquired in previous hours of study. Be sure to complete the quiz questions and exercises before continuing to the next hour. Refer to Appendix C, “Answers to Quizzes and Exercises,” for answers.

Quiz

  1. 1. What are some major disadvantages of using indexes?

  2. 2. Why is the order of columns in a composite index important?

  3. 3. Should a column with a large percentage of NULL values be indexed?

  4. 4. Is the main purpose of an index to stop duplicate values in a table?

  5. 5. True or false: The main reason for a composite index is for aggregate function usage in an index.

  6. 6. What does cardinality refer to? What is considered a column of high cardinality?

Exercises

  1. 1. For the following situations, decide whether an index should be used and, if so, what type of index should be used:

    1. Several columns, but a small table

    2. Medium-size table, no duplicates allowed

    3. Several columns, very large table, several columns used as filters in the WHERE clause

    4. Large table, many columns, a lot of data manipulation

  2. 2. Write a SQL statement to create an index called FOOD_IDX in FOOD on the FOOD_NAME column.

  3. 3. Write a SQL statement to create an index called WINGSPAN_IDX in BIRDS on the WINGSPAN column.

  4. 4. Drop the indexes you created.

  5. 5. Study the tables used in this book. List some good candidates for indexed columns, based on how a user might search for data.

  6. 6. Create some additional indexes on your tables as you like. Think about how your data might be searched and where indexes can create efficiencies.