Description

The SELECT command is used to query the database and return a result. The SELECT command is the only SQL command capable of returning a user-generated result, be it a table query or a simple expression. Most consider SELECT to be the most complex SQL command. Although the basic format is fairly easy to understand, it does take some experience to understand its full power.

All of Chapter 5 is devoted to the SELECT command.

Basic format

The core SELECT command follows a simple pattern that can be roughly described as SELECT output FROM input WHERE filter. The output section describes the data that makes up the result set, the input section describes what tables, views, subqueries, etc., will act as data sources, and the filter section sets up conditions on which rows are part of the result set and which rows are filtered out.

A SELECT can either be SELECT ALL (default) or SELECT DISTINCT. The ALL keyword returns all rows in the result set, regardless of their composition. The DISTINCT keyword will force the select statement to eliminate duplicate results in the result set. There is usually a considerable performance penalty for calculating larger DISTINCT results.

The result set columns are defined by a series of comma-separated expressions. Every SELECT statement must have at least one result expression. These expressions often consist of only a source column name, although they can be any general expression. The character * means “return all columns,” and will include all standard table columns from all of the source tables. All the standard columns of a specific source table can be returned with the format table_name.*. In both cases, the ROWID column will not be included, although ROWID alias columns will be included. Virtual tables can also mark some columns as hidden. Like the ROWID column, hidden columns will not be returned by default, but can be explicitly named as a result column.

Result columns can be given explicit names with an optional AS clause (the actual AS keyword is optional as well). Unless an AS clause is given, the name of the output column is at the discretion of the database engine. If an application depends on matching the names of specific output columns, the columns should be given explicit names with an AS clause.

The FROM clause defines where the data comes from and how it is shuffled together. If no FROM clause is given, the SELECT statement will return only one row. Each source is joined together with a comma or a JOIN operation. The comma acts as an unconditional CROSS JOIN. Different sources, including tables, subqueries, or other JOIN statements, can be grouped together into a large transitory table, which is fed through the rest of the SELECT statement, and ultimately used to produce the result set. For more information on the specific JOIN operators, see FROM Clause.

Each data source, be it a named table or a subquery, can be given an optional AS clause. Similar to result set columns, the actual AS keyword is optional. The AS clause allows an alias to be assigned to a given source. This is important to disambiguate table instances (for example, in a self-join).

The WHERE clause is used to filter rows. Conceptually, the FROM clause, complete with joins, is used to define a large table that consists of every possible row combination. The WHERE clause is evaluated against each of those rows, passing only those rows that evaluate to true. The WHERE clause can also be used to define join conditions, by effectively having the FROM clause produce the Cartesian product of the two tables, and use the WHERE clause to filter out only those rows that meet the join condition.

Additional clauses

Beyond SELECT, FROM, and WHERE, the SELECT statement can do additional processing with GROUP BY, HAVING, ORDER BY, and LIMIT.

The GROUP BY clause allows sets of rows in the result set to be collapsed into single rows. Groups of rows that share equivalent values in all of the expressions listed in the GROUP BY clause will be condensed to a single row. Normally, every source column reference in the result set expressions should be a column or expression included in the GROUP BY clause, or the column should appear as a parameter of an aggregate function. The value of any other source column is the value of the last row in the group to be processed, effectively making the result undefined. If a GROUP BY expression is a literal integer, it is assumed to be a column index to the result set. For example, the clause GROUP BY 2 would group the result set using the values in the second result column.

A HAVING clause can only be used in conjunction with a GROUP BY clause. Like the WHERE clause, a HAVING expression is used as a row filter. The key difference is that the HAVING expression is applied after any GROUP BY manipulation. This sequence allows the HAVING expression to filter aggregate outputs. Be aware that the WHERE clause is usually more efficient, since it can eliminate rows earlier in the SELECT pipeline. If possible, filtering should be done in the WHERE clause, saving the HAVING clause to filter aggregate results.

The ORDER BY clause sorts the result set into a specific order. Typically, the output ordering is not defined. Rows are returned as they become available, and no attempt is made to return the results in any specific order. The ORDER BY clause can be used to enforce a specific output ordering. Output is sorted by each expression in the clause, in turn, from most specific to least specific. The fact that the output of a SELECT can be ordered is one of the key differences between an SQL table and a result set. As with GROUP BY, if one of the ORDER BY expressions is a literal integer, it is assumed to be a column index to the result set.

Finally, the LIMIT clause can be used to control how many rows are returned, starting at a specific offset. If no offset is provided, the LIMIT will start from the beginning of the result set. Note that the two syntax variations (comma or OFFSET) provide the parameters in the opposite order.

Since the row order of a result is undefined, a LIMIT is most frequently used in conjunction with an ORDER BY clause. Although it is not strictly required, including an ORDER BY brings some meaning to the limit and offset values. There are very few cases when it makes sense to use a LIMIT without some type of imposed ordering.