Retrieving Rows with SELECT

The heart of all SQL queries is the SELECT command. SELECT is used to build queries (also known as SELECT statements). Queries are the only SQL instructions by which your data can be retrieved from tables and views. The data returned via a query is called a result set and consists of rows, with columns, similar to a table.

The columns of a result set are not stored on the disk in any fixed form. They are purely a temporary result of the query’s requested data. A query on a table may return a result set with the same column structure as the table, or it may differ drastically. Result sets may even have columns which are drawn from several other tables by a single query.

Since it is central to PostgreSQL, SELECT is easily the most complicated single command, having the most available clauses and parameters. The following is the syntax for SELECT. The terms used are summarized and described in greater detail within the following sections. The term expression is used to refer to either a column name, or a general expression (such as a column being operated upon by a constant, or another column).

SELECT [ ALL | DISTINCT [ ON ( expression  [, ...] ) ] ]
       target [ AS name ] [,...]
       [ FROM source [, ...] ]
              [ [ NATURAL ] join_type source
              [ ON condition | USING ( column_list ) ] ]
              [, ...]
       [ WHERE condition ]
       [ GROUP BY expression [, ...] ]
       [ HAVING condition [, ...] ]
       [ { UNION | INTERSECT | EXCEPT } [ ALL ] sub-query ]
       [ ORDER BY expression
               [ ASC | DESC | USING operator ]
               [, ...] ]
       [ FOR UPDATE [ OF table [, ...] ] ]
       [ LIMIT { count | ALL } [ { OFFSET | , } start ] ]

In this syntax diagram, source may be either a table name or a subselect. The syntax for these general forms is as follows:

FROM { [ ONLY ] table [ [ AS ] alias [ ( column_alias [, ...] ) ] ] |
       ( query ) [ AS ] alias [ ( column_alias [, ...] ) ] }
ALL

The ALL keyword may be specified as a noise term to make it clear that all rows should be returned.

DISTINCT [ ON ( expression [, ...] ) ]

The DISTINCT clause specifies a column (or expression) for which to retrieve only one row per unique value of expression.

target [ AS name ] [, ...]

The SELECT targets are usually column names, though they can be constants, identifiers, functions or general expressions. Each target requested must be separated by commas, and may be named dynamically to name via the AS clause. Supplying the asterisk symbol (*) as a target is shorthand for requesting all non-system columns, and may be listed along with other targets.

FROM source [, ... ]

The FROM clause dictates the source that PostgreSQL will look in for the specified targets. The source, in this case, may be a table name or a sub-query. You can specify numerous sources, separated by commas. (This is roughly equivalent to a cross join). The syntax for the FROM clause is described in more detail later in this section.

[ NATURAL ] join_type source [ ON condition | USING ( column_list ) ]

The FROM sources may be joined together via the JOIN clause, which requires a join_type (e.g., INNER, FULL OUTER, CROSS) and may require a condition or column_list to further define the nature of the join, depending on the join_type.

WHERE condition

The WHERE clause constrains the result set from the SELECT statement to specified criteria, which are defined by condition. Conditions must return a single Boolean value (true or false), but may consist of several checks combined with logical operators (e.g., with AND, and OR) to indicate that available rows must meet all supplied conditions to be included in the statement’s results.

GROUP BY expression [, ...]

The GROUP BY clause aggregates (groups) rows together by the criteria described in expression. This can be as simple as a column name (and often is) or an arbitrary expression applied to values of the result set.

HAVING condition [, ...]

The HAVING clause is similar to the WHERE clause, but checks its conditions on aggregated (grouped) sets instead of atomic rows.

{ UNION | INTERSECT | EXCEPT } [ ALL ] sub-query

Performs one of three set operations between the SELECT statement and a second query, returning their result sets in uniform column structure (which must be compatible). Duplicate rows are removed from the resultant set unless the ALL keyword is used.

UNION

Returns the set of collected rows.

INTERSECT

Returns the set of rows where the values of the two sets overlap.

EXCEPT

Returns the set of rows which are found in the SELECT statement, but not found in the secondary query.

ORDER BY expression

Sorts the results of the SELECT statement by expression.

[ ASC | DESC | USING operator ]

Determines whether or not the ORDER BY expression proceeds in ascending order (ASC), or descending order (DESC). An operator may alternatively be specified with the USING keyword (e.g., < or >).

FOR UPDATE [ OF table [, ...] ]

Allows for exclusive locking of the returned rows. When used within a transaction block, FOR UPDATE locks the rows of the specified table until the transaction is committed. While locked, the rows cannot be updated by other transactions.

LIMIT { count | ALL }

Limits the number of rows returned to a maximum of count, or explicitly allows ALL rows.

{ OFFSET | , } start

Instructs the LIMIT clause at what point to begin limiting the results. For example, a LIMIT with a count set to 100, and an OFFSET clause with a start value of 50 would return the rows from 50 to 150 (if there are that many results to return).

Terms used in the FROM clause’s syntax description are as follows:

[ ONLY ] table

The table name specifies what table to use as a source for the SELECT statement. Specifying the ONLY clause causes the rows of any child’s table to be omitted from the query.

[ AS ] alias

An alias may optionally be assigned to a FROM source, in order to simplify a query (e.g., books might be temporarily referenced with an alias of b). The AS term is considered noise, and is optional.

( query ) [ AS ] alias

Any valid SELECT statement may be placed in parentheses as the query. This causes the result set created by the query to be used as a FROM source, as if it had been a static table. This use of a sub-query requires a specified alias.

( column_alias [, ...] )

The FROM sources which have assigned aliases may also alias columns by specifying arbitrary column aliases. Each column_alias must be separated by commas, and grouped within parentheses following the FROM source’s alias. These aliases must match the order of the defined columns in the table to which it is applied.

A SELECT statement may be as simple as a request for all rows and all columns from a specified table. Use the following syntax to retrieve all rows and columns from a table:

SELECT * FROM table_name;

The asterisk (*) character, as mentioned in the explanation of SELECT’s syntax, is short-hand for all non-system columns. In essence, the SELECT * requests all non-system data in the table named table_name; this retrieves all columns and all rows, because no row limit is specified. To demonstrate, Example 4-23 requests all columns (*) from Book Town’s books table.

While SELECT * is a good example of a basic query, and is sometimes very useful, you will probably be interested in retrieving only a few columns worth of information at a time. To stay efficient, and to keep your queries clear, it is a good idea to explicitly specify the intended target columns rather than to use the asterisk. This is especially true when using the JOIN clause, as will be discussed in the section titled Joining Data Sets with JOIN.

To specify the target columns for a query, list the names of the columns following the SELECT keyword. The query will return data for only those columns that you list. The order of these columns need not match their literal order in the table, and columns may be listed more than once, or not at all, as shown in Example 4-24.

As you can see, the data sets returned in both Example 4-24 and Example 4-23 are nearly identical. The second set is returned in a different column arrangement, (omitting the subject_id column, and repeating the id column twice) as a result of the target list.

In addition to plain column names, targets in the SELECT statement may be arbitrary expressions (e.g., involving functions, or operators acting upon identifiers), or constants. The syntax is simple, and only requires that each identifier, expression, or constant be separated by commas. Conveniently, different types of targets may be arbitrarily mixed in the target list.

In fact, the SELECT command may be used to retrieve expressions and constants without the use of a FROM clause or specified columns, as in Example 4-25.

The target list allows the use of an optional AS clause for each specified target, which re-names a column in the returned result set to an arbitrary name specified in the clause. The rules and limitations for the specified name are the same as for normal identifiers (e.g., they may be quoted to contain spaces, may not be keywords unless quoted, and so on).

Using AS has no lasting effect on the column itself, but only on the result set which is returned by the query. AS can be particularly useful when selecting expressions or constants, rather than plain columns. Naming result set columns with AS can clarify the meaning of an otherwise ambiguous expression or constant. This technique is demonstrated in Example 4-26, which shows the same results as Example 4-25, but with different column headings.

The FROM clause allows you to choose either a table or a result set as a source for your specified target list. Multiple sources may be entered following the FROM clause, separated by commas. Specifying multiple sources in this fashion is functionally similar to a CROSS JOIN, discussed in the section titled Joining Data Sets with JOIN.

Take care when specifying multiple FROM sources to PostgreSQL. The result of performing a SELECT on several comma-delimited sources without a WHERE or JOIN clause to qualify the relationship between the sources is that the complete Cartesian product of the sources will be returned. This is a result set where each column from each source is combined in every possible combination of rows between each other source.

Typically a WHERE clause is used to define the relationship between comma-delimited FROM sources, as shown in Example 4-27 (see the section titled Qualifying with the WHERE Clause for more information about the WHERE clause).

You must be careful when identifying column names and using multiple sources in the FROM clause, as it can introduce ambiguity between identifiers. Consider a SELECT that draws from both the books table and the authors table. Each of these tables has a column called id. If specified, PostgreSQL will be unable to determine if the id column refers to the book, or the author:

booktown=# SELECT id FROM books, authors;
ERROR:  Column reference "id" is ambiguous

As a result of the potential for ambiguity, “complete” column names can be referenced through a special syntax called dot-notation. Dot-notation refers to the placement of a dot, or period, between the table name and a column name, in order to explicitly reference a particular column. For example, books.id refers to the id column within the books table.

Dot-notation is only required in instances of ambiguity between data sets. As shown in Example 4-27, you can use the column name as an identifier source, as long as it is unique among the available sets defined by the FROM clause. (In this case, the title column, which is unique to the books table, and the last_name column, which is unique to the authors tables).

If you wish to use a sub-query to generate a result set as a source for your FROM clause, the entire query must be surrounded by parentheses. This instructs PostgreSQL to correctly interpret the query as a sub -SELECT statement and to execute it before the SELECT statement within which it resides.

Example 4-28 demonstrates a peculiar query which retrieves all column values (*) from the books table via a sub-query. The query then retrieves a string constant of test and the id values from that result set (derived from the sub-query).

The query in Example 4-28 is rather peculiar because the net effect is no different than if you had selected from the books table. This occurs because the result set from the sub-query is identical to the set of values in the books table. The use of this query demonstrates the combination of a string constant from one SELECT statement with a value drawn from the result set of a second SELECT statement. See the section titled Using Sub-Queries for more realistic examples of sub-queries once you have a better understanding of the SELECT statement itself.

Like columns, FROM sources (e.g., tables, or sub-queries) may be aliased with the AS clause. This is usually applied as a convenient shorthand for the dot-notation described in the preceding section. Aliasing a data set allows you to refer to it via dot-notation, which provides a more succinct and readable SQL statement. Example 4-29 demonstrates the same query used in Example 4-27, however you can see that it simplifies the dot-notation with the AS clause.

In addition to placing aliases on the FROM clause’s data sources, you can place aliases on the columns within that source. This is done by following a valid data source’s alias with a list of column aliases, grouped in parentheses and separated by commas. A column alias list therefore consists of a sequence of identifier aliases for each column, which correspond to the literal columns in the order that the table is defined with (from left to right).

When describing a column alias list, you do not need to specify each column; any column that is left unspecified is accessible via its normal name within such a query. If the only column you wish to alias is to the right of any other columns that you do not necessarily wish to alias, you will need to explicitly list the preceding columns (it is valid to list the same name for an existing column as its “alias”). Otherwise, PostgreSQL will have no way of knowing which column you were attempting to alias and will assume you were addressing the first column from the left.

Example 4-30 illustrates the same query that is used in Example 4-29 but aliases the id columns in each table to unique identifiers in order to reference them directly (i.e., without dot-notation). The syntax is functionally identical, aliasing only the books table’s id column, thus making the authors table’s id column non-ambiguous:

The optional DISTINCT keyword excludes duplicate rows from the result set. If supplied without the ON clause, a query that specifies DISTINCT will exclude any row whose target columns have already been retrieved identically. Only columns in the SELECT’s target list will be evaluated.

For example, the books table has 15 rows, each with an author_id. Some authors may have several entries in the books table, causing there to be several rows with the same author_id. Supplying the DISTINCT clause, as shown in the first query in Example 4-31, ensures that the result set will not have two identical rows.

As you can see, the first query in Example 4-31 returns only 13 rows from the books table, even though there are 15 total rows within it. Two authors with two books each end up being displayed only once.

The second query in Example 4-31 uses a different form of DISTINCT, which specifies the columns (or expressions) to be checked for redundancies. In this case, 13 rows are still returned, as the ON clause specifies to use the author_id column as the basis for determining if a row is redundant or not. Without the ON clause, the second query would return all 15 rows, because the DISTINCT clause would cause PostgreSQL to look for rows that are completely unique.

The titles that are omitted from the resultant data set by ON are arbitrarily determined by PostgreSQL, unless an ORDER BY clause is specified. If the ORDER BY clause is used with DISTINCT, you can specify the order in which columns are selected; hence, you can select which rows will be considered distinct first. See the section titled Sorting Rows with ORDER BY for information about sorting rows.

If you are interested in grouping rows which have non-unique criteria, rather than omitting all rows but one, see the description of the GROUP BY clause in the section titled Grouping Rows with GROUP BY.

The WHERE clause allows you to provide Boolean (true or false) conditions that rows must satisfy to be included in the resulting row set. In practice, a SELECT statement will almost always contain at least one qualification via the WHERE clause.

Suppose that you want to see all of the books in Book Town’s Computers section. The subject_id for the Computers subject is 4. Therefore, the WHERE clause can be applied with an equivalence operation (the = operator) to check for rows in the books table with a subject_id equal to 4. This is shown in Example 4-32.

The query in Example 4-32 returns only rows whose subject_id column matches the integer constant value of 4. Thus, only the four rows for computer books are returned, rather than the 15 rows shown by the simple query in Example 4-23.

The WHERE clause accepts numerous conditions, provided that they are joined by valid logical keywords (e.g., the AND, and OR keywords) and returns a single Boolean condition. For example, you may be interested in seeing all Book Town titles that fall under the Computers subject which are also by the author Mark Lutz, thus joining two conditions to narrow the focus of your query. Alternatively, you might be interested in seeing each of Book Town’s titles that fall under either the Computers subject or the Arts subject, thereby joining two conditions to broaden the focus of your intended result set. Example 4-33 demonstrates each of these scenarios using the AND keyword and OR keyword, respectively.

The first SELECT statement in Example 4-33 combines one condition, which checks for titles in the Computers subject (with a subject_id of 4), with another condition, which checks if the author is Mark Lutz (with an author_id of 7805) via the AND keyword. The result is a smaller data set, constrained to two rows that fit both specified conditions.

The second SELECT statement in Example 4-33 combines the same first condition (books in the Computers subject) with a second condition: if the title falls under the Arts subject (with a subject_id of 0). The result is a slightly larger data set of five rows that matched at least one of these conditions.

WHERE conditions may be grouped together indefinitely, though after two conditions you may wish to group the conditions with parentheses. Doing so explicitly indicates how the conditions are interrelated. As a demonstration, the two statements in Example 4-34 have different effects based merely on the addition of parentheses.

The preceding example demonstrates two attempts to look up Book Town titles with an author_id of 1866. The titles also have a subject_id of either 15, or 3. As you can see from the first statement, when the three conditions are used without parentheses, the intent of the statement is ambiguous, and interpreted incorrectly. The addition of parentheses will cause the evaluations within parentheses to be considered before any surrounding condition.

As demonstrated by the use of the WHERE clause on two table sources in the section titled Selecting Sources with the FROM Clause, you have the ability to retrieve data from different data sources by combining their columns into joined rows. In SQL, this process is formally called a join.

The essential concept behind a join is that two or more data sets, when joined, have their columns combined into a new set of rows containing each of the columns requested from each of the data sets. The foundation of all joins is the Cartesian product, which is the set of all possible combinations between two data sets. That product may then be refined into a smaller subset by a set of criteria in the JOIN syntax. These criteria describe a relationship between data sets, though such a definition is not required.

There are three general types of joins:

Cross joins

  • Creates a Cartesian product (or cross product) between two sets of data. It is called a product because it does not define a relationship between the sets; instead, it returns every possible combination of rows between the joined sets, essentially multiplying the sources by one another.

Inner joins

  • Creates a subset of the Cartesian product between two sets of data, requiring a conditional clause to specify criteria upon which to join records. The condition must return a Boolean value to determine whether or not a row is included in the joined set.

Outer joins

  • Similar to an inner join, in that it accepts criteria which will match rows between two sets of data, but returns at least one instance of each row from a specified set. This is either the left set (the data source to the left of the JOIN keyword), the right set (the data source to the right of the JOIN keyword), or both sets, depending on the variety of outer join employed. The missing column values for the empty half of the row which does not meet the join condition are returned as NULL values.

More useful are the inner and outer joins, which require a qualification of the relationship between joined data sets in the JOIN syntax itself. The following is the syntax for an inner or outer join:

source1 [ NATURAL ] join_type source2
[ ON ( condition [, ...] ) | USING ( column [, ...] ) ]
source1

Identifies the first data set that is being joined (i.e., a table name or sub-query).

[ NATURAL ]

Implies that the two data sets should be joined on equivalent values between like-named columns (e.g., if two tables have a column called id, it will join rows where the id values are equivalent). The NATURAL clause will respect column aliases, if applied. The use of the NATURAL clause makes it both unnecessary and invalid to try to specify either of the ON or USING clauses.

join_type

Specifies the type of JOIN intended. Valid values in this context are [ INNER ] JOIN (specifying just JOIN implies an INNER JOIN), LEFT [ OUTER] JOIN, RIGHT [ OUTER ] JOIN, and FULL [ OUTER ] JOIN.

source2

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

ON ( condition [, ...] )

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

Specifies the relationship between source1 and source2. Any arbitrary criteria may be specified within the ON clause, just as you would specify conditions following a WHERE clause. Column and table aliases are allowed in this criteria.

USING ( column [, ...] )

Specifies like-named columns between source1 and source2 with which to join rows by equivalent values. Similar to a NATURAL JOIN, but allows you to indicate what specific columns to join on, whereas NATURAL will join on all like-named columns. Similar to NATURAL joins, column aliases are respected in the USING clause’s parameters.

The SQL92 INNER JOIN syntax is a tool that helps differentiate the conditions with which you are joining data sources (the JOIN conditions) from the conditions with which you are evaluating rows for inclusion in your data set (the WHERE conditions). For example, consider the two SELECT statements in Example 4-36.

The two forms of syntax in Example 4-36 are functionally identical, and return the same results. The INNER JOIN syntax allows you to segregate the relational criteria from your evaluation criteria by only defining the set relationships in the ON clause. This can make involved queries much easier to read and maintain, as you do not need to interpret what each condition described by the WHERE clause is conceptually achieving.

Notice that the second query demonstrates the use of aliases b and a in the ON clause for the books and authors tables, respectively. The use of these aliases in the ON clause is perfectly valid, and often preferable from a perspective of improved readability.

In cases of simple equivalence joins, it may be more convenient for you to use either the USING or NATURAL clauses instead of the ON clause. These are only applicable on data sets with identically named columns. If you have columns that define a relationship between two sets that are not identically named, you may still use the USING or NATURAL clauses by employing column aliases, as demonstrated in Example 4-37, to re-name one or both of the columns to a uniform name.

The first SELECT statement in Example 4-37 assigns the alias of author_id to the first column in the authors table (which is actually named id). By passing the author_id identifier to the USING clause, PostgreSQL then searches for a column identifier in each data set with that name to join rows on values found to be equivalent.

Inner joins are adequate for a wide variety of queries, but there are times when an outer join is required to get all of the data you need. The key to understanding the difference between inner and outer joins is in knowing how each type of join handles rows that do not meet their defined relationship.

In short, an inner join will discard any row for which it cannot find a corresponding value between the sets being joined (as specified by either the ON or USING clause).

In contrast to inner joins, an outer join can retain rows where corresponding values between sets are not found, populating the missing columns with NULL values. Whether or not the outer join does retain that row depends on which set is missing the value and the kind of outer join that is specified.

There are three forms of outer joins:

Left outer joins

Right outer joins

Full outer joins

Consider again Book Town’s books table, and another Book Town table called editions. While the books table stores general information on a given title, the editions table stores specific information pertaining to each edition, such as an the book’s ISBN, publisher, and publication date. The editions table has a column called book_id which corresponds to the books table’s primary key column, id.

Suppose that you want to retrieve each of Book Town’s titles, along with its isbn, if applicable. Performing a query with an inner join between the books and editions tables will correctly return a data set with title and isbn columns. However, as demonstrated in Example 4-38, if a book does not yet have a printed edition (or if that edition has not yet been entered into Book Town’s database), those titles will not be displayed.

In contrast, the statement immediately following the inner join in Example 4-38 employs an outer join, returning 20 rows. Three of the returned rows do not have ISBN numbers, but are not omitted due to the definition of the join.

The join specified by the second query in Example 4-38 uses the LEFT OUTER JOIN clause to define its join type. This is because the query focuses on titles from the books table that have ISBN numbers, and not those editions having ISBN numbers that do not correspond to titles. As the books table is to the left of the JOIN keyword, it is defined as a left outer join to achieve this. If the focus of the query was to see both ISBN numbers without titles as well as titles without ISBN numbers, the same query could instead be modified to be a full outer join with the FULL OUTER JOIN clause.

The difference between inner and outer joins illustrated in Example 4-38 is a vital concept to understand, as misuse of joins can lead to both omitted and unexpected rows.

It should be understood that while a single JOIN clause connects only two sets of data, in practice, joins are not restricted to only two data sources. You may arbitrarily specify numerous JOIN clauses following sets that are themselves constructed from joins, just as you may specify numerous data sources separated by commas.

When connecting several joins together, it is a good practice to group each join and sub-join within parentheses. Explicitly grouping joins in this fashion insures that there is no ambiguity, to either PostgreSQL or a developer, as to which data sets are joined, and in what order.

An interesting observation to be made about Example 4-39 is that, while the books table is itself deeply involved in the join, none of its columns are retrieved in the final result set. The books table is included in the JOIN clauses in order to provide criteria through which other tables are joined together. Each of the tables whose columns are retrieved in the query rely on the books table in order to draw relationships with any other table through the id column (with the exception of the publishers table, which relates to the publisher_id column in the editions table).

The GROUP BY clause introduces a powerful SQL concept: aggregation. To aggregate means to gather into a sum, or whole. The practical effect of aggregating in a SQL query is that any rows whose results from the GROUP BY expression match identically are grouped together into a single aggregate row. The GROUP BY expression may define a column, but it may also be any operation upon a column as well. If several columns or expressions are specified (delimited by commas), the entire set of specified criteria must be identical for rows to be grouped together.

To effectively use aggregation you must understand that any target columns requested by an aggregating query which are not specified in the GROUP BY clause will be inaccessible, unless selected through an aggregate function. An aggregate function accepts a column name (or expression involving at least one column name) which can represent several values (i.e., from several grouped rows), performs an operation on those values, and returns a single value.

Common aggregate functions include count(), which returns the number of rows in the set, max(), which returns the maximum value in the column, and min(), which returns the minimum value in the column. An aggregate function operates only on rows in the query’s result set, and is therefore executed after conditional joins and WHERE conditions have been processed.

Imagine that you wanted to know how many books Book Town stores in its database for each known publisher. You could perform a simple join between the editions and publishers tables in order to associate each publisher name with a title that they publish. It would be tedious to manually count how many titles each publisher maintained, and in cases of larger data sets, it can become difficult to manage larger result sets.

Example 4-40 demonstrates a join between these two Book Town tables, but also introduces two new elements: the count() function, and the GROUP BY clause.

The GROUP BY clause in Example 4-40 instructs PostgreSQL to group the rows in the joined data set by p.name, which in this query is a reference to the name column in the publishers table. Therefore, any rows that have the same publisher name will be grouped together, or aggregated. The count() function then counts the number of isbn values from the editions table that are in each aggregated row, and returns a single numeric value representing the number of rows that were aggregated for each unique publisher.

Note that in Example 4-40 the argument of the editions table’s isbn column is chosen simply to indicate the objective of the example (to count how many books there are per publisher). Any column name will return the same number, as count() will always return the number of rows grouped in the current aggregate row.

Something to watch out for when designing aggregate queries is that the WHERE clause cannot accept criteria involving aggregate functions. Instead, use the HAVING clause. It functions identically to the WHERE clause, but its conditions must be on aggregate functions rather than single-row conditions. Syntactically, the HAVING clause follows the GROUP BY clause, as demonstrated in Example 4-41.

Both Example 4-40 and Example 4-41 create a data set through an inner join between the editions and publishers table. However, Example 4-41 constrains the final result to publishers having more than a single book in the Book Town database, as set by the HAVING clause.

As described in Chapter 3, row data is not stored in a consistent order within tables. In fact, an identical query executed twice is in no way guaranteed to return the rows in the same order each time. As order is commonly an important part of retrieving data for database-dependent applications, use the ORDER BY clause to allow flexible sorting of your result set.

The ORDER BY clause accepts as its parameters a list of comma-delimited column names (or expressions upon columns), which are used as sorting criteria. For each sort criteria, you may optionally apply either the ASC, DESC, or USING keywords to control the type of sorting employed:

ASC

Causes the rows to sort by the related criteria in an ascending fashion (e.g., numbers will be sorted lowest to highest, text will be sorted alphabetically from a to z). ASC is equivalent to specifying USING <. Since it is the default behavior, specifying ASC is only useful for explicit readability.

DESC

Causes the rows to sort by the related criteria in a descending fashion (e.g., numbers will be sorted highest to lowest, text will be sorted alphabetically from z to a). DESC is equivalent to specifying USING >.

USING operator

Allows the specification of the operator operator to be used to compare each column for precedence. This can be particularly useful for custom operators.

Example 4-42 demonstrates the use of the ORDER BY clause on the editions table. It specifies the publication column as the source of values to sort by, and explicitly declares the ordering method as an ascending (ASC) sort.

As you can see in the result set from Example 4-42, the rows return in ascending order, from the oldest date to the newest. It should be noted that even columns and expressions that do not appear in the target list of the SELECT statement may be used to sort the retrieved rows. Furthermore, aggregate functions and expressions are allowed by the ORDER BY clause if the query involves aggregation. The ability to sort by such a wide scope of sources thus allows for a great deal of flexibility in ordering results from a variety of query approaches.

When specifying multiple expressions to sort by, the result set will be ordered by the first criteria (from left to right), and will only process subsequent sorting criteria if the first condition’s sort is inconclusive. For example, consider the sorting performed in Example 4-43.

The query in Example 4-43 selects the numeric edition and publication date of each book from the editions table. The ORDER BY clause then specifies two columns to sort by: edition, in ascending order, and publication, in descending order.

As you can see in the result set for Example 4-43, each row is first sorted by edition, proceeding from the lower editions to the higher editions. Subsequently, wherever the editions are identical, the publication date is used to then sort again, from the most recent publication date to the least recent.

Sorting is extremely relevant when using the DISTINCT keyword, as discussed in the section titled Removing Duplicate Rows with DISTINCT. If you are only interested in seeing the most recently published copy of each edition in the editions table, the ORDER BY and DISTINCT clauses can be combined to achieve an effect somewhat similar to the GROUP BY clause, as shown in Example 4-44.

Since the ORDER BY occurring before the DISTINCT clause eliminates duplicate rows, the net effect can be very similar to using the max() or min() with a GROUP BY clause. This technique can sometimes be more efficient, depending on the complexity of the aggregation and sorting involved.

PostgreSQL enforces no limit upon the number of rows retrievable from a SQL query. If you attempt to execute a query that returns several million rows, it may take a while, but the server will not stop until it has returned the entire result set (or until it is interrupted).

Applications could conceivably be written to programmatically “page” through large sets of data after retrieval, but SQL provides as a convenience the LIMIT and OFFSET clauses, which allow for the retrieval of a specified portion of the generated result set.

When the LIMIT clause is specified, no more than the requested number of rows will be returned (though there may be fewer if the result set is smaller than the passed parameter). When the OFFSET clause is specified, it skips the number of rows defined by its parameters before returning rows. If both are specified, the number of rows to be included as per the LIMIT clause will not be counted until the number of rows dictated by the OFFSET clause have been skipped.

Example 4-45 demonstrates, in the first query, a simple use of LIMIT, by retrieving only 5 rows from the joined set of the editions and books table. Ordinarily, such a join would result in 17 rows.

The second query in Example 4-45 shows the use of the OFFSET clause, to shift the scope of the result set down by two rows. You can see that the last three rows of the first query’s result set overlap with the first three rows of the second query’s result set. The ORDER BY clause in each of these queries insures the consistency of the sets returned.

While joins are used in SQL to combine column values into a single row, the UNION, INTERSECT and EXCEPT clauses exist to merge or omit row data by comparing column values, returning a new result set based on this comparison. Each of these keywords may be used at the end of a valid SQL query and followed by a second query, in order to compare the resultant data sets, and then either merge or omit rows based on that comparison.

When comparing data sets in this manner, it is required that they each have the same number of columns, as well as the same column type. Note that they do not need to have the same name, or be queried from the same table or data source.

Example 4-46, Example 4-47, and Example 4-48 each demonstrate these keywords by combining and omitting rows from comparative data sets. Example 4-46 creates a result set by combining several authors’ last names with book titles via the UNION keyword.

Example 4-47 demonstrates the selection of ISBN numbers from the books table, limited to rows which intersect with the query on the shipments table for books which have records of more than two shipments. Finally, Example 4-48 demonstrates the removal of any rows from the first query which are matched completely in the second.

In Example 4-48, only rows that do not match the second query are returned. Notice that the effective result of this is that only authors who do not have a book in the books table are returned. This is due to the INNER JOIN clause, which causes the second query to omit any authors whose author_id is not found in the books table.

While the use of these keywords in a single SQL query precludes the ability to use the LIMIT clause, this limitation can be circumvented by PostgreSQL’s support for sub-queries. By grouping in parentheses each of the queries involved between a UNION, EXCEPT, or EXCEPT clause, the returned result sets from the sub-queries are compared, as demonstrated in Example 4-49.

Notice that the query used in Example 4-49 creates a set from the books table that is constrained to the last seven rows and sorted alphabetically by title. The EXCEPT clause then removes from that data set the first eleven rows, sorted alphabetically in an ascending fashion. The result consists of the last four rows from the table, sorted from the bottom by the final ORDER BY clause on the new exception set.

In order to achieve simple programmatic transformations without having to call out to a procedural language, PostgreSQL supports standard SQL case expressions. These use the SQL keywords CASE, WHEN, THEN, and END to allow basic conditional transformations per each row.

The entirety of a case expression is syntactically placed within the SELECT statement’s target list. A case expression’s result column is named case by default, but it may be aliased in the same manner as any normal target list. The general syntax for a case expression in a SELECT statement’s target list is as follows:

CASE WHEN condition1 THEN result1
     WHEN condition2 THEN result2
     [ ... ]
     [ ELSE default_result ]
END [ AS alias ]

The CASE, WHEN, THEN, and ELSE keywords are somewhat similar to the if-then-else logic in programming languages. The condition of a WHEN clause must return a Boolean result.

When a WHEN condition is met, the result from its corresponding THEN clause will return in the result column for that row. If no conditions are met, the ELSE clause may be used to specify a default result value. If there are no results found for a case expression, NULL is returned.

Adding to the power of case expressions are PostgreSQL’s sub-queries, described in the section titled Using Sub-Queries. As demonstrated in Example 4-51, a sub-query may be provided as a result within a conditional expression.

In Example 4-51, any book found to have a cost of less than 20 has its title returned via a sub-select to the books table, along with its ISBN from the main query to the stock table.

The INTO TABLE clause may be used with any valid SELECT query in order to create a new table with the column structure and row data of the returned result set. The syntax for this is as follows:

SELECT select_targets
       INTO [ TABLE ] new_table
       FROM old_table;

This syntax performs an implicit CREATE TABLE command, creating a table with the same column names, value types, and row data as the result set from the original table. When the message SELECT is returned, you will know that the statement was successfully performed, and the new table created. This is demonstrated in Example 4-52, which creates a backup table called stock_backup out of the data in the stock table.

The table specified by the INTO clause must not exist, or else an error will be returned. Upon the error, the values of the query will not be inserted and the query will fail. Note that the TABLE keyword, in this query, is an optional noise term.