Retrieving Rows with SELECT

The heart of all SQL queries is the SELECT command. SELECT is used to build queries (also known as SELECT statements). Queries are the only SQL instructions by which your data can be retrieved from tables and views. The data returned via a query is called a result set and consists of rows, with columns, similar to a table.

The columns of a result set are not stored on the disk in any fixed form. They are purely a temporary result of the query’s requested data. A query on a table may return a result set with the same column structure as the table, or it may differ drastically. Result sets may even have columns which are drawn from several other tables by a single query.

Since it is central to PostgreSQL, SELECT is easily the most complicated single command, having the most available clauses and parameters. The following is the syntax for SELECT. The terms used are summarized and described in greater detail within the following sections. The term expression is used to refer to either a column name, or a general expression (such as a column being operated upon by a constant, or another column).

SELECT [ ALL | DISTINCT [ ON ( expression  [, ...] ) ] ]
       target [ AS name ] [,...]
       [ FROM source [, ...] ]
              [ [ NATURAL ] join_type source
              [ ON condition | USING ( column_list ) ] ]
              [, ...]
       [ WHERE condition ]
       [ GROUP BY expression [, ...] ]
       [ HAVING condition [, ...] ]
       [ { UNION | INTERSECT | EXCEPT } [ ALL ] sub-query ]
       [ ORDER BY expression
               [ ASC | DESC | USING operator ]
               [, ...] ]
       [ FOR UPDATE [ OF table [, ...] ] ]
       [ LIMIT { count | ALL } [ { OFFSET | , } start ] ]

In this syntax diagram, source may be either a table name or a subselect. The syntax for these general forms is as follows:

FROM { [ ONLY ] table [ [ AS ] alias [ ( column_alias [, ...] ) ] ] |
       ( query ) [ AS ] alias [ ( column_alias [, ...] ) ] }

ALL

The ALL keyword may be specified as a noise term to make it clear that all rows should be returned.

DISTINCT [ ON ( expression [, ...] ) ]

The DISTINCT clause specifies a column (or expression) for which to retrieve only one row per unique value of expression.

target [ AS name ] [, ...]

The SELECT targets are usually column names, though they can be constants, identifiers, functions or general expressions. Each target requested must be separated by commas, and may be named dynamically to name via the AS clause. Supplying the asterisk symbol (*) as a target is shorthand for requesting all non-system columns, and may be listed along with other targets.

FROM source [, ... ]

The FROM clause dictates the source that PostgreSQL will look in for the specified targets. The source, in this case, may be a table name or a sub-query. You can specify numerous sources, separated by commas. (This is roughly equivalent to a cross join). The syntax for the FROM clause is described in more detail later in this section.

[ NATURAL ] join_type source [ ON condition | USING ( column_list ) ]

The FROM sources may be joined together via the JOIN clause, which requires a join_type (e.g., INNER, FULL OUTER, CROSS) and may require a condition or column_list to further define the nature of the join, depending on the join_type.

WHERE condition

The WHERE clause constrains the result set from the SELECT statement to specified criteria, which are defined by condition. Conditions must return a single Boolean value (true or false), but may consist of several checks combined with logical operators (e.g., with AND, and OR) to indicate that available rows must meet all supplied conditions to be included in the statement’s results.

GROUP BY expression [, ...]

The GROUP BY clause aggregates (groups) rows together by the criteria described in expression. This can be as simple as a column name (and often is) or an arbitrary expression applied to values of the result set.

HAVING condition [, ...]

The HAVING clause is similar to the WHERE clause, but checks its conditions on aggregated (grouped) sets instead of atomic rows.

{ UNION | INTERSECT | EXCEPT } [ ALL ] sub-query

Performs one of three set operations between the SELECT statement and a second query, returning their result sets in uniform column structure (which must be compatible). Duplicate rows are removed from the resultant set unless the ALL keyword is used.

UNION: Returns the set of collected rows.
INTERSECT: Returns the set of rows where the values of the two sets overlap.
EXCEPT: Returns the set of rows which are found in the SELECT statement, but not found in the secondary query.

ORDER BY expression

Sorts the results of the SELECT statement by expression.

[ ASC | DESC | USING operator ]

Determines whether or not the ORDER BY expression proceeds in ascending order (ASC), or descending order (DESC). An operator may alternatively be specified with the USING keyword (e.g., < or >).

FOR UPDATE [ OF table [, ...] ]

Allows for exclusive locking of the returned rows. When used within a transaction block, FOR UPDATE locks the rows of the specified table until the transaction is committed. While locked, the rows cannot be updated by other transactions.

LIMIT { count | ALL }

Limits the number of rows returned to a maximum of count, or explicitly allows ALL rows.

{ OFFSET | , } start

Instructs the LIMIT clause at what point to begin limiting the results. For example, a LIMIT with a count set to 100, and an OFFSET clause with a start value of 50 would return the rows from 50 to 150 (if there are that many results to return).

Terms used in the FROM clause’s syntax description are as follows:

[ ONLY ] table: The table name specifies what table to use as a source for the SELECT statement. Specifying the ONLY clause causes the rows of any child’s table to be omitted from the query.
[ AS ] alias: An alias may optionally be assigned to a FROM source, in order to simplify a query (e.g., books might be temporarily referenced with an alias of b). The AS term is considered noise, and is optional.
( query ) [ AS ] alias: Any valid SELECT statement may be placed in parentheses as the query. This causes the result set created by the query to be used as a FROM source, as if it had been a static table. This use of a sub-query requires a specified alias.
( column_alias [, ...] ): The FROM sources which have assigned aliases may also alias columns by specifying arbitrary column aliases. Each column_alias must be separated by commas, and grouped within parentheses following the FROM source’s alias. These aliases must match the order of the defined columns in the table to which it is applied.

A Simple SELECT

A SELECT statement may be as simple as a request for all rows and all columns from a specified table. Use the following syntax to retrieve all rows and columns from a table:

SELECT * FROM table_name;

The asterisk (*) character, as mentioned in the explanation of SELECT’s syntax, is short-hand for all non-system columns. In essence, the SELECT * requests all non-system data in the table named table_name; this retrieves all columns and all rows, because no row limit is specified. To demonstrate, Example 4-23 requests all columns (*) from Book Town’s books table.

Example 4-23. Selecting all from the books table

booktown=# SELECT * FROM books;
  id   |            title            | author_id | subject_id
-------+-----------------------------+-----------+------------
  7808 | The Shining                 |      4156 |          9
  4513 | Dune                        |      1866 |         15
  4267 | 2001: A Space Odyssey       |      2001 |         15
  1608 | The Cat in the Hat          |      1809 |          2
  1590 | Bartholomew and the Oobleck |      1809 |          2
 25908 | Franklin in the Dark        |     15990 |          2
  1501 | Goodnight Moon              |      2031 |          2
   190 | Little Women                |        16 |          6
  1234 | The Velveteen Rabbit        |     25041 |          3
  2038 | Dynamic Anatomy             |      1644 |          0
   156 | The Tell-Tale Heart         |       115 |          9
 41472 | Practical PostgreSQL        |      1212 |          4
 41473 | Programming Python          |      7805 |          4
 41477 | Learning Python             |      7805 |          4
 41478 | Perl Cookbook               |      7806 |          4
(15 rows)

Specifying Target Columns

While SELECT * is a good example of a basic query, and is sometimes very useful, you will probably be interested in retrieving only a few columns worth of information at a time. To stay efficient, and to keep your queries clear, it is a good idea to explicitly specify the intended target columns rather than to use the asterisk. This is especially true when using the JOIN clause, as will be discussed in the section titled Joining Data Sets with JOIN.

To specify the target columns for a query, list the names of the columns following the SELECT keyword. The query will return data for only those columns that you list. The order of these columns need not match their literal order in the table, and columns may be listed more than once, or not at all, as shown in Example 4-24.

Example 4-24. Re-ordering columns

booktown=# SELECT id, author_id, title, id
booktown-#        FROM books;
  id   | author_id |            title            |  id
-------+-----------+-----------------------------+-------
  7808 |      4156 | The Shining                 |   7808
  4513 |      1866 | Dune                        |   4513
  4267 |      2001 | 2001: A Space Odyssey       |   4267
  1608 |      1809 | The Cat in the Hat          |   1608
  1590 |      1809 | Bartholomew and the Oobleck |   1590
 25908 |     15990 | Franklin in the Dark        |  25908
  1501 |      2031 | Goodnight Moon              |   1501
   190 |        16 | Little Women                |    190
  1234 |     25041 | The Velveteen Rabbit        |   1234
  2038 |      1644 | Dynamic Anatomy             |   2038
   156 |       115 | The Tell-Tale Heart         |    156
 41472 |      1212 | Practical PostgreSQL        |  41472
 41473 |      7805 | Programming Python          |  41473
 41477 |      7805 | Learning Python             |  41477
 41478 |      7806 | Perl Cookbook               |  41478
(15 rows)

As you can see, the data sets returned in both Example 4-24 and Example 4-23 are nearly identical. The second set is returned in a different column arrangement, (omitting the subject_id column, and repeating the id column twice) as a result of the target list.

Expressions, Constants, and Aliases

In addition to plain column names, targets in the SELECT statement may be arbitrary expressions (e.g., involving functions, or operators acting upon identifiers), or constants. The syntax is simple, and only requires that each identifier, expression, or constant be separated by commas. Conveniently, different types of targets may be arbitrarily mixed in the target list.

In fact, the SELECT command may be used to retrieve expressions and constants without the use of a FROM clause or specified columns, as in Example 4-25.

Example 4-25. Using expressions and constants

testdb=# SELECT 2 + 2,
testdb-#        pi(),
testdb-#        'PostgreSQL is more than a calculator!';
 ?column? |        pi        |               ?column?
----------+------------------+---------------------------------------
       4  | 3.14159265358979 | PostgreSQL is more than a calculator!
(1 row)

The target list allows the use of an optional AS clause for each specified target, which re-names a column in the returned result set to an arbitrary name specified in the clause. The rules and limitations for the specified name are the same as for normal identifiers (e.g., they may be quoted to contain spaces, may not be keywords unless quoted, and so on).

Using AS has no lasting effect on the column itself, but only on the result set which is returned by the query. AS can be particularly useful when selecting expressions or constants, rather than plain columns. Naming result set columns with AS can clarify the meaning of an otherwise ambiguous expression or constant. This technique is demonstrated in Example 4-26, which shows the same results as Example 4-25, but with different column headings.

Example 4-26. Using the AS clause with expressions and constants

booktown=# SELECT 2 + 2 AS "2 plus 2",
booktown-#        pi() AS "the pi function",
booktown-#        'PostgreSQL is more than a calculator!' AS comments;
 2 plus 2  | the pi function  |               comments
----------+------------------+---------------------------------------
       4  | 3.14159265358979 | PostgreSQL is more than a calculator!
(1 row)

Selecting Sources with the FROM Clause

The FROM clause allows you to choose either a table or a result set as a source for your specified target list. Multiple sources may be entered following the FROM clause, separated by commas. Specifying multiple sources in this fashion is functionally similar to a CROSS JOIN, discussed in the section titled Joining Data Sets with JOIN.

Take care when specifying multiple FROM sources to PostgreSQL. The result of performing a SELECT on several comma-delimited sources without a WHERE or JOIN clause to qualify the relationship between the sources is that the complete Cartesian product of the sources will be returned. This is a result set where each column from each source is combined in every possible combination of rows between each other source.

Typically a WHERE clause is used to define the relationship between comma-delimited FROM sources, as shown in Example 4-27 (see the section titled Qualifying with the WHERE Clause for more information about the WHERE clause).

You must be careful when identifying column names and using multiple sources in the FROM clause, as it can introduce ambiguity between identifiers. Consider a SELECT that draws from both the books table and the authors table. Each of these tables has a column called id. If specified, PostgreSQL will be unable to determine if the id column refers to the book, or the author:

booktown=# SELECT id FROM books, authors;
ERROR:  Column reference "id" is ambiguous

As a result of the potential for ambiguity, “complete” column names can be referenced through a special syntax called dot-notation. Dot-notation refers to the placement of a dot, or period, between the table name and a column name, in order to explicitly reference a particular column. For example, books.id refers to the id column within the books table.

Dot-notation is only required in instances of ambiguity between data sets. As shown in Example 4-27, you can use the column name as an identifier source, as long as it is unique among the available sets defined by the FROM clause. (In this case, the title column, which is unique to the books table, and the last_name column, which is unique to the authors tables).

Example 4-27. Selecting from multiple table sources

booktown=# SELECT books.id, title, authors.id, last_name
booktown-#        FROM books, authors
booktown-#        WHERE books.author_id = authors.id;
  id   |            title            |  id   |  last_name
-------+-----------------------------+-------+--------------
   190 | Little Women                |    16 | Alcott
   156 | The Tell-Tale Heart         |   115 | Poe
 41472 | Practical PostgreSQL        |  1212 | Worsley
  2038 | Dynamic Anatomy             |  1644 | Hogarth
  1608 | The Cat in the Hat          |  1809 | Geisel
  1590 | Bartholomew and the Oobleck |  1809 | Geisel
  4513 | Dune                        |  1866 | Herbert
  4267 | 2001: A Space Odyssey       |  2001 | Clarke
  1501 | Goodnight Moon              |  2031 | Brown
  7808 | The Shining                 |  4156 | King
 41473 | Programming Python          |  7805 | Lutz
 41477 | Learning Python             |  7805 | Lutz
 41478 | Perl Cookbook               |  7806 | Christiansen
 25908 | Franklin in the Dark        | 15990 | Bourgeois
  1234 | The Velveteen Rabbit        | 25041 | Bianco
(15 rows)

If you wish to use a sub-query to generate a result set as a source for your FROM clause, the entire query must be surrounded by parentheses. This instructs PostgreSQL to correctly interpret the query as a sub -SELECT statement and to execute it before the SELECT statement within which it resides.

Example 4-28 demonstrates a peculiar query which retrieves all column values (*) from the books table via a sub-query. The query then retrieves a string constant of test and the id values from that result set (derived from the sub-query).

Example 4-28. Selecting from a sub-query

booktown=# SELECT 'test' AS test, id
booktown-#        FROM (SELECT * FROM books)
booktown-#        AS example_sub_query;
 test |  id
------+-------
 test |  7808
 test |  4513
 test |  4267
 test |  1608
 test |  1590
 test | 25908
 test |  1501
 test |   190
 test |  1234
 test |  2038
 test |   156
 test | 41472
 test | 41473
 test | 41477
 test | 41478
(15 rows)

The query in Example 4-28 is rather peculiar because the net effect is no different than if you had selected from the books table. This occurs because the result set from the sub-query is identical to the set of values in the books table. The use of this query demonstrates the combination of a string constant from one SELECT statement with a value drawn from the result set of a second SELECT statement. See the section titled Using Sub-Queries for more realistic examples of sub-queries once you have a better understanding of the SELECT statement itself.

Note

When specifying a table that is inherited by other tables, you may provide the optional ONLY keyword before the table name to indicate that you do not want to draw from any sub-tables. (See Chapter 7 for more information on inheritance.)

Aliasing FROM Sources

Like columns, FROM sources (e.g., tables, or sub-queries) may be aliased with the AS clause. This is usually applied as a convenient shorthand for the dot-notation described in the preceding section. Aliasing a data set allows you to refer to it via dot-notation, which provides a more succinct and readable SQL statement. Example 4-29 demonstrates the same query used in Example 4-27, however you can see that it simplifies the dot-notation with the AS clause.

Example 4-29. Aliasing FROM sources

booktown=# SELECT b.id, title, a.id, last_name
booktown-#        FROM books AS b, authors AS a
booktown-#        WHERE b.author_id = a.id;
  id   |            title            |   id  |  last_name
-------+-----------------------------+-------+--------------
   190 | Little Women                |    16 | Alcott
   156 | The Tell-Tale Heart         |   115 | Poe
 41472 | Practical PostgreSQL        |  1212 | Worsley
  2038 | Dynamic Anatomy             |  1644 | Hogarth
  1608 | The Cat in the Hat          |  1809 | Geisel
  1590 | Bartholomew and the Oobleck |  1809 | Geisel
  4513 | Dune                        |  1866 | Herbert
  4267 | 2001: A Space Odyssey       |  2001 | Clarke
  1501 | Goodnight Moon              |  2031 | Brown
  7808 | The Shining                 |  4156 | King
 41473 | Programming Python          |  7805 | Lutz
 41477 | Learning Python             |  7805 | Lutz
 41478 | Perl Cookbook               |  7806 | Christiansen
 25908 | Franklin in the Dark        | 15990 | Bourgeois
  1234 | The Velveteen Rabbit        | 25041 | Bianco
(15 rows)

In addition to placing aliases on the FROM clause’s data sources, you can place aliases on the columns within that source. This is done by following a valid data source’s alias with a list of column aliases, grouped in parentheses and separated by commas. A column alias list therefore consists of a sequence of identifier aliases for each column, which correspond to the literal columns in the order that the table is defined with (from left to right).

When describing a column alias list, you do not need to specify each column; any column that is left unspecified is accessible via its normal name within such a query. If the only column you wish to alias is to the right of any other columns that you do not necessarily wish to alias, you will need to explicitly list the preceding columns (it is valid to list the same name for an existing column as its “alias”). Otherwise, PostgreSQL will have no way of knowing which column you were attempting to alias and will assume you were addressing the first column from the left.

Note

The AS keyword is technically considered noise, and may be omitted in practice; PostgreSQL determines that any stray identifiers following a FROM source may be used as aliases.

Example 4-30 illustrates the same query that is used in Example 4-29 but aliases the id columns in each table to unique identifiers in order to reference them directly (i.e., without dot-notation). The syntax is functionally identical, aliasing only the books table’s id column, thus making the authors table’s id column non-ambiguous:

Example 4-30. Aliasing columns

booktown=# SELECT the_book_id, title, id, last_name
booktown-#        FROM books AS b (the_book_id), authors
booktown-#        WHERE author_id = id;
 the_book_id |            title            |  id   |  last_name
-------------+-----------------------------+-------+--------------
         190 | Little Women                |    16 | Alcott
         156 | The Tell-Tale Heart         |   115 | Poe
       41472 | Practical PostgreSQL        |  1212 | Worsley
        2038 | Dynamic Anatomy             |  1644 | Hogarth
        1608 | The Cat in the Hat          |  1809 | Geisel
        1590 | Bartholomew and the Oobleck |  1809 | Geisel
        4513 | Dune                        |  1866 | Herbert
        4267 | 2001: A Space Odyssey       |  2001 | Clarke
        1501 | Goodnight Moon              |  2031 | Brown
        7808 | The Shining                 |  4156 | King
       41473 | Programming Python          |  7805 | Lutz
       41477 | Learning Python             |  7805 | Lutz
       41478 | Perl Cookbook               |  7806 | Christiansen
       25908 | Franklin in the Dark        | 15990 | Bourgeois
        1234 | The Velveteen Rabbit        | 25041 | Bianco
(15 rows)

Removing Duplicate Rows with DISTINCT

The optional DISTINCT keyword excludes duplicate rows from the result set. If supplied without the ON clause, a query that specifies DISTINCT will exclude any row whose target columns have already been retrieved identically. Only columns in the SELECT’s target list will be evaluated.

For example, the books table has 15 rows, each with an author_id. Some authors may have several entries in the books table, causing there to be several rows with the same author_id. Supplying the DISTINCT clause, as shown in the first query in Example 4-31, ensures that the result set will not have two identical rows.

Example 4-31. Using DISTINCT

booktown=# SELECT DISTINCT author_id
booktown-#        FROM books;
 author_id
-----------
        16
       115
      1212
      1644
      1809
      1866
      2001
      2031
      4156
      7805
      7806
     15990
     25041
(13 rows)

booktown=# SELECT DISTINCT ON (author_id)
booktown-#        author_id, title
booktown-#        FROM books;
 author_id |         title
-----------+-----------------------
        16 | Little Women
       115 | The Tell-Tale Heart
      1212 | Practical PostgreSQL
      1644 | Dynamic Anatomy
      1809 | The Cat in the Hat
      1866 | Dune
      2001 | 2001: A Space Odyssey
      2031 | Goodnight Moon
      4156 | The Shining
      7805 | Programming Python
      7806 | Perl Cookbook
     15990 | Franklin in the Dark
     25041 | The Velveteen Rabbit
(13 rows)

As you can see, the first query in Example 4-31 returns only 13 rows from the books table, even though there are 15 total rows within it. Two authors with two books each end up being displayed only once.

The second query in Example 4-31 uses a different form of DISTINCT, which specifies the columns (or expressions) to be checked for redundancies. In this case, 13 rows are still returned, as the ON clause specifies to use the author_id column as the basis for determining if a row is redundant or not. Without the ON clause, the second query would return all 15 rows, because the DISTINCT clause would cause PostgreSQL to look for rows that are completely unique.

The titles that are omitted from the resultant data set by ON are arbitrarily determined by PostgreSQL, unless an ORDER BY clause is specified. If the ORDER BY clause is used with DISTINCT, you can specify the order in which columns are selected; hence, you can select which rows will be considered distinct first. See the section titled Sorting Rows with ORDER BY for information about sorting rows.

If you are interested in grouping rows which have non-unique criteria, rather than omitting all rows but one, see the description of the GROUP BY clause in the section titled Grouping Rows with GROUP BY.

Qualifying with the WHERE Clause

The WHERE clause allows you to provide Boolean (true or false) conditions that rows must satisfy to be included in the resulting row set. In practice, a SELECT statement will almost always contain at least one qualification via the WHERE clause.

Suppose that you want to see all of the books in Book Town’s Computers section. The subject_id for the Computers subject is 4. Therefore, the WHERE clause can be applied with an equivalence operation (the = operator) to check for rows in the books table with a subject_id equal to 4. This is shown in Example 4-32.

Example 4-32. A simple WHERE clause

booktown=# SELECT * FROM books
booktown-#          WHERE subject_id = 4;
  id   |        title         | author_id | subject_id
-------+----------------------+-----------+------------
 41472 | Practical PostgreSQL |      1212 |          4
 41473 | Programming Python   |      7805 |          4
 41477 | Learning Python      |      7805 |          4
 41478 | Perl Cookbook        |      7806 |          4
(4 rows)

The query in Example 4-32 returns only rows whose subject_id column matches the integer constant value of 4. Thus, only the four rows for computer books are returned, rather than the 15 rows shown by the simple query in Example 4-23.

The WHERE clause accepts numerous conditions, provided that they are joined by valid logical keywords (e.g., the AND, and OR keywords) and returns a single Boolean condition. For example, you may be interested in seeing all Book Town titles that fall under the Computers subject which are also by the author Mark Lutz, thus joining two conditions to narrow the focus of your query. Alternatively, you might be interested in seeing each of Book Town’s titles that fall under either the Computers subject or the Arts subject, thereby joining two conditions to broaden the focus of your intended result set. Example 4-33 demonstrates each of these scenarios using the AND keyword and OR keyword, respectively.

Example 4-33. Combining conditions in the WHERE clause

booktown=# SELECT title FROM books
booktown-#              WHERE subject_id = 4
booktown-#              AND author_id = 7805;
       title
--------------------
 Programming Python
 Learning Python
(2 rows)

booktown=# SELECT title FROM books
booktown-#              WHERE subject_id = 4
booktown-#              OR subject_id = 0;
        title
----------------------
 Dynamic Anatomy
 Practical PostgreSQL
 Programming Python
 Learning Python
 Perl Cookbook
(5 rows)

The first SELECT statement in Example 4-33 combines one condition, which checks for titles in the Computers subject (with a subject_id of 4), with another condition, which checks if the author is Mark Lutz (with an author_id of 7805) via the AND keyword. The result is a smaller data set, constrained to two rows that fit both specified conditions.

The second SELECT statement in Example 4-33 combines the same first condition (books in the Computers subject) with a second condition: if the title falls under the Arts subject (with a subject_id of 0). The result is a slightly larger data set of five rows that matched at least one of these conditions.

WHERE conditions may be grouped together indefinitely, though after two conditions you may wish to group the conditions with parentheses. Doing so explicitly indicates how the conditions are interrelated. As a demonstration, the two statements in Example 4-34 have different effects based merely on the addition of parentheses.

Example 4-34. Grouping WHERE conditions with parentheses

booktown=# SELECT * FROM books
booktown-#          WHERE author_id = 1866
booktown-#          AND subject_id = 15
booktown-#          OR subject_id = 3;
  id  |        title         | author_id | subject_id
------+----------------------+-----------+------------
 4513 | Dune                 |      1866 |         15
1234  | The Velveteen Rabbit |     25041 |          3
(2 rows)

booktown=# SELECT * FROM books
booktown-#          WHERE author_id = 1866
booktown-#          AND (subject_id = 15
booktown(#               OR subject_id = 3);
  id  | title | author_id | subject_id
------+-------+-----------+------------
 4513  | Dune  |      1866 |         15
(1 row)

The preceding example demonstrates two attempts to look up Book Town titles with an author_id of 1866. The titles also have a subject_id of either 15, or 3. As you can see from the first statement, when the three conditions are used without parentheses, the intent of the statement is ambiguous, and interpreted incorrectly. The addition of parentheses will cause the evaluations within parentheses to be considered before any surrounding condition.

Joining Data Sets with JOIN

As demonstrated by the use of the WHERE clause on two table sources in the section titled Selecting Sources with the FROM Clause, you have the ability to retrieve data from different data sources by combining their columns into joined rows. In SQL, this process is formally called a join.

The essential concept behind a join is that two or more data sets, when joined, have their columns combined into a new set of rows containing each of the columns requested from each of the data sets. The foundation of all joins is the Cartesian product, which is the set of all possible combinations between two data sets. That product may then be refined into a smaller subset by a set of criteria in the JOIN syntax. These criteria describe a relationship between data sets, though such a definition is not required.

There are three general types of joins:

Cross joins

Creates a Cartesian product (or cross product) between two sets of data. It is called a product because it does not define a relationship between the sets; instead, it returns every possible combination of rows between the joined sets, essentially multiplying the sources by one another.

Inner joins

Creates a subset of the Cartesian product between two sets of data, requiring a conditional clause to specify criteria upon which to join records. The condition must return a Boolean value to determine whether or not a row is included in the joined set.

Outer joins

Similar to an inner join, in that it accepts criteria which will match rows between two sets of data, but returns at least one instance of each row from a specified set. This is either the left set (the data source to the left of the JOIN keyword), the right set (the data source to the right of the JOIN keyword), or both sets, depending on the variety of outer join employed. The missing column values for the empty half of the row which does not meet the join condition are returned as NULL values.

Cross joins

A cross join is functionally identical to listing comma-delimited sources. It therefore should almost always be accompanied by a WHERE clause to qualify the relationship between the joined data sets. Example 4-35 demonstrates the same functional query used in Example 4-27, substituting the formal JOIN syntax for the comma.

Example 4-35. A simple CROSS JOIN

booktown=# SELECT b.id, title, a.id, last_name
booktown-#        FROM books AS b CROSS JOIN authors AS a
booktown-#        WHERE b.author_id = a.id;
  id   |            title            |  id   |  last_name
-------+-----------------------------+-------+--------------
   190 | Little Women                |    16 | Alcott
   156 | The Tell-Tale Heart         |   115 | Poe
 41472 | Practical PostgreSQL        |  1212 | Worsley
  2038 | Dynamic Anatomy             |  1644 | Hogarth
  1608 | The Cat in the Hat          |  1809 | Geisel
  1590 | Bartholomew and the Oobleck |  1809 | Geisel
  4513 | Dune                        |  1866 | Herbert
  4267 | 2001: A Space Odyssey       |  2001 | Clarke
  1501 | Goodnight Moon              |  2031 | Brown
  7808 | The Shining                 |  4156 | King
 41473 | Programming Python          |  7805 | Lutz
 41477 | Learning Python             |  7805 | Lutz
 41478 | Perl Cookbook               |  7806 | Christiansen
 25908 | Franklin in the Dark        | 15990 | Bourgeois
  1234 | The Velveteen Rabbit        | 25041 | Bianco
(15 rows)

This syntax is merely a more formal way of stating the relationship between the two data sets. There is no functional difference between the CROSS JOIN syntax and using a simple comma delimited list of tables.

Inner and outer join syntax

More useful are the inner and outer joins, which require a qualification of the relationship between joined data sets in the JOIN syntax itself. The following is the syntax for an inner or outer join:

source1 [ NATURAL ] join_type source2
[ ON ( condition [, ...] ) | USING ( column [, ...] ) ]

source1

Identifies the first data set that is being joined (i.e., a table name or sub-query).

[ NATURAL ]

Implies that the two data sets should be joined on equivalent values between like-named columns (e.g., if two tables have a column called id, it will join rows where the id values are equivalent). The NATURAL clause will respect column aliases, if applied. The use of the NATURAL clause makes it both unnecessary and invalid to try to specify either of the ON or USING clauses.

join_type

Specifies the type of JOIN intended. Valid values in this context are [ INNER ] JOIN (specifying just JOIN implies an INNER JOIN), LEFT [ OUTER] JOIN, RIGHT [ OUTER ] JOIN, and FULL [ OUTER ] JOIN.

source2

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

ON ( condition [, ...] )

Identifies the second data set that is being joined (i.e., a table name, or sub-query).

Specifies the relationship between source1 and source2. Any arbitrary criteria may be specified within the ON clause, just as you would specify conditions following a WHERE clause. Column and table aliases are allowed in this criteria.

USING ( column [, ...] )

Specifies like-named columns between source1 and source2 with which to join rows by equivalent values. Similar to a NATURAL JOIN, but allows you to indicate what specific columns to join on, whereas NATURAL will join on all like-named columns. Similar to NATURAL joins, column aliases are respected in the USING clause’s parameters.

Inner joins

The SQL92 INNER JOIN syntax is a tool that helps differentiate the conditions with which you are joining data sources (the JOIN conditions) from the conditions with which you are evaluating rows for inclusion in your data set (the WHERE conditions). For example, consider the two SELECT statements in Example 4-36.

Example 4-36. Comparing INNER JOIN to WHERE

booktown=# SELECT title, last_name, first_name
booktown-#        FROM books, authors
booktown-#        WHERE (books.author_id = authors.id)
booktown-#        AND last_name = 'Geisel';
           title             | last_name |  first_name
-----------------------------+-----------+---------------
The Cat in the Hat           | Geisel    | Theodor Seuss
Bartholomew and the Oobleck  | Geisel    | Theodor Seuss
(2 rows)

booktown=# SELECT title, last_name, first_name
booktown-#        FROM books AS b INNER JOIN authors AS a
booktown-#        ON (b.author_id = a.id)
booktown-#        WHERE last_name = 'Geisel';
           title             | last_name |  first_name
-----------------------------+-----------+---------------
The Cat in the Hat           | Geisel    | Theodor Seuss
Bartholomew and the Oobleck  | Geisel    | Theodor Seuss
(2 rows)

The two forms of syntax in Example 4-36 are functionally identical, and return the same results. The INNER JOIN syntax allows you to segregate the relational criteria from your evaluation criteria by only defining the set relationships in the ON clause. This can make involved queries much easier to read and maintain, as you do not need to interpret what each condition described by the WHERE clause is conceptually achieving.

Notice that the second query demonstrates the use of aliases b and a in the ON clause for the books and authors tables, respectively. The use of these aliases in the ON clause is perfectly valid, and often preferable from a perspective of improved readability.

In cases of simple equivalence joins, it may be more convenient for you to use either the USING or NATURAL clauses instead of the ON clause. These are only applicable on data sets with identically named columns. If you have columns that define a relationship between two sets that are not identically named, you may still use the USING or NATURAL clauses by employing column aliases, as demonstrated in Example 4-37, to re-name one or both of the columns to a uniform name.

Example 4-37. The NATURAL and USING clauses

booktown=# SELECT title, last_name, first_name
booktown-#        FROM books INNER JOIN authors AS a (author_id)
booktown-#        USING (author_id)
booktown-#        WHERE last_name = 'Geisel';
            title            | last_name |  first_name
-----------------------------+-----------+---------------
The Cat in the Hat           | Geisel    | Theodor Seuss
Bartholomew and the Oobleck  | Geisel    | Theodor Seuss
(2 rows)


booktown=# SELECT title, last_name, first_name
booktown-#        FROM books NATURAL INNER JOIN authors AS a (author_id)
booktown-#        WHERE last_name = 'Geisel';
            title            | last_name |  first_name
-----------------------------+-----------+---------------
The Cat in the Hat           | Geisel    | Theodor Seuss
Bartholomew and the Oobleck  | Geisel    | Theodor Seuss
(2 rows)

The first SELECT statement in Example 4-37 assigns the alias of author_id to the first column in the authors table (which is actually named id). By passing the author_id identifier to the USING clause, PostgreSQL then searches for a column identifier in each data set with that name to join rows on values found to be equivalent.

Inner joins are adequate for a wide variety of queries, but there are times when an outer join is required to get all of the data you need. The key to understanding the difference between inner and outer joins is in knowing how each type of join handles rows that do not meet their defined relationship.

In short, an inner join will discard any row for which it cannot find a corresponding value between the sets being joined (as specified by either the ON or USING clause).

Outer joins

In contrast to inner joins, an outer join can retain rows where corresponding values between sets are not found, populating the missing columns with NULL values. Whether or not the outer join does retain that row depends on which set is missing the value and the kind of outer join that is specified.

There are three forms of outer joins:

Left outer joins

Will always return at least one instance of each row in the set of rows to the left of the JOIN keyword. Missing columns in the right set are populated with NULL values.

Right outer joins

Will always return at least one instance of each row in the set of rows to the right of the JOIN keyword. Missing columns in the left set are populated with NULL values.

Full outer joins

Will always return at least one instance of each row in each joined set. Missing columns on either side of the new set will be populated with NULL values.

Consider again Book Town’s books table, and another Book Town table called editions. While the books table stores general information on a given title, the editions table stores specific information pertaining to each edition, such as an the book’s ISBN, publisher, and publication date. The editions table has a column called book_id which corresponds to the books table’s primary key column, id.

Suppose that you want to retrieve each of Book Town’s titles, along with its isbn, if applicable. Performing a query with an inner join between the books and editions tables will correctly return a data set with title and isbn columns. However, as demonstrated in Example 4-38, if a book does not yet have a printed edition (or if that edition has not yet been entered into Book Town’s database), those titles will not be displayed.

In contrast, the statement immediately following the inner join in Example 4-38 employs an outer join, returning 20 rows. Three of the returned rows do not have ISBN numbers, but are not omitted due to the definition of the join.

Example 4-38. Inner joins versus outer joins

booktown=# SELECT title, isbn
booktown-#        FROM books INNER JOIN editions
booktown-#        ON (books.id = editions.book_id);
            title            |    isbn
-----------------------------+------------
 The Tell-Tale Heart         | 1885418035
 The Tell-Tale Heart         | 0929605942
 Little Women                | 0760720002
 The Velveteen Rabbit        | 0679803335
 Goodnight Moon              | 0694003611
 Bartholomew and the Oobleck | 0394800753
 The Cat in the Hat          | 039480001X
 The Cat in the Hat          | 0394900014
 Dynamic Anatomy             | 0823015505
 2001: A Space Odyssey       | 0451457994
 2001: A Space Odyssey       | 0451198492
 Dune                        | 0441172717
 Dune                        | 044100590X
 The Shining                 | 0451160916
 The Shining                 | 0385121679
 Franklin in the Dark        | 0590445065
 Programming Python          | 0596000855
(17 rows)

booktown=# SELECT title, isbn
booktown-#        FROM books LEFT OUTER JOIN editions
booktown-#        ON (books.id = editions.book_id);
            title            |    isbn
-----------------------------+------------
 The Tell-Tale Heart         | 1885418035
 The Tell-Tale Heart         | 0929605942
 Little Women                | 0760720002
 The Velveteen Rabbit        | 0679803335
 Goodnight Moon              | 0694003611
 Bartholomew and the Oobleck | 0394800753
 The Cat in the Hat          | 039480001X
 The Cat in the Hat          | 0394900014
 Dynamic Anatomy             | 0823015505
 2001: A Space Odyssey       | 0451457994
 2001: A Space Odyssey       | 0451198492
 Dune                        | 0441172717
 Dune                        | 044100590X
 The Shining                 | 0451160916
 The Shining                 | 0385121679
 Franklin in the Dark        | 0590445065
 Practical PostgreSQL        |
 Programming Python          | 0596000855
 Learning Python             |
 Perl Cookbook               |
(20 rows)

The join specified by the second query in Example 4-38 uses the LEFT OUTER JOIN clause to define its join type. This is because the query focuses on titles from the books table that have ISBN numbers, and not those editions having ISBN numbers that do not correspond to titles. As the books table is to the left of the JOIN keyword, it is defined as a left outer join to achieve this. If the focus of the query was to see both ISBN numbers without titles as well as titles without ISBN numbers, the same query could instead be modified to be a full outer join with the FULL OUTER JOIN clause.

The difference between inner and outer joins illustrated in Example 4-38 is a vital concept to understand, as misuse of joins can lead to both omitted and unexpected rows.

Note

The actual OUTER keyword is an optional term in a PostgreSQL outer join. Specifying a join as either a LEFT JOIN, RIGHT JOIN or FULL JOIN implicitly defines it as an outer join.

Intricate joins

It should be understood that while a single JOIN clause connects only two sets of data, in practice, joins are not restricted to only two data sources. You may arbitrarily specify numerous JOIN clauses following sets that are themselves constructed from joins, just as you may specify numerous data sources separated by commas.

When connecting several joins together, it is a good practice to group each join and sub-join within parentheses. Explicitly grouping joins in this fashion insures that there is no ambiguity, to either PostgreSQL or a developer, as to which data sets are joined, and in what order.

Example 4-39. Joining many data sources

booktown=# SELECT a.last_name, p.name AS publisher, e.isbn, s.subject
booktown-#        FROM ((((authors AS a INNER JOIN books AS b
booktown(#                 ON (a.id = b.author_id))
booktown(#        INNER JOIN editions AS e ON (e.book_id = b.id))
booktown(#        INNER JOIN publishers AS p ON (p.id = e.publisher_id))
booktown(#        INNER JOIN subjects AS s ON (s.id = b.subject_id));
 last_name |          publisher          |    isbn    |     subject
-----------+-----------------------------+------------+------------------
 Hogarth   | Watson-Guptill Publications | 0823015505 | Arts
 Brown     | HarperCollins               | 0694003611 | Children's Books
 Geisel    | Random House                | 0394800753 | Children's Books
 Geisel    | Random House                | 039480001X | Children's Books
 Geisel    | Random House                | 0394900014 | Children's Books
 Bourgeois | Kids Can Press              | 0590445065 | Children's Books
 Bianco    | Penguin                     | 0679803335 | Classics
 Lutz      | O'Reilly & Associates       | 0596000855 | Computers
 Alcott    | Henry Holt & Company, Inc.  | 0760720002 | Drama
 Poe       | Mojo Press                  | 1885418035 | Horror
 Poe       | Books of Wonder             | 0929605942 | Horror
 King      | Doubleday                   | 0451160916 | Horror
 King      | Doubleday                   | 0385121679 | Horror
 Clarke    | Roc                         | 0451457994 | Science Fiction
 Clarke    | Roc                         | 0451198492 | Science Fiction
 Herbert   | Ace Books                   | 0441172717 | Science Fiction
 Herbert   | Ace Books                   | 044100590X | Science Fiction
(17 rows)

An interesting observation to be made about Example 4-39 is that, while the books table is itself deeply involved in the join, none of its columns are retrieved in the final result set. The books table is included in the JOIN clauses in order to provide criteria through which other tables are joined together. Each of the tables whose columns are retrieved in the query rely on the books table in order to draw relationships with any other table through the id column (with the exception of the publishers table, which relates to the publisher_id column in the editions table).

Grouping Rows with GROUP BY

The GROUP BY clause introduces a powerful SQL concept: aggregation. To aggregate means to gather into a sum, or whole. The practical effect of aggregating in a SQL query is that any rows whose results from the GROUP BY expression match identically are grouped together into a single aggregate row. The GROUP BY expression may define a column, but it may also be any operation upon a column as well. If several columns or expressions are specified (delimited by commas), the entire set of specified criteria must be identical for rows to be grouped together.

To effectively use aggregation you must understand that any target columns requested by an aggregating query which are not specified in the GROUP BY clause will be inaccessible, unless selected through an aggregate function. An aggregate function accepts a column name (or expression involving at least one column name) which can represent several values (i.e., from several grouped rows), performs an operation on those values, and returns a single value.

Common aggregate functions include count(), which returns the number of rows in the set, max(), which returns the maximum value in the column, and min(), which returns the minimum value in the column. An aggregate function operates only on rows in the query’s result set, and is therefore executed after conditional joins and WHERE conditions have been processed.

Imagine that you wanted to know how many books Book Town stores in its database for each known publisher. You could perform a simple join between the editions and publishers tables in order to associate each publisher name with a title that they publish. It would be tedious to manually count how many titles each publisher maintained, and in cases of larger data sets, it can become difficult to manage larger result sets.

Example 4-40 demonstrates a join between these two Book Town tables, but also introduces two new elements: the count() function, and the GROUP BY clause.

Example 4-40. Using GROUP BY

booktown=# SELECT count(e.isbn) AS "number of books",
booktown-#        p.name AS publisher
booktown-#        FROM editions AS e INNER JOIN publishers AS p
booktown-#             ON (e.publisher_id = p.id)
booktown-#        GROUP BY p.name;
 number of books |          publisher
-----------------+-----------------------------
               2 | Ace Books
               1 | Books of Wonder
               2 | Doubleday
               1 | HarperCollins
               1 | Henry Holt & Company, Inc.
               1 | Kids Can Press
               1 | Mojo Press
               1 | O'Reilly & Associates
               1 | Penguin
               3 | Random House
               2 | Roc
               1 | Watson-Guptill Publications
(12 rows)

The GROUP BY clause in Example 4-40 instructs PostgreSQL to group the rows in the joined data set by p.name, which in this query is a reference to the name column in the publishers table. Therefore, any rows that have the same publisher name will be grouped together, or aggregated. The count() function then counts the number of isbn values from the editions table that are in each aggregated row, and returns a single numeric value representing the number of rows that were aggregated for each unique publisher.

Note that in Example 4-40 the argument of the editions table’s isbn column is chosen simply to indicate the objective of the example (to count how many books there are per publisher). Any column name will return the same number, as count() will always return the number of rows grouped in the current aggregate row.

Something to watch out for when designing aggregate queries is that the WHERE clause cannot accept criteria involving aggregate functions. Instead, use the HAVING clause. It functions identically to the WHERE clause, but its conditions must be on aggregate functions rather than single-row conditions. Syntactically, the HAVING clause follows the GROUP BY clause, as demonstrated in Example 4-41.

Example 4-41. Using the HAVING clause

booktown=# SELECT count(e.isbn) AS "number of books",
booktown-#        p.name AS publisher
booktown-#        FROM editions AS e INNER JOIN publishers AS p
booktown-#             ON (e.publisher_id = p.id)
booktown-#        GROUP BY publisher
booktown-#        HAVING count(e.isbn) > 1;
 number of books |  publisher
-----------------+--------------
               2 | Ace Books
               2 | Doubleday
               3 | Random House
               2 | Roc
(4 rows)

Both Example 4-40 and Example 4-41 create a data set through an inner join between the editions and publishers table. However, Example 4-41 constrains the final result to publishers having more than a single book in the Book Town database, as set by the HAVING clause.

Note

If a result set’s column is aliased via an AS clause to a name that overlaps with a real column in one of the source data sets, and used in the GROUP BY clause, PostgreSQL will assume that you are referring to the input column, not the output alias.

Sorting Rows with ORDER BY

As described in Chapter 3, row data is not stored in a consistent order within tables. In fact, an identical query executed twice is in no way guaranteed to return the rows in the same order each time. As order is commonly an important part of retrieving data for database-dependent applications, use the ORDER BY clause to allow flexible sorting of your result set.

The ORDER BY clause accepts as its parameters a list of comma-delimited column names (or expressions upon columns), which are used as sorting criteria. For each sort criteria, you may optionally apply either the ASC, DESC, or USING keywords to control the type of sorting employed:

ASC: Causes the rows to sort by the related criteria in an ascending fashion (e.g., numbers will be sorted lowest to highest, text will be sorted alphabetically from a to z). ASC is equivalent to specifying USING <. Since it is the default behavior, specifying ASC is only useful for explicit readability.
DESC: Causes the rows to sort by the related criteria in a descending fashion (e.g., numbers will be sorted highest to lowest, text will be sorted alphabetically from z to a). DESC is equivalent to specifying USING >.
USING operator: Allows the specification of the operator operator to be used to compare each column for precedence. This can be particularly useful for custom operators.

Example 4-42 demonstrates the use of the ORDER BY clause on the editions table. It specifies the publication column as the source of values to sort by, and explicitly declares the ordering method as an ascending (ASC) sort.

Example 4-42. Using ORDER BY

booktown=# SELECT isbn, edition, publication
booktown-#        FROM editions
booktown-#        ORDER BY publication ASC;
    isbn    | edition | publication
------------+---------+-------------
 0760720002 |       1 | 1868-01-01
 0679803335 |       1 | 1922-01-01
 0694003611 |       1 | 1947-03-04
 0394800753 |       1 | 1949-03-01
 0394900014 |       1 | 1957-01-01
 039480001X |       1 | 1957-03-01
 0823015505 |       1 | 1958-01-01
 0451160916 |       1 | 1981-08-01
 0590445065 |       1 | 1987-03-01
 0385121679 |       2 | 1993-10-01
 1885418035 |       1 | 1995-03-28
 0441172717 |       2 | 1998-09-01
 0929605942 |       2 | 1998-12-01
 044100590X |       3 | 1999-10-01
 0451198492 |       3 | 1999-10-01
 0451457994 |       3 | 2000-09-12
 0596000855 |       2 | 2001-03-01
(17 rows)

As you can see in the result set from Example 4-42, the rows return in ascending order, from the oldest date to the newest. It should be noted that even columns and expressions that do not appear in the target list of the SELECT statement may be used to sort the retrieved rows. Furthermore, aggregate functions and expressions are allowed by the ORDER BY clause if the query involves aggregation. The ability to sort by such a wide scope of sources thus allows for a great deal of flexibility in ordering results from a variety of query approaches.

Note

If a column alias in the result set has the same name as a literal column in an input source from which it is drawing rows, and it is used in the ORDER BY clause, PostgreSQL will assume that it is a reference to the named column in the result set, not the column in the source set. This is an accepted inconsistency compared against the default behavior of the GROUP BY clause, as specified by the SQL92 standard.

When specifying multiple expressions to sort by, the result set will be ordered by the first criteria (from left to right), and will only process subsequent sorting criteria if the first condition’s sort is inconclusive. For example, consider the sorting performed in Example 4-43.

Example 4-43. Using ORDER BY with multiple expressions

booktown=# SELECT edition, publication
booktown-#        FROM editions
booktown-#        ORDER BY edition ASC,
booktown-#                 publication DESC;
 edition | publication
---------+-------------
       1 | 1995-03-28
       1 | 1987-03-01
       1 | 1981-08-01
       1 | 1958-01-01
       1 | 1957-03-01
       1 | 1957-01-01
       1 | 1949-03-01
       1 | 1947-03-04
       1 | 1922-01-01
       1 | 1868-01-01
       2 | 2001-03-01
       2 | 1998-12-01
       2 | 1998-09-01
       2 | 1993-10-01
       3 | 2000-09-12
       3 | 1999-10-01
       3 | 1999-10-01
(17 rows)

The query in Example 4-43 selects the numeric edition and publication date of each book from the editions table. The ORDER BY clause then specifies two columns to sort by: edition, in ascending order, and publication, in descending order.

As you can see in the result set for Example 4-43, each row is first sorted by edition, proceeding from the lower editions to the higher editions. Subsequently, wherever the editions are identical, the publication date is used to then sort again, from the most recent publication date to the least recent.

Sorting is extremely relevant when using the DISTINCT keyword, as discussed in the section titled Removing Duplicate Rows with DISTINCT. If you are only interested in seeing the most recently published copy of each edition in the editions table, the ORDER BY and DISTINCT clauses can be combined to achieve an effect somewhat similar to the GROUP BY clause, as shown in Example 4-44.

Example 4-44. Using DISTINCT with ORDER BY

booktown=# SELECT DISTINCT ON (edition)
booktown-#        edition, publication
booktown-#        FROM editions
booktown-#        ORDER BY edition ASC,
booktown-#                 publication DESC;
 edition | publication
---------+-------------
       1 | 1995-03-28
       2 | 2001-03-01
       3 | 2000-09-12
(3 rows)

booktown=# SELECT edition, max(publication)
booktown-#        FROM editions
booktown-#        GROUP BY edition;
 edition |    max
---------+------------
       1 | 1995-03-28
       2 | 2001-03-01
       3 | 2000-09-12
(3 rows)

Since the ORDER BY occurring before the DISTINCT clause eliminates duplicate rows, the net effect can be very similar to using the max() or min() with a GROUP BY clause. This technique can sometimes be more efficient, depending on the complexity of the aggregation and sorting involved.

Note

While never strictly necessary, PostgreSQL can accept integer constants as expressions in the ORDER BY clause, instead of column names or expressions. Such a constant will be interpreted as representing the column that is at the numbered position in the target list, from left to right, starting at 1 (e.g., ORDER BY 1 ASC references the first column in the result set).

Setting Row Range with LIMIT and OFFSET

PostgreSQL enforces no limit upon the number of rows retrievable from a SQL query. If you attempt to execute a query that returns several million rows, it may take a while, but the server will not stop until it has returned the entire result set (or until it is interrupted).

Applications could conceivably be written to programmatically “page” through large sets of data after retrieval, but SQL provides as a convenience the LIMIT and OFFSET clauses, which allow for the retrieval of a specified portion of the generated result set.

When the LIMIT clause is specified, no more than the requested number of rows will be returned (though there may be fewer if the result set is smaller than the passed parameter). When the OFFSET clause is specified, it skips the number of rows defined by its parameters before returning rows. If both are specified, the number of rows to be included as per the LIMIT clause will not be counted until the number of rows dictated by the OFFSET clause have been skipped.

Example 4-45. Using LIMIT and OFFSET

booktown=# SELECT isbn, title, publication
booktown-#        FROM editions NATURAL JOIN books AS b (book_id)
booktown-#        ORDER BY publication DESC
booktown-#        LIMIT 5;
    isbn    |         title         | publication
------------+-----------------------+-------------
 0596000855 | Programming Python    | 2001-03-01
 0451457994 | 2001: A Space Odyssey | 2000-09-12
 0451198492 | 2001: A Space Odyssey | 1999-10-01
 044100590X | Dune                  | 1999-10-01
 0929605942 | The Tell-Tale Heart   | 1998-12-01
(5 rows)

booktown=# SELECT isbn, title, publication
booktown-#        FROM editions NATURAL JOIN books AS b (book_id)
booktown-#        ORDER BY publication DESC
booktown-#        LIMIT 5
booktown-#        OFFSET 2;
    isbn    |         title         | publication
------------+-----------------------+-------------
 0451198492 | 2001: A Space Odyssey | 1999-10-01
 044100590X | Dune                  | 1999-10-01
 0929605942 | The Tell-Tale Heart   | 1998-12-01
 0441172717 | Dune                  | 1998-09-01
 1885418035 | The Tell-Tale Heart   | 1995-03-28
(5 rows)

Example 4-45 demonstrates, in the first query, a simple use of LIMIT, by retrieving only 5 rows from the joined set of the editions and books table. Ordinarily, such a join would result in 17 rows.

The second query in Example 4-45 shows the use of the OFFSET clause, to shift the scope of the result set down by two rows. You can see that the last three rows of the first query’s result set overlap with the first three rows of the second query’s result set. The ORDER BY clause in each of these queries insures the consistency of the sets returned.

Note

The ORDER BY clause can be a helpful tool for making sure that the results of a limited query are relevant. This is because sorting occurs before limiting, allowing you to determine which rows end up being limited.

Comparing Sets with UNION, INTERSECT and EXCEPT

While joins are used in SQL to combine column values into a single row, the UNION, INTERSECT and EXCEPT clauses exist to merge or omit row data by comparing column values, returning a new result set based on this comparison. Each of these keywords may be used at the end of a valid SQL query and followed by a second query, in order to compare the resultant data sets, and then either merge or omit rows based on that comparison.

When comparing data sets in this manner, it is required that they each have the same number of columns, as well as the same column type. Note that they do not need to have the same name, or be queried from the same table or data source.

UNION: A pair of queries merged with the UNION keyword will combine all non-distinct rows into a single data set. Like rows will not be duplicated.
INTERSECT: A pair of queries merged with the INTERSECT keyword will cause any rows not found in both data sets to be omitted. As such, the only rows returned are those that overlap between the two query result sets.
EXCEPT: A pair of queries merged with the EXCEPT keyword will cause any rows found in both data sets to be omitted from the returned data set. As such, only rows found in the query to the left of the EXCEPT clause that are not found in the query to the right of the clause will be returned.

Example 4-46, Example 4-47, and Example 4-48 each demonstrate these keywords by combining and omitting rows from comparative data sets. Example 4-46 creates a result set by combining several authors’ last names with book titles via the UNION keyword.

Example 4-47 demonstrates the selection of ISBN numbers from the books table, limited to rows which intersect with the query on the shipments table for books which have records of more than two shipments. Finally, Example 4-48 demonstrates the removal of any rows from the first query which are matched completely in the second.

Example 4-46. Using UNION

booktown=# SELECT title FROM books
booktown-#        UNION
booktown-#        SELECT last_name FROM authors
booktown-#        LIMIT 11;
            title
-----------------------------
 2001: A Space Odyssey
 Alcott
 Bartholomew and the Oobleck
 Bianco
 Bourgeois
 Brautigan
 Brite
 Brown
 Christiansen
 Clarke
 Denham
(11 rows)

Example 4-47. Using INTERSECT

booktown=# SELECT isbn FROM editions
booktown-#        INTERSECT
booktown-#        SELECT isbn FROM shipments
booktown-#               GROUP BY isbn
booktown-#               HAVING count(id) > 2;
    isbn
------------
 039480001X
 0394800753
 0451160916
 0590445065
 0694003611
(5 rows)

Example 4-48. Using EXCEPT

booktown=# SELECT last_name, first_name
booktown-#        FROM authors
booktown-#        EXCEPT
booktown-#        SELECT last_name, first_name
booktown-#               FROM authors AS a (author_id)
booktown-#               NATURAL INNER JOIN books
booktown-#        ORDER BY first_name ASC;
 last_name | first_name
-----------+------------
 Denham    | Ariel
 Gorey     | Edward
 Brite     | Poppy Z.
 Brautigan | Richard
(4 rows)

In Example 4-48, only rows that do not match the second query are returned. Notice that the effective result of this is that only authors who do not have a book in the books table are returned. This is due to the INNER JOIN clause, which causes the second query to omit any authors whose author_id is not found in the books table.

While the use of these keywords in a single SQL query precludes the ability to use the LIMIT clause, this limitation can be circumvented by PostgreSQL’s support for sub-queries. By grouping in parentheses each of the queries involved between a UNION, EXCEPT, or EXCEPT clause, the returned result sets from the sub-queries are compared, as demonstrated in Example 4-49.

Example 4-49. Comparing sub-query result sets

booktown=# (SELECT title FROM books ORDER BY title DESC LIMIT 7)
booktown-#     EXCEPT
booktown-#     (SELECT title FROM books ORDER BY title ASC LIMIT 11)
booktown-#     ORDER BY title DESC;
        title
----------------------
 The Velveteen Rabbit
 The Tell-Tale Heart
 The Shining
 The Cat in the Hat
(4 rows)

Notice that the query used in Example 4-49 creates a set from the books table that is constrained to the last seven rows and sorted alphabetically by title. The EXCEPT clause then removes from that data set the first eleven rows, sorted alphabetically in an ascending fashion. The result consists of the last four rows from the table, sorted from the bottom by the final ORDER BY clause on the new exception set.

Using Case Expressions

In order to achieve simple programmatic transformations without having to call out to a procedural language, PostgreSQL supports standard SQL case expressions. These use the SQL keywords CASE, WHEN, THEN, and END to allow basic conditional transformations per each row.

The entirety of a case expression is syntactically placed within the SELECT statement’s target list. A case expression’s result column is named case by default, but it may be aliased in the same manner as any normal target list. The general syntax for a case expression in a SELECT statement’s target list is as follows:

CASE WHEN condition1 THEN result1
     WHEN condition2 THEN result2
     [ ... ]
     [ ELSE default_result ]
END [ AS alias ]

The CASE, WHEN, THEN, and ELSE keywords are somewhat similar to the if-then-else logic in programming languages. The condition of a WHEN clause must return a Boolean result.

When a WHEN condition is met, the result from its corresponding THEN clause will return in the result column for that row. If no conditions are met, the ELSE clause may be used to specify a default result value. If there are no results found for a case expression, NULL is returned.

Example 4-50. Using case expressions in statements

booktown=# SELECT isbn,
booktown-#        CASE WHEN cost > 20 THEN 'over $20.00 cost'
booktown-#             WHEN cost = 20 THEN '$20.00 cost'
booktown-#             ELSE 'under $20.00 cost'
booktown-#        END AS cost_range
booktown-#        FROM stock
booktown-#        LIMIT 8;
    isbn    |    cost_range
------------+-------------------
 0385121679 | over $20.00 cost
 039480001X | over $20.00 cost
 044100590X | over $20.00 cost
 0451198492 | over $20.00 cost
 0394900014 | over $20.00 cost
 0441172717 | under $20.00 cost
 0451160916 | over $20.00 cost
 0679803335 | $20.00 cost
(8 rows)

Adding to the power of case expressions are PostgreSQL’s sub-queries, described in the section titled Using Sub-Queries. As demonstrated in Example 4-51, a sub-query may be provided as a result within a conditional expression.

Example 4-51. Using case expressions with sub-queries

booktown=# SELECT isbn,
booktown-#        CASE WHEN cost > 20 THEN 'N/A - (Out of price range)'
booktown-#             ELSE (SELECT title FROM books b JOIN editions e
booktown(#                                     ON (b.id = e.book_id)
booktown(#                                WHERE e.isbn = stock.isbn)
booktown-#        END AS cost_range
booktown-#        FROM stock
booktown-#        ORDER BY cost_range ASC
booktown-#        LIMIT 8;
    isbn    |         cost_range
------------+-----------------------------
 0451457994 | 2001: A Space Odyssey
 0394800753 | Bartholomew and the Oobleck
 0441172717 | Dune
 0760720002 | Little Women
 0385121679 | N/A - (Out of price range)
 039480001X | N/A - (Out of price range)
 044100590X | N/A - (Out of price range)
 0451198492 | N/A - (Out of price range)
(8 rows)

In Example 4-51, any book found to have a cost of less than 20 has its title returned via a sub-select to the books table, along with its ISBN from the main query to the stock table.

Creating Tables from Other Tables

The INTO TABLE clause may be used with any valid SELECT query in order to create a new table with the column structure and row data of the returned result set. The syntax for this is as follows:

SELECT select_targets
       INTO [ TABLE ] new_table
       FROM old_table;

This syntax performs an implicit CREATE TABLE command, creating a table with the same column names, value types, and row data as the result set from the original table. When the message SELECT is returned, you will know that the statement was successfully performed, and the new table created. This is demonstrated in Example 4-52, which creates a backup table called stock_backup out of the data in the stock table.

Example 4-52. Using SELECT INTO

booktown=# SELECT * INTO stock_backup
booktown-#          FROM stock;
SELECT

The table specified by the INTO clause must not exist, or else an error will be returned. Upon the error, the values of the query will not be inserted and the query will fail. Note that the TABLE keyword, in this query, is an optional noise term.