Lesson 10. Grouping Data

In this lesson, you’ll learn how to group data so that you can summarize subsets of table contents. This involves two new SELECT statement clauses: the GROUP BY clause and the HAVING clause.

Understanding Data Grouping

In the last lesson, you learned that the SQL aggregate functions can be used to summarize data. These functions enable you to count rows, calculate sums and averages, and obtain high and low values without having to retrieve all the data.

All the calculations thus far were performed on all the data in a table or on data that matched a specific WHERE clause. As a reminder, the following example returns the number of products offered by vendor DLL01:

Input ▾

SELECT COUNT(*) AS num_prods
FROM Products
WHERE vend_id ='DLL01';

Output ▾

num_prods
-----------
4

But what if you wanted to return the number of products offered by each vendor? Or products offered by vendors who offer a single product, or only those who offer more than 10 products?

This is where groups come into play. Grouping lets you divide data into logical sets so that you can perform aggregate calculations on each group.

Creating Groups

Groups are created using the GROUP BY clause in your SELECT statement. The best way to understand this is to look at an example:

Input ▾

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
GROUP BY vend_id;

Output ▾

vend_id  num_prods
-------  ---------
BRS01    3
DLL01    4
FNG01    2

Analysis ▾

The above SELECT statement specifies two columns, vend_id, which contains the ID of a product’s vendor, and num_prods, which is a calculated field (created using the COUNT(*) function). The GROUP BY clause instructs the DBMS to sort the data and group it by vend_id. This causes num_prods to be calculated once per vend_id rather than once for the entire table. As you can see in the output, vendor BRS01 has 3 products listed, vendor DLL01 has 4 products listed, and vendor FNG01 has 2 products listed.

Because you used GROUP BY, you did not have to specify each group to be evaluated and calculated. That was done automatically. The GROUP BY clause instructs the DBMS to group the data and then perform the aggregate on each group rather than on the entire result set.

Before you use GROUP BY, here are some important rules about its use that you need to know:

Image GROUP BY clauses can contain as many columns as you want. This enables you to nest groups, providing you with more granular control over how data is grouped.

Image If you have nested groups in your GROUP BY clause, data is summarized at the last specified group. In other words, all the columns specified are evaluated together when grouping is established (so you won’t get data back for each individual column level).

Image Every column listed in GROUP BY must be a retrieved column or a valid expression (but not an aggregate function). If an expression is used in the SELECT, that same expression must be specified in GROUP BY. Aliases cannot be used.

Image Most SQL implementations do not allow GROUP BY columns with variable-length datatypes (such as text or memo fields).

Image Aside from the aggregate calculation statements, every column in your SELECT statement must be present in the GROUP BY clause.

Image If the grouping column contains a row with a NULL value, NULL will be returned as a group. If there are multiple rows with NULL values, they’ll all be grouped together.

Image The GROUP BY clause must come after any WHERE clause and before any ORDER BY clause.

Tip: The ALL Clause

Some SQL implementations (such as Microsoft SQL Server) support an optional ALL clause within GROUP BY. This clause can be used to return all groups, even those that have no matching rows (in which case the aggregate would return NULL). Refer to your DBMS documentation to see if it supports ALL.

Caution: Specifying Columns by Relative Position

Some SQL implementations allow you to specify GROUP BY columns by the position in the SELECT list. For example, GROUP BY 2,1 can mean group by the second column selected and then by the first. Although this shorthand syntax is convenient, it is not supported by all SQL implementations. Its use is also risky in that it is highly susceptible to the introduction of errors when editing SQL statements.

Filtering Groups

In addition to being able to group data using GROUP BY, SQL also allows you to filter which groups to include and which to exclude. For example, you might want a list of all customers who have made at least two orders. To obtain this data, you must filter based on the complete group, not on individual rows.

You’ve already seen the WHERE clause in action (that was introduced back in Lesson 4, “Filtering Data”). But WHERE does not work here because WHERE filters specific rows, not groups. As a matter of fact, WHERE has no idea what a group is.

So what do you use instead of WHERE? SQL provides yet another clause for this purpose: the HAVING clause. HAVING is very similar to WHERE. In fact, all types of WHERE clauses you’ve learned about thus far can also be used with HAVING. The only difference is that WHERE filters rows and HAVING filters groups.

Tip: HAVING Supports All WHERE’s Operators

In Lesson 4 and Lesson 5, “Advanced Data Filtering,” you learned about WHERE clause conditions (including wildcard conditions and clauses with multiple operators). All the techniques and options that you’ve learned about WHERE can be applied to HAVING. The syntax is identical; just the keyword is different.

So how do you filter groups? Look at the following example:

Input ▾

SELECT cust_id, COUNT(*) AS orders
FROM Orders
GROUP BY cust_id
HAVING COUNT(*) >= 2;

Output ▾

cust_id    orders
---------- -----------
1000000001 2

Analysis ▾

The first three lines of this SELECT statement are similar to the statements seen above. The final line adds a HAVING clause that filters on those groups with a COUNT(*) >= 2—two or more orders.

As you can see, a WHERE clause couldn’t work here because the filtering is based on the group aggregate value, not on the values of specific rows.

Note: The Difference Between HAVING and WHERE

Here’s another way to look it: WHERE filters before data is grouped, and HAVING filters after data is grouped. This is an important distinction; rows that are eliminated by a WHERE clause will not be included in the group. This could change the calculated values, which in turn could affect which groups are filtered based on the use of those values in the HAVING clause.

So is there ever a need to use both WHERE and HAVING clauses in one statement? Actually, yes, there is. Suppose you want to further filter the above statement so that it returns any customers who placed two or more orders in the past 12 months. To do that, you can add a WHERE clause that filters out just the orders placed in the past 12 months. You then add a HAVING clause to filter just the groups with two or more rows in them.

To better demonstrate this, look at the following example, which lists all vendors who have two or more products priced at 4 or more:

Input ▾

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
WHERE prod_price >= 4
GROUP BY vend_id
HAVING COUNT(*) >= 2;

Output ▾

vend_id  num_prods
-------  -----------
BRS01    3
FNG01    2

Analysis ▾

This statement warrants an explanation. The first line is a basic SELECT using an aggregate function—much like the examples thus far. The WHERE clause filters all rows with a prod_price of at least 4. Data is then grouped by vend_id, and then a HAVING clause filters just those groups with a count of 2 or more. Without the WHERE clause, an extra row would have been retrieved (vendor DLL01 who sells four products all priced under 4) as seen here:

Input ▾

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
GROUP BY vend_id
HAVING COUNT(*) >= 2;

Output ▾

vend_id  num_prods
-------  -----------
BRS01    3
DLL01    4
FNG01    2

Note: Using HAVING and WHERE

HAVING is so similar to WHERE that most DBMSs treat them as the same thing if no GROUP BY is specified. Nevertheless, you should make that distinction yourself. Use HAVING only in conjunction with GROUP BY clauses. Use WHERE for standard row-level filtering.

Grouping and Sorting

It is important to understand that GROUP BY and ORDER BY are very different, even though they often accomplish the same thing. Table 10.1 summarizes the differences between them.

Table 10.1 ORDER BY Versus GROUP BY

ORDER BY

GROUP BY

Sorts generated output.

Groups rows. The output might not be in group order, however.

Any columns (even columns not selected) may be used.

Only selected columns or expressions columns may be used, and every selected column expression must be used.

Never required.

Required if using columns (or expressions) with aggregate functions.

The first difference listed in Table 10.1 is extremely important. More often than not, you will find that data grouped using GROUP BY will indeed be output in group order. But that is not always the case, and it is not actually required by the SQL specifications. Furthermore, even if your particular DBMS does, in fact, always sort the data by the specified GROUP BY clause, you might actually want it sorted differently. Just because you group data one way (to obtain group-specific aggregate values) does not mean that you want the output sorted that same way. You should always provide an explicit ORDER BY clause as well, even if it is identical to the GROUP BY clause.

Tip: Don’t Forget ORDER BY

As a rule, anytime you use a GROUP BY clause, you should also specify an ORDER BY clause. That is the only way to ensure that data will be sorted properly. Never rely on GROUP BY to sort your data.

To demonstrate the use of both GROUP BY and ORDER BY, let’s look at an example. The following SELECT statement is similar to the ones seen previously. It retrieves the order number and number of items ordered for all orders containing three or more items:

Input ▾

SELECT order_num, COUNT(*) AS items
FROM OrderItems
GROUP BY order_num
HAVING COUNT(*) >= 3;

Output ▾

order_num  items
---------  -----
20006      3
20007      5
20008      5
20009      3

To sort the output by number of items ordered, all you need to do is add an ORDER BY clause, as follows:

Input ▾

SELECT order_num, COUNT(*) AS items
FROM OrderItems
GROUP BY order_num
HAVING COUNT(*) >= 3
ORDER BY items, order_num;

Output ▾

order_num  items
---------  -----
20006      3
20009      3
20007      5
20008      5

Analysis ▾

In this example, the GROUP BY clause is used to group the data by order number (the order_num column) so that the COUNT(*) function can return the number of items in each order. The HAVING clause filters the data so that only orders with three or more items are returned. Finally, the output is sorted using the ORDER BY clause.

SELECT Clause Ordering

This is probably a good time to review the order in which SELECT statement clauses are to be specified. Table 10.2 lists all the clauses we have learned thus far, in the order they must be used.

Table 10.2 SELECT Clauses and Their Sequence

Clause

Description

Required

SELECT

Columns or expressions to be returned

Yes

FROM

Table to retrieve data from

Only if selecting data from a table

WHERE

Row-level filtering

No

GROUP BY

Group specification

Only if calculating aggregates by group

HAVING

Group-level filtering

No

ORDER BY

Output sort order

No

Summary

In Lesson 9, “Summarizing Data,” you learned how to use the SQL aggregate functions to perform summary calculations on your data. In this lesson, you learned how to use the GROUP BY clause to perform these calculations on groups of data, returning results for each group. You saw how to use the HAVING clause to filter specific groups. You also learned the difference between ORDER BY and GROUP BY and between WHERE and HAVING.

Challenges

  1. The OrderItems table contains the individual items for each order. Write a SQL statement that returns the number of lines (as order_lines) for each order number (order_num) and sort the results by order_lines.

  2. Write a SQL statement that returns a field named cheapest_item, which contains the lowest-cost item for each vendor (using prod_price in the Products table), and sort the results from lowest to highest cost.

  3. It’s important to identify the best customers, so write a SQL statement to return the order number (order_num in the OrderItems table) for all orders of at least 100 items.

  4. Another way to determine the best customers is by how much they have spent. Write a SQL statement to return the order number (order_num in the OrderItems table) for all orders with a total price of at least 1000. Hint: for this one you’ll need to calculate and sum the total (item_price multiplied by quantity). Sort the results by order number.

  5. What is wrong with the following SQL statement? (Try to figure it out without running it.)

    SELECT order_num, COUNT(*) AS items
    FROM OrderItems
    GROUP BY items
    HAVING COUNT(*) >= 3
    ORDER BY items, order_num;