13 Grouping Data

13
Grouping Data

“Don’t drown yourself with details. Look at the whole.”

—MARSHAL FERDINAND FOCH COMMANDER-IN-CHIEF, ALLIED ARMIES IN FRANCE

Topics Covered in This Chapter

Why Group Data?

The GROUP BY Clause

“Some Restrictions Apply”

Uses for GROUP BY

Sample Statements

Summary

Problems for You to Solve

Chapter 12, “Simple Totals,” explained how to use the aggregate functions (COUNT, MIN, MAX, AVG, and SUM) to ask SQL to calculate a value across all the rows in the table defined in your FROM and WHERE clauses. I pointed out, however, that after you include any value expression that contains an aggregate function in your SELECT clause, all your value expressions must either be a literal constant or contain an aggregate function. This characteristic is useful if you want to see only one row of totals across a result set, but what if you want to see some subtotals? In this chapter, I’ll show you how to ask for subtotals by grouping your data. I’ll show you how to do more complex subtotals in Chapter 21, “Performing Complex Calculations on Groups.”

Why Group Data?

When you’re working in the Sales Orders database, finding out the number of orders (COUNT), the total sales (SUM), the average of sales (AVG), the smallest order (MIN), or the largest order (MAX) is useful, indeed. And if you want to calculate any of these values by customer, order date, or product, you can add a filter (WHERE) to fetch the rows for one particular customer or product. But what if you want to see subtotals for all customers, displaying the customer name along with the subtotals? To do that, you need to ask your database system to group the rows.

Likewise, in the Entertainment Agency database, it’s easy to find out the number of contracts, the total contract price, the smallest contract price, or the largest contract price for all contracts. You can even filter the rows so that you see these calculations for one particular entertainer, one particular customer, or across a specific range of dates. Again, if you want to see one total row for each customer or entertainer, you must group the rows.

Are you starting to get the idea? When you ask your database system to group rows on column values or expressions, it forms subsets of rows based on matching values. You can then ask your database to calculate aggregate values on each group. Let’s look at a simple example from the Entertainment Agency database. First, I need to build a query that fetches the columns of interest—entertainer name and contract price. Here’s the SQL:

Click here to view code image

SQL

SELECT Entertainers.EntStageName,

Engagements.ContractPrice

FROM Entertainers

INNER JOIN Engagements

ON Entertainers.EntertainerID =

Engagements.EntertainerID

ORDER BY EntStageName

The result looks like the following table. (In the sample database, I saved this request as CH13_Entertainers_And_ContractPrices.)

EntStageName	ContractPrice
Carol Peacock Trio	$140.00
Carol Peacock Trio	$1,670.00
Carol Peacock Trio	$770.00
Carol Peacock Trio	$1,670.00
Carol Peacock Trio	$1,670.00
Carol Peacock Trio	$320.00
Carol Peacock Trio	$1,400.00
Carol Peacock Trio	$680.00
Carol Peacock Trio	$410.00
Carol Peacock Trio	$1,940.00
Carol Peacock Trio	$410.00
Caroline Coie Cuartet	$1,250.00
Caroline Coie Cuartet	$2,450.00
Caroline Coie Cuartet	$1,490.00
Caroline Coie Cuartet	$1,370.00
<< more rows here >>

You already know that you can count all the rows, or find the smallest, largest, sum, or average of the ContractPrice column—as long as you eliminate the EntStageName column. However, you can keep this column if you ask your database to group on it. If you ask to group on entertainer stage name, your database will form one group containing the first eleven rows (“Carol Peacock Trio”), a second group containing the next eleven rows (“Caroline Coie Cuartet”), and so on through the entire table. You can now ask for the COUNT of the rows or the SUM, MIN, MAX, or AVG of the ContractPrice column, and you will get one aggregate row per entertainment group. The result looks like the following table.

EntStageName	NumContracts	TotPrice	MinPrice	MaxPrice	AvgPrice
Carol Peacock Trio	11	$11,080.00	$140.00	$1,940.00	$1,007.27
Caroline Coie Cuartet	11	$15,070.00	$290.00	$2,450.00	$1,370.00
Coldwater Cattle Company	8	$14,875.00	$350.00	$3,800.00	$1,859.38
Country Feeling	15	$34,080.00	$275.00	$14,105.00	$2,272.00
Jazz Persuasion	7	$5,480.00	$500.00	$1,670.00	$782.86
Jim Glynn	9	$3,030.00	$110.00	$770.00	$336.67
Julia Schnebly	8	$4,345.00	$275.00	$875.00	$543.13
JV & the Deep Six	10	$17,150.00	$950.00	$3,650.00	$1,715.00
Modern Dance	10	$14,600.00	$650.00	$2,930.00	$1,460.00
Saturday Revue	9	$11,550.00	$290.00	$2,930.00	$1,283.33
Susan McLain	6	$2,670.00	$230.00	$800.00	$445.00
Topazz	7	$6,620.00	$590.00	$1,550.00	$945.71
<< more rows here >>

Looks interesting, doesn’t it? I bet you’d like to know how I did that! I’ll show you all the details in the following sections.

Note: Remember in the Introduction that I warned you that results from each database system won’t necessarily match the sort order you see in examples in this book unless you include an ORDER BY clause. Even when you include that specification, the system might return results in columns not included in the ORDER BY clause in a different sequence because of different optimization techniques.

If you’re running the examples in Microsoft SQL Server, simply selecting the rows from the view does not honor any ORDER BY clause specified in the view. You must open the design of the view and execute it from there for the ORDER BY clause to be honored.

The GROUP BY Clause

As you discovered in Chapter 12, you can find out all sorts of interesting information by using aggregate functions. However, you might have noticed that all the examples I gave you applied the aggregate functions across all the rows returned by the FROM and WHERE clauses. You could filter the result set down to one group using the WHERE clause, but there was really no way to look at the results from multiple groups in one request. To accomplish this summarizing by group in a single request, I need to add one more major clause to your SQL vocabulary—GROUP BY.

Syntax

Let’s take a close look at the GROUP BY clause. Figure 13-1 shows the basic diagram for a SELECT statement with GROUP BY added.

Figure 13-1 The syntax diagram of a SELECT statement with a GROUP BY clause

As you recall from earlier chapters, you define the tables that are the source of your data in the FROM clause. Your FROM clause can be as simple as a single table name or as complex as a JOIN of multiple tables. As discussed in Chapter 8, “INNER JOINs,” you can even embed an entire table subquery (a SELECT statement) as a table reference. Next, you can optionally provide a WHERE clause to include or exclude certain rows supplied by the FROM clause. I covered the WHERE clause in detail in Chapter 6, “Filtering Your Data.”

When you add a GROUP BY clause, you specify the columns in the logical table formed by the FROM and WHERE clauses that you want your database system to use as the definition for groups of rows. Rows that have the same values in the list of columns you specify will be gathered into a group. You can use the columns that you list in the GROUP BY clause in value expressions in your SELECT clause, and you can use any of the aggregate functions I discussed in the previous chapter to perform calculations across each group.

Note: When you use GROUP BY, you’ll often see the results returned by your database system as though they are sorted by the columns you specified. This happens because some optimizers first sort the data internally to make it faster to process your GROUP BY. Keep in mind that if you want a specific sort order, you must also include an ORDER BY clause.

Let’s apply the GROUP BY clause to see how you can calculate information about contract prices by entertainment group—the sample I tantalized you with earlier. Figure 13-2 shows the tables needed to solve this problem.

Figure 13-2 The relationship between the Entertainers and Engagements tables

Note: Throughout this chapter, I use the “Request/Translation/Clean Up/SQL” technique introduced in Chapter 4, “Creating a Simple Query.”

“Show me for each entertainment group the group name, the count of contracts for the group, the total price of all the contracts, the lowest contract price, the highest contract price, and the average price of all the contracts.”

(Hint: When you see a request that wants the count, total, smallest, largest, or average of values at a detail level [contracts] for each value at a higher level [entertainers], you are going to need to use aggregate functions and grouping in your request. Remember that for each entertainer there are most likely many contracts.)

Translation

Select entertainer name, the count of contracts, the sum of the contract price, the lowest contract price, the highest contract price, and the average contract price from the entertainers table joined with the engagements table on entertainer ID, grouped by entertainer name

Clean Up

Select entertainer name, the count of (*) contracts, the sum of the (contract price), the lowest min(contract price), the highest max(contract price), and the average avg(contract price) from the entertainers table inner joined with the engagements table on entertainers.entertainer ID in the entertainers table matches = engagements.entertainer ID in the engagements table, grouped by entertainer name

SQL

SELECT Entertainers.EntStageName,

COUNT(*) AS NumContracts,

SUM(Engagements.ContractPrice) AS TotPrice,

MIN(Engagements.ContractPrice) AS MinPrice,

MAX(Engagements.ContractPrice) AS MaxPrice,

AVG(Engagements.ContractPrice) AS AvgPrice

FROM Entertainers

INNER JOIN Engagements

ON Entertainers.EntertainerID =

Engagements.EntertainerID

GROUP BY Entertainers.EntStageName

Note that I substituted MIN for “lowest,” MAX for “highest,” and AVG for “average,” as I showed you in the previous chapter. I also asked for COUNT(*) because I want to count all the engagement (contract) rows regardless of any Null values. Adding the GROUP BY clause is what gets me the aggregate calculations per entertainment group. It also allows me to include the entertainer name in the SELECT clause. (I saved this request as CH13_Aggregate_Contract_Info_By_Entertainer in the sample database.)

Do you suppose the above query returns a row for each entertainer? What about entertainers who have never been booked? If you remember what you learned in Chapter 9 about OUTER JOIN, you might be tempted to solve the problem like this:

Click here to view code image

SQL

SELECT Entertainers.EntStageName,

COUNT(*) AS NumContracts,

SUM(Engagements.ContractPrice) AS TotPrice,

MIN(Engagements.ContractPrice) AS MinPrice,

MAX(Engagements.ContractPrice) AS MaxPrice,

AVG(Engagements.ContractPrice) AS AvgPrice

FROM Entertainers

LEFT OUTER JOIN Engagements

ON Entertainers.EntertainerID =

Engagements.EntertainerID

GROUP BY Entertainers.EntStageName

One interesting point about all the aggregate functions is that they ignore rows that have a Null value. The above query will return a blank or Null value for TotPrice, MinPrice, MaxPrice, and AvgPrice for the one entertainer who has no engagements, but you’ll find that NumContracts is 1! How can that be? Well, this SQL asks for COUNT(*)—count any row returned. The OUTER JOIN returns exactly one row for the entertainer with no booking, so the count of 1 is correct. (I saved this request as CH13_Aggregate_Contract_Info_All_Entertainers_WRONG in the sample database.) However, if you remember from the previous chapter, you can also COUNT(value expression), and that tells your database system to add to the count only if it finds a non-Null value in the value expression or column name you specify. Let’s tweak the query one more time.

Click here to view code image

SQL

SELECT Entertainers.EntStageName,

COUNT(Engagements.EntertainerID) AS NumContracts,

SUM(Engagements.ContractPrice) AS TotPrice,

MIN(Engagements.ContractPrice) AS MinPrice,

MAX(Engagements.ContractPrice) AS MaxPrice,

AVG(Engagements.ContractPrice) AS AvgPrice

FROM Entertainers

LEFT OUTER JOIN Engagements

ON Entertainers.EntertainerID =

Engagements.EntertainerID

GROUP BY Entertainers.EntStageName

Because the EntertainerID column from the Engagements table for the one entertainer who has no bookings is Null, nothing gets counted. If you run this query, you should see the correct value 0 in NumContracts for the one entertainer who has no engagements. (I saved this request as CH13_Aggregate_Contract_Info_All_Entertainers in the sample database.)

What if you want (or need) to group on more than one value? Let’s look at this same problem, but from the perspective of customers rather than entertainers, and let’s assume you want to display in your result set both the customer’s last name and first name. Figure 13-3 shows the necessary tables.

Figure 13-3 The relationship between the Customers and Engagements tables

“Show me for each customer the customer first and last names, the count of contracts for the customer, the total price of all the contracts, the lowest contract price, the highest contract price, and the average price of all the contracts.”

Translation

Select customer last name, customer first name, the count of contracts, the sum of the contract price, the lowest contract price, the highest contract price, and the average contract price from the customers table joined with the engagements table on customer ID, grouped by customer last name and customer first name

Clean Up

Select customer last name, customer first name, the count of (*) contracts, the sum of the (contract price), the lowest min(contract price), the highest max(contract price), and the average avg(contract price) from the customers table inner joined with the engagements table on customers.customer ID in the customers table matches = engagements.customer ID in the engagements table, grouped by customer last name, and customer first name

SQL

SELECT Customers.CustLastName,

Customers.CustFirstName,

COUNT(*) AS NumContracts,

SUM(Engagements.ContractPrice) AS TotPrice,

MIN(Engagements.ContractPrice) AS MinPrice,

MAX(Engagements.ContractPrice) AS MaxPrice,

AVG(Engagements.ContractPrice) AS AvgPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

GROUP BY Customers.CustLastName,

Customers.CustFirstName

The result looks like the following table. (In the Entertainment Agency sample database, I saved this request as CH13_Aggregate_ Contract_Info_By_Customer.)

CustLast Name	CustFirst Name	Num Contracts	TotPrice	MinPrice	MaxPrice	AvgPrice
Berg	Matt	9	$13,170.00	$200.00	$2,675.00	$1,463.33
Brehm	Peter	7	$7,250.00	$290.00	$3,800.00	$1,035.71
Ehrlich	Zachary	13	$12,455.00	$230.00	$1,550.00	$958.08
Hallmark	Elizabeth	8	$25,585.00	$410.00	$14,105.00	$3,198.13
Hartwig	Doris	8	$10,795.00	$140.00	$2,750.00	$1,349.38
Keyser	Liz	7	$4,685.00	$200.00	$1,490.00	$669.29
McCrae	Dean	11	$11,800.00	$290.00	$2,570.00	$1,072.73
Patterson	Kerry	7	$6,815.00	$110.00	$2,930.00	$973.57
<< more rows here >>

Because it takes two columns to display the customer name, I had to include them both in the GROUP BY clause. Remember that if you want to include a column in the output that is not the result of an aggregate calculation, you must also include it in the GROUP BY clause. I did not include ContractPrice in the GROUP BY clause because that’s the column I’m using in many of the aggregate function expressions. If I had included ContractPrice, I would have gotten unique groups of customers and prices. MIN, MAX, and AVG will all return that grouped price. COUNT will be greater than one only if more than one contract with the same price exists for a given customer. If you think about it, though, grouping by customer and price and asking for a COUNT would be a good way to find customers who have multiple contracts at the same price.

Do you suppose this query includes customers who have no bookings? If you answered “No,” you’re correct! To fetch data for all customers regardless of whether they’ve booked an engagement, you must use an OUTER JOIN and be careful to COUNT one of the columns from the Engagements table. The solution is similar to the problem discussed earlier for entertainers and engagements.

Mixing Columns and Expressions

Suppose you want to list the customer name as one output column, the full customer address as another output column, the last engagement date, and the sum of engagement contract prices. The customer name is in two columns: CustFirstName and CustLastName. The columns you need for a full address are CustStreetAddress, CustCity, CustState, and CustZipCode. Let’s see how you should construct the SQL for this request. (I saved this request as CH13_Customers_Last_Booking in the Entertainment Agency sample database.)

“Show me for each customer the customer full name, the customer full address, the latest contract date for the customer, and the total price of all the contracts.”

Translation

Select customer last name and customer first name as CustomerFullName; street address, city, state, and ZIP Code as CustomerFullAddress; the latest contract start date; and the sum of the contract price from the customers table joined with the engagements table on customer ID, grouped by customer last name, customer first name, customer street address, customer city, customer state, and customer ZIP Code

Clean Up

Select customer last name and || ‘, ’ || customer first name as CustomerFullName, street address, || ‘, ‘ || city, || ‘, ‘ || state, and || ‘ ‘ || ZIP Code as CustomerFullAddress, the latest max(contract start date) as latest date, and the sum of the (contract price) as total contract price from the customers table inner joined with the engagements table on customers.customer ID in the customers table matches = engagements.customer ID in the engagements table grouped by customer last name, customer first name, customer street address, customer city, customer state, and customer ZIP Code

SQL

SELECT Customers.CustLastName || ', ' ||

Customers.CustFirstName AS CustomerFullName,

Customers.CustStreetAddress || ', ' ||

Customers.CustCity || ', ' ||

Customers.CustState || ' ' ||

Customers.CustZipCode AS CustomerFullAddress,

MAX(Engagements.StartDate) AS LatestDate,

SUM(Engagements.ContractPrice),

AS TotalContractPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

GROUP BY Customers.CustLastName,

Customers.CustFirstName,

Customers.CustStreetAddress,

Customers.CustCity, Customers.CustState,

Customers.CustZipCode

Notice that I had to list every one of the columns that I used in an output expression that did not include an aggregate function. I used StartDate and ContractPrice in aggregate expressions, so I don’t need to list them in the GROUP BY clause. In fact, it doesn’t make sense to group on either StartDate or ContractPrice because I want to use these in an aggregate calculation across multiple customers. If, for example, I grouped on StartDate, MAX(StartDate) would return the grouping value, and the expression SUM(ContractPrice) would return only the sum of contract prices for a customer on any given date. You wouldn’t get the sum of more than one contract unless a customer had more than one contract for a given date—not likely.

Using GROUP BY in a Subquery in a WHERE Clause

In Chapter 11, “Subqueries,” I introduced the COUNT and MAX aggregate functions to show how to filter rows using an aggregate value fetched with a subquery. In Chapter 12 I showed how to use MIN, AVG, and SUM in a subquery filter as well. Let’s look at a request that requires both a subquery with an aggregate function and a GROUP BY clause in the subquery:

“Display the engagement contract whose price is greater than the sum of all contracts for any other customer.”

Translation

Select customer first name, customer last name, engagement start date, and engagement contract price from the customers table joined with the engagements table on customer ID where the contract price is greater than the sum of all contract prices from the engagements table for customers other than the current customer, grouped by customer ID

Clean Up

Select customer first name, customer last name, engagement start date, and engagement contract price from the customers table inner joined with the engagements table on customers.customer ID in the customers table matches = engagements.customer ID in the engagements table where the contract price is greater than > ALL (select the sum of all (contract prices) from the engagements table as E2 for where E2.customers ID <> other than the current customers.customer ID, grouped by E2.customer ID)

SQL

SELECT Customers.CustFirstName,

Customers.CustLastName,

Engagements.StartDate,

Engagements.ContractPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

WHERE Engagements.ContractPrice > ALL

(Select SUM(ContractPrice)

FROM Engagements AS E2

WHERE E2.CustomerID <> Customers.CustomerID

GROUP BY E2.CustomerID)

Let’s analyze what the subquery is doing. For each engagement that the query looks at in the JOIN of the Customers and Engagements tables, the subquery calculates the SUM of all contract prices for all other customers and groups them by customer ID. Because there are multiple customers in the database, the subquery will return multiple SUM values—one for each of the other customers. For this reason, I cannot ask for a simple greater than (>) comparison. I can, however, use the quantified greater than all (> ALL) comparison to check a set of values as you learned in Chapter 11. If you run this query in the sample Entertainment Agency database for this chapter (I saved it as CH13_Biggest_Big_Contract), you’ll find that one contract fits the bill, as shown here:

CustFirstName	CustLastName	StartDate	ContractPrice
Elizabeth	Hallmark	2018-01-22	$14,105.00

Simulating a SELECT DISTINCT Statement

Did it occur to you that you can use a GROUP BY clause and not include any aggregate functions in your SELECT clause? Sure you can! When you do this, you get the same effect as using the DISTINCT keyword covered in Chapter 4. (See the “Eliminating Duplicate Rows” section in that chapter.)

Let’s look at a simple request that requires unique values and solve it using both techniques:

“Show me the unique city names from the customers table.”

Translation 1	Select the unique city names from the customers table
Clean Up	Select the unique distinct city names from the customers table
`SQL`	`SELECT DISTINCT Customers.CustCity` `FROM Customers`
Translation 2	Select city name from the customers table, grouped by city name
Clean Up	Select city name from the customers table, grouped by city name
`SQL`	`SELECT Customers.CustCity` `FROM Customers` `GROUP BY Customers.CustCityName`

Remember that GROUP BY groups all the rows on the grouping column(s) you specify and returns one row per group. This is a slightly different way to get to the same result that you obtain with the DISTINCT keyword. Which one is better? I think that DISTINCT might be a clearer statement of what you want if all you want is unique rows, but you might find that your database system solves the problem faster when you use GROUP BY. In addition, GROUP BY lets you obtain more information about your data. Consider the following query:

Click here to view code image

SQL

SELECT Customers.CustCity, Count(*) as

CustPerCity

FROM Customers

GROUP BY Customers.CustCityName

With this query, you not only fetch the unique city names but also find out how many customers are in each city. Is that cool or what?

“Some Restrictions Apply”

I already mentioned that adding a GROUP BY clause places certain restrictions on constructing your request. Let’s review those restrictions to make sure you don’t fall into common traps.

Column Restrictions

When you add a GROUP BY clause, you’re asking your database system to form unique groups of rows from those returned by the tables defined in your FROM clause and filtered by your WHERE clause. You can use as many aggregate expressions as you like in your SELECT clause, and these expressions can use any of the columns in the table defined by the FROM and WHERE clauses. As I pointed out in an earlier example, it probably does not make sense to reference a column in an aggregate expression and also include that column in your grouping specification.

If you choose to also include expressions that reference columns but do not include an aggregate function, you must list all columns you use this way in the GROUP BY clause. One of the most common mistakes is to assume that you can reference columns in nonaggregate expressions as long as the columns come from unique rows. For example, let’s look at an incorrect request that includes a primary key value—something that I know by definition is unique:

“Display the customer ID, customer full name, and the total of all engagement contract prices.”

Translation

Select customer ID, customer first name, and customer last name as CustFullName, and the sum of contract prices as TotalPrice from the customers table joined with the engagements table on customer ID, grouped by customer ID

Clean Up

Select customer ID, customer first name and || ‘ ‘ || customer last name as CustFullName, and the sum of (contract price)s as TotalPrice from the customers table inner joined with the engagements table on customers.customer ID in the customers table matches = engagements.customer ID in the engagements table, grouped by customer ID

SQL

SELECT Customers.CustomerID,

Customers.CustFirstName || ' ' ||

Customers.CustLastName AS CustFullName,

SUM(Engagements.ContractPrice) AS TotalPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

GROUP BY Customers.CustomerID

I know that CustomerID is unique per customer. Grouping on CustomerID alone should be sufficient to fetch unique customer first and last name information within the groups formed by CustomerID. However, SQL is a language based on syntax, not semantics. In other words, SQL does not take into account any knowledge that could be implied by the design of your database tables—including whether columns are primary keys. SQL demands that your request be syntactically “pure” and translatable without any knowledge of the underlying table design. So, the above SQL statement will fail on a database system that is fully compliant with the SQL Standard because I’ve included columns in the SELECT clause that are not in an aggregate function and are also not in the GROUP BY clause (CustFirstName and CustLastName). (Surprisingly, a query constructed like the above does work in MySQL and PostgreSQL.) The correct SQL request is as follows:

Click here to view code image

SQL

SELECT Customers.CustomerID,

Customers.CustFirstName || ' ' ||

Customers.CustLastName AS CustFullName,

SUM(Engagements.ContractPrice) AS TotalPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

GROUP BY Customers.CustomerID,

Customers.CustFirstName,

Customers.CustLastName

This might seem like overkill, but it’s the correct way to do it!

Note: In some database systems, you must exactly duplicate the expressions you use in the SELECT clause in the GROUP BY clause. Oracle and Microsoft Office Access are examples of systems that either support or require this. (Microsoft Office Access lets you do it either way.) In my example, instead of listing the separate columns, you would have to end the SQL with this:

Click here to view code image

GROUP BY Customers.CustomerID,
Customers.CustFirstName || ' ' ||
Customers.CustLastName

This isn’t compliant with the SQL Standard, but you might find that this is the only way you can get your request to work on your system.

Grouping on Expressions

I showed you earlier some correct examples of creating expressions that do not include aggregate functions. One of the most common mistakes is to attempt to group on the expression you create in the SELECT clause rather than on the individual columns. Remember that the GROUP BY clause must refer to columns created by the FROM and WHERE clauses. It cannot use an expression you create in your SELECT clause.

Let’s take another look at an example I solved earlier to show you what I mean, but this time, let’s make the mistake. (I’m skipping the Translation and Clean Up steps here because I covered them earlier.)

“Show me for each customer in the state of Washington the customer full name, the customer full address, the latest contract date for the customer, and the total price of all the contracts.”

Click here to view code image

SQL

SELECT Customers.CustLastName || ', ' ||

Customers.CustFirstName AS CustomerFullName,

Customers.CustStreetAddress || ', ' ||

Customers.CustCity || ', ' ||

Customers.CustState || ' ' ||

Customers.CustZipCode AS CustomerFullAddress,

MAX(Engagements.StartDate) AS LatestDate,

SUM(Engagements.ContractPrice)

AS TotalContractPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

WHERE Customers.CustState = 'WA'

GROUP BY CustomerFullName,

CustomerFullAddress

Some database systems will let you get away with this, but it’s not correct. The CustomerFullName and CustomerFullAddress columns don’t exist until after your database system has evaluated the FROM, WHERE, and GROUP BY clauses. The GROUP BY clause won’t find these columns in the result created in the FROM and WHERE clauses, so on a database system that strictly adheres to the SQL Standard you’ll get a syntax error.

I showed you earlier one correct way to solve this: You must list all the columns you use in both the CustomerFullName and CustomerFullAddress expressions. Another way is to make the FROM clause generate the calculated columns by embedding a table subquery. Here’s what it looks like:

Click here to view code image

SQL

SELECT CE.CustomerFullName,

CE.CustomerFullAddress,

MAX(CE.StartDate) AS LatestDate,

SUM(CE.ContractPrice) AS TotalContractPrice

FROM

(SELECT Customers.CustLastName || ', ' ||

Customers.CustFirstName AS CustomerFullName,

Customers.CustStreetAddress || ', ' ||

Customers.CustCity || ', ' ||

Customers.CustState || ' ' ||

Customers.CustZipCode AS CustomerFullAddress,

Engagements.StartDate,

Engagements.ContractPrice

FROM Customers

INNER JOIN Engagements

ON Customers.CustomerID =

Engagements.CustomerID

WHERE Customers.CustState = 'WA')

AS CE

GROUP BY CE.CustomerFullName,

CE.CustomerFullAddress

This works now because I’ve generated the CustomerFullName and CustomerFullAddress columns as output in the FROM clause. You have to admit, though, that this makes the query very complex. In truth, it’s better to just list all the individual columns you plan to use in nonaggregate expressions rather than try to generate the expressions as columns inside the FROM clause. I saved this last request as CH13_Customers_Total_Contract in the Entertainment Agency sample database.

Uses for GROUP BY

At this point, you should have a fairly good understanding of how to ask for subtotals across groups using aggregate functions and the GROUP BY clause. The best way to give you an idea of the wide range of uses for GROUP BY is to list some problems you can solve with this new clause and then present a robust set of examples in the “Sample Statements” section:

“Show me each vendor and the average by vendor of the number of days to deliver products.”

“Display for each product the product name and the total sales.”

“List for each customer and order date the customer full name and the total cost of items ordered on each date.”

“Display each entertainment group ID, entertainment group member, and the amount of pay for each member based on the total contract price divided by the number of members in the group.”

“Show each agent name, the sum of the contract price for the engagements booked, and the agent’s total commission.”

“For completed classes, list by category and student the category name, the student name, and the student’s average grade in all classes taken in that category.”

“Display by category the category name and the count of classes offered.”

“List each staff member and the count of classes each is scheduled to teach.”

“Show me for each tournament and match the tournament ID, the tournament location, the match number, the name of each team, and the total of the handicap score for each team.”

“Display for each bowler the bowler name and the average of the bowler’s raw game scores.”

“Show me how many recipes exist for each class of ingredient.”

“If I want to cook all the recipes in my cookbook, how much of each ingredient must I have on hand?”

Sample Statements

You now know the mechanics of constructing queries using a GROUP BY clause and have seen some of the types of requests you can answer. Let’s take a look at a set of samples, all of which request that the information be grouped. These examples come from each of the sample databases.

I’ve also included sample result sets that would be returned by these operations and placed them immediately after the SQL syntax line. The name that appears immediately above a result set is the name I gave each query in the sample data on the companion website for this book, www.informit.com/title/9780134858333. I stored each query in the appropriate sample database (as indicated within the example), and I prefixed the names of the queries relevant to this chapter with “CH13.” You can follow the instructions in the Introduction of this book to load the samples onto your computer and try them.

Note: Remember that all the column names and table names used in these examples are drawn from the sample database structures shown in Appendix B, “Schema for the Sample Databases.” To simplify the process, I have combined the Translation and Clean Up steps for all the examples.

These samples assume you have thoroughly studied and understood the concepts covered in previous chapters, especially the chapters on JOINs and subqueries.

Sales Orders Database

“List for each customer and order date the customer full name and the total cost of items ordered on each date.”

Translation/Clean Up

Select customer first name and || ‘ ‘ || customer last name as CustFullName, order date, and the sum of (quoted price times * quantity ordered) as TotalCost from the customers table inner joined with the orders table on customers.customer ID in the customers table matches = orders.customer ID in the orders table, and then inner joined with the order details table on orders.order number in the orders table matches = order_details.order number in the order details table, grouped by customer first name, customer last name, and order date

SQL

SELECT Customers.CustFirstName || ' ' ||

Customers.CustLastName AS CustFullName,

Orders.OrderDate,

SUM(Order_Details.QuotedPrice *

Order_Details.QuantityOrdered) AS TotalCost

FROM (Customers

INNER JOIN Orders

ON Customers.CustomerID = Orders.CustomerID)

INNER JOIN Order_Details

ON Orders.OrderNumber =

Order_Details.OrderNumber

GROUP BY Customers.CustFirstName,

Customers.CustLastName, Orders.OrderDate

CH13_Order_Totals_By_Customer_And_Date (847 rows)

CustFullName	OrderDate	TotalCost
Alaina Hallmark	2017-09-03	$4,699.98
Alaina Hallmark	2017-09-15	$4,433.95
Alaina Hallmark	2017-09-22	$353.25
Alaina Hallmark	2017-09-23	$3,951.90
Alaina Hallmark	2017-10-01	$10,388.68
Alaina Hallmark	2017-10-13	$3,088.00
Alaina Hallmark	2017-10-23	$6,775.06
Alaina Hallmark	2017-10-31	$15,781.10
<< more rows here >>

Entertainment Agency Database

“Display each entertainment group ID, entertainment group member, and the amount of pay for each member based on the total contract price divided by the number of members in the group.”

Note: This one is really tricky because each member might belong to more than one entertainer group. You must sum the contract prices for each entertainer and then divide by the count of members in that group (assuming each member gets equal pay). Fetching the count requires a subquery filtered on the current entertainer ID (the ID of the group, not the ID of the member), which means you also must group by entertainer ID. Oh yes, and don’t forget to exclude members who are not active (Status = 3).

Translation/Clean Up

Select entertainer ID, member first name, member last name, and the sum of (contract price)s divided by / the (select count(*) of active members from entertainer members as EM2 in the current entertainer group where status is not equal to <> not active 3 and the EM2 table entertainer ID equals = the entertainer members table entertainer ID) from the members table inner joined with the entertainer members table on members.member ID in the members table matches = entertainer_members.member ID in the entertainer members table, then inner joined with the entertainers table on entertainers.entertainer ID in the entertainers table matches = entertainer_members.entertainer ID in the entertainer members table, and finally inner joined with the engagements table on entertainers.entertainer ID in the entertainers table matches = engagements.entertainer ID in the engagements table, where member status is not equal to <> not active 3, grouped by entertainer ID, member first name, and member last name, sorted order by member last name

SQL

SELECT Entertainers.EntertainerID,

Members.MbrFirstName, Members.MbrLastName,

SUM(Engagements.ContractPrice)/

(SELECT COUNT(*)

FROM Entertainer_Members AS EM2

WHERE EM2.Status <> 3

AND EM2.EntertainerID =

Entertainers.EntertainerID)

AS MemberPay

FROM ((Members

INNER JOIN Entertainer_Members

ON Members.MemberID =

Entertainer_Members.MemberID)

INNER JOIN Entertainers

ON Entertainers.EntertainerID =

Entertainer_Members.EntertainerID)

INNER JOIN Engagements

ON Entertainers.EntertainerID =

Engagements.EntertainerID

WHERE Entertainer_Members.Status <> 3

GROUP BY Entertainers.EntertainerID,

Members.MbrFirstName, Members.MbrLastName

ORDER BY Members.MbrLastName

CH13_Member_Pay (39 rows)

EntertainerID	MbrFirstName	MbrLastName	MemberPay
1010	Kendra	Bonnicksen	$2,887.50
1013	Kendra	Bonnicksen	$3,767.50
1007	Robert	Brown	$2,975.00
1008	Robert	Brown	$6,816.00
1008	George	Chavez	$6,816.00
1013	George	Chavez	$3,767.50
1010	Caroline	Coie	$2,887.50
1013	Caroline	Coie	$3,767.50
<< more rows here >>

School Scheduling Database

“For completed classes, list by category and student the category name, the student name, and the student’s average grade of all classes taken in that category.”

Translation/Clean Up

Select category description, student first name, student last name, and the average AVG(of grade) as AvgOfGrade from the categories table inner joined with the subjects table on categories.category ID in the categories table matches = subjects.category ID in the subjects table, then inner joined with the classes table on subjects.subject ID in the subjects table matches = classes.subject ID in the classes table, then inner joined with the student schedules table on classes.class ID in the classes table matches = student_schedules.class ID in the student schedules table, then inner joined with the student class status table on student_class_status.class status in the student class status table matches = student_schedules.class status in the student schedules table, and finally inner joined with the students table on students.student ID in the students table matches = student_schedules.student ID in the student schedules table where class status description is = ‘Completed,’ grouped by category description, student first name, and student last name

SQL

SELECT Categories.CategoryDescription,

Students.StudFirstName,

Students.StudLastName,

AVG(Student_Schedules.Grade) AS AvgOfGrade

FROM ((((Categories

INNER JOIN Subjects

ON Categories.CategoryID = Subjects.CategoryID)

INNER JOIN Classes

ON Subjects.SubjectID = Classes.SubjectID)

INNER JOIN Student_Schedules

ON Classes.ClassID = Student_Schedules.ClassID)

INNER JOIN Student_Class_Status

ON Student_Class_Status.ClassStatus =

Student_Schedules.ClassStatus)

INNER JOIN Students

ON Students.StudentID =

Student_Schedules.StudentID

WHERE Student_Class_Status.ClassStatusDescription =

'Completed'

GROUP BY Categories.CategoryDescription,

Students.StudFirstName,

Students.StudLastName

CH13_Student_GradeAverage_By_Category (63 rows)

Category Description	StudFirst Name	StudLast Name	AvgOfGrade
Accounting	Doris	Hartwig	80.51
Accounting	Elizabeth	Hallmark	91.12
Accounting	Kendra	Bonnicksen	88.50
Accounting	Richard	Lum	79.61
Accounting	Sarah	Thompson	77.34
Art	Doris	Hartwig	82.19
Art	George	Chavez	83.63
Art	John	Kennedy	87.65
Art	Kerry	Patterson	99.83
Art	Michael	Viescas	73.37
<< more rows here >>

Bowling League Database

“Show me for each tournament and match the tournament ID, the tournament location, the match number, the name of each team, and the total of the handicap score for each team.”

Translation/Clean Up

Select tourney ID, tourney location, match ID, team name, and the sum of (handicap score) as TotHandiCapScore from the tournaments table inner joined with the tourney matches table on tournaments.tourney ID in the tournaments table matches = tourney_matches.tourney ID in the tourney matches table, then inner joined with the match games table on tourney_matches.match ID in the tourney matches table matches = match_games.match ID in the match games table, then inner joined with the bowler scores table on match_games.match ID in the match games table matches = bowler_scores.match ID in the bowler scores table and match_games.game number in the match games table matches = bowler_scores.game number in the bowler scores table, then inner joined with the bowlers table on bowlers.bowler ID in the bowlers table matches = bowler_scores.bowler ID in the bowler scores table, and finally inner joined with the teams table on teams.team ID in the teams table matches = bowlers.team ID in the bowlers table, grouped by tourney ID, tourney location, match ID, and team name

SQL

SELECT Tournaments.TourneyID,

Tournaments.TourneyLocation,

Tourney_Matches.MatchID, Teams.TeamName,

SUM(Bowler_Scores.HandicapScore)

AS TotHandiCapScore

FROM ((((Tournaments

INNER JOIN Tourney_Matches

ON Tournaments.TourneyID =

Tourney_Matches.TourneyID)

INNER JOIN Match_Games

ON Tourney_Matches.MatchID =

Match_Games.MatchID)

INNER JOIN Bowler_Scores

ON (Match_Games.MatchID =

Bowler_Scores.MatchID) AND

(Match_Games.GameNumber =

Bowler_Scores.GameNumber))

INNER JOIN Bowlers

ON Bowlers.BowlerID = Bowler_Scores.BowlerID)

INNER JOIN Teams

ON Teams.TeamID = Bowlers.TeamID

GROUP BY Tournaments.TourneyID,

Tournaments.TourneyLocation,

Tourney_Matches.MatchID, Teams.TeamName

As you can see, the difficult part of this request is assembling the complex JOIN clauses to link all the tables in the correct manner.

CH13_Tournament_Match_Team_Results (112 rows)

Tourney ID	Tourney Location	MatchID	TeamName	TotHandi CapScore
1	Red Rooster Lanes	1	Marlins	2351
1	Red Rooster Lanes	1	Sharks	2348
1	Red Rooster Lanes	2	Barracudas	2289
1	Red Rooster Lanes	2	Terrapins	2391
1	Red Rooster Lanes	3	Dolphins	2389
1	Red Rooster Lanes	3	Orcas	2395
1	Red Rooster Lanes	4	Manatees	2292
1	Red Rooster Lanes	4	Swordfish	2353
2	Thunderbird Lanes	5	Marlins	2297
2	Thunderbird Lanes	5	Terrapins	2279
<< more rows here >>

“Display the highest raw score for each bowler.”

Translation/Clean Up

Select bowler first name, bowler last name, and the maximum (raw score) as HighScore from the bowlers table inner joined with the bowler scores table on bowlers.bowler ID in the bowlers table matches = bowler_scores.bowler ID in the bowler scores table, grouped by bowler first name, and bowler last name

SQL

SELECT Bowlers.BowlerFirstName,

Bowlers.BowlerLastName,

MAX(Bowler_Scores.RawScore) AS HighScore

FROM Bowlers

INNER JOIN Bowler_Scores

ON Bowlers.BowlerID = Bowler_Scores.BowlerID

GROUP BY Bowlers.BowlerFirstName,

Bowlers.BowlerLastName

CH13_Bowler_High_Score_Group (32 rows)

BowlerFirstName	BowlerLastName	HighScore
Alaina	Hallmark	180
Alastair	Black	164
Angel	Kennedy	194
Ann	Patterson	165
Bailey	Hallmark	164
Barbara	Fournier	164
Caleb	Viescas	193
Carol	Viescas	150
David	Cunningham	180
David	Fournier	178
<< more rows here >>

Recipes Database

“Show me how many recipes exist for each class of ingredient.”

Note: The challenge here is that you don’t want to count a particular recipe class more than once per recipe. For example, if a recipe contains multiple herbs or dairy ingredients, that recipe should be counted only once per class. Sounds like it’s time to use COUNT(DISTINCT value expression), doesn’t it?

Translation/Clean Up

Select ingredient class description, and the unique count of (distinct recipe ID) as CountOfRecipeID from the ingredient classes table inner joined with the ingredients table on ingredient_classes.ingredient class ID in the ingredient classes table matches = ingredients.ingredient class ID in the ingredients table, and then inner joined with the recipe ingredients table on ingredients.ingredient ID in the ingredients table matches = recipe_ingredients.ingredient ID in the recipe ingredients table, grouped by ingredient class description

SQL

SELECT

Ingredient_Classes.IngredientClassDescription,

Count(DISTINCT RecipeID) AS CountOfRecipeID

FROM (Ingredient_Classes

INNER JOIN Ingredients

ON Ingredient_Classes.IngredientClassID =

Ingredients.IngredientClassID)

INNER JOIN Recipe_Ingredients

ON Ingredients.IngredientID =

Recipe_Ingredients.IngredientID

GROUP BY

Ingredient_Classes.IngredientClassDescription

CH13_IngredientClass_Distinct_Recipe_Count (19 rows)

IngredientClassDescription	CountOfRecipeID
Bottle	1
Butter	3
Cheese	2
Chips	1
Condiment	2
Dairy	2
Fruit	1
Grain	2
Herb	1
<< more rows here >>

Note: Because Microsoft Access does not support COUNT DISTINCT, you’ll find that the query in the Access sample database first selects the DISTINCT values of RecipeID using a table subquery in the FROM clause and then counts the resulting rows.

Summary

I began the chapter by explaining to you why you might want to group data to get multiple subtotals from a result set. After tantalizing you with an example, I proceeded to show how to use the GROUP BY clause to solve the example problem and several others. I also showed how to mix column expressions with aggregate functions.

I next explored an interesting example of using GROUP BY in a subquery that acts as a filter in a WHERE clause. I subsequently pointed out that constructing a query using GROUP BY and no aggregate functions is the same as using DISTINCT in your SELECT clause. Then I warned you to carefully construct your GROUP BY clause to include the columns and not the expressions.

I wrapped up my discussion of the GROUP BY clause by explaining some common pitfalls. I showed that SQL does not consider any knowledge of primary keys. I also explained common mistakes you might make when using column expressions in your SELECT clause.

I summarized why the GROUP BY clause is useful and gave you a sample list of problems you can solve using GROUP BY. The rest of the chapter provided examples of how to build requests that require the GROUP BY clause.

The following section presents several requests that you can work out on your own.

Problems for You to Solve

Below, I show you the request statement and the name of the solution query in the sample databases. If you want some practice, you can work out the SQL you need for each request and then check your answer with the query I saved in the samples. Don’t worry if your syntax doesn’t exactly match the syntax of the queries I saved—as long as your result set is the same.

Sales Orders Database

1. “Show me each vendor and the average by vendor of the number of days to deliver products.”

(Hint: Use the AVG aggregate function and group on vendor.)

You can find the solution in CH13_Vendor_Avg_Delivery (10 rows).

2. “Display for each product the product name and the total sales.”

(Hint: Use SUM with a calculation of quantity times price and group on product name.)

You can find the solution in CH13_Sales_By_Product (38 rows).

3. “List all vendors and the count of products sold by each.”

You can find the solution in CH13_Vendor_Product_Count_Group (10 rows).

4. Challenge: Now solve problem 3 by using a subquery.

You can find the solution in CH13_Vendor_Product_Count_Subquery (10 rows).

Entertainment Agency Database

1. “Show each agent’s name, the sum of the contract price for the engagements booked, and the agent’s total commission.”

(Hint: You must multiply the sum of the contract prices by the agent’s commission. Be sure to group on the commission rate as well!)

You can find the solution in CH13_Agent_Sales_And_Commissions (8 rows).

School Scheduling Database

1. “Display by category the category name and the count of classes offered.”

(Hint: Use COUNT and group on category name.)

You can find the solution in CH13_Category_Class_Count (15 rows).

2. “List each staff member and the count of classes each is scheduled to teach.”

(Hint: Use COUNT and group on staff name.)

You can find the solution in CH13_Staff_Class_Count (22 rows).

3. Challenge: Now solve problem 2 by using a subquery.

You can find the solution in CH13_Staff_Class_Count_Subquery (27 rows).

4. Can you explain why the subquery solution returns five more rows? Is it possible to modify the query in question 2 to return 27 rows? If so, how would you do it?

(Hint: Think about using an OUTER JOIN.)

Bowling League Database

1. “Display for each bowler the bowler name and the average of the bowler’s raw game scores.”

(Hint: Use the AVG aggregate function and group on bowler name.)

You can find the solution in CH13_Bowler_Averages (32 rows).

2. “Calculate the current average and handicap for each bowler.”

(Hint: This is a “friendly” league, so the handicap is 90 percent of 200 minus the raw average. Be sure to round the raw average and convert it to an integer before subtracting it from 200, and then round and truncate the final result. Although the SQL Standard doesn’t define a ROUND function, most commercial database systems provide one. Check your product documentation for details.)

You can find the solution in CH13_Bowler_Average_Handicap (32 rows).

3. Challenge: “Display the highest raw score for each bowler,” but solve it by using a subquery.

You can find the solution in CH13_Bowler_High_Score_Subquery (32 rows).

Recipes Database

1. “If I want to cook all the recipes in my cookbook, how much of each ingredient must I have on hand?”

(Hint: Use SUM and group on ingredient name and measurement description.)

You can find the solution in CH13_Total_Ingredients_Needed (65 rows).

2. “List all meat ingredients and the count of recipes that include each one.”

You can find the solution in CH13_Meat_Ingredient_Recipe_Count_Group (4 rows).

3. Challenge: Now solve problem 2 by using a subquery.

You can find the solution in CH13_Meat_Ingredient_Recipe_Count_Subquery (11 rows).

4. Can you explain why the subquery solution returns seven more rows? Is it possible to modify the query in question 2 to return 11 rows? If so, how would you do it?

(Hint: Think about using an OUTER JOIN.)

13Grouping Data

Topics Covered in This Chapter

Why Group Data?

The GROUP BY Clause

Syntax

Mixing Columns and Expressions

Using GROUP BY in a Subquery in a WHERE Clause

Simulating a SELECT DISTINCT Statement

“Some Restrictions Apply”

Column Restrictions

Grouping on Expressions

Uses for GROUP BY

Sample Statements

Sales Orders Database

Entertainment Agency Database

School Scheduling Database

Bowling League Database

Recipes Database

Summary

Problems for You to Solve

Sales Orders Database

Entertainment Agency Database

School Scheduling Database

Bowling League Database

Recipes Database

13
Grouping Data