Lesson 13
Grouping and Aggregates

SQL can compute aggregate values. Aggregates are single values calculated from many values. If you have a table of students with GPAs, you could calculate the average GPA. The average value is an aggregate. The minimum or maximum GPA could also be calculated. Both are aggregates. Regardless of the number of students, an aggregate “rolls up” or “cooks down” to one value.

The GROUP BY clause allows for the further partitioning of a result and for the calculation of aggregates per partition. Using GROUP BY, there is the ability to calculate the average GPA per home state, the minimum GPA per major, or the number of classes each student completed.

The lesson will once again use the TrackIt database and schema. You will need to have it installed to run the queries provided in this lesson to see the results.

AGGREGATE FUNCTIONS

There are a dozen or more SQL aggregate functions, depending on what SQL database system you use. The following are the most common and universally supported:

  • COUNT: Counts the number of non‐NULL values in a set; works on any non‐NULL value
  • SUM: Sums values in a set; values must be numeric
  • AVG: Calculates the average of values in a set; values must be numeric
  • MIN: Determines the minimum value in a set; values must be comparable
  • MAX: Determines the maximum value in a set; values must be comparable

An aggregate function commonly appears in the SELECT value list. Listing 13.1 presents a simple query against the TrackIt database that counts the number of TaskIds.

The previous listing queries the TrackIt database and counts the number of TaskIds and then counts the number of Tasks. The output will be a count of 543 values for both queries.

Each of the five aggregate functions requires one argument: the source of values to be aggregated. It can be a field or any value expression. The * argument in COUNT(*) is special. It tells the SQL engine to count records, not values. In the queries in Listing 13.1, the result is identical, but that's not always the case. Consider TaskStatusIds, as shown in Listing 13.2.

This query returns 532 values. The number of tasks was 543, so the number of TaskStatusIds doesn't match the number of Tasks. This is because NULLs are omitted. Task.TaskStatusId can be NULL and is NULL 11 times out of 543.

You can aggregate any value. The value can come from a joined table or from a result filtered with WHERE. Listing 13.3 counts resolved Tasks.

In this listing, the Task table has been joined with the TaskStatus table. The TaskIds are then counted, but only where the IsResolved status is equal to 1. The results show that there are 276 resolved Tasks. (Your result may be different if you have previously added, updated, or deleted Tasks.)

USING GROUP BY

As you've seen, GROUP BY is an optional clause in a SELECT statement. It partitions a result into groups. GROUP BY can be used with aggregate functions to compute a value per group instead of computing across the entire result.

While you can write SELECT statements that do not include GROUP BY, you must include a GROUP BY statement if the SELECT clause includes both aggregate and nonaggregate fields. If you do not provide a GROUP BY statement in these cases, your query will not group the results appropriately. The GROUP BY clause is placed after WHERE, if it's present, and before ORDER BY.

In Listing 13.4, Tasks are counted per status. In this case, the results will be grouped by status, and the number of tasks that are associated with each status will be counted.

You can see in this listing that the name of the status is being selected along with the count of TaskIds. If the status name is NULL, then it will be categorized under the name [None]. A LEFT OUTER JOIN is used to connect the Task table to the TaskStatus table before grouping by the Task Name and also ordering by the task name. The final output when this query is executed is as follows:

StatusName TaskCount
[None] 11
Closed 80
In Progress 64
Parked 64
Pending Release 65
Resolved 53
Resolved, Duplicate 62
Resolved, Will Not Fix 81
Testing/Validation 63

There are a few nuances in the query, as listed here:

  • Note that the GROUP BY statement references the nonaggregated field in the SELECT statement (s.Name). As shown in Listing 13.5, if GROUP BY is not included, the database engine will calculate the total TaskCount across all records and display it with the first Task.Name value.

    The following is the output for this modified query:

    StatusName TaskCount
    [None] 543
  • A LEFT OUTER JOIN is used to get all Tasks. An INNER JOIN would eliminate NULL TaskStatusIds.
  • The sort is by s.Name because it's the value that drives IFNULL(s.Name, '[None]'). You could instead sort by COUNT(s.TaskId).
  • Because s.Name can be NULL, a replacement value is provided so it's easy to display the NULL status.
  • Aliases provide meaningful names for aggregate values.

Grouping and Multiple Columns

What happens if you want to know if a status is resolved as well as know its name? What happens when the TaskStatus.IsResolved column is added, as shown in Listing 13.6?

This listing should not work; however, this should create the following results when using MySQL:

StatusName IsResolved TaskCount
[None] Null 11
Closed 1 80
In Progress 0 64
Parked 0 64
Pending Release 0 65
Resolved 1 53
Resolved, Duplicate 1 62
Resolved, Will Not Fix 1 81
Testing/Validation 0 63

You should note that not all relational database management systems (RDBMSs) are this forgiving. For example, in Oracle or SQL Server, you may receive a message like the following:

 Error Code: 1055. Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'trackit.s.IsResolved' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
 

This error can happen because in core SQL, SELECTGROUP BY can't select a value that's not an aggregate or part of the group. That makes sense. If the value would create new groups, what happens to the original groups, and how should the aggregate behave?

As an example, what should happen if Task.TaskId is added to the SELECT list? Are results now grouped by TaskId? If so, the Task count would be pretty boring. Each TaskId has one and only one Task. If results are not grouped by TaskId, what then?

No worries. Even though this query works in MySQL, it's good practice to write queries deliberately and correctly, especially if the same queries might be used in a different system.

To add TaskStatus.IsResolved, make it part of the GROUP BY clause when it is added to the SELECT list. This is shown in Listing 13.7. Conceptually, this groups the results by TaskStatus.Name and TaskStatus.IsResolved. Practically, it doesn't change the original groups. Can you see why this is true?

You can see in this listing that IsResolved is being added to the SELECT but is including a check to see if it is null. If IsResolved is null, then it is included in a group identified as 0. You can also see that IsResolved is also part of what is being grouped within the GROUP BY clause. The output from this is the same as the previous listing, except for the first row showing a value of 0 for IsResovled instead of null.

StatusName IsResolved TaskCount
[None] 0 11
Closed 1 80
In Progress 0 64
Parked 0 64
Pending Release 0 65
Resolved 1 53
Resolved, Duplicate 1 62
Resolved, Will Not Fix 1 81
Testing/Validation 0 63

Adding DISTINCT

Most uses of DISTINCT can be accomplished by grouping data with GROUP BY. In fact, MySQL uses GROUP BY optimizations to optimize DISTINCT queries. Consider using a GROUP BY if it's more appropriate, and it's usually more appropriate.

For example, the code in Listing 13.8 can be used to get a list of unique project names.

This query produces a list of the 26 unique project names. The following are the first few entries of the output:

ProjectName ProjectId
Churlish Curling game‐churlish
CookerMaker game‐cooker
Diva Diva Diva game‐diva
Don't Eat The Cheese! game‐cheese
GameIt Accounts Payable accounts‐payable

While this list presents a list of distinct project names, you could do the same by using GROUP BY Project.Name instead. This is shown in Listing 13.9.

The results from Listing 13.8 and Listing 13.9 are the same. The GROUP BY in Listing 13.9 provides the same results as if you had used DISTINCT.

USING HAVING

There is another clause that can be used with SELECT called HAVING. The HAVING clause works with GROUP BY and allows data to be filtered based on defined criteria. It will specify the condition or conditions that make up a group or aggregation.

For the TrackIt database, the next goal is to fetch the estimated hours in Tasks assigned to Workers, calculate a total per Worker, and find all Workers with more than 100 total hours. Three tables are needed: Worker, ProjectWorker, and Task. INNER JOINs can be used in all relationships. There are no optional rows.

Determine groups and selected values. We're interested in total hours per Worker, so you should select the Workers’ names. Just to be careful, the WorkerId can be added to the group. That ensures two or more Workers are not treated as one if they share the same names, because each Worker has a unique WorkerId. Listing 13.10 shows what a first draft of a query might be.

This query is selecting the name of the worker, which is displayed as a concatenation of the first and last names with a space between them. It also displays the aggregate sum of sum of the estimated hours. This is done for each worker based on project hours. These are then grouped by the WorkerID followed with the first and last names also being part of the GROUP BY.

Executing this query will show 88 records. The first five could look like this:

WorkerName TotalHours
Inez Fanthome 65.50
Lindy Chattoe 67.75
Thorin Achromov 45.50
Rickie Osgodby 24.25
Andriette Dimsdale 168.50

Aggregate values can't be filtered using the WHERE clause. The WHERE clause is evaluated before aggregate functions, so it happens too late. Instead, the HAVING clause can be used. As mentioned, HAVING is an optional clause in a SELECT statement, and it can be included only when a GROUP BY clause exists. HAVING is followed by an expression that evaluates to a Boolean value, just like a WHERE clause, but the expression includes comparisons against aggregate values.

To exclude totals less than 100 hours, add a HAVING clause to Listing 13.10. The HAVING clause is placed after GROUP BY and before ORDER BY, as shown in Listing 13.11.

You can see in this listing that the HAVING statement has been added. It uses the SUM aggregate function to determine which totals of EstimateHours are greater than or equal to 100 and includes only those results. With the addition of the HAVING statement, the new results now show only nine records.

WorkerName TotalHours
Andriette Dimsdale 168.50
Vlad Anfusso 153.50
Karlen Egalton 116.75
Kenon Kirkham 218.00
Luci Reeves 132.25
Ealasaid Blinco 136.50
Juliet Strivens 114.25
Winston Marien 105.50
Danyelle O'Hanley 125.50

SELECT EVALUATION ORDER

There is an order to how a SELECT statement is evaluated. Specifically, the key words in a SELECT statement are evaluated in the following order:

  1. FROM: Determine where to start
  2. JOIN ON: Link other tables and formalize row relationships
  3. WHERE: Filter table rows
  4. GROUP BY: Partition the data and calculate aggregates
  5. HAVING: Filter by aggregate
  6. SELECT: Decide what is part of the final result
  7. DISTINCT: Remove duplicates from the result
  8. ORDER BY: Sort the final result
  9. LIMIT: Return a subset of the final result

This order shows why you can't filter aggregates in a WHERE clause but you can ORDER BY aggregate values.

OTHER EXAMPLES

Consider Tasks from the Project perspective. You could get a sense of when each Project truly started by grabbing its Tasks and finding the minimum due date. That would tell when the first concrete Task is due, and concrete assigned Tasks might be more reliable than an idealized schedule.

Strictly speaking, only two tables are needed: Project and Task. In the worker queries, ProjectWorker is included, but it's not really necessary. Can you see why? Task includes ProjectWorker's primary key ProjectId and WorkerId as a foreign key. That ensures an invalid combination can't be created of ProjectId and WorkerId in Task; for example, a Project's Task can't be assigned to a Worker who isn't assigned to the project. That's pretty nice. With that insurance in place, the ProjectWorker table can be safely ignored. The JOIN will work just fine without it, as shown in Listing 13.12.

This query returns 20 rows. The following are the first five records:

ProjectName MinTaskDueDate
Churlish Curling 2000‐09‐13
CookerMaker 2000‐05‐05
Diva Diva Diva 2002‐08‐16
Don't Eat The Cheese! 2000‐03‐24
Grumps 2000‐01‐15

This listing uses the MIN function with DueDate. It is worth noting that MIN and MAX work on any data type that can be compared and ranked. They definitely work for numbers, but they also work with dates, times, and strings.

What if an overview of each Project is wanted: first and last Task due date, total estimated hours, total number of Tasks, and average Task hours estimate? This can be done in one query, as shown in Listing 13.13.

Once a group has been defined, you can run any number of aggregate calculations on it. To make things a bit more interesting, Listing 13.13 eliminates Projects with fewer than 10 Tasks.

You should be able to follow each line of this query, which uses the aggregate functions mentioned at the beginning of this lesson. As you go through the query, pay close attention to where each clause is declared in the SELECT. As you can see, the query is using everything available to it. The following are 4 of the 21 records returned:

ProjectName MinTaskDueDate MaxTaskDueDate TotalHours AverageTaskHours TaskCount
Gametic Enterprise 2000‐01‐06 2019‐11‐22 205.75 5.143750 40
GameIt Accounts Receivable 2001‐03‐23 2018‐11‐02 129.00 5.375000 24
GameIt HR Intranet V2 2000‐08‐10 2018‐12‐26 122.25 5.093750 24
GameIt Payroll 2000‐03‐23 2019‐07‐03 137.25 5.718750 24

In looking at the results, when nongame Projects are allowed, it's clear they dominate total hours and number of Tasks (in our fake data).

SUMMARY

Aggregate functions compute a single value from many values. The source values come from fields and calculations. On their own, aggregates compute one value per query result. Common aggregates include COUNT, SUM, MIN, MAX, and AVG.

The GROUP BY clause is an optional extension to the SELECT statement. It partitions results into distinct groups. Aggregate functions can operate on groups and provide a single computed value per group.

The HAVING clause is an optional extension to the SELECT statement. It is valid only if a SELECT contains a GROUP BY. HAVING evaluates a Boolean expression using aggregate values. As with WHERE, a record is included if the expression evaluates to true; otherwise, the record is excluded. You cannot filter aggregate values in the WHERE clause. Aggregates are not computed until after the WHERE is evaluated.

EXERCISES

The following exercises are provided to allow you to experiment with concepts presented in this lesson:

The Personal Trainer Database

For the exercises in this lesson, you will use the PersonalTrainer database and schema that you've seen in previous lessons. If you already have the PersonalTrainer database, you are welcome to use it. If not, you can use the personaltrainer‐schema‐and‐data.sql script from the downloadable files for this book. The downloadable files are at www.wiley.com/go/jobreadysql.

After running the script, use MySQL Workbench or a MySQL command line to review the tables in the database and the fields in each table. You may also want to reference the ERD in Figure 13.1 to identify the relationships between tables.

Snapshot of the PersonalTrainer database schema.

Figure 13.1 The PersonalTrainer database schema

Exercise 13.1: Number of Clients (1 row)

Use an aggregate to count the number of Clients.

Exercise 13.2: Counting Client Birth Dates (1 row)

Use an aggregate to count Client.BirthDate. The number is different from total Clients. Why?

Exercise 13.3: Clients by City (20 rows)

Group Clients by City and count them.

  • Sort by the number of Clients descending.
  • Include both City and the client count in the results.

Here are the sample results:

city client_count
New Orleans 105
Jefferson 30

Exercise 13.4: Invoice Totals (1,000 rows)

Calculate a total per invoice using only the InvoiceLineItem table.

  • Group by InvoiceId.
  • You'll need an expression for the line item total: Price * Quantity.
  • Aggregate per group using SUM.

Here are the sample results:

invoiceid invoice:total
1 283.12500000
2 105.00000000

Exercise 13.5: Invoices More Than $500 (234 rows)

Modify the query in Exercise 13.4 for the following:

  • Only include totals greater than $500.00.
  • Sort from lowest total to highest.

Here are the sample results:

invoiceid invoice:total
368 502.50000000
557 502.50000000

Exercise 13.6: Average Line Item Totals (3 rows)

Calculate the average line item total, grouped by InvoiceLineItem.Description. The following are sample results:

description invoice:average
Individual Instruction 160.502717391304
Group Instruction 25.482495511670

Exercise 13.7: More Than $1,000 Paid (146 rows)

Select ClientId, FirstName, and LastName from Client for clients who have paid more than $1,000 total.

  • Paid is Invoice.InvoiceStatus = 2 .
  • Sort by LastName, then FirstName.

Here are the sample results:

ClientId FirstName LastName Total
bcf40948‐b93b‐4c1f‐b1c7‐ee10c05b9faf Randal Aberkirdo 1540.99500000
d0a2212e‐6332‐4541‐9e00‐116ddf88fe45 Phyllys Acome 1115.62500000

Exercise 13.8: Counts by Category (13 rows)

Count exercises by category.

  • Group by ExerciseCategory.Name.
  • Sort by exercise count descending.

Here are the sample results:

CategoryName ExerciseCount
Bodyweight 11
Flexibility 9

Exercise 13.9: Exercises (64 rows)

Select Exercise.Name along with the minimum, maximum, and average ExerciseInstance.Sets. Sort the results by Exercise.Name.

Here are the sample results:

ExerciseName MinSets MaxSets AvgSets
Air squats 1 2 1.2500
Ananda Balasana 1 10 3.5000

Exercise 13.10: Client Birth Dates (26 rows)

Find the minimum and maximum Client.BirthDate per Workout. Sort the results by the workout name.

Here are the sample results:

WorkoutName EarliestBirthDate LatestBirthDate
3, 2, 1… Yoga! 1928‐04‐28 1993‐02‐07
Agility Training 1935‐05‐11 2004‐02‐28

Exercise 13.11: Client Goal Count (500 rows, 50 rows with no goal)

Count the client goals. Be careful not to exclude rows for clients without goals. Your sample results should include the following:

ClientId GoalCount
00268ec4‐cdb6‐4643‐8e94‐3aa467419af6 0
04971685‐17d8‐4973‐bf35‐42e8a2d4810c 0

Exercise 13.12: Exercise Unit Value (82 rows)

Select Exercise.Name, Unit.Name, and minimum and maximum ExerciseInstanceUnitValue.Value for all exercises with a configured ExerciseInstanceUnitValue. Sort the results by Exercise.Name and then Unit.Name.

Here are the sample results:

ExerciseName UnitName MinValue MaxValue
Air squats Repetitions 25 150
Ananda Balasana Minutes 5 25

Exercise 13.13: Categorized Exercise Unit Value (82 rows)

Modify the query in Exercise 13.12 to include ExerciseCategory.Name. Order the output by ExerciseCategory.Name, then Exercise.Name, and then Unit.Name.

Here are the sample results:

CategoryName ExerciseName UnitName MinValue MaxValue
Biking Street ride Miles 5 40
Biking Trail ride Miles 5 40

Exercise 13.14: Level Ages (4 rows)

Select the minimum and maximum age in years for each Level. To calculate age in years, use the MySQL function DATEDIFF. (Do online research to see how this function works. One location online you can review is dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html.)

Here are the sample results:

LevelName MinAge MaxAge
Beginner 15.0466 94.2110
Intermediate 14.0329 95.2575