SQL can compute aggregate values. Aggregates are single values calculated from many values. If you have a table of students with GPAs, you could calculate the average GPA. The average value is an aggregate. The minimum or maximum GPA could also be calculated. Both are aggregates. Regardless of the number of students, an aggregate “rolls up” or “cooks down” to one value.
The GROUP BY
clause allows for the further partitioning of a result and for the calculation of aggregates per partition. Using GROUP BY
, there is the ability to calculate the average GPA per home state, the minimum GPA per major, or the number of classes each student completed.
The lesson will once again use the TrackIt database and schema. You will need to have it installed to run the queries provided in this lesson to see the results.
There are a dozen or more SQL aggregate functions, depending on what SQL database system you use. The following are the most common and universally supported:
An aggregate function commonly appears in the SELECT
value list. Listing 13.1 presents a simple query against the TrackIt database that counts the number of TaskIds.
The previous listing queries the TrackIt database and counts the number of TaskIds and then counts the number of Tasks. The output will be a count of 543 values for both queries.
Each of the five aggregate functions requires one argument: the source of values to be aggregated. It can be a field or any value expression. The * argument in COUNT(*)
is special. It tells the SQL engine to count records, not values. In the queries in Listing 13.1, the result is identical, but that's not always the case. Consider TaskStatusIds, as shown in Listing 13.2.
This query returns 532 values. The number of tasks was 543, so the number of TaskStatusIds doesn't match the number of Tasks. This is because NULL
s are omitted. Task.TaskStatusId
can be NULL
and is NULL
11 times out of 543.
You can aggregate any value. The value can come from a joined table or from a result filtered with WHERE
. Listing 13.3 counts resolved Tasks.
In this listing, the Task table has been joined with the TaskStatus table. The TaskIds
are then counted, but only where the IsResolved
status is equal to 1
. The results show that there are 276 resolved Tasks. (Your result may be different if you have previously added, updated, or deleted Tasks.)
As you've seen, GROUP BY
is an optional clause in a SELECT
statement. It partitions a result into groups. GROUP BY
can be used with aggregate functions to compute a value per group instead of computing across the entire result.
While you can write SELECT
statements that do not include GROUP BY
, you must include a GROUP BY
statement if the SELECT
clause includes both aggregate and nonaggregate fields. If you do not provide a GROUP BY
statement in these cases, your query will not group the results appropriately. The GROUP BY
clause is placed after WHERE
, if it's present, and before ORDER BY
.
In Listing 13.4, Tasks are counted per status. In this case, the results will be grouped by status, and the number of tasks that are associated with each status will be counted.
You can see in this listing that the name of the status is being selected along with the count of TaskIds. If the status name is NULL
, then it will be categorized under the name [None]
. A LEFT OUTER JOIN
is used to connect the Task table to the TaskStatus table before grouping by the Task Name and also ordering by the task name. The final output when this query is executed is as follows:
StatusName | TaskCount |
---|---|
[None] | 11 |
Closed | 80 |
In Progress | 64 |
Parked | 64 |
Pending Release | 65 |
Resolved | 53 |
Resolved, Duplicate | 62 |
Resolved, Will Not Fix | 81 |
Testing/Validation | 63 |
There are a few nuances in the query, as listed here:
GROUP BY
statement references the nonaggregated field in the SELECT
statement (s.Name
). As shown in Listing 13.5, if GROUP BY
is not included, the database engine will calculate the total TaskCount
across all records and display it with the first Task.Name
value.
The following is the output for this modified query:
StatusName | TaskCount |
---|---|
[None] | 543 |
LEFT OUTER JOIN
is used to get all Tasks. An INNER JOIN
would eliminate NULL TaskStatusIds
.s.Name
because it's the value that drives IFNULL(s.Name, '[None]')
. You could instead sort by COUNT(s.TaskId)
.s.Name
can be NULL
, a replacement value is provided so it's easy to display the NULL
status.What happens if you want to know if a status is resolved as well as know its name? What happens when the TaskStatus.IsResolved
column is added, as shown in Listing 13.6?
This listing should not work; however, this should create the following results when using MySQL:
StatusName | IsResolved | TaskCount |
---|---|---|
[None] | Null | 11 |
Closed | 1 | 80 |
In Progress | 0 | 64 |
Parked | 0 | 64 |
Pending Release | 0 | 65 |
Resolved | 1 | 53 |
Resolved, Duplicate | 1 | 62 |
Resolved, Will Not Fix | 1 | 81 |
Testing/Validation | 0 | 63 |
You should note that not all relational database management systems (RDBMSs) are this forgiving. For example, in Oracle or SQL Server, you may receive a message like the following:
Error Code: 1055. Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'trackit.s.IsResolved' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
This error can happen because in core SQL, SELECT
…GROUP BY
can't select a value that's not an aggregate or part of the group. That makes sense. If the value would create new groups, what happens to the original groups, and how should the aggregate behave?
As an example, what should happen if Task.TaskId
is added to the SELECT
list? Are results now grouped by TaskId
? If so, the Task count would be pretty boring. Each TaskId
has one and only one Task. If results are not grouped by TaskId
, what then?
No worries. Even though this query works in MySQL, it's good practice to write queries deliberately and correctly, especially if the same queries might be used in a different system.
To add TaskStatus.IsResolved
, make it part of the GROUP BY
clause when it is added to the SELECT
list. This is shown in Listing 13.7. Conceptually, this groups the results by TaskStatus.Name
and TaskStatus.IsResolved
. Practically, it doesn't change the original groups. Can you see why this is true?
You can see in this listing that IsResolved
is being added to the SELECT
but is including a check to see if it is null
. If IsResolved
is null, then it is included in a group identified as 0. You can also see that IsResolved
is also part of what is being grouped within the GROUP BY
clause. The output from this is the same as the previous listing, except for the first row showing a value of 0 for IsResovled
instead of null.
StatusName | IsResolved | TaskCount |
---|---|---|
[None] | 0 | 11 |
Closed | 1 | 80 |
In Progress | 0 | 64 |
Parked | 0 | 64 |
Pending Release | 0 | 65 |
Resolved | 1 | 53 |
Resolved, Duplicate | 1 | 62 |
Resolved, Will Not Fix | 1 | 81 |
Testing/Validation | 0 | 63 |
Most uses of DISTINCT
can be accomplished by grouping data with GROUP BY
. In fact, MySQL uses GROUP BY
optimizations to optimize DISTINCT
queries. Consider using a GROUP BY
if it's more appropriate, and it's usually more appropriate.
For example, the code in Listing 13.8 can be used to get a list of unique project names.
This query produces a list of the 26 unique project names. The following are the first few entries of the output:
ProjectName | ProjectId |
---|---|
Churlish Curling | game‐churlish |
CookerMaker | game‐cooker |
Diva Diva Diva | game‐diva |
Don't Eat The Cheese! | game‐cheese |
GameIt Accounts Payable | accounts‐payable |
While this list presents a list of distinct project names, you could do the same by using GROUP BY Project.Name
instead. This is shown in Listing 13.9.
The results from Listing 13.8 and Listing 13.9 are the same. The GROUP BY
in Listing 13.9 provides the same results as if you had used DISTINCT
.
There is another clause that can be used with SELECT
called HAVING
. The HAVING
clause works with GROUP BY
and allows data to be filtered based on defined criteria. It will specify the condition or conditions that make up a group or aggregation.
For the TrackIt database, the next goal is to fetch the estimated hours in Tasks assigned to Workers, calculate a total per Worker, and find all Workers with more than 100 total hours. Three tables are needed: Worker, ProjectWorker, and Task. INNER JOIN
s can be used in all relationships. There are no optional rows.
Determine groups and selected values. We're interested in total hours per Worker, so you should select the Workers’ names. Just to be careful, the WorkerId can be added to the group. That ensures two or more Workers are not treated as one if they share the same names, because each Worker has a unique WorkerId. Listing 13.10 shows what a first draft of a query might be.
This query is selecting the name of the worker, which is displayed as a concatenation of the first and last names with a space between them. It also displays the aggregate sum of sum of the estimated hours. This is done for each worker based on project hours. These are then grouped by the WorkerID followed with the first and last names also being part of the GROUP BY
.
Executing this query will show 88 records. The first five could look like this:
WorkerName | TotalHours |
---|---|
Inez Fanthome | 65.50 |
Lindy Chattoe | 67.75 |
Thorin Achromov | 45.50 |
Rickie Osgodby | 24.25 |
Andriette Dimsdale | 168.50 |
Aggregate values can't be filtered using the WHERE
clause. The WHERE
clause is evaluated before aggregate functions, so it happens too late. Instead, the HAVING
clause can be used. As mentioned, HAVING
is an optional clause in a SELECT
statement, and it can be included only when a GROUP BY
clause exists. HAVING
is followed by an expression that evaluates to a Boolean value, just like a WHERE
clause, but the expression includes comparisons against aggregate values.
To exclude totals less than 100 hours, add a HAVING
clause to Listing 13.10. The HAVING
clause is placed after GROUP BY
and before ORDER BY
, as shown in Listing 13.11.
You can see in this listing that the HAVING
statement has been added. It uses the SUM
aggregate function to determine which totals of EstimateHours are greater than or equal to 100 and includes only those results. With the addition of the HAVING
statement, the new results now show only nine records.
WorkerName | TotalHours |
---|---|
Andriette Dimsdale | 168.50 |
Vlad Anfusso | 153.50 |
Karlen Egalton | 116.75 |
Kenon Kirkham | 218.00 |
Luci Reeves | 132.25 |
Ealasaid Blinco | 136.50 |
Juliet Strivens | 114.25 |
Winston Marien | 105.50 |
Danyelle O'Hanley | 125.50 |
There is an order to how a SELECT
statement is evaluated. Specifically, the key words in a SELECT
statement are evaluated in the following order:
This order shows why you can't filter aggregates in a WHERE
clause but you can ORDER BY
aggregate values.
Consider Tasks from the Project perspective. You could get a sense of when each Project truly started by grabbing its Tasks and finding the minimum due date. That would tell when the first concrete Task is due, and concrete assigned Tasks might be more reliable than an idealized schedule.
Strictly speaking, only two tables are needed: Project and Task. In the worker queries, ProjectWorker is included, but it's not really necessary. Can you see why? Task includes ProjectWorker's primary key ProjectId
and WorkerId
as a foreign key. That ensures an invalid combination can't be created of ProjectId
and WorkerId
in Task; for example, a Project's Task can't be assigned to a Worker who isn't assigned to the project. That's pretty nice. With that insurance in place, the ProjectWorker table can be safely ignored. The JOIN
will work just fine without it, as shown in Listing 13.12.
This query returns 20 rows. The following are the first five records:
ProjectName | MinTaskDueDate |
---|---|
Churlish Curling | 2000‐09‐13 |
CookerMaker | 2000‐05‐05 |
Diva Diva Diva | 2002‐08‐16 |
Don't Eat The Cheese! | 2000‐03‐24 |
Grumps | 2000‐01‐15 |
This listing uses the MIN
function with DueDate
. It is worth noting that MIN
and MAX
work on any data type that can be compared and ranked. They definitely work for numbers, but they also work with dates, times, and strings.
What if an overview of each Project is wanted: first and last Task due date, total estimated hours, total number of Tasks, and average Task hours estimate? This can be done in one query, as shown in Listing 13.13.
Once a group has been defined, you can run any number of aggregate calculations on it. To make things a bit more interesting, Listing 13.13 eliminates Projects with fewer than 10 Tasks.
You should be able to follow each line of this query, which uses the aggregate functions mentioned at the beginning of this lesson. As you go through the query, pay close attention to where each clause is declared in the SELECT
. As you can see, the query is using everything available to it. The following are 4 of the 21 records returned:
ProjectName | MinTaskDueDate | MaxTaskDueDate | TotalHours | AverageTaskHours | TaskCount |
---|---|---|---|---|---|
Gametic Enterprise | 2000‐01‐06 | 2019‐11‐22 | 205.75 | 5.143750 | 40 |
GameIt Accounts Receivable | 2001‐03‐23 | 2018‐11‐02 | 129.00 | 5.375000 | 24 |
GameIt HR Intranet V2 | 2000‐08‐10 | 2018‐12‐26 | 122.25 | 5.093750 | 24 |
GameIt Payroll | 2000‐03‐23 | 2019‐07‐03 | 137.25 | 5.718750 | 24 |
In looking at the results, when nongame Projects are allowed, it's clear they dominate total hours and number of Tasks (in our fake data).
Aggregate functions compute a single value from many values. The source values come from fields and calculations. On their own, aggregates compute one value per query result. Common aggregates include COUNT
, SUM
, MIN
, MAX
, and AVG
.
The GROUP BY
clause is an optional extension to the SELECT
statement. It partitions results into distinct groups. Aggregate functions can operate on groups and provide a single computed value per group.
The HAVING
clause is an optional extension to the SELECT
statement. It is valid only if a SELECT
contains a GROUP BY
. HAVING
evaluates a Boolean expression using aggregate values. As with WHERE
, a record is included if the expression evaluates to true; otherwise, the record is excluded. You cannot filter aggregate values in the WHERE
clause. Aggregates are not computed until after the WHERE
is evaluated.
The following exercises are provided to allow you to experiment with concepts presented in this lesson:
For the exercises in this lesson, you will use the PersonalTrainer database and schema that you've seen in previous lessons. If you already have the PersonalTrainer database, you are welcome to use it. If not, you can use the personaltrainer‐schema‐and‐data.sql
script from the downloadable files for this book. The downloadable files are at www.wiley.com/go/jobreadysql
.
After running the script, use MySQL Workbench or a MySQL command line to review the tables in the database and the fields in each table. You may also want to reference the ERD in Figure 13.1 to identify the relationships between tables.
Use an aggregate to count the number of Clients.
Use an aggregate to count Client.BirthDate
. The number is different from total Clients. Why?
Group Clients by City and count them.
Here are the sample results:
city | client_count |
---|---|
New Orleans | 105 |
Jefferson | 30 |
Calculate a total per invoice using only the InvoiceLineItem table.
InvoiceId
.Price * Quantity
.Here are the sample results:
invoiceid | invoice:total |
---|---|
1 | 283.12500000 |
2 | 105.00000000 |
Modify the query in Exercise 13.4 for the following:
Here are the sample results:
invoiceid | invoice:total |
---|---|
368 | 502.50000000 |
557 | 502.50000000 |
Calculate the average line item total, grouped by InvoiceLineItem.Description
. The following are sample results:
description | invoice:average |
---|---|
Individual Instruction | 160.502717391304 |
Group Instruction | 25.482495511670 |
Select ClientId, FirstName, and LastName from Client for clients who have paid more than $1,000 total.
Invoice.InvoiceStatus = 2
.Here are the sample results:
ClientId | FirstName | LastName | Total |
---|---|---|---|
bcf40948‐b93b‐4c1f‐b1c7‐ee10c05b9faf | Randal | Aberkirdo | 1540.99500000 |
d0a2212e‐6332‐4541‐9e00‐116ddf88fe45 | Phyllys | Acome | 1115.62500000 |
Count exercises by category.
ExerciseCategory.Name
.Here are the sample results:
CategoryName | ExerciseCount |
---|---|
Bodyweight | 11 |
Flexibility | 9 |
Select Exercise.Name
along with the minimum, maximum, and average ExerciseInstance.Sets
. Sort the results by Exercise.Name
.
Here are the sample results:
ExerciseName | MinSets | MaxSets | AvgSets |
---|---|---|---|
Air squats | 1 | 2 | 1.2500 |
Ananda Balasana | 1 | 10 | 3.5000 |
Find the minimum and maximum Client.BirthDate
per Workout. Sort the results by the workout name.
Here are the sample results:
WorkoutName | EarliestBirthDate | LatestBirthDate |
---|---|---|
3, 2, 1… Yoga! | 1928‐04‐28 | 1993‐02‐07 |
Agility Training | 1935‐05‐11 | 2004‐02‐28 |
Count the client goals. Be careful not to exclude rows for clients without goals. Your sample results should include the following:
ClientId | GoalCount |
---|---|
00268ec4‐cdb6‐4643‐8e94‐3aa467419af6 | 0 |
04971685‐17d8‐4973‐bf35‐42e8a2d4810c | 0 |
Select Exercise.Name
, Unit.Name
, and minimum and maximum ExerciseInstanceUnitValue.Value
for all exercises with a configured ExerciseInstanceUnitValue. Sort the results by Exercise.Name
and then Unit.Name
.
Here are the sample results:
ExerciseName | UnitName | MinValue | MaxValue |
---|---|---|---|
Air squats | Repetitions | 25 | 150 |
Ananda Balasana | Minutes | 5 | 25 |
Modify the query in Exercise 13.12 to include ExerciseCategory.Name
. Order the output by ExerciseCategory.Name
, then Exercise.Name
, and then Unit.Name
.
Here are the sample results:
CategoryName | ExerciseName | UnitName | MinValue | MaxValue |
---|---|---|---|---|
Biking | Street ride | Miles | 5 | 40 |
Biking | Trail ride | Miles | 5 | 40 |
Select the minimum and maximum age in years for each Level. To calculate age in years, use the MySQL function DATEDIFF
. (Do online research to see how this function works. One location online you can review is dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html
.)
Here are the sample results:
LevelName | MinAge | MaxAge |
---|---|---|
Beginner | 15.0466 | 94.2110 |
Intermediate | 14.0329 | 95.2575 |