Databases will always include numbers: there’s always something to value, count, or calculate. And you may want to round the results from those numbers to conform to personal preferences. There are numeric and arithmetic functions to do these things in MySQL and MariaDB. Some are known as aggregate functions. We will cover almost all of the aggregate and many numeric functions in this chapter. We won’t cover the more advanced functions related to statistics, or the mathematical functions related to calculus and geometry. Instead, we will cover the most useful and most used functions and leave the others for you to learn later on your own as you need them.
Statistics can provide us with useful information about a database. If a database includes information about the activities of an organization, we can determine some statistical information about those activities. If a database includes numeric values associated with items an organization sells or tracks, statistics can provide us with information for making decisions about those items.
In our birdwatchers
database, we can use aggregate
functions to understand the behavior of our members in relation to our
bird-watchers website, the events they attend, and other activities. For
our rookery
database, we can ascertain some information about
birds using aggregate functions. That can be useful to our members related
to searching for birds in the wild, as well as their concerns for the
well-being of birds. We can ascertain information about where birds are
seen in the wild by our members.
In this section, we will look at aggregate functions that will help
us to determine this kind of information. In order to aggregate data
together to calculate statistical values of sorts, we sometimes must use
the GROUP BY
clause. Some of the aggregate
functions, such as the COUNT()
function we’ve used in
earlier chapters for counting rows in a table, do not require this clause,
at least under certain conditions. We’ll start with COUNT()
and then look at functions for simple statistics, such as determining an
average for a set of numbers.
One of the simplest calculations we can do is to count. We learn it as
children as an introduction to mathematics. So let’s start with
counting, the COUNT()
function.
Suppose we want to know how many birds are in the
birds
table. To do that, enter the folowing in
mysql
:
SELECT COUNT(*)
FROM birds;
+----------+
| COUNT(*) |
+----------+
| 28891 |
+----------+
Notice that we didn’t have to include the GROUP BY
clause for this simple SQL statement. That’s because we wanted MySQL to
count all of the rows in the table. We didn’t need GROUP BY
because we didn’t want it to separate the rows into separate
groups—there’s just one group here. Notice also that we’re giving
COUNT()
an asterisk as the argument. That’s a wildcard to
tell MySQL that we want to count all of the rows found. Because we don’t
have a WHERE
clause, all of the rows will be
selected.
Many of the bird species lack common names. So the
common_name
column in birds
is blank for these
species. COUNT()
has a special convention: if you pass a
column name instead of an asterisk as its argument, it counts only the
columns that are not NULL. Let’s change that data and then see how it might
look. Enter these two SQL statements:
UPDATE birds
SET common_name = NULL
WHERE common_name = '';
SELECT COUNT(common_name)
FROM birds;
+--------------------+
| COUNT(common_name) |
+--------------------+
| 9553 |
+--------------------+
That’s the number of birds with a common name in the table. We
could have gotten the same results with a WHERE
clause, and
without having modified the data as we did. This lets us select only
rows where the common_name
does not equal ''
.
We’ve changed those values to NULL, though, so let’s use the
WHERE
clause to see how that would look based on NULL
values. Enter the following:
SELECT COUNT(*) FROM birds
WHERE common_name IS NULL;
+----------+
| COUNT(*) |
+----------+
| 19338 |
+----------+
This gave us a different number. That’s because we’re counting the
rows where the common_name
is NULL—we used the operator IS NULL
. Before, we counted the rows
where the common_name
was not NULL. We can count those with
the WHERE
clause like so:
SELECT COUNT(*) FROM birds
WHERE common_name IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
| 9553 |
+----------+
That’s the answer we got before. It just required us to use
the IS NOT NULL
operator.
As useful as all of this may be, let’s get some more interesting
results. Let’s count the number of birds within each family of birds. To
do that, we have to use the GROUP BY
clause. We’ll enter
the following to get a count of the number of birds in each
family:
SELECT COUNT(*)
FROM birds
GROUP BY family_id;
+----------+
| COUNT(*) |
+----------+
| 5 |
| 6 |
| 248 |
| 119 |
| 168 |
| 39 |
| 223 |
| ... |
+----------+
227 rows in set (0.15 sec)
In this example, we told MySQL to GROUP BY
the
family_id
. So it sorted the rows by the
family_id
and counted the number of rows for each group.
Because the results here would take up 227 rows, I’ve removed some of
the results to save space. This SQL statement did what we asked, but
it’s not very useful or interesting. It would be better to get the name
of the bird families to go with these counts. To do this, we’ll have to
use a JOIN
to include the bird_families
table.
Here’s how we would do that:
SELECT bird_families.scientific_name AS 'Bird Family',
COUNT(*) AS 'Number of Species'
FROM birds JOIN bird_families USING(family_id)
GROUP BY birds.family_id;
+--------------------+-------------------+
| Bird Family | Number of Species |
+--------------------+-------------------+
| Gaviidae | 6 |
| Anatidae | 248 |
| Charadriidae | 119 |
| Laridae | 168 |
| Sternidae | 39 |
| Caprimulgidae | 223 |
| Sittidae | 92 |
| ... | |
+--------------------+-------------------+
225 rows in set (0.17 sec)
That’s nicer looking, and the results are more interesting. I’ve
shortened the results again, but notice that we now have only 225 rows.
That’s because we have some rows in the birds
table in
which the family_id
is NULL. When using a database, watch
for discrepancies like this; don’t ignore them just because you weren’t
looking for problems. They can help you catch problems you
overlooked.
Let’s modify the SELECT
statement to show the number
of rows in birds
that do not have matching values in
bird_families
. We’ll do this with a LEFT JOIN
(covered in Joining Tables, which included examples, but
let’s apply that concept again here):
SELECT bird_families.scientific_name AS 'Bird Family',
COUNT(*) AS 'Number of Species'
FROM birds LEFT JOIN bird_families USING(family_id)
GROUP BY birds.family_id;
+--------------------+-------------------+
| Bird Family | Number of Species |
+--------------------+-------------------+
| NULL | 4 |
| NULL | 1 |
| Gaviidae | 6 |
| Anatidae | 248 |
| Charadriidae | 119 |
| Laridae | 168 |
| Sternidae | 39 |
| Caprimulgidae | 223 |
| Sittidae | 92 |
| ... | |
+--------------------+-------------------+
225 rows in set (0.17 sec)
Some of these rows may have a family_id
of NULL, and
one may have a family_id
not contained in
bird_families
. To resolve this problem, we would run a
SELECT
to list rows where the bird_id
is not
included in bird_families
. But this is getting away from
learning about aggregate functions. Let’s assume that we’ve found the
rows with missing data and fixed them so that we can move on.
In the results for the last two examples, you may have noticed
that the names of the bird families are not listed alphabetically.
That’s because GROUP BY
orders rows based on the columns by
which it is grouping (i.e., family_id
). If we want to order
the results based on the family name, the scientific_name
in the bird_families
table, we’d have to change the
GROUP BY
clause to group by that column. Try entering
this:
SELECT bird_families.scientific_name AS 'Bird Family',
COUNT(*) AS 'Number of Species'
FROM birds LEFT JOIN bird_families USING(family_id)
GROUP BY bird_families.scientific_name;
+--------------------+-------------------+
| Bird Family | Number of Species |
+--------------------+-------------------+
| Acanthisittidae | 9 |
| Acanthizidae | 238 |
| Accipitridae | 481 |
| Acrocephalidae | 122 |
| Aegithalidae | 49 |
| Aegithinidae | 20 |
| Aegothelidae | 21 |
| Alaudidae | 447 |
| ... | |
+--------------------+-------------------+
That’s better. What would be nicer is if those results also showed
the total number of birds at the bottom. We can get that from a separate
SQL statement, but to get the total in the same results set, we would
add WITH ROLLUP
to the GROUP BY
clause like
so:
SELECT bird_families.scientific_name AS 'Bird Family',
COUNT(*) AS 'Number of Species'
FROM birds JOIN bird_families USING(family_id)
GROUP BY bird_families.scientific_name WITH ROLLUP;
+--------------------+-------------------+ | Bird Family | Number of Species | +--------------------+-------------------+ | Acanthisittidae | 9 | | Acanthizidae | 238 | | Accipitridae | 481 | | Acrocephalidae | 122 | | Aegithalidae | 49 | | Aegithinidae | 20 | | Aegothelidae | 21 | | Alaudidae | 447 | | ... | | | NULL | 28891 | +--------------------+-------------------+
The total is on the last line and is equal to the count we did in
the first example of this section. In the results here, the NULL value
for the first field doesn’t refer to rows that don’t have a value for
family_id
. Instead, this is the total line. MySQL just
doesn’t have a value to put in that field as a label, so it uses NULL.
We can tweak that, though, to give it a label. While we’re doing that,
let’s include counts by orders of birds. Enter the following:
SELECT IFNULL( bird_orders.scientific_name, '') AS 'Bird Order',
IFNULL( bird_families.scientific_name, 'Total:') AS 'Bird Family',
COUNT(*) AS 'Number of Species'
FROM birds
JOIN bird_families USING(family_id)
JOIN bird_orders USING(order_id)
GROUP BY bird_orders.scientific_name, bird_families.scientific_name
WITH ROLLUP;
+---------------------+--------------------+-------------------+
| Bird Order | Bird Family | Number of Species |
+---------------------+--------------------+-------------------+
| Anseriformes | Anhimidae | 3 |
| Anseriformes | Total: | 3 |
| Apodiformes | Apodidae | 316 |
| Apodiformes | Hemiprocnidae | 16 |
| Apodiformes | Trochilidae | 809 |
| Apodiformes | Total: | 1141 |
| Caprimulgiformes | Aegothelidae | 21 |
| Caprimulgiformes | Caprimulgidae | 224 |
| Caprimulgiformes | Nyctibiidae | 17 |
| Caprimulgiformes | Podargidae | 26 |
| ... | | |
| | Total: | 28890 |
+---------------------+--------------------+-------------------+
Besides adding another field to get the number of birds within an
order of birds, we used the IFNULL()
function to wrap the fields for the bird order counts and family
counts. This function tells MySQL that if the value for the field will
be NULL, it should be replaced with the value or string given—else it
should return the count. Because the statement calculates first the
primary totals (i.e., the totals for each family of birds), and then
calculates the secondary totals (i.e., the totals for each order of
birds), this works.
The results in the previous example aren’t marvelous, but you can easily use this method in conjunction with a script that will display these results on a web page. You can use an API to check for a value of Total: in the second field and then adjust for that. You could instead do these simple calculations in an API script, rather than have MySQL do them. However, sometimes it’s better to do calculations at the database system level. I have found often that better SQL statements make for tighter and easier to maintain API scripts. All right; enough of that. Let’s move on to more aggregate functions, besides just counting the number of rows.
In Chapter 11, we created a new table,
bird_identification_tests
, for recording fun tests members
could do online to try their skills at identifying birds. Suppose we
want to tell the member how long it takes them on average to identify
birds. A simple calculation would be to get the total time elapsed
(i.e., subtracting id_end
from id_start
) for
each row and then adding those differences together to get the sum of
all rows. We would then divide that sum by the number of rows. To get
the sum, we can use the SUM()
function.
Before we jump too far ahead, though, let’s look at some entries
for one of the members to remember and know what to do. We’ll use
the TIMEDIFF()
function to determine the
difference between the time the test started and when it ended (covered
in the section Comparing Dates and Times). Enter the
following:
SELECT common_name AS 'Bird',
TIME_TO_SEC( TIMEDIFF(id_end, id_start) )
AS 'Seconds to Identify'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE name_first = 'Ricky' AND name_last = 'Adams';
+--------------------+---------------------+
| Bird | Seconds to Identify |
+--------------------+---------------------+
| Crested Shelduck | 81 |
| Moluccan Scrubfowl | 174 |
| Indian Pond-Heron | 181 |
+--------------------+---------------------+
Because we need the total number of seconds for each test in order
to add the values together to get to an average, we used TIME_TO_SEC()
to convert the results
from TIMEDIFF()
(e.g., to convert from 121, for 1 minute
and 21 seconds, to 81 seconds). We did this extra step just to see how
these values come more easily together with SUM()
and to
better understand the time functions in the following SQL
statement:
SELECT CONCAT(name_first, SPACE(1), name_last)
AS 'Birdwatcher',
SUM(TIME_TO_SEC( TIMEDIFF(id_end, id_start) ) )
AS 'Total Seconds for Identifications'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE name_first = 'Ricky' AND name_last = 'Adams';
+-------------+-----------------------------------+
| Birdwatcher | Total Seconds for Identifications |
+-------------+-----------------------------------+
| Ricky Adams | 436 |
+-------------+-----------------------------------+
That gives us the correct number of seconds that Ricky Adams spent
identifying three birds. Notice that this is another aggregate function
that doesn’t require the GROUP BY
clause. Now let’s change
the SQL statement to calculate the average time (e.g.,
426 seconds divided by 3
entries). To do this, we’ll use an absurdly complex and inefficient
method. We’ll create a subquery to get each value to calculate the
average. You don’t have to enter this one. Just look it over:
SELECT Identifications, Seconds,
(Seconds / Identifications) AS 'Avg. Seconds/Identification'
FROM
( SELECT human_id, COUNT(*) AS 'Identifications'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE name_first = 'Ricky' AND name_last = 'Adams')
AS row_count
JOIN
( SELECT human_id, CONCAT(name_first, SPACE(1), name_last)
AS 'Birdwatcher',
SUM(TIME_TO_SEC(TIMEDIFF(id_end, id_start)))
AS 'Seconds'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id) )
AS second_count
USING(human_id);
+-----------------+---------+-----------------------------+
| Identifications | Seconds | Avg. Seconds/Identification |
+-----------------+---------+-----------------------------+
| 3 | 436 | 145.3333 |
+-----------------+---------+-----------------------------+
That was a lot of work for something that should be simple—and it
can be. Let’s change that to use AVG()
:
SELECT CONCAT(name_first, SPACE(1), name_last)
AS 'Birdwatcher',
AVG( TIME_TO_SEC( TIMEDIFF(id_end, id_start)) )
AS 'Avg. Seconds per Identification'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE name_first = 'Ricky' AND name_last = 'Adams';
+-------------+---------------------------------+
| Birdwatcher | Avg. Seconds per Identification |
+-------------+---------------------------------+
| Ricky Adams | 145.3333 |
+-------------+---------------------------------+
That was much easier, and without any subqueries. If we remove the
WHERE
clause, we would get the average time for all of the
members. Let’s do that and change the formatting of the time to minutes
and seconds, not just the average of total seconds. We’ll use SEC_TO_TIME()
to do that, reversing the
results of TIME_TO_SEC()
now that we’ve calculated the
average. Enter this on your server:
SELECT CONCAT(name_first, SPACE(1), name_last)
AS 'Birdwatcher',
COUNT(*) AS 'Birds',
TIME_FORMAT(
SEC_TO_TIME(AVG( TIME_TO_SEC( TIMEDIFF(id_end, id_start)))),
'%i:%s' )
AS 'Avg. Time'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
GROUP BY human_id LIMIT 3;
+---------------+-------+-----------+
| Birdwatcher | Birds | Avg. Time |
+---------------+-------+-----------+
| Rusty Osborne | 2 | 01:59 |
| Lexi Hollar | 3 | 00:23 |
| Ricky Adams | 3 | 02:25 |
+---------------+-------+-----------+
This time we included more members—but limited the results to three—and include the number of birds that each member identified. We also formatted the average time better. We can see that Ricky Adams took much longer on average than Lexi Hollar. It may be that Lexi is quicker or that Ricky was distracted when he was identifying birds.
Because we used the LIMIT
clause, we can’t determine the
longest and quickest average times from these results. To know that, we
need to remove the LIMIT
and then make the SQL statement a
subquery of another in which we will add an ORDER BY
clause. Essentially, the inner
SELECT
returns a list with each bird-watcher and their
average time, which the outer SELECT
puts in the order we
want:
SELECT Birdwatcher, avg_time AS 'Avg. Time'
FROM
(SELECT CONCAT(name_first, SPACE(1), name_last) AS 'Birdwatcher',
COUNT(*) AS 'Birds',
TIME_FORMAT( SEC_TO_TIME( AVG(
TIME_TO_SEC( TIMEDIFF(id_end, id_start)))
),'%i:%s' ) AS 'avg_time'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
GROUP BY human_id) AS average_times
ORDER BY avg_time;
+-------------------+-----------+
| Birdwatcher | Avg. Time |
+-------------------+-----------+
| Lexi Hollar | 00:23 |
| Geoffrey Dyer | 00:25 |
| Katerina Smirnova | 00:48 |
| Rusty Osborne | 01:59 |
| Ricky Adams | 02:25 |
| Anahit Vanetsyan | 03:20 |
+-------------------+-----------+
Now we know that Lexi is the quickest and Anahit was the slowest.
We had to use a subquery because you can’t generally put a GROUP
BY
and an ORDER BY
clause in the same SQL statement.
You have to do what we did here instead.
If we don’t want to know the names of who had the minimum average
and who had the maximum, we could use the MAX()
and
MIN()
functions. Let’s redo the previous SQL statement to include those
aggregate functions. Try this on your server:
SELECT MIN(avg_time) AS 'Minimum Avg. Time',
MAX(avg_time) AS 'Maximum Avg. Time'
FROM humans
JOIN
(SELECT human_id, COUNT(*) AS 'Birds',
TIME_FORMAT(
SEC_TO_TIME( AVG(
TIME_TO_SEC( TIMEDIFF(id_end, id_start)))
), '%i:%s' ) AS 'avg_time'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
GROUP BY human_id ) AS average_times;
+-------------------+-------------------+ | Minimum Avg. Time | Maximum Avg. Time | +-------------------+-------------------+ | 00:23 | 03:20 | +-------------------+-------------------+
Comparing these results to the previous ones, we can see that they are correct. If we want to see the minimum and maximum time for each person, instead of the averages, we could do this:
SELECT CONCAT(name_first, SPACE(1), name_last) AS 'Birdwatcher',
TIME_FORMAT(SEC_TO_TIME(
MIN(TIME_TO_SEC( TIMEDIFF(id_end, id_start)))
),%i:%s' ) AS 'Minimum Time',
TIME_FORMAT(SEC_TO_TIME(
MAX(TIME_TO_SEC( TIMEDIFF(id_end, id_start)))
), '%i:%s' ) AS 'Maximum Time'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
GROUP BY Birdwatcher;
+-------------------+--------------+--------------+
| Birdwatcher | Minimum Time | Maximum Time |
+-------------------+--------------+--------------+
| Anahit Vanetsyan | 00:20 | 08:48 |
| Geoffrey Dyer | 00:09 | 00:42 |
| Katerina Smirnova | 00:22 | 01:02 |
| Lexi Hollar | 00:11 | 00:39 |
| Ricky Adams | 01:21 | 03:01 |
| Rusty Osborne | 01:50 | 02:08 |
+-------------------+--------------+--------------+
This shows an alphabetic list of members and each one’s minimum
and maximum time to identify a bird. Essentially, once you group items
by the bird-watcher, you can run aggregate functions such as
AVG()
and MAX()
on them. We removed the field counting the number of
identifications they made.
We could play with this more to see which birds take the longest
to identify and which take the least amount of time. We could mark ones
that are most difficult to identify for more advanced members. Some
members may have a low average time if it were not for one bird that was
particularly difficult to identify. For those entries, we could use the
aggregate functions for more advanced statistical calculations to remove
them, functions like STDDEV()
and
VARIANCE()
. As a beginner, you probably won’t need to know
them. Just know that they exist in case one day you do.
Before moving on, let’s look at one more example using
MIN()
and MAX()
, an example that uses values
other than time values. The bird_sightings
table contains
information on birds that our members saw in the field. It includes the
GPS coordinates where each bird was seen: the location_gps
column. This column contains two 11-digit numbers: the latitude and the
longitude on the globe. Because birds tend to migrate between north and
south, suppose we want to know the farthest north and south that birds
were seen. We could use SUBSTRING()
to extract the latitude, the
MAX()
function to determine which value is farthest north,
and MIN()
to determine which is the farthest south. We
would do this like so:
SELECT common_name AS 'Bird',
MAX(SUBSTRING(location_gps, 1, 11)) AS 'Furthest North',
MIN(SUBSTRING(location_gps, 1, 11)) AS 'Furthest South'
FROM birdwatchers.bird_sightings
JOIN rookery.birds USING(bird_id)
WHERE location_gps IS NOT NULL
GROUP BY bird_id LIMIT 3;
+-----------------+----------------+----------------+
| Bird | Furthest North | Furthest South |
+-----------------+----------------+----------------+
| Eskimo Curlew | 66.16051056 | 66.16051056 |
| Whimbrel | 30.29138551 | 30.29138551 |
| Eurasian Curlew | 51.70469364 | 42.69096856 |
+-----------------+----------------+----------------+
In these results, because there was only one sighting of the first two birds, the values for both fields are the same. But for the Eurasian Curlew, you can see that it shows the farthest north and south that the bird was seen by our members.
There is one more aggregate function that I want to cover before finishing
with them. The GROUP_CONCAT()
function is not used
much, but it can be handy for particular situations. It’s used to
concatenate together the values for a group into a comma-separated list.
Without it, you would need to do a subquery and use
CONCAT_WS()
to concatenate the results of a field.
To list the bird families for a particular order of birds, we
could issue a simple SELECT
statement. Now suppose we want
a list of bird orders and bird families together, but we want one of the
fields in the results to contain all of the bird families for each bird
order. That would be cumbersome to do without
GROUP_CONCAT()
. Let’s see what it can do for us, using
this supposition. Enter the following on your server:
SELECT bird_orders.scientific_name AS 'Bird Order',
GROUP_CONCAT(bird_families.scientific_name)
AS 'Bird Families in Order'
FROM rookery.bird_families
JOIN rookery.bird_orders USING(order_id)
WHERE bird_orders.scientific_name = 'Charadriiformes'
GROUP BY order_id \G
*************************** 1. row *************************** Bird Order: Charadriiformes Bird Families in Order: Charadriidae,Laridae,Sternidae,Burhinidae,Chionidae,Pluvianellidae, Dromadidae,Haematopodidae,Ibidorhynchidae,Recurvirostridae, Jacanidae,Scolopacidae,Turnicidae,Glareolidae,Pedionomidae, Thinocoridae,Rostratulidae,Stercorariidae,Alcidae
I limited the results to one particular family to save space here.
To get lists of orders for all families, just remove the
WHERE
clause:
SELECT
bird_orders
.
scientific_name
AS
'Bird Order'
,
GROUP_CONCAT
(
bird_families
.
scientific_name
SEPARATOR
', '
)
AS
'Bird Families in Order'
FROM
rookery
.
bird_families
JOIN
rookery
.
bird_orders
USING
(
order_id
)
GROUP
BY
order_id
\
G
If you tried that, you saw that the SEPARATOR
clause
of the GROUP_CONCAT()
added a comma and a space after each
family name.
Numeric functions are functions that change numbers in some way. They don’t do a calculation, per se. That would be arithmetic functions. Instead, they help you simplify the numeric result of a query. You might want to round a number up or down, or get the absolute value. These actions can be done easily with numeric functions. We’ll look at them in this section.
Computers are very precise, so when we ask them to do a calculation, they will sometimes return a number with many decimal places. That may not matter to you, especially if the number is not displayed and used just by other functions for processing, either now or later. However, as humans, we tend to be more comfortable with rounded numbers. We’re usually not as precise as computers. To that end, there are a few numeric functions that may be used for rounding.
In Dynamic Columns, we created some tables with dynamic columns in MariaDB. These included surveys of members about their bird-watching preferences. Let’s use those tables and the data they contain to test some numeric functions. If you didn’t create those survey tables or if you aren’t using MariaDB, you won’t be able to participate in these examples.
To start, let’s look at one of the SQL statements we used in that section. We’ll run it again, but with more data from my site:
SELECT IFNULL(COLUMN_GET(choices, answer AS CHAR), 'total')
AS 'Birding Site', COUNT(*) AS 'Votes'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer WITH ROLLUP;
+--------------+-------+
| Birding Site | Votes |
+--------------+-------+
| forest | 30 |
| shore | 42 |
| backyard | 14 |
| total | 86 |
+--------------+-------+
This shows us the number of votes from members for the types of
locations they prefer for watching birds. Let’s calculate the
percentages to go with these results. To do this, we need first to count
the number of votes for all of the choices. We could put that in a
subquery, but let’s keep it simpler by executing a SELECT
statement first to get that value. We’ll create a user-defined variable
in which to temporarily store that number. A user variable is temporary
and will last only for the current client session. It can be accessed
only by the user that creates it. You would use the SET
statement to create a user variable.
The variable name must start with @
, followed by the equals
sign, and then a value, an expression, or an SQL statement that will
determine the value of the user-defined variable. Let’s create one now
for our example. Enter the following on your MariaDB server:
SET @fav_site_total =
(SELECT COUNT(*)
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1);
SELECT @fav_site_total;
+-----------------+
| @fav_site_total |
+-----------------+
| 86 |
+-----------------+
Because I added plenty more rows to the
survey_answers
table, this result is now higher than
previously. You’ll see that the total is correct in the results of the
next example. Let’s use the variable we created as the denominator for
calculating the percentage of votes for each choice:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
(COUNT(*) / @fav_site_total) AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 0.3488 |
| shore | 42 | 0.4884 |
| backyard | 14 | 0.1628 |
+--------------+-------+---------+
In this example, we’re dividing the number of votes for each
choice by the variable containing the total number of votes. That gives
us numbers with four decimal places. Let’s change those numbers to read
as percentages by multiplying them by 100 and using the ROUND()
function to get rid of the
decimals. We’ll use CONCAT()
to paste a percent sign to the
end of the number:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
CONCAT( ROUND( (COUNT(*) / @fav_site_total) * 100), '%')
AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 35% |
| shore | 42 | 49% |
| backyard | 14 | 16% |
+--------------+-------+---------+
Notice that the ROUND()
function rounded the first
two numbers up and the last one down. That’s how rounding goes. Let’s
change the results to show one decimal place:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
CONCAT( ROUND( (COUNT(*) / @fav_site_total) * 100, 1), '%') AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 34.9% |
| shore | 42 | 48.8% |
| backyard | 14 | 16.3% |
+--------------+-------+---------+
The ROUND()
function rounded up and down to the
first decimal place based on the true value, which includes multiple
decimal places. Suppose we want to be conservative and round all values
down, or all values up. For that, we need other functions.
To round only down, use the FLOOR()
function. To round only up,
use the CEILING()
function. Let’s use the
previous example to see how we would round down the results:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
CONCAT( FLOOR( (COUNT(*) / @fav_site_total) * 100), '%')
AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 34% |
| shore | 42 | 48% |
| backyard | 14 | 16% |
+--------------+-------+---------+
In this example, we replaced ROUND()
with
FLOOR()
so that the results would be rounded down. The
FLOOR()
function doesn’t allow for specifying the number
of decimal places. Instead, it rounds down to the integer value.
If we want to round only up, we would use the
CEILING()
function like so:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
CONCAT( CEILING( (COUNT(*) / @fav_site_total) * 100), '%') AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 35% |
| shore | 42 | 49% |
| backyard | 14 | 17% |
+--------------+-------+---------+
That rounded up all of the values. If a value has no decimal places, it wouldn’t change the value.
If we don’t want to round a number up or down, but we just want to eliminate the extra
decimal places, we can use TRUNCATE()
. Let’s see how that looks
with the same SQL statement we’ve been modifying:
SELECT COLUMN_GET(choices, answer AS CHAR)
AS 'Birding Site',
COUNT(*) AS 'Votes',
CONCAT( TRUNCATE( (COUNT(*) / @fav_site_total) * 100, 1), '%')
AS 'Percent'
FROM survey_answers
JOIN survey_questions USING(question_id)
WHERE survey_id = 1
AND question_id = 1
GROUP BY answer;
+--------------+-------+---------+
| Birding Site | Votes | Percent |
+--------------+-------+---------+
| forest | 30 | 34.8% |
| shore | 42 | 48.8% |
| backyard | 14 | 16.2% |
+--------------+-------+---------+
As the name of the function implies, it truncated the value after the number of decimal places specified (i.e., 1 in this example).
Sometimes when we’re working with numbers in functions, we get them in the
wrong order and the result is a number with a negative sign. If we’re
trying to find only the difference between two numbers, we can
use ABS()
to return the absolute value, the
value without a negative sign. Absolute values are also important for
certain mathematical calculations.
We’ll try this function by using part of some examples from the
previous section, where we determined the total number of seconds each
member took to identify birds. This time we’ll just calculate a total
for all rows, not grouping by human_id
:
SELECT
SUM( TIME_TO_SEC( TIMEDIFF(id_start, id_end) ) )
AS 'Total Seconds for All',
ABS( SUM( TIME_TO_SEC( TIMEDIFF(id_start, id_end) ) ) )
AS 'Absolute Total'
FROM bird_identification_tests;
+-----------------------+----------------+
| Total Seconds for All | Absolute Total |
+-----------------------+----------------+
| -1689 | 1689 |
+-----------------------+----------------+
There’s not much to this function and example. The first field has
a negative sign because we put the id_start
before the
id_end
within TIMEDIFF()
. We could just reverse the order, but there will be
situations in which you won’t know which value will be greater than the
other. For this, you may need ABS()
.
In other situations, you want to know whether a value is positive
or negative. For this, you can use the SIGN()
function.
It returns a value of 1 if the argument given results in a
positive number, -1 if it results in a negative number, and 0 if it’s
given a value of zero.
As an example, let’s go back to our bird identification tests.
Suppose we want a list of birds that members identified in less time
than the average. We calculated the minimum average earlier in Calculating a Group of Values. We’ll reuse part of that SQL
statement, but save the results to a user-defined variable and use that
variable to compare each row in bird_identification_tests
so we can list only rows in which the time it took to identify the bird
was less than average. Set up that variable and test it by entering this
on your server:
SET @min_avg_time =
(SELECT MIN(avg_time) FROM
(SELECT AVG( TIME_TO_SEC( TIMEDIFF(id_end, id_start)))
AS 'avg_time'
FROM bird_identification_tests
GROUP BY human_id) AS average_times);
SELECT @min_avg_time;
+---------------+
| @min_avg_time |
+---------------+
| 23.6667 |
+---------------+
That’s about right. We had a value of 23 seconds before, but
that’s because we rounded it with TIME_FORMAT()
. This is
more accurate. Let’s use that variable now to do a
comparison using SIGN()
in the WHERE
clause. Enter this on your server:
SELECT CONCAT(name_first, SPACE(1), name_last)
AS 'Birdwatcher',
common_name AS 'Bird',
ROUND(@min_avg_time - TIME_TO_SEC( TIMEDIFF(id_end, id_start) ) )
AS 'Seconds Less than Average'
FROM bird_identification_tests
JOIN humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE SIGN( TIME_TO_SEC( TIMEDIFF(id_end, id_start) - @min_avg_time)) = -1;
+-------------------+----------------------+---------------------------+
| Birdwatcher | Bird Identified | Seconds Less than Average |
+-------------------+----------------------+---------------------------+
| Lexi Hollar | Blue Duck | 3 |
| Lexi Hollar | Trinidad Piping-Guan | 13 |
| Geoffrey Dyer | Javan Plover | 15 |
| Katerina Smirnova | Blue Duck | 2 |
| Anahit Vanetsyan | Great Crested Grebe | 4 |
+-------------------+----------------------+---------------------------+
The use of SIGN()
in the WHERE
clause
selects rows in which the member took less than the average time. That’s
a function that would be difficult to duplicate in MySQL by any other
method.
Although we didn’t cover all of the aggregate and numeric functions,
we covered most of them—including the ones that are used most often. We
primarily skipped the statistics functions. We didn’t cover many
arithmetic functions, but those are pretty straightforward (e.g., POWER(2, 8)
returns 2 to
the eighth power, or 256), or they’re specialized (e.g., PI()
returns π, or 3.141593). What’s
important is that you feel comfortable with aggregate functions and using
the GROUP BY
clause—you’ll use them often—and that you have a
firm grasp on the numeric functions covered in this chapter. Several other
numeric functions exist, in case you ever need them. If you want to learn
about these other functions, you can check the MySQL
documentation or the MariaDB
documentation.
Numeric functions are pretty easy, once you know what each does. You probably didn’t have any trouble following the sections on them in this chapter. Aggregate functions, though, can be a little bothersome. Therefore, while some exercises in this section require you to use numeric functions, most include aggregate functions. Some call for you to combine numeric and aggregate functions. These should help you to retain what you learned in this chapter. There aren’t many exercises for this chapter, though, so it shouldn’t take you much time to complete all of them.
Construct a simple SELECT
statement that counts the
number of rows in the birds
table where the
common_name
contains the word Least
. Execute
that to make sure you did it correctly. Next, modify that SQL
statement to count the rows in which the common_name
contains the word Great
. You’ll do this by using the
LIKE
operator in the WHERE
clause.
In Calculating a Group of Values, we covered
how to group columns for counting. Combine the two SQL statements you
constructed in the previous exercise and make one SQL statement using
GROUP BY
to produce one field in the results that shows
the number of birds with a common_name
containing
Least
and another field that shows the number of birds
that are Great
.
In some of the examples in this chapter (see Counting Values), we had MySQL count the number of birds in each species and in each bird family. For this exercise, you may want to refer to those examples.
Construct a SELECT
statement to query the
birds
table, with three fields in the results set: the
name of the bird species, the number of birds in that species, and the
percentage that species represents of the total number of species. Let
MySQL calculate the total number of species; don’t enter that value
manually in the SQL statement.
After you’ve successfully executed this SQL statement, modify the SQL statement using one of the numeric functions to round to one decimal place the field that contains the percentage value.
Do the previous exercise again, but this time create another
SELECT
statement that retrieves only the total number of
bird species. With the SET
statement, create a user
variable to store that value taken by MySQL from the
SELECT
statement. You may give any name you want for that
variable.
Now change the SELECT
statement you created in the
previous exercise, but use the variable you created for determining
the percentage of total birds in the table. Once you have it executed
correctly, exit the mysql
client and log back in.
Run the same SQL statement to create the user variable and the
second SQL statement for this exercise again. Notice the time it takes
to execute in the results statement. Then execute again the
SELECT
statement from the previous exercise, the one that
doesn’t use a user variable. Notice how long it took to execute
compared to the SELECT
statement that uses a user
variable.
In the humans
table, the
membership_expiration
column contains date values. Put
together a SELECT
statement in which you determine the
number of months after the date 2014-01-01 until each member’s
membership will expire. If you’re not sure how to do this, refer to
Comparing Dates and Times. Use the
SIGN()
function in the WHERE
clause to
determine whether a membership has expired. List only unexpired
memberships. This was covered in Eliminating Negative Numbers. Remember to use the
IF NOT NULL
operator in the WHERE
clause to
exclude those members who don’t have a paid membership (i.e., no
expiration date). Label the field as Months to
Expiration
.
Modify the SQL statement you created for the previous exercise.
Don’t exclude expired memberships this time—but still exclude those
without a paid membership. Use the CONCAT()
function to
append " - expired
" to the number of months remaining or
past due. Don’t append the text if the membership hasn’t expired.
You’ll have to use the IF()
function to test that for
the field containing the number of months. You’ll also have to use the
ABS()
function to remove the negative sign from the
value.
Based on the SQL statement you constructed in the previous
exercises, create a new one to determine the average number of months
until expiration for all paid members in one field, and the average
number of months past expiration, based on the date of 2014-01-01. You
will need to use the AVG()
function to calculate these
averages. Once you have that working, add fields to determine the
minimum and maximum number of months, using MIN()
and
MAX()
, and the GROUP BY
clause.