Here we continue our discussion of using statistics to analyze data with several additional descriptive statistics, including:
mean—the average value in a set of values.
median—the middle value when all the values are arranged in sorted order.
mode—the most frequently occurring value.
These are measures of central tendency—each is a way of producing a single value that represents a “central” value in a set of values, i.e., a value which is in some sense typical of the others.
Let’s calculate the mean, median and mode on a list of integers. The following session creates a list called grades
, then uses the built-in sum
and len
functions to calculate the mean “by hand”—sum
calculates the total of the grades (397) and len
returns the number of grades (5):
In [1]: grades = [85, 93, 45, 89, 85]
In [2]: sum(grades) / len(grades)
Out[2]: 79.4
The previous chapter mentioned the descriptive statistics count and sum—implemented in Python as the built-in functions len
and sum
. Like functions min
and max
(introduced in the preceding chapter), sum
and len
are both examples of functional-style programming reductions—they reduce a collection of values to a single value—the sum of those values and the number of values, respectively. In Fig. 3.1’s class-average example, we could have deleted lines 10–15 and replaced average
in line 16 with snippet [2]
’s calculation.
The Python Standard Library’s statistics
module provides functions for calculating the mean, median and mode—these, too, are reductions. To use these capabilities, first import
the statistics
module:
In [3]: import statistics
Then, you can access the module’s functions with “statistics.
” followed by the name of the function to call. The following calculates the grades
list’s mean, median and mode, using the statistics
module’s mean
, median
and mode
functions:
In [4]: statistics.mean(grades)
Out[4]: 79.4
In [5]: statistics.median(grades)
Out[5]: 85
In [6]: statistics.mode(grades)
Out[6]: 85
Each function’s argument must be an iterable—in this case, the list grades
. To confirm that the median and mode are correct, you can use the built-in sorted
function to get a copy of grades
with its values arranged in increasing order:
In [7]: sorted(grades)
Out[7]: [45, 85, 85, 89, 93]
The grades
list has an odd number of values (5), so median
returns the middle value (85
). If the list’s number of values is even, median
returns the average of the two middle values. Studying the sorted values, you can see that 85
is the mode because it occurs most frequently (twice). The mode
function causes a StatisticsError
for lists like
[85, 93, 45, 89, 85, 93]
in which there are two or more “most frequent” values. Such a set of values is said to be bimodal. Here, both 85
and 93
occur twice. We’ll say more about mean, median and mode in the Intro to Data Science exercises at the end of the chapter.
(Fill-In) The ___________ statistic indicates the average value in a set of values.
Answer: mean.
(Fill-In) The ___________ statistic indicates the most frequently occurring value in a set of values.
Answer: mode.
(Fill-In) The ___________ statistic indicates the middle value in a set of values.
Answer: median.
(IPython Session) For the values 47, 95, 88, 73, 88 and 84, use the statistics
module to calculate the mean, median and mode.
Answer:
In [1]: import statistics
In [2]: values = [47, 95, 88, 73, 88, 84]
In [3]: statistics.mean(values)
Out[3]: 79.16666666666667
In [4]: statistics.median(values)
Out[4]: 86.0
In [5]: statistics.mode(values)
Out[5]: 88