3.17 Intro to Data Science: Measures of Central Tendency—Mean, Median and Mode

Here we continue our discussion of using statistics to analyze data with several additional descriptive statistics, including:

mean—the average value in a set of values.
median—the middle value when all the values are arranged in sorted order.
mode—the most frequently occurring value.

These are measures of central tendency—each is a way of producing a single value that represents a “central” value in a set of values, i.e., a value which is in some sense typical of the others.

Let’s calculate the mean, median and mode on a list of integers. The following session creates a list called grades, then uses the built-in sum and len functions to calculate the mean “by hand”—sum calculates the total of the grades (397) and len returns the number of grades (5):

In [1]: grades = [85, 93, 45, 89, 85]

In [2]: sum(grades) / len(grades)
Out[2]: 79.4

The previous chapter mentioned the descriptive statistics count and sum—implemented in Python as the built-in functions len and sum. Like functions min and max (introduced in the preceding chapter), sum and len are both examples of functional-style programming reductions—they reduce a collection of values to a single value—the sum of those values and the number of values, respectively. In Fig. 3.1’s class-average example, we could have deleted lines 10–15 and replaced average in line 16 with snippet [2]’s calculation.

The Python Standard Library’s statistics module provides functions for calculating the mean, median and mode—these, too, are reductions. To use these capabilities, first import the statistics module:

In [3]: import statistics

Then, you can access the module’s functions with “statistics.” followed by the name of the function to call. The following calculates the grades list’s mean, median and mode, using the statistics module’s mean, median and mode functions:

In [4]: statistics.mean(grades)
Out[4]: 79.4

In [5]: statistics.median(grades)
Out[5]: 85

In [6]: statistics.mode(grades)
Out[6]: 85

Each function’s argument must be an iterable—in this case, the list grades. To confirm that the median and mode are correct, you can use the built-in sorted function to get a copy of grades with its values arranged in increasing order:

In [7]: sorted(grades)
Out[7]: [45, 85, 85, 89, 93]

The grades list has an odd number of values (5), so median returns the middle value (85). If the list’s number of values is even, median returns the average of the two middle values. Studying the sorted values, you can see that 85 is the mode because it occurs most frequently (twice). The mode function causes a StatisticsError for lists like

[85, 93, 45, 89, 85, 93]

in which there are two or more “most frequent” values. Such a set of values is said to be bimodal. Here, both 85 and 93 occur twice. We’ll say more about mean, median and mode in the Intro to Data Science exercises at the end of the chapter.

Self Check

(Fill-In) The ___________ statistic indicates the average value in a set of values.
Answer: mean.
(Fill-In) The ___________ statistic indicates the most frequently occurring value in a set of values.
Answer: mode.
(Fill-In) The ___________ statistic indicates the middle value in a set of values.
Answer: median.

(IPython Session) For the values 47, 95, 88, 73, 88 and 84, use the statistics module to calculate the mean, median and mode.
Answer:

In [1]: import statistics
In [2]: values = [47, 95, 88, 73, 88, 84]

In [3]: statistics.mean(values)
Out[3]: 79.16666666666667

In [4]: statistics.median(values)
Out[4]: 86.0

In [5]: statistics.mode(values)
Out[5]: 88