12.3 Measures of Distribution

Learning Objectives

After Chapter 12.3, you will be able to:

Identify outliers using interquartile range or standard deviation
Describe the relationship between range and standard deviation
Justify whether certain measures of distribution are or are not appropriate for a given situation

Distributions can be characterized not only by their “center” points, but also by the spread of their data. This information can be described in a number of ways. Range is an absolute measure of the spread of a data set, while interquartile range and standard deviation provide more information about the distance that data falls from one of our measures of central tendency. We can use these quantities to determine if a data point is truly an outlier in our data set.

Range

The range of a data set is the difference between its largest and smallest values:

Range does not consider the number of items of the data set, nor does it consider the placement of any measures of central tendency. Range is therefore heavily affected by the presence of data outliers. In cases where it is not possible to calculate the standard deviation for a normal distribution because the entire data set is not provided, it is possible to approximate the standard deviation as one-fourth of the range.

Interquartile Range

Interquartile range is related to the median, first, and third quartiles. Quartiles, including the median (Q₂), divide data (when placed in ascending order) into groups that comprise one-fourth of the entire set. There is some debate over the most appropriate way to calculate quartiles; for the purposes of the MCAT, we will use the most common (and simplest) method:

To calculate the position of the first quartile (Q₁) in a set of data sorted in ascending order, multiply n by
If this is a whole number, the quartile is the mean of the value at this position and the next highest position.
If this is a decimal, round up to the next whole number, and take that as the quartile position.
To calculate the position of the third quartile (Q₃), multiply the value of n by Again, if this is a whole number, take the mean of this position and the next. If it is a decimal, round up to the next whole number, and take that as the quartile position.

The interquartile range is then calculated by subtracting the value of the first quartile from the value of the third quartile:

The interquartile range can be used to determine outliers. Any value that falls more than 1.5 interquartile ranges below the first quartile or above the third quartile is considered an outlier.

Key Concept

One definition of an outlier is any value lower than 1.5 × IQR below Q₁ or any value higher than 1.5 × IQR above Q₃.

Example:

Using the interquartile range, determine whether the 36-year-old from Ray’s party is an outlier. The ages are provided in numerical order below for convenience:

20, 22, 22, 22, 23, 24, 25, 36

Solution:

In order to determine whether this point is an outlier, we must first determine the interquartile range. To do so, we must determine the first and third quartiles. This data set contains eight values. Multiplying 8 by gives us 2, so the first quartile is the mean of the second and third values in the ordered data set:

Multiplying 8 by gives us 6, so the third quartile is the mean of the sixth and seventh values in the ordered data set:

The interquartile range is the difference between these:

IQR = Q₃ − Q₁ = 24.5 − 22 = 2.5

Outliers are data values more than 1.5 interquartile ranges below Q₁ or above Q₃. Thus, any value above 24.5 + 1.5 × 2.5 = 24.5 + 3.75 = 28.25 or below 22 − 1.5 × 2.5 = 22 − 3.75 = 18.25 will be an outlier. 36 is well above 28.25, so it is an outlier in this data set.

Standard Deviation

Standard deviation is the most informative measure of distribution, but it is also the most mathematically laborious. It is calculated relative to the mean of the data. Standard deviation is calculated by taking the difference between each data point and the mean, squaring this value, dividing the sum of all of these squared values by the number of points in the data set minus one, and then taking the square root of the result. Expressed mathematically

where σ is the standard deviation, x_i to x_n are the values of all of the data points in the set, is the mean, and n is the number of data points in the set. The use of n − 1 instead of n is mathematically—but not practically—important and the reason for doing so is beyond the scope of the MCAT.

Example:

Calculate the standard deviation for the following data set:

1, 2, 3, 9, 10

Solution:

First, determine the value of the mean:

Then, find the difference between each data point and the mean, and square this value. This is a rather tedious project, but is best solved with the use of a table as seen below:

x_i
1	−4	16
2	−3	9
3	−2	4
9	4	16
10	5	25

Now we can determine the standard deviation:

Keep in mind that when calculating the mean, we use n as the denominator, but when calculating standard deviation, we use n − 1.

The standard deviation can also be used to determine whether a data point is an outlier. If the data point falls more than three standard deviations from the mean, it is considered an outlier. The standard deviation relates to the normal distribution as well. On a normal distribution, approximately 68% of data points fall within one standard deviation of the mean, 95% fall within two standard deviations, and 99% fall within three standard deviations, as shown in Figure 12.1 earlier. Integration or specialized software can be used to determine percentages falling within other intervals.

Key Concept

Another definition of outlier is any value that lies more than three standard deviations from the mean.

Outliers

While we have already discussed methods for determining if a data point is an outlier, it is useful to know how to approach data with outliers. Outliers typically result from one of three causes:

A true statistical anomaly (e.g., a person who is over seven feet tall).
A measurement error (for example, reading the centimeter side of a tape measure instead of inches).
A distribution that is not approximated by the normal distribution (e.g., a skewed distribution with a long tail).

When an outlier is found, it should trigger an investigation to determine which of these three causes applies. If there is a measurement error, the data point should be excluded from analysis. However, the other two situations are less clear.

Mcat Expertise

The existence of outliers is key for determining whether or not the mean is an appropriate measure of central tendency. It may also indicate a measurement error on the part of the investigators.

If an outlier is the result of a true measurement, but is not representative of the population, it may be weighted to reflect its rarity, included normally, or excluded from the analysis depending on the purpose of the study and preselected protocols. The decision should be made before a study begins—not once an outlier has been found. When outliers are an indication that a data set may not approximate the normal distribution, repeated samples or larger samples will generally demonstrate if this is true.