12.1 Measures of Central Tendency

Learning Objectives

After Chapter 12.1, you will be able to:

Calculate mean, median, and mode for a data set
Predict the best measure of central tendency for a given data set

Measures of central tendency are those that describe the middle of a sample. How we define middle can vary. Is it the mathematical average of the numbers in the data set? Is it the result in a data set that divides the set into two—with half the sample values above this result and half the sample values below? Both of these data can be important, and the difference between them can also provide useful information on the shape of a distribution.

Mean

The mean or average of a set of data (more accurately, the arithmetic mean) is calculated by adding up all of the individual values within the data set and dividing the result by the number of values:

where x_i to x_n are the values of all of the data points in the set and n is the number of data points in the set. As we discussed in the last chapter, the mean may be a parameter or a statistic (as is true of all of the measures of central tendency) depending on whether we are discussing a population or a sample. Mean values are a good indicator of central tendency when all of the values tend to be fairly close to one another. Having an outlier—an extremely large or extremely small value compared to the other data values—can shift the mean toward one end of the range. For example, the average income in the United States is about $70,000, but half of the population makes less than $50,000. In this case, the small number of extremely high-income individuals in the distribution shifts the mean to the high end of the range.

Median

The median value for a set of data is its midpoint, where half of data points are greater than the value and half are smaller. In data sets with an odd number of values, the median will actually be one of the data points. In data sets with an even number of values, the median will be the mean of the two central data points. To calculate the median, a data set must first be listed in increasing fashion. The position of the median can be calculated as follows:

where n is the number of data values. In a data set with an even number of data points, this equation will solve for a noninteger number; for example, in a data set with 18 points, it will be The median in this case will be the arithmetic mean of the ninth and tenth items in the data set when sorted in ascending order.

The median tends to be the least susceptible to outliers, but may not be useful for data sets with very large ranges (the distance between the largest and smallest data point, as discussed later in this chapter) or multiple modes.

Example:

Using the same data from the last question, find the median age of the attendees. Comparing this value to the mean, is the median a better or worse indicator of central tendency in this sample?

Solution:

The first step in finding the median is to order the data from smallest to largest. Our original data was:

23, 22, 25, 22, 22, 24, 36, 20

Reordered, this becomes:

20, 22, 22, 22, 23, 24, 25, 36

n, the number of data points, is 8, so the median will be the average of the fourth and fifth data points. The median is therefore The median is a better indicator of central tendency for this data than the mean of 24.25. The median is unaffected by the outlier and lies close to most of the values in the data set. One could improve the representativeness of the mean by excluding 36 from the data set, in which case the mean would be 22.6 while the median would be 22.

If the mean and the median are far from each other, this implies the presence of outliers or a skewed distribution, as discussed later in this chapter. If the mean and median are very close, this implies a symmetrical distribution.

Key Concept

The median divides the data set into two groups with 50% of values higher than the median and 50% of values lower than it.

Mode

The mode, quite simply, is the number that appears the most often in a set of data. There may be multiple modes in a data set, or—if all numbers appear equally—there can even be no mode for a data set. When we examine distributions, the peaks represent modes. The mode is not typically used as a measure of central tendency for a set of data, but the number of modes, and their distance from one another, is often informative. If a data set has two modes with a small number of values between them, it may be useful to analyze these portions separately or to look for other variables that may be responsible for dividing the distribution into two parts.