MICROSOFT Excel 2010 provides a wide range of features that can help you perform statistical analysis of your data. A number of functions that assist in simple analysis tasks, such as AVERAGE, MEDIAN, and MODE, are built into the program. If the built-in statistical functions aren’t enough, you can turn to the Analysis Toolpak, an add-in that provides a collection of tools that augment the built-in analytical capabilities of Excel 2010. You can use these tools to create histograms and rank-and-percentile tables, extract samples from a data set, perform regression analysis, generate special random-number sets, apply Fourier and other transformations to your data, and more. In this chapter, we explore the most important statistical analysis functions that are built into Excel, as well as those included with the Analysis Toolpak.

Analyzing Distributions of Data

In statistics, a collection of measurements is called a distribution. Excel has several methods you can use to analyze distributions: built-in statistical functions, the sample and population statistical functions, and the rank-and-percentile functions together with the Rank And Percentile tool.

Note

You can also analyze distributions using the Descriptive Statistics and Histogram tools, both of which are included in the Analysis Toolpak add-in. For more information, see Using the Analysis Toolpak Data Analysis Tools on page 616.

Using Built-In Statistical Functions

You use the built-in statistical functions to analyze a group (or population) of measurements. In the following sections, the discussion is limited to the most commonly used statistical functions. To quickly access these functions, click the More Functions button on the Formulas tab on the ribbon, and then click Statistical to display a menu of statistical functions.

The AVERAGEIF function takes the arguments (range, criteria, average_range) where range represents the cells to average; criteria is a number, expression, cell reference, or text used to select the cells within range to include; and average_range is an optional argument specifying an alternate range of cells to evaluate. For example, the formula =AVERAGEIF(sales,“>20”,A2:C30) averages all cells in the range A2:C30 with values greater than 20. If the third argument is omitted, the formula performs the same operation on the named range sales instead. The average_range criterion becomes useful when you want to select cells on the basis of the contents of one row or column and actually perform the calculation on numbers contained in adjacent rows or columns.

The AVERAGEIFS function is similar to the AVERAGEIF function, but it allows you to specify multiple criteria. This function takes the arguments (average_range, criteria_range1, criteria1, criteria_range2, criteria2, …) where average_range specifies the cell range you want to average, criteria_range specifies the cells containing the values you want to compare, and criteria is a value, expression, cell reference, or text defining the cells within average_range that you want to include. You can add up to 127 sets of criteria_range and criteria arguments. This function would be useful for finding the average sale prices of particular automobile models on an online auction site, where you limit the results to include specific features such as number of doors, transmission type, interior options, etc.

The MEDIAN, MODE.SNGL, MODE.MULT, MAX, MIN, and COUNT Functions

These functions all take the same arguments, essentially just a cell range or a list of numbers separated by commas, such as (number1, number2, …). They can accept up to 255 arguments and ignore text, error values, and logical values. Here’s a brief description of each:

MEDIAN Computes the median of a set of numbers. The median is the number in the middle of the set; that is, an equal number of values are higher and lower than the median. If the numbers specified include an even number of values, the value returned is the average of the two that lie in the middle of the set.
MODE.SNGL Determines which value occurs most frequently in a set of numbers. If no number occurs more than once, MODE.SNGL returns the #N/A error value.
MODE.MULT Determines which values occur most frequently in a set of numbers and returns a vertical array of them. You must select enough cells to contain the results before entering the formula as an array. For more information, see Using Arrays on page 512.
MAX Returns the largest value in a range.
MIN Returns the smallest value in a range.
COUNT Tells you how many cells in a given range contain numbers, including dates and formulas that evaluate to numbers.

The PERCENTRANK.INC and PERCENTRANK.EXC functions return a ranking for any item (aka member) of a data set as a percentage. The .INC (inclusive) form of this function includes the entire data set, and the .EXC (exclusive) form eliminates rankings of 0% and 100%. We used PERCENTRANK.INC to create the percentile ranking in column E in Figure 17-1. These functions are meant to replace the old PERCENTRANK function, which you can still use and is equivalent to the .INC form.

Note

You can find the SAT Scores.xlsx file with the other examples on the companion Web site.

Both forms of the PERCENTRANK function take the arguments (array, x, significance). The array argument specifies the input range (which is $D$2:$D$1001, in our example), and x specifies the value whose rank you want to obtain. The significance argument, which is optional, indicates the number of digits of precision you want; if this argument is omitted, results are rounded to three digits (0.xxx or xx.x%).

Figure 17-1. The PERCENTRANK.INC function determines where a value stands in a population.

The PERCENTILE and QUARTILE Functions

You use the PERCENTILE.INC and PERCENTILE.EXC functions to find the member of a data set that stands at a specified percentile rank; they both take the arguments (array, k). The .INC (inclusive) form of this function includes the entire data set, and the .EXC (exclusive) form eliminates rankings of 0% and 100%. You must express the percentile k as a decimal fraction from 0 to 1. For example, to find out which score in the worksheet partially shown in Figure 17-1 represents the 86th percentile, you can use the formula =PERCENTILE.INC($D$2:$D$1001, 0.86).

The QUARTILE functions, which take the arguments (array, quart), work much like the PERCENTILE functions, except they return the values at the lowest, 25th, median, 75th, or highest percentile in the input set. The array argument specifies the input range. The quart argument specifies the value to be returned, as shown in Table 17-1.

Table 17-1. The Quart Argument

Quart	Returns
0	Lowest value
1	25th-percentile value
2	Median (50th-percentile) value
3	75th-percentile value
4	Highest value

Tip

INSIDE OUT Use MIN, MEDIAN, and MAX

QUARTILE is a powerful function, but if you don’t need the 25th or 75th percentile values, you will get faster results using other functions, particularly when working with large data sets. Use the MIN function instead of QUARTILE(array, 0), the MEDIAN function instead of QUARTILE(array, 2), and the MAX function instead of QUARTILE(array, 4).

The SMALL and LARGE Functions

The SMALL and LARGE functions return the kth smallest and kth largest values in an input range; both take the arguments (array, k), where k is the position from the largest or smallest value to the value in the array you want to find. For example, to find the 15th highest score in the worksheet partially shown in Figure 17-1, you can use the formula =LARGE($D$2:$D$1001, 15).

The RANK Functions

The RANK.AVG and RANK.EQ functions return the ranked position of a particular value within a set of values; both take the arguments (number, ref, order). If more than one value in the set has the same rank, the .AVG form of the function returns their average, while the .EQ form returns the higher value. The number argument is the number for which you want to find the rank, ref is the range containing the data set, and order (an optional argument) ranks the number as if it were in a ranking list in an ascending or descending (the default) order. For example, to find out where the score 1200 falls in the data set in Figure 17-1, you can use the formula =RANK.AVG(1200, $D$2:$D$1001).

Using Sample and Population Statistical Functions

Variance and standard deviation are statistical measurements of the dispersion of a group, or population, of values. The standard deviation is the square root of the variance. As a rule, about 68 percent of a normally distributed population falls within one standard deviation of the mean, and about 95 percent falls within two standard deviations. A large standard deviation indicates that the population is widely dispersed from the mean; a small standard deviation indicates that the population is tightly packed around the mean.

The VAR and STDEV function families compute the variance and standard deviation of the numbers in a range of cells. Before you make these calculations, you must determine whether those values represent the total population or only a representative sample of that population. The VAR.S and STDEV.S functions assume that the values represent only a sample of the total population, while the VAR.P and STDEV.P functions assume that the values represent the total population. The A versions—VARA, VARPA, STDEVA, and STDEVPA—include numeric text entries and logical values in their calculations, while the others do not.

Calculating Sample Statistics: VAR.S and STDEV.S

The VAR.S and STDEV.S functions compute variance and standard deviation, assuming that their arguments represent only a sample of the total population. These functions take the arguments (number1, number2, …) and accept up to 255 arguments. The worksheet in Figure 17-2 shows exam scores for five students and assumes that the scores in cells B4:E8 represent only a part of the total population.

Figure 17-2. Here, the VAR.S and STDEV.S functions measure the dispersion of sample exam scores.

Note

You can find the VAR.xlsx file with the other examples on the companion Web site.

The formula in cell I4 =VAR.S(B4:E8) calculates the variance for this sample group of test scores. The formula in cell I5 =STDEV.S(B4:E8) calculates the standard deviation.

Assuming that the test scores in the example are normally distributed, we can deduce that about 68 percent of the students (the general-rule percentage) achieved scores from 83.65 (the average 89.20 minus the standard deviation 5.55) to 94.75 (89.20 plus 5.55).

Calculating Total Population Statistics: VAR.P and STDEV.P

If the numbers you’re analyzing represent an entire population rather than a sample, use the VAR.P and STDEV.P functions to calculate variance and standard deviation. These functions take the arguments (number1, number2, …) and accept up to 255 arguments.

Assuming that cells B4:E8 in the worksheet shown in Figure 17-2 represent the total population, you can calculate the variance and standard deviation with the formulas =VAR.P(B4:E8) and =STDEV.P(B4:E8). The VAR.P function returns 29.26, and the STDEV.P function returns 5.41.

Note

The STDEV.S, STDEV.P, VAR.S, and VAR.P functions do not include text values or logical values in their calculations. If you want to include these values, use the A versions: STDEVA, STDEVPA, VARA, and VARPA. For more information, see The A Functions on page 604.

Chapter 17. Functions for Analyzing Statistics

Analyzing Distributions of Data

Note

Using Built-In Statistical Functions

Note

The AVERAGE Functions

Note

The MEDIAN, MODE.SNGL, MODE.MULT, MAX, MIN, and COUNT Functions

Note

Using Functions That Analyze Rank and Percentile

The PERCENTRANK Functions

Note

The PERCENTILE and QUARTILE Functions

Tip

The SMALL and LARGE Functions

The RANK Functions

Using Sample and Population Statistical Functions

Calculating Sample Statistics: VAR.S and STDEV.S

Note

Calculating Total Population Statistics: VAR.P and STDEV.P

Note