Chapter 17. Functions for Analyzing Statistics

Analyzing Distributions of Data

Understanding Linear and Exponential Regression

Using the Analysis Toolpak Data Analysis Tools

MICROSOFT Excel 2010 provides a wide range of features that can help you perform statistical analysis of your data. A number of functions that assist in simple analysis tasks, such as AVERAGE, MEDIAN, and MODE, are built into the program. If the built-in statistical functions aren’t enough, you can turn to the Analysis Toolpak, an add-in that provides a collection of tools that augment the built-in analytical capabilities of Excel 2010. You can use these tools to create histograms and rank-and-percentile tables, extract samples from a data set, perform regression analysis, generate special random-number sets, apply Fourier and other transformations to your data, and more. In this chapter, we explore the most important statistical analysis functions that are built into Excel, as well as those included with the Analysis Toolpak.

In statistics, a collection of measurements is called a distribution. Excel has several methods you can use to analyze distributions: built-in statistical functions, the sample and population statistical functions, and the rank-and-percentile functions together with the Rank And Percentile tool.

image with no caption

You use the built-in statistical functions to analyze a group (or population) of measurements. In the following sections, the discussion is limited to the most commonly used statistical functions. To quickly access these functions, click the More Functions button on the Formulas tab on the ribbon, and then click Statistical to display a menu of statistical functions.

The AVERAGE and AVERAGEA functions compute the arithmetic mean, or average, of the numbers in a range by summing a series of numeric values and then dividing the result by the number of values. These functions take the arguments (number1, number2, …) and can include up to 255 arguments. The AVERAGE function ignores blank cells and cells containing logical and text values, but the AVERAGEA function includes them. For example, to calculate the average of the values in cells B4 through B15, you could use the formula =(B4+B5+B6+B7+B8+B9+B10+B11+B12+B13+B14+B15)/12, but it’s obviously more efficient to use =AVERAGE(B4:B15).

The AVERAGEIF function takes the arguments (range, criteria, average_range) where range represents the cells to average; criteria is a number, expression, cell reference, or text used to select the cells within range to include; and average_range is an optional argument specifying an alternate range of cells to evaluate. For example, the formula =AVERAGEIF(sales,“>20”,A2:C30) averages all cells in the range A2:C30 with values greater than 20. If the third argument is omitted, the formula performs the same operation on the named range sales instead. The average_range criterion becomes useful when you want to select cells on the basis of the contents of one row or column and actually perform the calculation on numbers contained in adjacent rows or columns.

The AVERAGEIFS function is similar to the AVERAGEIF function, but it allows you to specify multiple criteria. This function takes the arguments (average_range, criteria_range1, criteria1, criteria_range2, criteria2, …) where average_range specifies the cell range you want to average, criteria_range specifies the cells containing the values you want to compare, and criteria is a value, expression, cell reference, or text defining the cells within average_range that you want to include. You can add up to 127 sets of criteria_range and criteria arguments. This function would be useful for finding the average sale prices of particular automobile models on an online auction site, where you limit the results to include specific features such as number of doors, transmission type, interior options, etc.

Excel includes several sets of functions that extract rank and percentile information from a set of input values: PERCENTRANK, PERCENTILE, QUARTILE, SMALL, LARGE, and RANK.

The PERCENTRANK.INC and PERCENTRANK.EXC functions return a ranking for any item (aka member) of a data set as a percentage. The .INC (inclusive) form of this function includes the entire data set, and the .EXC (exclusive) form eliminates rankings of 0% and 100%. We used PERCENTRANK.INC to create the percentile ranking in column E in Figure 17-1. These functions are meant to replace the old PERCENTRANK function, which you can still use and is equivalent to the .INC form.

Both forms of the PERCENTRANK function take the arguments (array, x, significance). The array argument specifies the input range (which is $D$2:$D$1001, in our example), and x specifies the value whose rank you want to obtain. The significance argument, which is optional, indicates the number of digits of precision you want; if this argument is omitted, results are rounded to three digits (0.xxx or xx.x%).

Variance and standard deviation are statistical measurements of the dispersion of a group, or population, of values. The standard deviation is the square root of the variance. As a rule, about 68 percent of a normally distributed population falls within one standard deviation of the mean, and about 95 percent falls within two standard deviations. A large standard deviation indicates that the population is widely dispersed from the mean; a small standard deviation indicates that the population is tightly packed around the mean.

The VAR and STDEV function families compute the variance and standard deviation of the numbers in a range of cells. Before you make these calculations, you must determine whether those values represent the total population or only a representative sample of that population. The VAR.S and STDEV.S functions assume that the values represent only a sample of the total population, while the VAR.P and STDEV.P functions assume that the values represent the total population. The A versions—VARA, VARPA, STDEVA, and STDEVPA—include numeric text entries and logical values in their calculations, while the others do not.