Schaum’s Outline of Probability, Second Edition

APPENDIX A
Descriptive Statistics

A.1 INTRODUCTION

Statistics means, on the one hand, lists of numerical data. For example, the weights of the students at a university, or the number of children per family in a city. Statistics as a science, on the other hand, is that branch of mathematics which organizes, analyzes, and interprets such raw data.

This appendix will mainly cover topics related to the gathering and description of data, called descriptive statistics. It is closely related to probability theory in that the probability model that one develops for the events of a space usually depends on the relative frequencies of such events. The topics of inferential statistics, such as estimation and testing hypothesis, lie beyond the scope of this appendix and text.

The numerical data x₁, x₂, … we consider will either come from a random sample of a larger population or from the larger population itself. We distinguish these two cases using different notation as follows:

Note that Greek letters are used with the population and are called parameters, whereas Latin letters are used with the samples and are called statistics. First we will give formulas for the data coming from a sample. This will be followed by formulas for the population.

A.2 FREQUENCY TABLES, HISTOGRAMS

One of the first things that one usually does with a large list of numerical data is to collect them into groups (grouped data). A group, sometimes called a category, refers to the set of numbers all of which have the same value x_i, or to the set (class) of numbers in a given interval where the midpoint x_i of the interval, called the class value, serves as an approximation to the values in the interval. We assume there are k such groups with f_i denoting the number of elements (frequency) in the group with value x_i or class value x_i. Such grouped data yields a table, called a frequency distribution, as follows:

Thus, the total number of data items is

As usual, Σ will denote a summation over all the values of the index, unless otherwise specified.

Our frequency distribution table usually lists, when applicable, the ends of the class intervals, called class boundaries or class limits. We assume all intervals have the same length called the class width. If a data item falls on a class boundary, it is usually assigned to the higher class.

Sometimes the table also lists the cumulative frequency function F_s where F_s is defined by

That is, F_s is the sum of the frequencies up to f_s. Thus, , the number of data items.

The number k of groups that we decide to use to collect our data should not be too small or too large. If it is too small, then we will lose much of the information of the given data; if it is too large, then we will lose the purpose of grouping the data. The rule of thumb is that k should lie between 5 and 12. We illustrate the above with two examples. Note that any such frequency distribution can then be pictured as a histogram or frequency polygon.

EXAMPLE A.1 Suppose an apartment house has n = 45 apartments, with the following numbers of tenants:

Observe that the only numbers which appear in the list are 1, 2, 3, 4, 5, and 6. The frequency distribution, including the cumulative frequency distribution, follows:

The sum of the frequencies is , which is also the last entry in the cumulative frequency row.

Figure A-1 shows the histogram corresponding to the above frequency distribution. The histogram is simply a bar graph where the height of the bar is the frequency of the given number in the list. Similarly, the cumulative frequency distribution could be presented as a histogram; the heights of the bars would be 8, 22, 29, …, 45.

Fig. A-1

Fig. A-2

EXAMPLE A.2 Suppose the 6:00 p.m. temperatures (in degrees Fahrenheit) for a 35-day period are as follows:

Rather than find the frequency of each individual data item, it is more useful to collect the data in classes as follows (where the temperature 95.0°F is assigned to the higher class 90 to 95 rather than the lower class 85 to 90:

The class width for this distribution is . The sum of the frequencies is ; it is also the last entry in the cumulative frequency row.

Figure A-2 shows the histogram corresponding to the above frequency distribution. It also shows the frequency polygon of the data, which is the line graph obtained by connecting the midpoints of the tops of the rectangles in the histogram. Observe that the line graph is extended to the class value 67.5 on the left and to 112.5 on the right. In such a case, the sum of the areas of the rectangles equals the area bounded by the frequency polygon and the x axis.

A.3 MEASURES OF CENTRAL TENDENCY; MEAN AND MEDIAN

There are various ways of giving an overview of data. One way is by graphical descriptions such as the frequency histogram or the frequency polygon discussed above. Another way is to use certain numerical descriptions of the data. Numbers, such as the mean and median, give, in some sense, the central or middle values of the data. The central tendency of our data is discussed in this section.

The next section discusses other numbers, the variance and standard deviation, which measure the dispersion or spread of the data about the mean, and the quartiles, which measure the dispersion or spread of the data about the median.

Many formulas will be designated as (a) or (b) where (a) indicates ungrouped data and (b) indicates grouped data. Unless otherwise stated, we assume that our data come from a random sample of a (larger) population. Separate formulas are given for data which come from the total population itself.

Mean (Arithmetic Mean)

The arithmetic mean or simply mean of a sample of n numerical values, denoted by (read: x – bar), is the sum of the values divided by the number of values. That is,

The mean is frequently called the average value.

EXAMPLE A.3

(a) Consider the data in Example A.1. Using the frequency distribution, rather than adding up the 45 numbers, we obtain the mean as follows:

In other words, there is an average of 2.8 people living in an apartment.

(b) Consider the data in Example A.2. Using the frequency distribution with class values, rather than the exact 35 numbers, we obtain the mean as follows:

That is, the average 6:00 p.m. temperature is approximately 87.2°F.

Median

Suppose a list of n data values is sorted in increasing order. The median of the data, denoted by

is defined to be the midvalue (if n is odd) or the average of the two middle values (if n is even). That is,

Note that is the average of the (n/2)th and terms when n is even.

Suppose, for example, the following two lists of sorted numbers are given:

List A has 5 terms; hence the middle term is the third term. Thus, its median . List B has 8 terms; hence there are two middle terms, the fourth term 5 and the fifth term 7. Thus, its median , the average of the two middle terms.

The cumulative frequency distribution can be used to find the median of an arbitrary set of data.

One property of the median is that there are just as many numbers less than as there are greater than .

Suppose the data are grouped. The cumulative frequency distribution can be used to find the class with the median. Then the class value is sometimes used as an approximation to the median or, for a better approximation, one can linearly interpolate in the class to find an approximation to the median.

EXAMPLE A.4

(a) Consider the data in Example A.1 which gives the number of tenants in 45 apartments. Here n = 45; hence the median is the twenty-third value. The cumulative frequency row tells us that .

(b) Consider the data in Example A.2 which gives the 6:00 p.m. temperatures for a 35-day period. The median is the eighteenth value, and its exact value can be found by using the original data before they are grouped into classes. Using the grouped data, we can find an approximation to the median in two ways. Note, first, using the cumulative frequency row, that the median is the second value in the group 85–90 which has five values. Thus:

(i) Simply let , the class value of the group.

(ii) Linearly interpolate in the class to obtain

Clearly (ii) will usually give a better approximation to the median.

Midrange

The midrange of a sorted sample is the average of the smallest value x₁ and the largest value x_n. That is,

For the data in Example A.1, and . Thus

For the data in Example A.2, and . Thus

(Again we use class values rather than the original data for our formula.)

Additional Measurements

(1) Weighted Mean (Weighted Arithmetic Mean): Suppose each value x_i is associated with a nonnegative weighting factor w_i. Then the weighted mean is defined as follows:

Here Σ w_i is the total weight. Note that Formula (A-1b) is a special case of Formula (A-3) where the weight of x_i is its frequency.

(2) Grand Mean: Suppose there are k samples and the ith sample has mean _i and n_i elements. Then the grand mean, denoted by (read: x-double bar) is defined as follows:

Population Mean

Suppose are the N numerical values of some population. The formula for the population mean, denoted by the Greek letter = (read: mu), follows:

(We emphasize that N denotes the number of elements in the population whereas n denotes the number of elements in a sample of the population.)

Remark: Observe that the formula for the population mean = is the same as the formula for the sample mean . On the other hand, there are formulas for the population which are not the same as the corresponding formulas for the sample. For example, the formula for the (population) standard deviation τ (Section A.4) is not the same as the formula for the sample standard deviation s.

A.4 MEASURES OF DISPERSION: VARIANCE AND STANDARD DEVIATION

Consider the following two samples of numerical values:

Observe that the median (middle value) of each list is . Furthermore, the following shows that both lists have the same mean :

Although both lists have the same first and last elements, the values in list A are clustered more closely about the mean than the values in list B. This section will discuss important ways of measuring such dispersions of data.

Variance and Standard Deviation

Consider a sample of values , and suppose is the mean of a sample. The difference is called the deviation of the data value x_i from the mean ; it is positive or negative accordingly as x_i is greater or less than . The sample variance, denoted by s², is defined as the sum of the squares of the deviations divided by . Namely,

The nonnegative square root of the sample variance s², denoted by s, is called the sample standard deviation. That is,

If the data are organized into classes, then we use the ith class value for x_i in the above Formula (A-6b).

The data in most applications and examples will come from some sample; hence we may simply say variance and standard deviation, omitting the adjective “sample”.

Since each squared deviation is nonnegative, so is the variance s². Moreover, s² is zero precisely when all the data values are all equal (and, therefore, are all equal to the mean ). Accordingly, if the data are more spread out, then the variance s² and the standard deviation s will be larger.

One advantage of the use of the standard deviation s over the variance s² is that the standard deviation s will have the same units as the original data.

EXAMPLE A.5 Consider the lists A and B above.

(a) List A has a mean . The following are the deviations of the 7 data values:

The squares of the deviations are as follows:

Also, . Therefore, the sample variance s² and standard deviation s are derived as follows:

and

(b) List B also has a mean = 10. The deviations of the data and their squares follow:

Again, . Accordingly, the sample variance s² and standard deviation s are derived as follows:

and

Note that list B, which exhibits more dispersion than A, has a larger variance and standard deviation than list A.

Alternate Formulas for Sample Variance

Alternate formulas for the sample variance, that is, which are equivalent to Formulas (A-6a) and (A-6b) are as follows:

Again, if the data are organized into classes, then we use the class values as approximations to the original values in the above Formula (A-8b).

Although Formulas (A-8a) and (A-8b) may look more complicated than Formulas (A-6a) and (A-6b), they are usually more convenient to use. In particular, these formulas only use one subtraction in the numerator, and they can be used without first calculating the sample mean .

EXAMPLE A.6 Consider the following data values:

Find: (a) mean , (b) variance s² and standard deviation s.

First construct the following table where the two numbers on the right, 95 and 1217, denote the sums Σx_i and Σx_i², respectively:

(It is currently common practice and notationally convenient to write numbers and their sum horizontally rather than vertically.)

(a) By Formula (A-1a), where ,

(b) Here we use Formula (A-8a) with n = 9 and

Then

Note that if we used Formula (A-6a), we would need to subtract from each x_i before squaring.

EXAMPLE A.7 Consider the data in Example A.1 which gives the number of tenants in 45 apartments. The sample mean = 2.8 was obtained in Example A.3. Find the sample variance s² and the sample standard deviation s.

First extend the frequency distribution table of the data as follows (where SUM refers to Σ f_i, Σ f_i x_i, and

Then

Note that n = 45 and .

Measures of Position: Quartiles and Five-Number Summary

Consider a set of n data values which are arranged in increasing order. Recall that the median of the data values has been defined as a number for which, at most, half of the values are less than M and, at most, half of the numbers are greater than M. Here “half” means n/2 when n is even and when n is odd. Specifically

The first, second, and third quartiles, Q₁, Q₂, Q₃, are defined as follows:

Q₁ = median of the first half of the values

Q₂ = M = median of all the values

Q₃ = median of the second half of the values

The 5-number summary of the data is the following quintuple:

where is the lowest value, Q₁, , are the quartiles, and is the highest value.

The range of the above data is the distance between the lowest and highest value, and the interquartile range (IQR) is the distance between the first and third quartiles; namely,

Observe that

Range Interval: [L, H] contains 100 percent of the data values.

IQR Interval: [Q₁, Q₃] contains about 50 percent of the data values.

Also, observe that the 5-number summary [L, Q₁, M, Q 3, H] or, equivalently, the 4 intervals,

divide the data into 4 sets where each set contains about 25 percent of the data values.

EXAMPLE A.8 Consider the following two lists of numerical values:

The median of both lists is the fourth value . Find the quartiles Q₁ and Q₃, the 5-number summary [L, Q₁, M, Q 3, H], and the range and interquartile range (IQR) of each list. Compare the range and IQR of both lists.

(a) The median M = 10 of list A divides the set into the first half {7, 9, 9} and the second half {10, 11, 14}. Hence and Also, is the lowest value and is the highest value. Thus, the 5-number summary of list A follows:

Furthermore

(b) The median M = 10 of list B divides the set into the first half {7, 7, 8} and the second half {11, 13, 14}. Hence and . Also, is the lowest value and is the highest value. Thus, the 5-number summary of list B is as follows:

Furthermore

Although list B exhibits more dispersion than list A, the ranges of both lists are the same. However, the of list B is much larger than the of list A. Generally speaking, the IQR usually gives a more accurate description of the dispersion of a list than the range since the range may be strongly influenced by a single small or large value.

EXAMPLE A.9 Consider the following list of numerical values:

Find the median M, the quartiles Q₁ and Q₃, the 5-number summary [L, Q₁, M, Q 3, H], and the range and interquartile range (IQR) of the data.

Here is even, so the median M is the average of the fifteenth and sixteenth values. Thus

The first quartile Q₁ is the mean of the first half (first 15) numbers, so , the eighth number of the first half sublist. The third quartile Q₃ is the mean of the second half (second 15) numbers, so , the eighth number of the second half sublist. Here, and , so the 5-number summary follows:

Furthermore:

A.5 BIVARIATE DATA, SCATTERPLOTS, CORRELATION COEFFICIENTS

Quite often in statistics it is desired to determine the relationship, if any, between two variables, such as between age and weight, weight and height, years of education and salary, amount of daily exercise and cholesterol level, and so on. Letting x and y denote the two variables, the data will consist of a list of pairs of numerical values

where the first values correspond to the variable x and the second values correspond to y.

As with a single variable, we can describe such bivariate data both graphically and numerically. Our primary concern is to determine whether there is a mathematical relationship, such as a linear relationship, between the data.

It should be kept in mind that a statistical relationship between two variables does not necessarily imply there is a causal relationship between them. For example, a strong relationship between weight and height does not imply that one variable causes the other. On the other hand, eating more does usually increase the weight of a person but it does not usually mean there will be an increase in the height of the person.

Scatterplots

Consider a list of pairs of numerical values representing variables x and y. The scatterplot of the data is simply a picture of the pairs of values as points in a coordinate planeR². The picture sometimes indicates a relationship between the points as illustrated in the following examples.

EXAMPLE A.10

(a) Consider the following data where x denotes the ages of 6 children and y denotes the corresponding number of correct answers in a 10-question test:

The scatterplot of the data appears in Fig. A-3(a). The picture of the points indicates, roughly speaking, that the number of correct answers increases as the age increases. We then say that x and y have a positive correlation.

Fig. A-3

(b) Consider the following data where x denotes the average daily temperature, in degrees Fahrenheit, and y denotes the corresponding daily natural gas consumption, in cubic feet:

The scatterplot of the data appears in Fig. A-3(b). The picture of the points indicates, roughly speaking, that the gas consumption decreases as the temperature increases. We then say that x and y have a negative correlation.

(c) Consider the following data where x denotes the average daily temperature, in degrees Fahrenheit, over a 6-day period and y denotes the corresponding number of defective traffic lights:

The scatterplot of the data appears in Fig. A-3(c). The picture of the points indicates that there is no apparent relationship between x and y.

Correlation Coefficient

Scatterplots indicate graphically whether there is a linear relationship between two variables x and y. A numeric indicator of such a linear relationship is the sample correlation coefficient r of x and y, which is defined as follows:

We assume the denominator in Formula (A-9) is not zero. It can be shown that the correlation coefficient r has the following properties:

(2) if y tends to increase as x increases and if y tends to decrease as x increases.

(3) The stronger the linear relationship between x and y, the closer r is to - 1 or 1; the weaker the linear relationship between x and y, the closer r is to 0.

An alternate formula for computing r is given below; we then illustrate the above properties of r with examples.

Another numerical measurement of bivariate data with variables x and y is the sample covariance which is denoted and defined as follows:

Formula (A-9) can now be written in the more compact form as

Sample correlation coefficient:

where s_x and s_y are the sample standard deviations of x and y, respectively, and s_xy is the sample covariance of x and y defined above.

An alternate formula for computing the correlation coefficient r follows:

This formula is very convenient to use after forming a table with the values of x_i, y_i, , , x_i y_i, and their sums, as illustrated below.

EXAMPLE A.11 Find the correlation coefficient r for each data set in Example A.10.

(a) Construct the following table which gives the x, y, x², y², and xy values, and the last column gives the corresponding sums:

Now use Formula (A-11) and the number of points is to obtain

Here r is close to 1, which is expected since the scatterplot in Fig. A-3(a) indicates a strong positive linear relationship between x and y.

(b) Construct the following table which gives the x, y, x², y² and xy values, and the last column gives the corresponding sums:

Formula (A-11), with n = 7, yields

Here r is close to −1, and the scatterplot in Fig. A-3(b) indicates a strong negative linear relationship between x and y.

(c) Construct the following table which gives the x, y, x², y², and xy values, and the last column gives the corresponding sums:

Formula (A-11), with n = 6, yields

Here r is close to 0, which is expected since the scatterplot in Fig. A-3(c) indicates no linear relationship between x and y.

A.6 METHODS OF LEAST SQUARES, REGRESSION LINE, CURVE FITTING

Suppose a scatterplot of the data points (x_i, y_i) indicates a linear relationship between variables x and y or, alternately, suppose the correlation coefficient r of x and y is close to 1 or −1. Then the next step is to find a line L that, in some sense, fits the data. The line L we choose is called the least-squares line. This section discusses this line, and then we discuss more general types of curve fitting.

Least-Squares Line

Consider a given set of data points P_i (x_i, y_i) and any (nonvertical) linear equation L. Let denote the y value of the point on L corresponding to x_i. Furthermore, let the difference between the actual value of y and the value of y on the curve or, in other words, the vertical (directed) distance between the point P_i and the line L, as shown in Fig. A-4. The sum

is called the squares error between the line L and the data points.

Fig. A-4

Fig. A-5

The least-squares line or the line of best fit or the regression line of y on x is, by definition, the line L whose squares error is as small as possible. It can be shown that such a line L exists and is unique. Let a denote the y intercept of the line L and let b denote its slope, that is, suppose the following is the equation of the line L:

Then a and b can be obtained from the following two equations, called the normal equations, in the two unknowns a and b where n is the number of points:

In particular, the slope b and y intercept a can also be obtained from the following formula (where r is the correlation coefficient):

Formula (A-13) is usually used instead of Formula (A-12) when one needs, or has already found, the means and , the standard deviations s_x and s_y, and the correlation r of the given data points.

Graphing the line L of best fit requires at least two points on L. The second equation in Formula (A-13) tells us that lies on the regression line L since

also, the first equation in Formula (A-13) then tells us that the point is also on L. These points are also pictured in Fig. A-5.

Remark: Recall that the above line L which minimizes the squares of the vertical distances from the given points P_i to L is called the regression line of y on x; it is usually used when one views y as a function of x. A line L′ also exists which minimizes the squares of the horizontal distances of the points P_i from L′; it is called the regression line of x on y. Given any two variables, the data usually indicate that one of them depends upon the other; we then let x denote the independent variable and let y denote the dependent variable. For example, suppose the variables are age and height. We normally assume height is a function of age, so we would let x denote age and y denote height. Accordingly, unless otherwise stated, our least-squares lines will be regression lines of y on x.

EXAMPLE A.12 Find the line L of best fit for the first two scatterplots in Fig. A-3.

(a) By the table in Example A.11(a),

Also, there are n = 6 points. Substitution in the normal equations in Formula (A-12) yields the following system:

The solution of the system follows:

Thus, the following is the line L of best fit.

To graph L, we need only plot two points on L and then draw the line through these points. Setting and , we obtain the two points:

and then we draw L, as shown in Fig. A-6(a).

Fig. A-6

(b) Here we use Formula (A-13) rather than Formula (A-12). By Example A.11(b), with , we obtain

Using Formulas (A-8) and (A-9), we obtain

Substituting these values in Formula (A-12), we get

Thus, the line L of best fit follows:

The graph of L, obtained by plotting (30, 8.933) and (42.857 1, 5.128 6) (approximately) and drawing the line through these points, is shown in Fig. A-6(b).

Curve Fitting

Sometimes the scatterplot does not indicate a linear relationship between the variables x and y, but one may visualize some other standard (well-known) curve, , which may approximate the data, called an approximate curve. Several such standard curves, where letters other than x and y denote constants, follow:

(1) Parabolic curve:

(2) Cubic curve:

(3) Hyperbolic curve: or

(4) Exponential curve: or

(5) Geometric curve: or

(6) Modified exponential curve:

(7) Modified geometrical curve:

Pictures of some of these standard curves appear in Fig. A-7.

Fig. A-7

Generally speaking, it is not easy to decide which curve to use for a given set of data points. On the other hand, it is usually easier to determine a linear relationship by looking at the scatterplot or by using the correlation coefficient. Thus, it is standard procedure to find the scatterplot of transformed data. Specifically:

(a) If log y versus x indicates a linear relationship, use the exponential curve (type 4).

(b) If 1/y versus x indicates a linear relationship, use the hyperbolic curve (type 3).

Once one decides upon the type of curve to be used, then that particular curve is the one that minimizes the squares error. We state this formally:

Definition: Consider a collection of curves and a given set of data points. The best-fitting or least-squares curve C in the collection is the curve which minimizes the sum

(where d_i denotes the vertical distance from a data point P_i (x_i, y_i) to the curve C).

Just as there are formulas to compute the constants a and b in the regression line L for a set of data points, so there are formulas to compute the constants in the best-fitting curve C in any of the above types (collections) of curves. The derivation of such formulas usually involves calculus.

EXAMPLE A.13 Consider the following data which indicates exponential growth:

Find the least-squares exponential curve C for the data, and plot the data points and C on the planeR².

The curve C has the form where a and b are unknowns. The logarithm (to base 10) of yields

where and . Thus, we seek the least-squares line L for the following data:

Using the normal equations in Formula (A-12) for L, we get

The antiderivatives of a′ and b′ yield, approximately,

Thus, is the required exponential curve C. The data points and C are plotted in Fig. A-8.

Fig. A-8

FREQUENCY DISTRIBUTION, MEAN AND MEDIAN

A.1. Consider the following frequency distribution which gives the number f of students who got x correct answers on a 20-question exam:

(a) Display the data in a histogram and a frequency polygon.

(b) Find the mean , median M, and midrange of the data.

(a) The histogram appears in Fig. A-9. The frequency polygon also appears in Fig. A-9; it is obtained from the histogram by connecting the midpoints of the tops of the rectangle in the histogram.

Fig. A-9

(b) First we extend our frequency table to include the cumulative distribution function cf, the products f_i x_i, and the sums Σf_i and Σx_i f_i as follows:

Here we use Formula (A-1b) which gives the mean for grouped data:

There are n = 35 scores, so the mean M is the eighteenth score. The row cf in the table tells us that 16 is the sixteenth score, and 17 is the seventeenth to twenty-third scores. Hence the mean

, the eighteenth score

The midrange is the average of the first score 9 and the last score 20; hence:

A.2. Consider the following data items:

(a) Construct the frequency distribution f and cumulative distribution cf of the data, and display the data in a histogram.

(b) Find the mean , median M, and midrange of the data.

(a) Construct the following frequency distribution table which also includes the products f_ix_i and the sums Σf_i and Σx_i f_i:

Note that the first line of the table consists of the range of numbers, from 2 to 7. The second line (frequency) can be obtained by either counting the number of times each number occurs or by going through the list one number after another and keeping a tally count, a running account as each number occurs. The histogram is shown in Fig. A-10(a).

(b) Here we use Formula (A-1b) which gives the mean for grouped data:

Fig. A-10

There are numbers, so the mean M is the average of the tenth and eleventh numbers. The row cf in the table tells us that 4 is the tenth number and 5 is the eleventh number. Hence

The midrange is the average of the first number 2 and the last number 7; hence

A.3. Consider the following scores on a statistic exam:

(a) Construct the frequency distribution f table where the data are grouped into four classes:

The table should include the class values x_i and the cumulative distribution cf of the data. (Recall that if a number falls on a class boundary, it is assigned to the higher class.) Also, display the data in a histogram.

(b) Find the mean , median M, and midrange of the data.

(a) Construct the following frequency distribution table which also includes the products f_i x_i and the sums Σf_i and Σx_i f_i:

The histogram is shown in Fig. A-10(b).

(b) Using the class values x_i, Formula (A-1b) yields

There are numbers, so the mean M is the average of the tenth and eleventh class scores which we approximate using their class values. The row cf in the table tells us that 75 is the approximation of the tenth and eleventh scores. Thus

The midrange is the average of the first class value 65 and the last class value 95; hence

A.4. The yearly rainfall, measured to the nearest tenth of a centimeter, for a 30-year period follows:

(a) Construct the frequency distribution f table where the data are grouped into 10 classes:

The table should include the class values (cv) x_i and the cumulative distribution (cf) of the data.

(b) Find the mean , median M, and midrange of the data.

(a) Construct the following frequency distribution table which also includes the products f_i x_i and the sums Σf_i and Σx_i f_i:

(b) Using the class values x_i, Formula (A-1b) yields

There are n = 30 numbers, so the mean M is the average of the fifteenth and sixteenth class values. The row cf in the table tells us that 37 is the fifteenth and sixteenth class value. Thus

The midrange is the average of the first class value 29 and the last class value 47; hence

MEASURES OF DISPERSION: VARIANCE, STANDARD DEVIATION, IQR

A.5. Consider the following data values:

(a) Find the sample mean .

(b) Find the variance s² and standard deviation s.

(a) The mean is the “average” of the numbers, the sum of the values divided by the number n = 10 of values:

(b) Method 1: Here we use Formula (A-6a). We have

Method 2: Here we use Formula (A-8a). First construct the following table where the two numbers on the right, 50 and 334, denote the sums Σx_i and respectively:

We have

The mean divides the 10 items into two halves, and , each with 5 numbers. The first quartile Q₁ is the median (middle element) of the first half A, so ; the third quartile Q₃ is the median (middle number) of the second half B, so . Here is the lowest number and is the highest number. Thus, the 5-number summary of the data follows:

Furthermore:

A.6. The ages of n = 30 children living in an apartment complex are as follows:

(a) Find the frequency distribution of the data.

(b) Find the sample mean , variance s², and standard deviation s for the data.

(c) Find the median M, the 5-number summary [L, Q₁, M, Q₂, H], the range, and the IQR (interquartile range) of the data.

(a) Construct the following frequency table which also includes the cumulative distribution cf function; products f_i x_i, , and the sums Σf_i, Σ f_i x_i, and

(b) We have

Also

(c) Here n = 30 is even; hence the median M is the average of the fifteenth and sixteenth ages. The row cf in the table tells us that 2 is the fifteenth age and 3 is the sixteenth age. Thus

The mean divides the 30 items into two halves, each with 15 ages. The first quartile Q₁ is the median of the first 15 ages, so Q₁ is the eighth age; the third quartile Q₃ is the median of the last 15 ages, so Q₃ is the twenty-third age. Using the cf row in the table, we obtain

Furthermore, is the lowest number and is the highest number. Thus, the 5-number summary of the data follows:

Furthermore:

A.7. Consider the following list of data values:

(a) Find the median M.

(b) Find the quartiles Q₁ and Q₃, the 5-number summary , the range, and the IQR (interquartile range) of the data.

(a) First arrange the data in numerical order:

There are n = 18 values, so the median M is the average of the ninth and tenth values. Thus

(b) Q₁ is the median of the nine values, from 1 to 5, less than M. Thus, , the fifth value. Q₃ is the median of the nine values, from 6 to 16, greater than M. Thus, , the fifth value. Also, L = 1 is the lowest number and is the highest number. Thus, the 5-number summary of the data follows:

Furthermore:

A.8. Consider the following frequency distribution:

(a) Find the sample mean , variance s², and standard deviation s for the data.

(b) Find the median M, the quartiles Q₁ and Q₃, the 5-number summary [L, Q₁, M, Q₂, H], the range, and the IQR (interquartile range) of the data.

(a) Extend the frequency table to include the cumulative distribution cf function; products and the sums Σf_i, Σ f_i x_i, and as follows:

Therefore

(b) Here is odd; hence the median M is the thirteenth number. The row cf in the table tells us that . The mean M = 4 divides the 25 numbers into two halves, each with 12 numbers. The first quartile Q₁ is the median of the first 12 number, so Q₁ is the average of the sixth number 2 and the seventh number 3. Thus, . The third quartile Q₃ is the median of the last 12 numbers, the fourteenth to twenty-fifth numbers, so Q₃ is the average of the nineteenth number 4 and twentieth number 4. Thus, . Furthermore, is the lowest number and is the highest number. Thus, the 5-number summary of the data is as follows:

Furthermore:

MISCELLANEOUS PROBLEMS INVOLVING ONE VARIABLE

A.9. An English class for foreign students consists of 20 French students, 25 Italian students, and 15 Spanish students. On an exam, the French students average 78, the Italian students 75, and the Spanish students 76. Find the mean grade for the class.

Here we use Formula (A-5) for the grand mean (the weighted mean of the means) with

This yields

That is, 76.25 is the mean grade for the class.

A.10. A history class contains 10 freshmen, 15 sophomores, 10 juniors, and 5 seniors. On an exam, the freshmen average 72, the sophomores 76, the juniors 78, and the seniors 80. Find the mean grade for the class.

Here we use Formula (A-5) for the grand mean with

Therefore

That is, 76 is the mean grade for the class.

BIVARIATE DATA

A.11. Consider data sets whose scatterplots appear in Fig. A-11. Estimate the correlation coefficient r for each data set if the choice is one of −1.5, −0.9, 0.0, 0.9, 1.5.

Fig. A-11

The correlation coefficient r must lie in the interval [−1, 1]. Moreover, r is close to 1 if the data are approximately linear with positive slope, r is close to −1 if the data are approximately linear with negative slope, and r is close to 0 if there is no relationship between the points. Accordingly:

(a) r is close to 1 since there appears to be a strong linear relationship between the points with positive slope; hence .

since there appears to be no relationship between the points.

(c) r is close to −1 since there appears to be a strong linear relationship between the points but with negative slope; hence .

A.12. Consider the following list of data values:

(a) Plot the data in a scatterplot.

(b) Compute the correlation coefficient r.

(d) Find L, the least-squares line .

(e) Graph L on the scatterplot in part (a).

(a) The scatterplot (with L) is shown in Fig. A-12(a).

(b) Construct the following table which contains the x, y, x², y², and xy values and where the last column gives the corresponding sums:

Now use Formula (A-11) and the number of points n = 5 to obtain

(The fact that r is close to −1 is expected since the scatterplot indicates a strong linear relationship with negative slope.)

Also, by Formulas (A-8a) and (A-7),

(d) Substitute r, s_x, s_y into Formula (A-13) to obtain the slope b of the least-squares line L:

Now substitute , , and b into Formula (A-13) to determine the y intercept a of L:

Hence L is

Alternately, we can find a and b using the normal equations in Formula (A-12) with :

(These equations would be used if we did not also want r, and , and s_x and s_y.)

(e) To graph L, we find two points on L and draw the line through them. One of the two points is

(which always lies on any least-squares line). Another point is (10, 2.4), which is obtained by substituting in the regression equation L and solving for y. The line L appears in the scatterplot in Fig. A-12(a).

Fig. A-12

A.13. Repeat Problem A.12 for the following data:

(a) The scatterplot (with L) is shown in Fig. A-12(b).

(b) Construct the following table which contains the x, y, x², y², and xy values and where the last column give the corresponding sums:

Now use Formula (A-11) and the number of points to obtain

(The fact that r is close to +1 is expected since the scatterplot indicates a strong linear relationship with positive slope.)

Also, by Formulas (A-8a) and (A-7):

(d) Substitute r, s_x, s_y into Formula (A-13) to obtain the slope b of the least-squares line L:

Now substitute , , and b into Formula (A-13) to determine the y intercept a of L:

Hence L is

Alternately, we can find a and b using the normal equations in Formula (A-12) with n = 4:

(These equations would be used if we did not also want r, and , and s_x and s_y.)

(e) To graph L, we find two points on L and draw the line through them. One point is . Another point is (0, 1.60), the y intercept. The line L appears on the scatterplot in Fig. A-12(b).

A.14. The definition of the sample covariance s_xy of variables x and y follows:

Find s_xy for the data in: (a) Problem A.12, (b) Problem A.13.

(a) The above formula for s_xy yields

We note that the variances s_x and s_y are always nonnegative but the covariance s_xy can be negative, which indicates that y tends to decrease as x increases.

(b) The above formula for s_xy yields

The covariance here is positive which indicates that y tends to increase as x increases.

A.15. Let W denote the number of American women graduating with a doctoral degree in mathematics in a given year. Suppose that, for certain years, W has the following values:

We assume that the increase, year by year, is approximately linear and that it will increase linearly in the near future. Estimate W for the years 2005, 2008, and 2010.

Our estimation will use a least-squares line L. For notational and computational convenience we let the year 1980 be a base for our x values. Hence we set

Thus, we seek the line of best fit for the data where the unknowns a and b will be determined by the following normal equations (A-12):

[We do not use Formula (A-13) for a and b since we do not need the correlation coefficient r nor do we need the values s_x, s_y, , and .]

The sums in the above system are obtained by constructing the following table which contains the x, y, x², and xy values and where the last column gives the corresponding sums:

Substitution in the above normal equations, with n = 4, yields

The solution of the system is and . Thus, the following is our least-square line L:

The (x, y) points and the line L are plotted in Fig. A-13(a).

Substitute 25 (2005), 28 (2008), and 30 (2010) for x in Formula (A-14) to obtain 51, 54.3, and 56.5, respectively. Thus, one would expect that, approximately, , , and women will receive doctoral degrees in the years 2005, 2008, and 2010, respectively.

Fig. A-13

A.16. Find the least-square parabola C for the following data:

Plot C and the data points in the planeR².

The parabola C has the form where the unknowns a, b, c are obtained from the following normal equations [which are analogous to the normal equations for the least-square line L in Formula (A-12)]:

The sums in the above system are obtained by constructing the following table which contains the x, y, x², x³, x⁴, xy, and xy values and where the last column gives the corresponding sums:

Substitution in the above normal equations, with , yields

The solution of the system yields

Thus, the required parabola C follows:

The given data points and C are plotted in Fig. A-13(b).

A.17. Derive the normal equations Formula (A-12) for the least-squares line L for n data points P_i (x_i, y_i).

We want to minimize the following least-square error:

where D may be viewed as a function of a and b. The minimum may be obtained by setting the partial derivatives D_a and D_b, equal to zero. The partial derivatives follow:

Setting and , we obtain the following required equations:

Supplementary Problems

FREQUENCY DISTRIBUTIONS, MEAN AND MEDIAN

A.18. The frequency distribution of the weekly wages, in dollars, of a group of unskilled workers follows:

(a) Display the data in a histogram and a frequency polygon.

(b) Find the mean , median M, and midrange of the data.

A.19. The amounts of 45 personal loans from a loan company follow:

(a) Group the data into classes with class width and beginning with $400, and construct the frequency and cumulative frequency distribution for the grouped data.

(b) Display the frequency distribution in a histogram.

A.20. The daily number of station wagons rented by an automobile rental agency during a 30-day period follows:

(a) Construct the frequency and cumulative frequency distribution for the data.

(b) Find the mean , median M, and midrange of the data.

A.21. The following denotes the number of people living in each of 35 apartments:

(a) Construct the frequency and cumulative frequency distribution for the data.

(b) Find the mean , median M, and midrange of the data.

A.22. The students in a mathematics class are divided into four groups:

(a) much greater than the median,

(b) little above the median,

(d) much below the median.

On which group should the teacher concentrate in order to increase the median of the class? Mean of the class?

MEASURES OF DISPERSION: VARIANCE, STANDARD DEVIATION, IQR

A.23. The prices of 1 lb of coffee in 7 stores follow:

(a) Find the mean , variance s², and standard deviation s.

(b) Find the median M, 5-number summary, and IQR of the data.

A.24. For a given week, the following were the average daily temperatures:

(a) Find the mean , variance s², and standard deviation s.

(b) Find the median M, 5-number summary, and IQR of the data.

A.25. During a given month, the 10 salespeople in an automobile dealership sold the following number of automobiles:

(a) Find the mean , variance s², and standard deviation s.

(b) Find the median M, 5-number summary, and IQR of the data.

A.26. The ages of students at a college dormitory are recorded, producing the following frequency distribution:

(a) Find the sample mean and standard deviation s.

(b) Find the median M, 5-number summary, and IQR of the data.

A.27. The following distribution gives the number of hours of overtime during 1 month for employees of a company:

(a) Find the sample mean and standard deviation s.

(b) Find the median M, 5-number summary, and IQR.

A.28. The following are 40 test scores:

(a) Group the data into 5 classes with class width beginning with 50 and construct the frequency and cumulative frequency distribution for the grouped data.

(b) Find the sample mean and standard deviation s of the grouped data.

(d) Find the median M, 5-number summary, and IQR of the grouped data.

A.29. The following distribution gives the number of visits for medical care by 80 patients during a 1-year period:

(a) Find the sample mean and standard deviation s.

(b) Find the median M, 5-number summary, and IQR.

MISCELLANEOUS PROBLEMS INVOLVING ONE VARIABLE

A.30. The students at a small school are divided into 4 groups: A, B, C, D. The number n of students in each group and the mean score of each group follow:

A.31. The mode of a list of numerical data is the value which occurs most often and more than once. Find the mode of the data in Problems: (a) A.20, (b) A.21, (c) A.26, (d) A.27.

BIVARIATE DATA

A.32. Consider the following list of data values:

(a) Draw a scatterplot of the data.

(b) Compute the correlation coefficient r. [Hint: First find Σx_i, Σy_i, , , Σx_i y_i and then use Formula (A-11).]

(d) Find L, the least-squares line y = a + bx.

(e) Graph L on the scatterplot in part (a).

A.33. Repeat Problem A.32 for the following list of data values:

A.34. Find the covariance s_xy of the variables x and y in: (a) Problem A.32, (b) Problem A.33. (See Problem A.14 for the definition of s_xy.)

A.35. Suppose 7 people in a company are interviewed, yielding the following data where x is the number of years of service and y is the number of people who reviewed the work of the person:

(a) Draw a scatterplot of the data.

(b) Find L, the least-squares line .

(d) Predict the number y of people who reviewed the work of another person if the number of years worked by the person is: , , .

A.36. Consider the following bivariate data:

(a) Find the correlation coefficient r. [Hint: First find Σx_i, Σy_i, , , Σx_i y_i and then use Formula (A-11).]

(b) Plot x against y in a scatterplot.

(d) Find the least-squares hyperbolic curve C which has the form and plot C on the scatterplot in (b). [Hint: Find the least-squares line for the data points (x_i, 1/y_i).]

(e) Which curve, L or C, best fits the data?

A.37. The following table lists average male weight, in pounds, and height, in inches, for certain ages which range from 1 to 21:

Find the correlation coefficient r for: (a) age and weight, (b) age and height, (c) weight and height.

A.38. Let x = age, y = height in Problem A.37. (a) Plot x against y in a scatterplot. (b) Find the line L of best fit. (c) Graph L on the scatterplot in part (a).

A.39. Let x = weight, y = height in Problem A.37. (a) Plot x against y in a scatterplot. (b) Find the line L of best fit. (c) Graph L on the scatterplot in part (a).

A.40. Find the least-squares exponential curve for the following data:

Answers to Supplementary Problems

A.18. (a) See Fig. A-14(a); , , mid = $210.

A.19. (a) The frequency distribution (where the wage is divided by $100 for notational convenience) follows:

(b) The histogram is shown in Fig. A-14(b).

, , mid = $1100.

Fig. A-14

A.20. (a) The distributions follow:

, , mid = 8.

A.21. (a) The frequency and cumulative frequency distributions follow:

, , mid = 4.

A.22. Group (c) to increase the median; likely (b) and (c) to increase the mean.

A.23. , , , , ,

A.24. , , , , ,

A.25. , , , , ,

A.26. , , , , ,

A.27. , , , , ,

A.28. (a) The distributions with class values follow:

Remark: The scores 100 are put in the 90–100 group since there are no scores higher than 100. If there were scores higher than 100, then the scores 100 would be put the next higher 100–110 group.

, , .