AP* Statistics: 8th Edition

Our studies so far have been concerned with measurements of a single variable. However, many important applications of statistics involve examining whether two or more variables are related to one another. For example, is there a relationship between the smoking histories of pregnant women and the birth weights of their children? Between SAT scores and success in college? Between amount of fertilizer used and amount of crop harvested?

Two questions immediately arise. First, how can the strength of an apparent relationship be measured? Second, how can an observed relationship be put into functional terms? For example, a real estate broker might not only wish to determine whether a relationship exists between the prime rate and the number of new homes sold in a month but might also find useful an expression with which to predict the number of home sales given a particular value of the prime rate.

A graphical display, called a scatterplot, gives an immediate visual impression of a possible relationship between two variables, while a numerical measurement, called a correlation coefficient, is often used as a quantitative value of the strength of a linear relationship. In either case, evidence of a relationship is not evidence of causation.

Suppose a relationship is perceived between two quantitative variables X and Y, and we graph the pairs (x, y). We are interested in the strength of this relationship, the scatterplot arising from the relationship, and any deviation from the basic pattern of this relationship. In this topic we examine whether the relationship can be reasonably explained in terms of a linear function, that is, one whose graph is a straight line.

We need to know what the term best-fitting straight line means and how we can find this line. Furthermore, we want to be able to gauge whether the relationship between the variables is strong enough so that finding and making use of this straight line is meaningful.

When larger values of one variable are associated with larger values of a second variable, the variables are called positively associated. When larger values of one are associated with smaller values of the other, the variables are called negatively associated.

The strength of the association is gauged by how close the plotted points are to a straight line.

Sometimes different dots in a scatterplot are labeled with different symbols or different colors to show a categorical variable. The resulting labeled scatterplot might distinguish between men and women, between stocks and bonds, and so on.

The above diagram is a labeled scatterplot distinguishing men with plus signs and women with square dots to show a categorical variable.

When analyzing the overall pattern in a scatterplot, it is also important to note clusters and outliers.

An experiment was conducted to note the effect of temperature and light on the potency of a particular antibiotic. One set of vials of the antibiotic was stored under different temperatures, but under the same lighting, while a second set of vials was stored under different lightings, but under the same temperature.

In the first scatterplot note the linear pattern with one outlier far outside this pattern. A possible explanation is that the antibiotic is more potent at lower temperatures, but only down to a certain temperature at which it drastically loses potency.

In the second histogram note the two clusters. It appears that below a certain light intensity the potency is one value, while above that intensity it is another value. In each cluster there seems to be no association between intensity and potency.

Although a scatter diagram usually gives an intuitive visual indication when a linear relationship is strong, in most cases it is quite difficult to visually judge the specific strength of a relationship. For this reason there is a mathematical measure called correlation (or the correlation coefficient). Important as correlation is, we always need to keep in mind that significant correlation does not necessarily indicate causation and that correlation measures the strength only of a linear relationship.

in terms of the means and standard deviations of the two sets. We note that the formula is actually the sum of the products of the corresponding z-scores divided by 1 less than the sample size. However, you should be able to quickly calculate correlation using the statistical package on your calculator. (Examining the formula helps you understand where correlation is coming from, but you will NOT have to use the formula to calculate r.)

Note from the formula that correlation does not distinguish between which variable is called x and which is called y. The formula is also based on standardized scores (z-scores), and so changing units does not change the correlation. Finally, since means and standard deviations can be strongly influenced by outliers, correlation is also strongly affected by extreme values.

The value of r always falls between –1 and +1, with –1 indicating perfect negative correlation and +1 indicating perfect positive correlation. It should be stressed that a correlation at or near zero doesn’t mean there isn’t a relationship between the variables; there may still be a strong nonlinear relationship.

We can also say that r² gives the percentage of variation in y that can be explained by the regression line, with x as the explanatory variable.

It can be shown that r², called the coefficient of determination, is the ratio of the variance of the predicted values

to the variance of the observed values y. That is, there is a partition of the y-variance, and r² is the proportion of this variance that is predictable from a knowledge of x. Alternatively, we can say that r² gives the percentage of variation in y that is explained by the variation in x. In either case, always interpret r² in context of the problem. Remember when calculating r from r² that r may be positive or negative.

While the correlation r is given as a decimal between –1.0 and 1.0, the coefficient of determination r² is usually given as a percentage. An r² of 100% is a perfect fit, with all the variation in y explained by variation in x. How large a value of r² is desirable depends on the application under consideration. While scientific experiments often aim for an r² in the 90% or above range, observational studies with r² of 10% to 20% might be considered informative. Note that while a correlation of .6 is twice a correlation of .3, the corresponding r² of 36% is four times the corresponding r² of 9%.

What is the best-fitting straight line that can be drawn through a set of points?

If a scatterplot indicates a non-linear relationship, don’t try to force a straight-line fit.

On the basis of our experience with measuring variances, by best-fitting straight line we mean the straight line that minimizes the sum of the squares of the vertical differences between the observed values and the values predicted by the line.

It is reasonable, intuitive, and correct that the best-fitting line will pass through (x, y) where x and y are the means of the variables X and Y. Then, from the basic expression for a line with a given slope through a given point, we have

where r is the correlation and s_x and s_y are the standard deviations of the two sets. That is, each standard deviation change in x results in a change of r standard deviations in y. If you graph z-scores for the y-variable against z-scores for the x-variable, the slope of the regression line is precisely r, and, in fact, the linear equation becomes z_y = rz_x.

This best-fitting straight line, that is, the line that minimizes the sum of the squares of the differences between the observed values and the values predicted by the line, is called the least squares regression line or simply the regression line. It can be calculated directly by entering the two data sets and using the statistics package on your calculator.

An insurance company conducts a survey of 15 of its life insurance agents. The average number of minutes spent with each potential customer and the number of policies sold in a week are noted for each agent. Letting X and Y represent the average number of minutes and the number of sales, respectively, we have

Answer: Plotting the 15 points (25, 10), (23, 11), …, (20, 11) gives an intuitive visual impression of the relationship:

This scatterplot indicates the existence of a relationship that appears to be linear; that is, the points lie roughly on a straight line. Furthermore, the linear relationship is positive; that is, as one variable increases, so does the other (the straight line slopes upward).

Using a calculator, we find the correlation to be r = .8836, the coefficient of determination to be r² = .78 (indicating that 78% of the variation in the number of policies is explained by the variation in the number of minutes spent), and the regression line to be

Thus, for example, we might predict that agents who average 24 minutes per customer will average 0.5492(24) – 1.73 = 11.45 sales per week. We also note that each additional minute spent seems to produce an average 0.5492 extra sale.

Following are advertising expenditures and total sales for six detergent products:

Predict the total sales if $5000 is spent on advertising and interpret the slope of the regression line. What if $100,000 is spent on advertising?

Answer: With your calculator, the equation of the regression line is found to be

(A calculator like the TI-84, with less round-off error, directly gives

= 59.683 + 7.295x.)

It is also worthwhile to replace the x and y with more appropriately named variables, resulting, for example, in

The regression line predicts that if $5000 is spent on advertising, the resulting total sales will be 7.293(5) + 59.691 = 96.156 thousands of dollars ($96,156).

The slope of the regression line indicates that every extra $1000 spent on advertising will result in an average of $7293 in added sales.

If $100,000 is spent on advertising, we calculate 7.293(100) + 59.691 = 788.991 thousands of dollars (≈$789,000). How much confidence should we have in this answer? Not much! We are trying to use the regression line to predict a value far outside the range of the data values. This procedure is called extrapolation and must be used with great care.

It should be noted that when we use the regression line to predict a y-value for a given x-value, we are actually predicting the mean y-value for that given x-value. For any given x-value, there are many possible y-values, and we are predicting their mean. So if $5000 is spent many times on advertising, various resulting total sales figures may result, but their predicted average is $96,156.

A random sample of 30 U.S. farm regions surveyed during the summer of 2003 produced the following statistics:

Based on this study, what is the mean predicted corn yield for a region where the average growing season temperature is 76.5°F?

Answer:

standard deviations below the average temperature reported in the study, so with a correlation of r = .32, the predicted corn yield is .32(–1.5) = –0.48 standard deviations below the average corn yield, or 131 – 0.48(5) = 128.6 bushels per acre.

Alternatively, we could have found the linear regression equation relating these variables:

intercept ≈ 131 – 81(0.533) ≈ 87.8, and thus

= 0.533 Temp + 87.8. Then 0.533(76.5) + 87.8 ≈ 128.6.

The difference between an observed and predicted value is called the residual. When the regression line is graphed on the scatterplot, the residual of a point is the vertical distance the point is from the regression line.

When the data point is above the regression line, the residual is positive; a data point below the line gives a negative residual.

The regression line is the line that minimizes the sum of the squares of the residuals.

We calculate the predicted values from the regression line in Example 4.7 and subtract from the observed values to obtain the residuals:

The TI-Nspire easily shows the “squares of the residuals,” the sum of which is minimized by the regression line.

The above equation is true in general; that is, the sum and thus the mean of the residuals is always zero.

The notation for residuals is

The standard deviation of the residuals is calculated as follows:

Plotting the residuals gives further information. In particular, a residual plot with a definite pattern is an indication that a nonlinear model will show a better fit to the data than the straight regression line. In addition to whether or not the residuals are randomly distributed, one should look at the balance between positive and negative residuals and also the size of the residuals in comparison to the associated y-values.

The residuals can be plotted against either the x-values or the

-values (since

is a linear transformation of x, the plots are identical except for scale and a left-right reversal when the slope is negative).

It is also important to understand that a linear model may be appropriate, but weak, with a low correlation. And, alternatively, a linear model may not be the best model (as evidenced by the residual plot), but it still might be a very good model with high r².

Suppose the drying time of a paint product varies depending on the amount of a certain additive it contains.

Additive (oz), x:	1	2	3	4	5	6	7	8	9	10
Drying time (hr), y:	4	2.1	1.5	1	1.2	1.7	2.5	3.6	4.9	6.1

Using a calculator, we find the regression line

= 0.327x + 1.062, and we find the residuals:

x:	1	2	3	4	5	6	7	8	9	10
y – :	2.61	0.38	–0.54	–1.37	–1.5	–1.32	–0.85	–0.08	0.89	1.77

This pattern indicates that a nonlinear model will be a better fit than a straight-line model. A scatterplot of the original data shows clearly what is happening:

The ability to interpret computer output is important not only to do well on the AP Statistics exam, but also to understand statistical reports in the business and scientific world.

Miles per gallon versus speed for a new model automobile is fitted with a least squares regression line. The graph of the residuals and some computer output for the regression are as follows:

The regression equation is
MPG = 38.9 – 0.218 Speed

Predictor	Coef	SE Coef	T	P
Constant	38.929	5.651	6.89	0.000
Speed	–0.2179	0.1119	–1.95	0.099

Analysis of Variance

Source	DF	SS	MS	F	P
Regression	1	199.34	199.34	3.79	0.099
Residual Error	6	315.54	52.59
Total	7	514.88

a. Interpret the slope of the regression line in context.
Answer: The slope of the regression line is –0.2179, indicating that, on average, the MPG drops by 0.2179 for every increase of one mile per hour in speed.

b. What is the mean predicted MPG at a speed of 30 mph?
Answer: At 30 mph, the mean predicted MPG is –0.2179(30) + 38.929, or about 32.4 MPG.

c. What was the actual MPG at a speed of 30 mph?
Answer: The residual for 30 mph is about +3.5, and since residual = actual – predicted, we estimate the actual MPG to be 32.4 + 3.5, or about 36 MPG.

d. Is a line the most appropriate model? Explain.
Answer: The fact that the residuals show such a strong curved pattern indicates that a nonlinear model would be more appropriate.

e. What does “S = 7.252” refer to?
Answer: The standard deviation of the residuals, s_e = 7.252, is a “typical value” of the residuals and gives a measure of how the points are spread around the regression line.

The number of youngsters playing Little League baseball in Ithaca, New York, during the years 1995–2003 is fitted with a least squares regression line. The graph of the residuals and some computer output for their regression are as follows:

Predictor	Coef	SE Coef	T	P
Constant	123.800	1.798	68.84	0.000
Years	12.6333	0.3778	33.44	0.000

S = 2.926 R–Sq = 99.4% R–Sq(adj) = 99.3%

Analysis of Variance

Source	DF	SS	MS	F	P
Regression	1	9576.1	9576.1	1118.45	0.000
Residual Error	7	59.9	8.6
Total	8	9636.0

Simply using a calculator to find a regression line is not enough; you must understand it (for example, be able to interpret the slope and intercepts in context).

a. Does it appear that a line is an appropriate model for the data? Explain.
Answer: Yes. R–Sq = 99.4% is large, and the residual plot shows no pattern. Thus a linear model is appropriate.

b. What is the equation of the regression line (in context)?
Answer: Predicted # of players = 123.8 + 12.6 (years since 1995)

c. Interpret the slope of the regression line in the context of the problem.
Answer: The slope of the regression line is 12.6, indicating that, on average, the predicted number of children playing Little League baseball in Ithaca increased by 12 or 13 players per year during the 1995–2003 time period.

d. Interpret the y-intercept of the regression line in the context of the problem.
Answer: The y-intercept, 123.8, refers to the year 1995. Thus the number of players in Little League in Ithaca in 1995 was predicted to be around 124.

e. What is the predicted number of players in 1997?
Answer: For 1997, x = 2, so the predicted number of players is 12.6(2) + 123.8 = 149.

f. What was the actual number of players in 1997?
Answer: The residual for 1997 (x = 2) from the residual plot is +5, so actual – predicted = 5, and thus the actual number of players in 1997 must have been 5 + 149 = 154.

g. What years, if any, did the number of players decrease from the previous year? Explain.
Answer: The number would decrease if one residual were more than 12.6 greater than the next residual. This never happens, so the number of players never decreased.

In a scatterplot, regression outliers are indicated by points falling far away from the overall pattern. That is, a point is an outlier if its residual is an outlier in the set of residuals.

A scatterplot of grade point average (GPA) versus weekly television time for a group of high school seniors is as follows:

By direct observation of the scatterplot, we note that there are two outliers: one person who watches 5 hours of television weekly yet has only a 1.5 GPA, and another person who watches 25 hours weekly yet has a 3.0 GPA. Note also that while the value of 30 weekly hours of television may be considered an outlier for the television hours variable and the 0.5 GPA may be considered an outlier for the GPA variable, the point (30, 0.5) is not an outlier in the regression context because it does not fall off the straight-line pattern.

Scores whose removal would sharply change the regression line are called influential scores. Sometimes this description is restricted to points with extreme x-values. An influential score may have a small residual but still have a greater effect on the regression line than scores with possibly larger residuals but average x-values.

The heavy line in the scatterplot on the left below shows what happens when point A is removed, and the heavy line in the scatterplot on the right below shows what happens when point B is removed.

Note that the regression line is greatly affected by the removal of point A but not by the removal of point B. Thus point A is an influential score, while point B is not. This is true in spite of the fact that point A is closer to the original regression line than point B.

Often a straight-line pattern is not the best model for depicting a relationship between two variables. A clear indication of this problem is when the scatterplot shows a distinctive curved pattern. Another indication is when the residuals show a distinctive pattern rather than a random scattering. In such a case, the nonlinear model can sometimes be revealed by transforming one or both of the variables and then noting a linear relationship. Useful transformations often result from using the log or ln buttons on your calculator to create new variables.

Year, x:	1950	1960	1970	1980	1990
Population (1000s), y:	50	67	91	122	165

The scatterplot and residual plot indicate that a nonlinear relationship would be an even stronger model.

x:	1950	1960	1970	1980	1990
log y:	1.70	1.83	1.96	2.09	2.22

A regression analysis yields

In context we have:

–23.65 + 0.013(Year). So, for example, the population predicted for the year 2000 would be calculated

0.013(2000) – 23.65 = 2.35, and so Pop = 10^2.35 ≈ 224 thousand, or 224,000. The “linear” equation

0.013x – 23.65 can be re-expressed as

= 10^{0.013x–23.65} [= 10^–23.65 × (10^0.013)^x = 2.2387E–24 × 1.0304^x].

Log y as a linear function of x, log y = ax + b, re-expresses as an exponential:

Log y as a linear function of log x, log y = a log x + b, re-expresses as a power:

Note: Although you need to be able to recognize the need for a transformation, justify its appropriateness (residuals plot), use whichever is appropriate from above to create a linear model, and use the model to make predictions, you do not have to be able to re-express the linear equation in the manner shown above.

A linear fit to x and log y gives log y = 0.279x + 1.139 with r = 0.98. This results in an exponential relationship:

A linear fit to log x and log y gives log y = 1.639 log x + 1.295, also with r = 0.99. This results in a power relationship:

All three models give high correlation and are reasonable fits. Further analysis can be done by examining the residual plots:

The first two residual plots have distinct curved patterns. The third residual plot illustrates both a more random pattern and smaller residuals. (One can also create residual plots using the derived curved models, but we choose to restrict our attention to residual plots of the linear models.) Among the above three models, the power model

= 19.72(x^1.639), appears to be best.

A scatterplot gives an immediate indication of the shape (linear or not), strength, and direction (positive or negative) of a possible relationship between two variables.

If the relationship appears roughly linear, then the correlation coefficient, r, is a useful measurement.

The value of r is always between –1 and +1, with positive values indicating positive association and negative values indicating negative association; and values close to –1 or +1 indicating a stronger linear association than values close to 0, which indicate a weaker linear association.

Correlation is not affected by which variable is called x and which y or by changing units.

The best-fitting straight line, called the regression line, minimizes the sum of the squares of the residuals.

A definite pattern in the residual plot indicates that a nonlinear model may fit the data better than the straight regression line.

The coefficient of determination, r², gives the percentage of variation in y that is accounted for by the variation in x.

Influential scores are scores whose removal would sharply change the regression line.

Nonlinear models can sometimes be studied by transforming one or both variables and then noting a linear relationship.

1. A study collects data on average combined SAT scores (math, critical reading, and writing) and percentage of students who took the exam at 100 randomly selected high schools. Following is part of the computer printout for regression:

(A) SAT in the variable column indicates that SAT is the dependent (response) variable.

(C) The y-intercept indicates the mean combined SAT score if percent of students taking the exam has no effect on combined SAT scores.

(D) The R² value indicates that the residual plot does not show a strong pattern.

(E) Schools with lower percentages of students taking the exam tend to have higher average combined SAT scores.

2. A simple random sample of 35 world-ranked chess players provides the following statistics:

Number of hours of study per day: x = 6.2, s_x = 1.3
Yearly winnings: y = $208,000, s_y = $42,000
Correlation r = 0.15

3. A rural college is considering constructing a windmill to generate electricity but is concerned over noise levels. A study is performed measuring noise levels (in decibels) at various distances (in feet) from the campus library, and a least squares regression line is calculated with a correlation of 0.74. Which of the following is a proper and most informative conclusion for an observation with a negative residual?

5. Suppose the correlation is negative. Given two points from the scatterplot, which of the following is possible?

I. The first point has a larger x-value and a smaller y-value than the second point.

II. The first point has a larger x-value and a larger y-value than the second point.

III. The first point has a smaller x-value and a larger y-value than the second point.

Which of the following scatterplots could have resulted in the above residual plot? (The y-axis scales are not the same in the scatterplots as in the residual plot.)

7. Suppose the regression line for a set of data,

= 3x + b, passes through the point (2, 5). If x and y are the sample means of the x- and y-values, respectively, then y =

8. Suppose a study finds that the correlation coefficient relating family income to SAT scores is r = +1. Which of the following are proper conclusions?

9. A study of department chairperson ratings and student ratings of the performance of high school statistics teachers reports a correlation of r = 1.15 between the two ratings. From this information we can conclude that

(D) there is strong association between chairperson and student ratings of teachers, but it would be incorrect to infer causation.

(B) Perfect correlation, that is, when the points lie exactly on a straight line, results in r = 0.

(E) A correlation of 0.75 indicates a relationship that is 3 times as linear as one for which the correlation is only 0.25.

The relationship between winning game proportions when facing the sun and when the sun is on one’s back is analyzed for a random sample of 10 professional players. The computer printout for regression is below:

11. What is the equation of the regression line, where face and back are the winning game proportions when facing the sun and with back to the sun, respectively?

13. For one player, the winning game proportions were 0.55 and 0.59 for facing and back, respectively. What was the associated residual?

III. A definite pattern in the residual plot is an indication that a nonlinear model will show a better fit to the data than the straight regression line.

15. Data are obtained for a group of college freshmen examining their SAT scores (math plus writing plus critical reading) from their senior year of high school and their GPAs during their first year of college. The resulting regression equation is

What percentage of the variation in GPAs can be explained by looking at SAT scores?

16. In a study of whether the structure of the adult human brain changes when a new skill is learned, the gray matter volume of four individuals was measured before and after learning a new cognitive skill. The resulting scatterplot was:

The correlation above is 0. Three researchers each run the experiment on a new subject and each obtain an additional data point:

17. In a study of winning percentage in home games versus average home attendance for professional baseball teams, the resulting regression line is:

What is the residual if a team has a winning percentage of 55% with an average attendance of 34,000?

18. Consider the following scatterplot of midterm and final exam scores for a class of 15 students.

(A) The same number of students scored 100 on the midterm exam as scored 100 on the final exam.

(B) Students who scored higher on the midterm exam tended to score higher on the final exam.

19. If every woman married a man who was exactly 2 inches taller than she, what would the correlation between the heights of married men and women be?

20. Suppose the correlation between two variables is r = 0.23. What will the new correlation be if 0.14 is added to all values of the x-variable, every value of the y-variable is doubled, and the two variables are interchanged?

21. Suppose the correlation between two variables is –0.57. If each of the y-scores is multiplied by –1, which of the following is true about the new scatterplot?

22. Consider the set of points {(2, 5), (3, 7), (4, 9), (5, 12), (10, n)}. What should n be so that the correlation between the x- and y-values is 1?

Which of the following is a true statement about the correlations for the three scatterplots?

24. Consider the three points (2, 11), (3, 17), and (4, 29). Given any straight line, we can calculate the sum of the squares of the three vertical distances from these points to the line. What is the smallest possible value this sum can be?

25. Suppose that the scatterplot of log X and log Y shows a strong positive correlation close to 1. Which of the following is true?

26. Consider n pairs of numbers. Suppose x = 2, s_x = 3, y = 4, and s_y = 5. Of the following, which could be the least squares line?

1. Average home attendance and number of home wins for the 2009–2010 NBA Pacific Division teams were as follows:

	Lakers	Suns	Clippers	Warriors	Kings
Average attendance	18,997	17,648	16,343	18,027	13,254
Home wins	34	32	21	18	18

(a) Does a winning team bring out the fans? Can average attendance be predicted from number of wins? Find the equation of the best-fitting straight line.

2. The shoe sizes and the number of ties owned by ten corporate vice presidents are as follows:

Shoe size, x:	8	9.5	9	11	9	9.5	8.5	9	9	9.5
Number of ties, y:	10	10	8	15	12	13	16	7	12	4

(c) Can we find the best-fitting straight-line approximation to the above data? Does it make sense to use this equation to predict the number of ties owned by a corporate executive who wears size 10 shoes? Explain.

3. Following is a scatterplot of the average life expectancies and per capita incomes (in thousands of dollars) for people in a sample of 50 countries.

(a) Estimate the mean for the set of 50 life expectancies and for the set of 50 per capita incomes.

(b) Estimate the standard deviation for the set of life expectancies and for the set of per capita incomes. Explain your reasoning.

(c) Does the scatterplot show a correlation between per capita income and life expectancy? Is it positive or negative? Is it weak or strong?

(b) Find the correlation for the set consisting of the 12 scores from A, B, and C.

5. An outlier can have a striking effect on the correlation r. For example, comment on the following three scatterplots:

6. Fuel economy y (in miles per gallon) is tabulated for various speeds x (in miles per hour) for a certain car model. A linear regression model gives Predicted fuel economy = 34.8 – 0.16 (Speed) with the following residual plot:

A quadratic regression model gives

= –0.0032x² + 0.26x + 23.8 with the following residual plot:

7. The following scatterplot shows the grades for research papers for a sociology professor’s class plotted against the lengths of the papers (in pages).

Mary turned in her paper late and was told by the professor that her grade would have been higher if she had turned it in on time. A computer printout fitting a straight line to the data (not including Mary’s score) by the method of least squares gives

(a) Find the correlation coefficient for the relationship between grade and length of paper based on these data (excluding Mary’s paper).

(d) How will the slope of the regression line change if Mary’s paper is included? Explain your answer.

(e) What grade did Mary receive? Predict what she would have received if her paper had been on time.

8. Data show a trend in winning long jump distances for an international competition over the years 1972–92. With jumps recorded in inches and dates in years since 1900, a least squares regression line is fit to the data. The computer output and a graph of the residuals are as follows:

(b) What is the slope of the least squares line? Give an interpretation of the slope.

9. A scatterplot of the number of accidents per day on a particular interstate highway during a 30-day month is as follows:

(d) Name a feature clearly shown by the histogram and boxplot but not as obvious in the scatterplot.

10. The following scatterplot shows the pulse rate drop (in beats per minute) plotted against the amount of medication (in grams) of an experimental drug being field-tested in several hospitals.

A computer printout showing the results of fitting a straight line to the data by the method of least squares gives

(a) Find the correlation coefficient for the relationship between pulse rate drop and grams of medication.

(d) A patient given 5 grams of medication has his pulse rate drop to zero. Does this invalidate the regression equation? Explain.

(e) How will the size of the correlation coefficient change if the 3-gram result is removed from the data set? Explain.

(f) How will the size of the slope of the least squares regression line change if the 3-gram result is removed from the data set? Explain.

(A)	I: –0.33	II: 0	III: 0.33
(B)	I: 0	II: 0.33	III: 0.64
(C)	I: 0	II: 0.33	III: 1.0
(D)	I: –0.33	II: 0	III: 1.0
(E)	I: 0	II: 0.50	III: 1.0