Descriptive statistics gives information about the center, spread, and shape of a set of data. The relationship between center and spread is very important. Consider two students who each score an 85 on a test. If the range of grades for the test taken by the first student is 82 to 89 while the range of grades for the test taken by the second student is 48 to 87, who did better?
Inferential statistics is the science of making statements about a larger group (the population) from information gathered from a smaller group (the sample). In this chapter, you will look at the normal distribution, a key piece in the study of inferential statistics, and regression, to find equations that relate bivariate data.
The two most popular measures of central tendency are the mean and the median. The mean of a set of data is the quotient of all the data values and the number of data points, If the data represent the entire population, the result is called a parameter. The population mean is represented by the Greek letter mu, π. If the data represent a sample of the population, the result is called a statistic. The sample mean is represented by the variable of the data with a bar above it. In this case, the mean is represented by
.
If the data are arranged in increasing or decreasing order, the number in the middle is called the median. If there is an odd number of data points, such as 3, 6, 7, 9, 11, then there will be the same number of data points below the center point (3, 6) as there are above it (9, 11). In this case, the middle point in the range of data, or the median, is 7. If there is an even number of data points, such as 3, 6, 7, 9, 11, 14, there are the same number of data points in the “lower” grouping (3, 6, 7) as in the “upper” grouping (9, 11, 14). The median of this group is equal to the mean of the two middle numbers, Please observe that the median is not one of the original data points.
While the common usage of the word average implies the mean, average, as a measure of center, could also be the median.
It should be mentioned here that the median is less impacted by an unusually large or small number than is the mean. Suppose that the total team salary for the Washington Redskins was $1023.4 million rather than $123.4 million. The median for the salaries would remain at $100.55 million while the mean for the data would increase to $128.99 million.
Determine the mean and median for each of the following questions, using the data provided.
1. A random sample of measurements (in mm) of the diameter of 20 ball bearings is recorded as being:
14.9, 11.4, 11.4, 10.6, 10.2, 13.6, 10.0, 12.1, 11.5, 14.9
10.1, 14.2, 13.1, 11.0, 15.0, 13.6, 11.5, 11.2, 14.8, 14.7
2. Ten measurements are made for the time (to the nearest hundredth of a second) of the period of a pendulum. The measurements are:
10.11, 10.24, 10.86, 10.37, 10.6, 10.61, 10.07, 10.01, 10.01, 10.7
3. The grades of 175 students on a state-wide mathematics test are displayed in the table.
4. A light bulb manufacturer tests a random selection of 457 light bulbs to determine the average life span of the bulb. The amount of time each bulb stayed lit (rounded to the nearest 25 hours) is given in the table.
As was noted in the section on the measures of central tendency, an unusually large or small data value will impact the mean but not the median of a set of data. To help clarify the nature of the data being summarized, the mean is usually reported with a second number representing how the data are dispersed (spread out). The simplest measure of dispersion is the range of the data. The range is the difference between the maximum and minimum values in the data set. Two more useful measures of dispersion (useful, in that they are used in the branch of statistics called inferential statistics) are the interquartile range (IQR) and the standard deviation.
The IQR represents the 50th percentile for a set of data. The first quartile (Q1) is the 25th percentile and the third quartile (Q3) is the 75th percentile. The IQR is the difference of Q3 and Q1. (Q1 is midway between the minimum value and the median, while Q3 is midway between the median and the maximum value.) The IQR is used to determine whether a data value is “unusually” large or small. The rule of thumb is that a data value that is more than 1.5 times the IQR greater than Q3 is considered unusually large, while a number that is less than 1.5 times the IQR below Q1 is considered unusually small. Numbers that fit this definition of unusually large or small are called outliers.
The box-and-whisker plot is a graphical representation of data which displays the minimum, Q1, median, Q3, and maximum values of the data set. The five numbers are referred to as the five-number summary for the data.
If we include the second-highest-paid salary in the list, Carlos Delgado’s $19,400,000, the five-number summary and box-and-whisker plot would be:
With this change in the maximum value considered, there is a slight difference in the values of the median and third quartile. Also note that Delgado’s salary is so much larger than the previous maximum that it is not connected to the box-and-whisker plot. Delgado’s salary is an outlier for the data. For these data, the IQR is $3,300,000. Multiplied by 1.5, this becomes $4,950,000. Delgado’s salary is $5,100,000 more than the Q3 value of $14,300,000, so it is considered an outlier.
The number one salary for that year, $22,000,000, was earned by Alex Rodriguez. The impact this number has on the five-number summary and on the box-and-whisker plot is interesting. Delgado’s salary is no longer an outlier.
The other important measure of dispersion is the standard deviation. This measures the average difference between a data point and the mean of the data. We need to be careful because this measure can be used with a sample to make inferences about a population. There is a difference between the method for computing the standard deviation for a sample and the standard deviation for a population. The sample standard deviation, designated by s, is computed by adding the squares of the differences between each data value and the mean, divide by one less than the number of pieces of data in the set, and then take the square root of the quotient. Written as a formula, we read The population standard deviation, designated by σ, is almost the same number. The sum of the squares is divided by the number of pieces of data in the data set, rather than by one less than this number. That is,
Why is there a difference between the formula for the sample standard deviation and the formula for the population standard deviation? A more in-depth study of statistics shows that one gets a better prediction of the population from a sample when the sample standard deviation formula with the divisor being n − 1 is used than when the divisor is n. If this has piqued your curiosity, perform an Internet search on degrees of freedom and you may gain a little more insight into an important theory of statistics.
So far, we have two statistical measures for samples and populations that go by the same name and have different symbols.
Given the data provided for each question, compute the values specified.
1. The Pacific salmon (Oncorhynchus spp.) spends part of its life in freshwater and the other in saltwater. A fisheries scientist at the University of Washington was interested in the distance that salmon travel per day while at sea. She radio-tagged 16 randomly selected salmon and determined the miles traveled per day. Her findings are given in the following table. Determine the interquartile range and standard deviation for these data.
2. The top 20 average home game attendance figures for the National Football League in 2010 are listed in the following table.
a. Compute the mean and median for these data.
b. Compute the interquartile range and the standard deviation for these data.
3. The following table displays the number of calories in all McDonald’s sandwiches.
a. Compute the mean and median for these data.
b. Compute the interquartile range and the standard deviation for these data.
4. The following information was gathered about the number of calories from the 15 sandwiches available from Burger King (data from http://www.fastfoodnutrition.org/r-nutrition-facts/Burger%20King-item.html).
Mean: 616 calories; Median: 640 calories; Min.: 340 calories; Q1: 430 calories; Q3: 710 calories; Max.: 1020 calories; Population Standard Deviation: 187.9 calories
Compare these data against the data from the McDonald’s sandwiches with regard to the measures of center and spread.
The probability for a continuous random variable (a variable which is measured rather than counted) is computed by measuring the area under the bell curve. Most students have not yet studied calculus when they first encounter this topic, so the notion of finding the area with a curved boundary (other than a circle) seems a daunting task. You need not worry, as the technology available to you will eliminate this concern. Before we continue with the use of technology, there is a simple problem that you need to understand. The area under the bell curve gives the probability of an event. This means that two points—a left endpoint and a right endpoint—need to be located so that the area between these points can be determined. What if the left endpoint and the right endpoint are the same number? The result will be a line segment from the x-axis to the point on the graph of the bell curve. Since a line segment is a one-dimensional figure, it has no area. This illustrates an important aspect of the continuous random variable—probabilities are computed for an interval (however small the interval) but not for a point. That is, we can calculate the probability that x is between 1.5 and 2.5, between 1.9 and 2.1, between 1.999 and 2.001, but never at x = 2.
There are three benchmark probabilities for normal distributions that are considered to be common knowledge and are necessary for students of statistics.
The first benchmark probability is that approximately 68% of the data lies within one standard deviation of the mean.
The second benchmark probability is that approximately 95% of the data lies within two standard deviations of the mean.
The third benchmark probability is that approximately 99.7% of the data lies within three standard deviations of the mean.
Most graphing calculators, spreadsheet programs, or computer programs with CAS capability have built-in functions that can compute probabilities. The TI 83/84 series and the Nspire calculators have a normcdf function that can compute probabilities for all values of the input variable. The parameters for this function are normcdf(lower bound, upper bound, mean, standard deviation).
The heights of the females in the senior class at Central High School are normally distributed with a mean 61 inches and a standard deviation of 2.5 inches. Use these data and to answer questions 1–6.
1. Compute the probability that the height of a randomly selected senior girl is between 61 and 63.5 inches.
2. Compute the probability that the height of a randomly selected senior girl is between 56 and 66 inches.
3. Compute the probability that the height of a randomly selected senior girl is greater than 63.5 inches.
4. Compute the probability that the height of a randomly selected senior girl is less than 66 inches.
5. Compute the probability that the height of a randomly selected senior girl is less than 56 inches or greater than 66 inches.
6. Compute the probability that the height of a randomly selected senior girl is less than 58.5 inches or greater than 66 inches.
A soda dispenser dispenses soda amounts with a normal distribution, a mean of 11.9 fluid ounces, and a standard deviation of 0.35 ounces. A cup of soda is randomly selected from all the cups of soda that have been dispensed. Use this information to answer questions 7–10.
7. What is the probability that the number of fluid ounces dispensed is between 11.7 and 12.2?
8 What is the probability that the number of fluid ounces dispensed is less than 12.1?
9. What is the probability that the number of fluid ounces dispensed is more than 12.4?
10. What is the probability that the number of fluid ounces dispensed is less than 11.65 or more than 12.35?
Waiting time for a teller at the North Side Community Bank on a Friday night is normally distributed with a mean of 3.8 min and a standard deviation of 0.4 min. The bank manager is concerned with customer satisfaction and has initiated a policy of giving a $5 gift card to any customer who has to wait longer than 4.5 min to be served. Use this information to answer question 11.
11. What is the probability that a Friday night customer will receive a $5 gift card from the manager?
Many students have the perception that mathematics is not real because the answers to the equations done in class are always “nice” numbers (integers, terminating decimals, or fractions with reasonable denominators). The reality is that the entire world uses mathematics and the numbers we deal with, as well as the answers we get, are not always “nice.” With the inclusion of technology in the classroom, you have had the opportunity to see more and more real applications of mathematics. It is often the case that data are gathered from some research, and in analyzing the data, a mathematical equation is sought to relate two or more variables so that predictions can be made about future behavior. The process through which the equation is found is called regression.
In addition to linear regressions, you should also be familiar with exponential regressions (equations of the form y = a × bx), power regressions (equations of the form y = a × xb), and logarithmic regressions (equations of the form y = a + b ln(x)).
The power regression returns an r value of 1. The r value, called the Pearson correlation coefficient, gives a measure of the strength of the equation. It will always be the case that −1 ≤ r ≤ 1. The closer |r| is to 1, the better the equation fits the data. This rule only applies to linear, power, exponential, and logarithmic regressions. The reason for this is that each of these curve forms can be linearized through the use of logarithms.
Solve the following.
1. Real estate ads for a metropolitan area reveal the following data for the relationship between the number of square feet of living space and the asking price for condominiums.
a. Make a sketch of the price for a condominium in terms of the number of square feet of living space.
b. Determine if there is a exponential relationship between amount of living area and the price of a condominium.
c. Write an equation for the price of a condominium in terms of the number of square feet of living area.
d. Explain the meaning of base of the exponential statement for this equation.
e. How much should one expect to pay for a condominium with 1900 sq ft of living area?
2. The data in the following table show the number of grams of fat and the number of calories from fat in McDonald’s sandwiches.
a. Sketch a graph of the number of calories from fat in a McDonald’s sandwich in terms of the number of grams of fat in the sandwich.
b. Determine if there is a linear relationship between the number of calories from fat in a McDonald’s sandwich and the number of grams of fat.
c. Determine the equation for the line of best fit for the number of calories from fat in a McDonald’s sandwich in terms of the number of grams of fat.
d. Explain the meaning of slope for this equation.
e. Predict the number of calories from fat in a McDonald’s sandwich if the sandwich contains 20 grams of fat.
3. The price of Apple stock (in dollars) on the first trading day of August for the years 2001–2005 and 2007–2011 is listed in the following table.
a. Make a scatterplot for the data, measuring time as the number of years since 2000.
b. Determine the type of relationship that appears to be present between the variables.
c. Determine the equation of best fit for the data.
d. Predict the price of the Apple stock in 2006.
4. A model used for explaining radioactive decay produces the following data.
a. Make a scatterplot for the data.
b. Determine the type of relationship that appears to be present between the variables.
c. Determine the equation of best fit for the data.
d. Predict the number of radioactive nuclei after 0.5 sec.
5. The data in the table below represent the stopping distance (in feet) when a vehicle on a wet road and with poor tread on its tires is driving at the given speed (in miles per hour).
a. Make a scatterplot for the data.
b. Determine the type of relationship that appears to be present between the variables.
c. Determine the equation of best fit for the data.
d. Predict the stopping distance on a wet road with tires having poor tread from a speed of 58 miles per hour.
6. An astronomical unit (au) is defined to be the average distance from the center of the sun to the center of the Earth. The following table contains the number of astronomical units the planets in our solar system are from the sun and the number of Earth years it takes for each planet to make a revolution around the sun.
a. Make a scatterplot for the data.
b. Determine the type of relationship that appears to be present between the variables.
c. Determine the equation of best fit for the data.
d. Predict the time needed for the planetoid Ceres, which is 2.7 au from the sun, to make a complete revolution around the sun.
7. The following table gives the wind chill factors when the air temperature is 10°F.
a. Make a scatterplot for the data.
b. Determine the type of relationship that appears to be present between the variables.
c. Determine the equation of best fit for the data.
d. Predict the wind chill at 10°F when the wind is blowing at 35 miles per hour.