Chapter 2: Visualizations of Data

Histograms and Frequency Distributions

Learning Objectives

Read and make frequency tables for a data set.
Identify and translate data sets to and from a histogram, a relative frequency histogram, and a frequency polygon.
Identify histogram distribution shapes as skewed or symmetric and understand the basic implications of these shapes.
Identify and translate data sets to and from an ogive plot (cumulative distribution function).

Introduction

Charts and graphs of various types, when created carefully, can provide instantaneous important information about a data set without calculating, or even having knowledge of, various statistical measures. This chapter will concentrate on some of the more common visual presentations of data.

Frequency Tables

The earth has seemed so large in scope for thousands of years that it is only recently that many people have begun to take seriously the idea that we live on a planet of limited and dwindling resources. This is something that residents of the Galapagos Islands are also beginning to understand. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos. Basic human resources such as water, food, fuel, and building materials must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire to protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.

Figure 2.1

The Recycling Center on Santa Cruz in the Galapagos turns all the recycled glass into pavers that are used for the streets in Puerto Ayora.

It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and that we have a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the information in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!

Example: Water, Water, Everywhere!

Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, 154 billion gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated by the plastic bottles and the small fraction of the plastic that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil-burning vehicles.

Example: Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data, the first step is to organize it so it is easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles of water consumed each week.

Consider the following raw data:

6, 4, 7, 7, 8, 5, 3, 6, 8, 6, 5, 7, 7, 5, 2, 6, 1, 3, 5, 4, 7, 4, 6, 7, 6, 6, 7, 5, 4, 6, 5, 3

Here are the correct frequencies using the imaginary data presented above:

Figure: Imaginary Class Data on Water Bottle Usage

Table 2.1

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week	Frequency
1	1
2	1
3	3
4	4
5	6
6	8
7	7
8	2

When creating a frequency table, it is often helpful to use tally marks as a running total to avoid missing a value or over-representing another.

Table 2.2

Frequency table using tally marks
Number of Plastic Beverage Bottles per Week	Tally	Frequency
1	${\color{red} \| }$	1
2	${\color{red} \| }$	1
3	${\color{red} \| \| \| }$	3
4	${\color{red} \| \| \| \| }$	4
5	${\color{red} \bcancel{ \| \| \| \| } \ \| }$	6
6	${\color{red} \bcancel{ \| \| \| \| } \ \| \| \| }$	8
7	${\color{red} \bcancel{ \| \| \| \| } \ \| \| }$	7
8	${\color{red} \| \| }$	2

The following data set shows the countries in the world that consume the most bottled water per person per year.

Table 2.3

Country	Liters of Bottled Water Consumed per Person per Year
Italy	183.6
Mexico	168.5
United Arab Emirates	163.5
Belgium and Luxembourg	148.0
France	141.6
Spain	136.7
Germany	124.9
Lebanon	101.4
Switzerland	99.6
Cyprus	92.0
United States	90.5
Saudi Arabia	87.8
Czech Republic	87.1
Austria	82.1
Portugal	80.3

Figure: Bottled Water Consumption per Person in Leading Countries in 2004. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm

These data values have been measured at the ratio level. There is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from 80.3 liters to 183 liters. By examining the data, it seems appropriate for us to create our frequency table in groups of 10. We will skip the tally marks in this case, because the data values are already in numerical order, and it is easy to see how many are in each classification.

A bracket, '[' or ']', indicates that the endpoint of the interval is included in the class. A parenthesis, '(' or ')', indicates that the endpoint is not included. It is common practice in statistics to include a number that borders two classes as the larger of the two numbers in an interval. For example, $[80-90)$ means this classification includes everything from 80 and gets infinitely close to, but not equal to, 90. 90 is included in the next class, $[90 -100)$ .

Table 2.4

Liters per Person	Frequency
$[80-90)$	4
$[90-100)$	3
$[100-110)$	1
$[110-120)$	0
$[120-130)$	1
$[130-140)$	1
$[140-150)$	2
$[150-160)$	0
$[160-170)$	2
$[170-180)$	0
$[180-190)$	1

Figure: Completed Frequency Table for World Bottled Water Consumption Data (2004)

Histograms

Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram. Let's revisit our data about student bottled beverage habits.

Table 2.5

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week	Frequency
1	1
2	1
3	3
4	4
5	6
6	8
7	7
8	2

Here is the same data in a histogram:

In this case, the horizontal axis represents the variable (number of plastic bottles of water consumed), and the vertical axis is the frequency, or count. Each vertical bar represents the number of people in each class of ranges of bottles. For example, in the range of consuming $[1 -2)$ bottles, there is only one person, so the height of the bar is at 1. We can see from the graph that the most common class of bottles used by people each week is the $[6-7)$ range, or six bottles per week.

A histogram is for numerical data. With histograms, the different sections are referred to as bins. Think of a column, or bin, as a vertical container that collects all the data for that range of values. If a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. It is important when drawing a histogram to be certain that there are enough bins so that the last data value is included. Often this means you have to extend the horizontal axis beyond the value of the last data point. In this example, if we had stopped the graph at 8, we would have missed that data, because the 8's actually appear in the bin between 8 and 9. Very often, when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoint of each bin, unless you specify otherwise.

On the Web

http://illuminations.nctm.org/ActivityDetail.aspx?ID=78 Here you can change the bin width and explore how it effects the shape of the histogram.

Relative Frequency Histogram

A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. For example, there is only one data value in the first bin. This represents $\frac{1}{32}$ , or approximately 3%, of the total data. Thus, the vertical bar for the bin extends upward to 3%.

Frequency Polygons

A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.

To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.

Here is a frequency polygon constructed directly from the previously-shown histogram:

Here is the frequency polygon in finished form:

Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!

Example: It would be interesting to compare bottled water consumption in two different years. Two frequency polygons would help give an overall picture of how the years are similar, and how they are different. In the following graph, two frequency polygons, one representing 1999, and the other representing 2004, are overlaid. 1999 is in red, and 2004 is in green.

It appears there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower-consuming countries is almost identical in the two frequency polygons, but it increased by 20 liters per person in 2004. In 1999, there was a middle peak, but that group shifted significantly to the right in 2004 (by between 40 and 60 liters per person). The frequency polygon is the first type of graph we have learned about that makes this type of comparison easier.

Cumulative Frequency Histograms and Ogive Plots

Very often, it is helpful to know how the data accumulate over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including a particular class.

Table 2.6

Number of Plastic Beverage Bottles per Week	Frequency	Cumulative Frequency
1	1	1
2	1	2
3	3	5
4	4	9
5	6	15
6	8	23
7	7	30
8	2	32

Figure: Cumulative Frequency Table for Bottle Data

For example, the cumulative frequency for 5 bottles per week is 15, because 15 students consumed 5 or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.

If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram, it would look as follows:

A relative cumulative frequency histogram would be the same, except that the vertical bars would represent the relative cumulative frequencies of the data:

Table 2.7

Number of Plastic Beverage Bottles per Week	Frequency	Cumulative Frequency	Relative Cumulative Frequency (%)
1	1	1	3.1
2	1	2	6.3
3	3	5	15.6
4	4	9	28.1
5	6	15	46.9
6	8	23	71.9
7	7	30	93.8
8	2	32	100

Figure: Relative Cumulative Frequency Table for Bottle Data

Remembering what we did with the frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we connected the midpoints of the bins. In a relative cumulative frequency plot, we use the point on the right side of each bin.

The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to a particular value, just like in the frequency table. For example, the point that is plotted at 4 corresponds to 15.6%, because that is the percentage of the data that is less than or equal to 3. It does not include the 4's, because they are in the bin to the right of that point. This is why we plot a point at 1 on the horizontal axis and at 0% on the vertical axis. None of the data is lower than 1, and similarly, all of the data is below 9. Here is the final version of the plot:

This plot is commonly referred to as an ogive plot. The name ogive comes from a particular pointed arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive-type arch:

If a distribution is symmetric and mound shaped, then its ogive plot will look just like the shape of one half of such an arch.

Shape, Center, Spread

In the first chapter, we introduced measures of center and spread as important descriptors of a data set. The shape of a distribution of data is very important as well. Shape, center, and spread should always be your starting point when describing a data set.

Referring to our imaginary student poll on using plastic beverage containers, we notice that the data are spread out from 0 to 9. The graph for the data illustrates this concept, and the range quantifies it. Look back at the graph and notice that there is a large concentration of students in the 5, 6, and 7 region. This would lead us to believe that the center of this data set is somewhere in this area. We use the mean and/or median to measure central tendency, but it is also important that you see that the center of the distribution is near the large concentration of data. This is done with shape.

Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter, is that it has a single large concentration of data that appears like a mountain. A data set that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:

Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves. The most important feature of a density curve is symmetry. The first density curve above is symmetric and mound-shaped. Notice the second curve is mound-shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. It is the direction of the long, spread out section of data, called the tail, that determines the direction of the skewing. For example, in the $3^{\text{rd}}$ curve, the left tail of the distribution is stretched out, so this distribution is skewed left. Our student bottle data set has this skewed-left shape.

Lesson Summary

A frequency table is useful to organize data into classes according to the number of occurrences, or frequency, of each class. Relative frequency shows the percentage of data in each class. A histogram is a graphical representation of a frequency table (either actual or relative frequency). A frequency polygon is created by plotting the midpoint of each bin at its frequency and connecting the points with line segments. Frequency polygons are useful for viewing the overall shape of a distribution of data, as well as comparing multiple data sets. For any distribution of data, you should always be able to describe the shape, center, and spread. A data set that is mound shaped can be classified as either symmetric or skewed. Distributions that are skewed left have the bulk of the data concentrated on the higher end of the distribution, and the lower end, or tail, of the distribution is spread out to the left. A skewed-right distribution has a large portion of the data concentrated in the lower values of the variable, with the tail spread out to the right. A relative cumulative frequency plot, or ogive plot, shows how the data accumulate across the different values of the variable.

Points to Consider

What characteristics of a data set make it easier or harder to represent it using frequency tables, histograms, or frequency polygons?
What characteristics of a data set make representing it using frequency tables, histograms, frequency polygons, or ogive plots more or less useful?
What effects does the shape of a data set have on the statistical measures of center and spread?
How do you determine the most appropriate classification to use for a frequency table or the bin width to use for a histogram?

Review Questions

1. Lois was gathering data on the plastic beverage bottle consumption habits of her classmates, but she ran out of time as class was ending. When she arrived home, something had spilled in her backpack and smudged the data for the 2's. Fortunately, none of the other values was affected, and she knew there were 30 total students in the class. Complete her frequency table.

Table 2.8

Number of Plastic Beverage Bottles per Week	Tally	Frequency
1	${\color{red} \| \|}$
2
3	${\color{red} \| \| \|}$
4	${\color{red} \| \| }$
5	${\color{red} \| \| \| }$
6	${\color{red}\bcancel{ \| \| \| \| } \ \| \| }$
7	${\color{red}\bcancel{\| \| \| \| }\ \| }$
8	${\color{red} \| }$

2. The following frequency table contains exactly one data value that is a positive multiple of ten. What must that value be?

Table 2.9

Class	Frequency
$[0 - 5)$	4
$[5 - 10)$	0
$[10 - 15)$	2
$[15 - 20)$	1
$[20 - 25)$	0
$[25 - 30)$	3
$[30 - 35)$	0
$[35 - 40)$	1

(a) 10

(b) 20

(d) 40

(e) There is not enough information to determine the answer.

3. The following table includes the data from the same group of countries from the earlier bottled water consumption example, but is for the year 1999, instead.

Table 2.10

Country	Liters of Bottled Water Consumed per Person per Year
Italy	154.8
Mexico	117.0
United Arab Emirates	109.8
Belgium and Luxembourg	121.9
France	117.3
Spain	101.8
Germany	100.7
Lebanon	67.8
Switzerland	90.1
Cyprus	67.4
United States	63.6
Saudi Arabia	75.3
Czech Republic	62.1
Austria	74.6
Portugal	70.4

Figure: Bottled Water Consumption per Person in Leading Countries in 1999. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm

(a) Create a frequency table for this data set.

(b) Create the histogram for this data set.

4. The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.

Table 2.11

Manufactured Material	Energy Saved (millions of BTU's per ton)
Aluminum Cans	206
Copper Wire	83
Steel Cans	20
LDPE Plastics (e.g., trash bags)	56
PET Plastics (e.g., beverage bottles)	53
HDPE Plastics (e.g., household cleaner bottles)	51
Personal Computers	43
Carpet	106
Glass	2
Corrugated Cardboard	15
Newspaper	16
Phone Books	11
Magazines	11
Office Paper	10

Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material. Source: National Geographic, January 2008. Volume 213 No., pg 82-83.

(a) Complete the frequency table below, including the actual frequency, the relative frequency (round to the nearest tenth of a percent), and the relative cumulative frequency.

(b) Create a relative frequency histogram from your table in part (a).

(d) Create the ogive plot.

(e) Comment on the shape, center, and spread of this distribution as it relates to the original data. (Do not actually calculate any specific statistics).

(f) Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?

(g) There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.

(h) What does the steepest part of an ogive plot tell you about the distribution?

On the Web

http://www.earth-policy.org/Updates/2006/Update51_data.htm

http://en.wikipedia.org/wiki/Ogive

Technology Notes: Histograms on the TI-83/84 Graphing Calculator

To draw a histogram on your TI-83/84 graphing calculator, you must first enter the data in a list. In the home screen, press [2ND][}], and then enter the data separated by commas (see the screen below). When all the data have been entered, press [2ND][}][STO], and then press [2ND][L1][ENTER].

Now you are ready to plot the histogram. Press [2ND][STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time. Choose Plot1. Turn the plot on, change the type of plot to a histogram (see sample screen below), and choose L1. Enter '1' for the Freq by pressing [2ND][A-LOCK] to turn off alpha lock, which is normally on in this menu, because most of the time you would want to enter a variable here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in Chapter 1.

Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case, 'XSCL' is what determines the bin width. Also notice that the maximum $x$ value needs to go up to 9 to show the last bin, even though the data values stop at 8. Enter all of the values shown below.

Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.

Common Graphs and Data Plots

Learning Objectives

Identify and translate data sets to and from a bar graph and a pie graph.
Identify and translate data sets to and from a dot plot.
Identify and translate data sets to and from a stem-and-leaf plot.
Identify and translate data sets to and from a scatterplot and a line graph.
Identify graph distribution shapes as skewed or symmetric, and understand the basic implication of these shapes.
Compare distributions of univariate data (shape, center, spread, and outliers).

Introduction

In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.

Categorical Variables: Bar Graphs and Pie Graphs

Example: E-Waste and Bar Graphs

We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected $\text{workers}^1$ . The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.

Table 2.12

Electronic Equipment	Thousands of Tons Discarded
Cathode Ray Tube (CRT) TV's	7591.1
CRT Monitors	389.8
Printers, Keyboards, Mice	324.9
Desktop Computers	259.5
Laptop Computers	30.8
Projection TV's	132.8
Cell Phones	11.7
LCD Monitors	4.9

Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.

The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:

While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.

Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!

Pie Graphs

Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.

Here is a table with the percentages and the approximate angle measure of each sector:

Table 2.13

Electronic Equipment	Thousands of Tons Discarded	Percentage of Total Discarded	Angle Measure of Circle Sector
Cathode Ray Tube (CRT) TV's	7591.1	86.8	312.5
CRT Monitors	389.8	4.5	16.2
Printers, Keyboards, Mice	324.9	3.7	13.4
Desktop Computers	259.5	3.0	10.7
Laptop Computers	30.8	0.4	1.3
Projection TV's	132.8	1.5	5.5
Cell Phones	11.7	0.1	0.5
LCD Monitors	4.9	$\sim 0$	0.2

And here is the completed pie graph:

Displaying Univariate Data

Dot Plots

A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.

Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.

Table 2.14

Percentage of the paper packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country	% of Paper Packaging Recycled
Estonia	34
New Zealand	40
Poland	40
Cyprus	42
Portugal	56
United States	59
Italy	62
Spain	63
Australia	66
Greece	70
Finland	70
Ireland	70
Netherlands	70
Sweden	76
France	76
Germany	83
Austria	83
Belgium	83
Japan	98

The dot plot for this data would look like this:

Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.

Stem-and-Leaf Plots

One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:

Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:

It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:

Note: $5|69$ means 56% and 59% are the two values in the 50's.

If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.

Example: Consider the following populations of counties in California.

Butte - 220,748

Calaveras - 45,987

Del Norte - 29,547

Fresno - 942,298

Humboldt - 132,755

Imperial - 179,254

San Francisco - 845,999

Santa Barbara - 431,312

To construct a stem and leave plot, we need to either round or truncate to two digits.

Table 2.15

Value	Value Rounded	Value Truncated
149	15	14
657	66	65
188	19	18

$2|2$ represents $220,000 - 229,999$ when data has been truncated

$2|2$ represents $215,000 - 224,999$ when data has been rounded.

If we decide to round the above data, we have:

Butte - 220,000

Calaveras - 46,000

Del Norte - 30,000

Fresno - 940,000

Humboldt - 130,000

Imperial - 180,000

San Francisco - 850,000

Santa Barbara - 430,000

And the stem and leaf will be as follows:

where:

$2|2$ represents $220,000 - 224,999$ .

Source: California State Association of Counties http://www.counties.org/default,asp?id=399

Back-to-Back Stem Plots

Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.

In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:

Table 2.16

Percentage of the glass packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country	% of Glass Packaging Recycled
Cyprus	4
United States	21
Poland	27
Greece	34
Portugal	39
Spain	41
Australia	44
Ireland	56
Italy	56
Finland	56
France	59
Estonia	64
New Zealand	72
Netherlands	76
Germany	81
Austria	86
Japan	96
Belgium	98
Sweden	100

In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.

We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.

Displaying Bivariate Data

Scatterplots and Line Plots

Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.

Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.

Table 2.17

Country	% of Paper Packaging Recycled	% of Glass Packaging Recycled
Estonia	34	64
New Zealand	40	72
Poland	40	27
Cyprus	42	4
Portugal	56	39
United States	59	21
Italy	62	56
Spain	63	41
Australia	66	44
Greece	70	34
Finland	70	56
Ireland	70	55
Netherlands	70	76
Sweden	70	100
France	76	59
Germany	83	81
Austria	83	44
Belgium	83	98
Japan	98	96

Figure: Paper and Glass Packaging Recycling Rates for 19 countries

Scatterplots

We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.

Recall that one of the things you saw from the stem-and-leaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.

With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.

Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.

The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.

New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.

On the Web

http://tinyurl.com/y8vcm5y Guess the correlation.

Line Plots

Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:

Table 2.18

Year	Municipal Waste Generated (Millions of Tons)
1990	269
1991	294
1992	281
1993	292
1994	307
1995	323
1996	327
1997	327
1998	340

Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm

In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.

When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:

It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.

Lesson Summary

Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem-and-leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.

Points to Consider

What characteristics of a data set make it easier or harder to represent using dot plots, stem-and-leaf plots, or histograms?
Which plots are most useful to interpret the ideas of shape, center, and spread?
What effects does the shape of a data set have on the statistical measures of center and spread?

Multimedia Links

For a description of how to draw a stem-and-leaf plot, as well as how to derive information from one (14.0), see APUS07, Stem-and-Leaf Plot (8:08).

Click here to watch the video

Review Questions

1. Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the 'other' category, because they occur in relatively small amounts that are still dangerous and toxic.

Table 2.19

Material	Kilograms
Plastics	6.21
Lead	1.71
Aluminum	3.83
Iron	5.54
Copper	2.12
Tin	0.27
Zinc	0.60
Nickel	0.23
Barium	0.05
Other elements and chemicals	6.44

Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm

(a) Create a bar graph for this data.

(b) Complete the chart below to show the approximate percentage of the total weight for each material.

Table 2.20

Material	Kilograms	Approximate Percentage of Total Weight
Plastics	6.21
Lead	1.71
Aluminum	3.83
Iron	5.54
Copper	2.12
Tin	0.27
Zinc	0.60
Nickel	0.23
Barium	0.05
Other elements and chemicals	6.44

2. The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.

Table 2.21

State	Percentage
Alabama	23
Alaska	7
Arizona	18
Arkansas	36
California	30
Colorado	18
Connecticut	23
Delaware	31
District of Columbia	8
Florida	40
Georgia	33
Hawaii	25
Illinois	28
Indiana	23
Iowa	32
Kansas	11
Kentucky	28
Louisiana	14
Maine	41
Maryland	29
Massachusetts	33
Michigan	25
Minnesota	42
Mississippi	13
Missouri	33
Montana	5
Nebraska	27
Nevada	15
New Hampshire	25
New Jersey	45
New Mexico	12
New York	39
North Carolina	26
North Dakota	21
Ohio	19
Oklahoma	12
Oregon	28
Pennsylvania	26
Rhode Island	23
South Carolina	34
South Dakota	42
Tennessee	40
Utah	19
Vermont	30
Virginia	35
Washington	48
West Virginia	20
Wisconsin	36
Wyoming	5

Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm

(a) Create a dot plot for this data.

(b) Discuss the shape, center, and spread of this distribution.

(d) Use your stem-and-leaf plot to find the median percentage for this data.

3. Identify the important features of the shape of each of the following

distributions.

Questions 4-7 refer to the following dot plots:

4. Identify the overall shape of each distribution.

5. How would you characterize the center(s) of these distributions?

6. Which of these distributions has the smallest standard deviation?

7. Which of these distributions has the largest standard deviation?

8. In question 2, you looked at the percentage of waste recycled in each state. Do you think there is a relationship between the percentage recycled and the total amount of waste that a state generates? Here are the data, including both variables.

Table 2.22

State	Percentage	Total Amount of Municipal Waste in Thousands of Tons
Alabama	23	5549
Alaska	7	560
Arizona	18	5700
Arkansas	36	4287
California	30	45000
Colorado	18	3084
Connecticut	23	2950
Delaware	31	1189
District of Columbia	8	246
Florida	40	23617
Georgia	33	14645
Hawaii	25	2125
Illinois	28	13386
Indiana	23	7171
Iowa	32	3462
Kansas	11	4250
Kentucky	28	4418
Louisiana	14	3894
Maine	41	1339
Maryland	29	5329
Massachusetts	33	7160
Michigan	25	13500
Minnesota	42	4780
Mississippi	13	2360
Missouri	33	7896
Montana	5	1039
Nebraska	27	2000
Nevada	15	3955
New Hampshire	25	1200
New Jersey	45	8200
New Mexico	12	1400
New York	39	28800
North Carolina	26	9843
North Dakota	21	510
Ohio	19	12339
Oklahoma	12	2500
Oregon	28	3836
Pennsylvania	26	9440
Rhode Island	23	477
South Carolina	34	8361
South Dakota	42	510
Tennessee	40	9496
Utah	19	3760
Vermont	30	600
Virginia	35	9000
Washington	48	6527
West Virginia	20	2000
Wisconsin	36	3622
Wyoming	5	530

(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.

(b) How much municipal waste was created in Illinois?

(d) Describe the direction and strength of the association between the two variables.

9. The following line graph shows the recycling rates of two different types of plastic bottles in the US from 1995 to

2001.

a. Explain the general trends for both types of plastics over these years.

b. What was the total change in PET bottle recycling from 1995 to 2001?

c. Can you think of a reason to explain this change?

d. During what years was this change the most rapid?

References

National Geographic, January 2008. Volume 213 No.1

http://www.etoxics.org/site/PageServer?pagename=svtc_global_ewaste_crisis'

http://www.earth-policy.org/Updates/2006/Update51_data.htm

Technology Notes: Scatterplots on the TI-83/84 Graphing Calculator

Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STAT-PLOT] to enter the STAT-PLOTS menu, and choose the first plot.

Change the settings to match the following screenshot:

This selects a scatterplot with the explanatory variable in L1 and the response variable in L2. In order to see the points better, you should choose either the square or the plus sign for the mark. The square has been chosen in the screenshot. Finally, set the window as shown below to match the data. In this case, we looked at our lowest and highest data values in each variable and added a bit of room to create a pleasant window. Press [GRAPH] to see the result, which is also shown below.

Line Plots on the TI-83/84 Graphing Calculator

Your graphing calculator will also draw a line plot, and the process is almost identical to that for creating a scatterplot. Enter the data into your lists, and choose a line plot in the Plot1 menu, as in the following screenshot.

Next, set an appropriate window (not necessarily the one shown below), and graph the resulting plot.

Box-and-Whisker Plots

Learning Objectives

Calculate the values of the five-number summary.
Draw and translate data sets to and from a box-and-whisker plot.
Interpret the shape of a box-and-whisker plot.
Compare distributions of univariate data (shape, center, spread, and outliers).
Describe the effects of changing units on summary measures.

Introduction

In this section, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread, and outliers will be studied in this context.

The Five-Number Summary

The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.

Example: The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the reservoir capacities of the major water sources for Arizona:

Table 2.23

Lake/Reservoir	% of Capacity
Salt River System	59
Lake Pleasant	49
Verde River System	33
San Carlos	9
Lyman Reservoir	3
Show Low Lake	51
Lake Havasu	98
Lake Mohave	85
Lake Mead	95
Lake Powell	89

Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html

This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of $\text{capacity}^1$ .

Placing the data in order from smallest to largest gives the following:

3, 9, 33, 49, 51, 59, 85, 89, 95, 98

Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the $25^{\text{th}}$ percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:

$\left \{3, 33, 55, 89, 98 \right \}$

Box-and-Whisker Plots

A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'. Here is the box plot for this data:

The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.

Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:

In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).

Comparing Multiple Box Plots

Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets. In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:

The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's capacities are lower. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.

Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.

Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.

Outliers in Box-and-Whisker Plots

Here are the reservoir data for California (the names of the lakes and reservoirs have been omitted):

80, 83, 77, 95, 85, 74, 34, 68, 90, 82, 75

At first glance, the 34 should stand out. It appears as if this point is significantly different from the rest of the data. Let's use a graphing calculator to investigate this plot. Enter your data into a list as we have done before, and then choose a plot. Under 'Type', you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).

Setting a window is not as important for a box plot, so we will use the calculator's ability to automatically scale a window to our data by pressing [ZOOM] and selecting '9:Zoom Stat'.

While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point. Go back and change your plot to the first box plot option, which is the modified box plot, and then graph it.

Notice that without the outlier, the distribution is really roughly symmetric.

This data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than 1.5 times the interquartile range will be considered an outlier. Because the $IQR$ is the same as the length of the box, any point that is more than one-and-a-half box lengths from either quartile is plotted as an outlier.

A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.

The calculations for determining the outlier in this case are as follows:

Lower Quartile: 74

Upper Quartile: 85

Interquartile range $(IQR): 85 - 74 = 11$

$1.5 * IQR = 16.5$

Cut-off for outliers in left whisker: $74 - 16.5 = 57.5$ . Thus, any value less than 57.5 is considered an outlier.

Notice that we did not even bother to test the calculation on the right whisker, because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.

If you press [TRACE] and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the outlier.

The Effects of Changing Units on Shape, Center, and Spread

In the previous lesson, we looked at data for the materials in a typical desktop computer.

Table 2.24

Material	Kilograms
Plastics	6.21
Lead	1.71
Aluminum	3.83
Iron	5.54
Copper	2.12
Tin	0.27
Zinc	0.60
Nickel	0.23
Barium	0.05
Other elements and chemicals	6.44

Here is the data set given in pounds. The weight of each in kilograms was multiplied by 2.2.

Table 2.25

Material	Pounds
Plastics	13.7
Lead	3.8
Aluminum	8.4
Iron	12.2
Copper	4.7
Tin	0.6
Zinc	1.3
Nickel	0.5
Barium	0.1
Other elements and chemicals	14.2

When all values are multiplied by a factor of 2.2, the calculation of the mean is also multiplied by 2.2, so the center of the distribution would be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. In other words, the center and the measures of spread will increase proportionally.

Example: This is easier to think of with numbers. Suppose that your mean is 20, and that two of the data values in your distribution are 21 and 23. If you multiply 21 and 23 by 2, you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were $21 - 20 = 1$ and $23 - 20 = 3$ , but now, your deviations are $42 - 40 = 2$ and $46 - 40 = 6$ , so your deviations are getting twice as big as well.

This should result in the graph maintaining the same shape, but being stretched out, or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.

On the Web

http://tinyurl.com/34s6sm Investigate the mean, median and box plots.

http://tinyurl.com/3ao9px More investigation of boxplots.

Lesson Summary

The five-number summary is a useful collection of statistical measures consisting of the following in ascending order: minimum, lower quartile, median, upper quartile, maximum. A box-and-whisker plot is a graphical representation of the five-number summary showing a box bounded by the lower and upper quartiles and the median as a line in the box. The whiskers are line segments extended from the quartiles to the minimum and maximum values. Each whisker and section of the box contains approximately 25% of the data. The width of the box is the interquartile range, or $IQR$ , and shows the spread of the middle 50% of the data. Box-and-whisker plots are effective at giving an overall impression of the shape, center, and spread of a data set. While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot. Any point that is more than 1.5 times the length of the box $(IQR)$ from either end of the box is considered to be an outlier. When changing the units of a distribution, the center and spread will be affected, but the shape will stay the same.

Points to Consider

What characteristics of a data set make it easier or harder to represent it using dot plots, stem-and-leaf plots, histograms, and box-and-whisker plots?
Which plots are most useful to interpret the ideas of shape, center, and spread?
What effects do other transformations of the data have on the shape, center, and spread?

Multimedia Links

For a description of how to draw a box-and-whisker plot from given data (14.0), see patrickJMT, Box and Whisker Plot (5:53).

Click here to watch the video

Review Questions

1. Here are the 1998 data on the percentage of capacity of reservoirs in

Idaho.

$70, 84, 62, 80, 75, 95, 69, 48, 76, 70, 45, 83, 58, 75, 85, 70\ 62, 64, 39, 68, 67, 35, 55, 93, 51, 67, 86, 58, 49, 47, 42, 75$

a. Find the five-number summary for this data set.

b. Show all work to determine if there are true outliers according to the $1.5*IQR$ rule.

c. Create a box-and-whisker plot showing any outliers.

d. Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.

e. Based on your answer in part (d), how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.

2. Here are the 1998 data on the percentage of capacity of reservoirs in

Utah.

$80, 46, 83, 75, 83, 90, 90, 72, 77, 4, 83, 105, 63, 87, 73, 84, 0, 70, 65, 96, 89, 78, 99, 104, 83, 81$

a. Find the five-number summary for this data set.

b. Show all work to determine if there are true outliers according to the $1.5*IQR$ rule.

c. Create a box-and-whisker plot showing any outliers.

d. Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.

e. Based on your answer in part (d) how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.

3. Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.

4. If the median of a distribution is less than the mean, which of the following statements is the most correct?

a. The distribution is skewed left.

b. The distribution is skewed right.

c. There are outliers on the left side.

d. There are outliers on the right side.

e. (b) or (d) could be true.

5. The following table contains recent data on the average price of a gallon of gasoline for states that share a border crossing into Canada.

a. Find the five-number summary for this data.

b. Show all work to test for outliers.

c. Graph the box-and-whisker plot for this data.

d. Canadian gasoline is sold in liters. Suppose a Canadian crossed the border into one of these states and wanted to compare the cost of gasoline. There are approximately 4 liters in a gallon. If we were to convert the distribution to liters, describe the resulting shape, center, and spread of the new distribution.

e. Complete the following table. Convert to cost per liter by dividing by 3.7854, and then graph the resulting box plot.

As an interesting extension to this problem, you could look up the current data and compare that distribution with the data presented here. You could also find the exchange rate for Canadian dollars and convert the prices into the other currency.

Table 2.26

State	Average Price of a Gallon of Gasoline (US$)	Average Price of a Liter of Gasoline (US$)
Alaska	3.458
Washington	3.528
Idaho	3.26
Montana	3.22
North Dakota	3.282
Minnesota	3.12
Michigan	3.352
New York	3.393
Vermont	3.252
New Hampshire	3.152
Maine	3.309

Average Prices of a Gallon of Gasoline on March 16, 2008

Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://www.fuelgaugereport.com/sbsavg.asp

References

$^1 \ \text{Kunzig, Robert. Drying of the West. National Geographic, February 2008, Vol. 213, No. 2, Page 94.}$

http://en.wikipedia.org/wiki/Box_plot

Chapter Review

Part One: Questions

1. Which of the following can be inferred from this

histogram?

a. The mode is 1.

b. mean < median

c. median < mean

d. The distribution is skewed left.

e. None of the above can be inferred from this histogram.

2. Sean was given the following relative frequency histogram to

read.

Unfortunately, the copier cut off the bin with the highest frequency. Which of the following could possibly be the relative frequency of the cut-off bin?

a. 16

b. 24

c. 32

d. 68

3. Tianna was given a graph for a homework question in her statistics class, but she forgot to label the graph or the axes and couldn’t remember if it was a frequency polygon or an ogive plot. Here is her

graph:

Identify which of the two graphs she has and briefly explain why.

In questions 4-7, match the distribution with the choice of the correct real-world situation that best fits the graph.

a. Endy collected and graphed the heights of all the $12^{\text{th}}$ grade students in his high school.

b. Brittany asked each of the students in her statistics class to bring in 20 pennies selected at random from their pocket or piggy bank. She created a plot of the dates of the pennies.

c. Thamar asked her friends what their favorite movie was this year and graphed the results.

d. Jeno bought a large box of doughnut holes at the local pastry shop, weighed each of them, and then plotted their weights to the nearest tenth of a gram.

8. Which of the following box plots matches the

histogram?

9. If a data set is roughly symmetric with no skewing or outliers, which of the following would be an appropriate sketch of the shape of the corresponding ogive plot?

10. Which of the following scatterplots shows a strong, negative association?

Part Two: Open-Ended Questions

1. The Burj Dubai will become the world’s tallest building when it is completed. It will be twice the height of the Empire State Building in New York.

Table 2.27

Building	City	Height (ft)
Taipei 101	Tapei	1671
Shanghai World Financial Center	Shanghai	1614
Petronas Tower	Kuala Lumpur	1483
Sears Tower	Chicago	1451
Jin Mao Tower	Shanghai	1380
Two International Finance Center	Hong Kong	1362
CITIC Plaza	Guangzhou	1283
Shun Hing Square	Shenzen	1260
Empire State Building	New York	1250
Central Plaza	Hong Kong	1227
Bank of China Tower	Hong Kong	1205
Bank of America Tower	New York	1200
Emirates Office Tower	Dubai	1163
Tuntex Sky Tower	Kaohsiung	1140

The chart lists the 15 tallest buildings in the world (as of 12/2007).

(a) Complete the table below, and draw an ogive plot of the resulting data.

Table 2.28

Class

Frequency

Relative Frequency

Cumulative Frequency

Relative Cumulative Frequency

(b) Use your ogive plot to approximate the median height for this data.

(d) Find the $90^{\text{th}}$ percentile for this data (i.e., the height that 90% of the data is less than).

2. Recent reports have called attention to an inexplicable collapse of the Chinook Salmon population in western rivers (see http://www.nytimes.com/2008/03/17/science/earth/17salmon.html). The following data tracks the fall salmon population in the Sacramento River from 1971 to 2007.

Table 2.29

*Year $^$**	Adults	Jacks
1971-1975	164,947	37,409
1976-1980	154,059	29,117
1981-1985	169,034	45,464
1986-1990	182,815	35,021
1991-1995	158,485	28,639
1996	299,590	40,078
1997	342,876	38,352
1998	238,059	31,701
1998	395,942	37,567
1999	416,789	21,994
2000	546,056	33,439
2001	775,499	46,526
2002	521,636	29,806
2003	283,554	67,660
2004	394,007	18,115
2005	267,908	8.048
2006	87,966	1,897

Figure: Total Fall Salmon Escapement in the Sacramento River. Source: http://www.pcouncil.org/newsreleases/Sacto_adult_and_jack_escapement_thru%202007.pdf

During the years from 1971 to 1995, only 5-year averages are available.

In case you are not up on your salmon facts, there are two terms in this chart that may be unfamiliar. Fish escapement refers to the number of fish who escape the hazards of the open ocean and return to their freshwater streams and rivers to spawn. A Jack salmon is a fish that returns to spawn before reaching full adulthood.

(a) Create one line graph that shows both the adult and jack populations for these years. The data from 1971 to 1995 represent the five-year averages. Devise an appropriate method for displaying this on your line plot while maintaining consistency.

(b) Write at least two complete sentences that explain what this graph tells you about the change in the salmon population over time.

3. The following data set about Galapagos land area was used in the first chapter.

Table 2.30

Island	Approximate Area (sq. km)
Baltra	8
Darwin	1.1
Española	60
Fernandina	642
Floreana	173
Genovesa	14
Isabela	4640
Marchena	130
North Seymour	1.9
Pinta	60
Pinzón	18
Rabida	4.9
San Cristóbal	558
Santa Cruz	986
Santa Fe	24
Santiago	585
South Plaza	0.13
Wolf	1.3

Figure: Land Area of Major Islands in the Galapagos Archipelago. Source: http://en.wikipedia.org/wiki/Gal%C3%A1pagos_Islands

(a) Choose two methods for representing this data, one categorical, and one numerical, and draw the plot using your chosen method.

(b) Write a few sentences commenting on the shape, spread, and center of the distribution in the context of the original data. You may use summary statistics to back up your statements.

4. Investigation: The National Weather Service maintains a vast array of data on a variety of topics. Go to: http://lwf.ncdc.noaa.gov/oa/climate/online/ccd/snowfall.html. You will find records for the mean snowfall for various cities across the US.

a. Create a back-to-back stem-and-leaf plot for all the cities located in each of two geographic regions. (Use the simplistic breakdown found at http://library.thinkquest.org/4552/ to classify the states by region.)

b. Write a few sentences that compare the two distributions, commenting on the shape, spread, and center in the context of the original data. You may use summary statistics to back up your statements.

Keywords

Back-to-back stem plots

Bar graph

Bias

Bivariate data

Box-and-whisker plot

Cumulative frequency histogram

Density curves

Dot plot

Explanatory variable

Five-number summary

Frequency polygon

Frequency tables

Histogram

Modified box plot

Mound-shaped

Negative linear association

Ogive plot

Pie graph

Positive linear association

Relative cumulative frequency histogram

Relative cumulative frequency plot

Relative frequency histogram

Response variable

Scatterplot

Skewed left

Skewed right

Stem-and-leaf plot

Symmetric

Tail

Number of Plastic Beverage Bottles per Week	Tally	Frequency
1	${\color{red} \| }$	1
2	${\color{red} \| }$	1
3	${\color{red} \| \| \| }$	3
4	${\color{red} \| \| \| \| }$	4
5	${\color{red} \bcancel{ \| \| \| \| } \ \| }$	6
6	${\color{red} \bcancel{ \| \| \| \| } \ \| \| \| }$	8
7	${\color{red} \bcancel{ \| \| \| \| } \ \| \| }$	7
8	${\color{red} \| \| }$	2

Number of Plastic Beverage Bottles per Week	Tally	Frequency
1	${\color{red} \| \|}$
2
3	${\color{red} \| \| \|}$
4	${\color{red} \| \| }$
5	${\color{red} \| \| \| }$
6	${\color{red}\bcancel{ \| \| \| \| } \ \| \| }$
7	${\color{red}\bcancel{\| \| \| \| }\ \| }$
8	${\color{red} \| }$