Chapter 2: Visualizations of Data

Histograms and Frequency Distributions

Learning Objectives

Introduction

Charts and graphs of various types, when created carefully, can provide instantaneous important information about a data set without calculating, or even having knowledge of, various statistical measures. This chapter will concentrate on some of the more common visual presentations of data.

Frequency Tables

The earth has seemed so large in scope for thousands of years that it is only recently that many people have begun to take seriously the idea that we live on a planet of limited and dwindling resources. This is something that residents of the Galapagos Islands are also beginning to understand. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos. Basic human resources such as water, food, fuel, and building materials must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire to protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.

Figure 2.1 

The Recycling Center on Santa Cruz in the Galapagos turns all the recycled glass into pavers that are used for the streets in Puerto Ayora.

It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and that we have a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the information in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!

Example: Water, Water, Everywhere!

Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, 154 billion gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated by the plastic bottles and the small fraction of the plastic that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil-burning vehicles.

Example: Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data, the first step is to organize it so it is easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles of water consumed each week.

Consider the following raw data:

6, 4, 7, 7, 8, 5, 3, 6, 8, 6, 5, 7, 7, 5, 2, 6, 1, 3, 5, 4, 7, 4, 6, 7, 6, 6, 7, 5, 4, 6, 5, 3

Here are the correct frequencies using the imaginary data presented above:

Figure: Imaginary Class Data on Water Bottle Usage

Table 2.1

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week Frequency
1 1
2 1
3 3
4 4
5 6
6 8
7 7
8 2

When creating a frequency table, it is often helpful to use tally marks as a running total to avoid missing a value or over-representing another.

Table 2.2

Frequency table using tally marks
Number of Plastic Beverage Bottles per Week Tally Frequency
1 {\color{red} | } 1
2 {\color{red} | } 1
3 {\color{red} | | | } 3
4 {\color{red} | | | | } 4
5 {\color{red} \bcancel{ | | | | } \ | } 6
6 {\color{red} \bcancel{ | | | | } \ | | | } 8
7 {\color{red} \bcancel{ | | | | } \ | | } 7
8 {\color{red} | | } 2

The following data set shows the countries in the world that consume the most bottled water per person per year.

Table 2.3

Country Liters of Bottled Water Consumed per Person per Year
Italy 183.6
Mexico 168.5
United Arab Emirates 163.5
Belgium and Luxembourg 148.0
France 141.6
Spain 136.7
Germany 124.9
Lebanon 101.4
Switzerland 99.6
Cyprus 92.0
United States 90.5
Saudi Arabia 87.8
Czech Republic 87.1
Austria 82.1
Portugal 80.3

Figure: Bottled Water Consumption per Person in Leading Countries in 2004. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm

These data values have been measured at the ratio level. There is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from 80.3 liters to 183 liters. By examining the data, it seems appropriate for us to create our frequency table in groups of 10. We will skip the tally marks in this case, because the data values are already in numerical order, and it is easy to see how many are in each classification.

A bracket, '[' or ']', indicates that the endpoint of the interval is included in the class. A parenthesis, '(' or ')', indicates that the endpoint is not included. It is common practice in statistics to include a number that borders two classes as the larger of the two numbers in an interval. For example, [80-90) means this classification includes everything from 80 and gets infinitely close to, but not equal to, 90. 90 is included in the next class, [90 -100).

Table 2.4

Liters per Person Frequency
[80-90) 4
[90-100) 3
[100-110) 1
[110-120) 0
[120-130) 1
[130-140) 1
[140-150) 2
[150-160) 0
[160-170) 2
[170-180) 0
[180-190) 1

Figure: Completed Frequency Table for World Bottled Water Consumption Data (2004)

Histograms

Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram. Let's revisit our data about student bottled beverage habits.

Table 2.5

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week Frequency
1 1
2 1
3 3
4 4
5 6
6 8
7 7
8 2

Here is the same data in a histogram:

In this case, the horizontal axis represents the variable (number of plastic bottles of water consumed), and the vertical axis is the frequency, or count. Each vertical bar represents the number of people in each class of ranges of bottles. For example, in the range of consuming [1 -2) bottles, there is only one person, so the height of the bar is at 1. We can see from the graph that the most common class of bottles used by people each week is the [6-7) range, or six bottles per week.

A histogram is for numerical data. With histograms, the different sections are referred to as bins. Think of a column, or bin, as a vertical container that collects all the data for that range of values. If a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. It is important when drawing a histogram to be certain that there are enough bins so that the last data value is included. Often this means you have to extend the horizontal axis beyond the value of the last data point. In this example, if we had stopped the graph at 8, we would have missed that data, because the 8's actually appear in the bin between 8 and 9. Very often, when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoint of each bin, unless you specify otherwise.

On the Web

http://illuminations.nctm.org/ActivityDetail.aspx?ID=78 Here you can change the bin width and explore how it effects the shape of the histogram.

Relative Frequency Histogram

A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. For example, there is only one data value in the first bin. This represents \frac{1}{32}, or approximately 3%, of the total data. Thus, the vertical bar for the bin extends upward to 3%.

Frequency Polygons

A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.

To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.

Here is a frequency polygon constructed directly from the previously-shown histogram:

Here is the frequency polygon in finished form:

Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!

Example: It would be interesting to compare bottled water consumption in two different years. Two frequency polygons would help give an overall picture of how the years are similar, and how they are different. In the following graph, two frequency polygons, one representing 1999, and the other representing 2004, are overlaid. 1999 is in red, and 2004 is in green.

It appears there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower-consuming countries is almost identical in the two frequency polygons, but it increased by 20 liters per person in 2004. In 1999, there was a middle peak, but that group shifted significantly to the right in 2004 (by between 40 and 60 liters per person). The frequency polygon is the first type of graph we have learned about that makes this type of comparison easier.

Cumulative Frequency Histograms and Ogive Plots

Very often, it is helpful to know how the data accumulate over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including a particular class.

Table 2.6

Number of Plastic Beverage Bottles per Week Frequency Cumulative Frequency
1 1 1
2 1 2
3 3 5
4 4 9
5 6 15
6 8 23
7 7 30
8 2 32

Figure: Cumulative Frequency Table for Bottle Data

For example, the cumulative frequency for 5 bottles per week is 15, because 15 students consumed 5 or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.

If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram, it would look as follows:

A relative cumulative frequency histogram would be the same, except that the vertical bars would represent the relative cumulative frequencies of the data:

Table 2.7

Number of Plastic Beverage Bottles per Week Frequency Cumulative Frequency Relative Cumulative Frequency (%)
1 1 1 3.1
2 1 2 6.3
3 3 5 15.6
4 4 9 28.1
5 6 15 46.9
6 8 23 71.9
7 7 30 93.8
8 2 32 100

Figure: Relative Cumulative Frequency Table for Bottle Data

Remembering what we did with the frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we connected the midpoints of the bins. In a relative cumulative frequency plot, we use the point on the right side of each bin.

The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to a particular value, just like in the frequency table. For example, the point that is plotted at 4 corresponds to 15.6%, because that is the percentage of the data that is less than or equal to 3. It does not include the 4's, because they are in the bin to the right of that point. This is why we plot a point at 1 on the horizontal axis and at 0% on the vertical axis. None of the data is lower than 1, and similarly, all of the data is below 9. Here is the final version of the plot:

This plot is commonly referred to as an ogive plot. The name ogive comes from a particular pointed arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive-type arch:

If a distribution is symmetric and mound shaped, then its ogive plot will look just like the shape of one half of such an arch.

Shape, Center, Spread

In the first chapter, we introduced measures of center and spread as important descriptors of a data set. The shape of a distribution of data is very important as well. Shape, center, and spread should always be your starting point when describing a data set.

Referring to our imaginary student poll on using plastic beverage containers, we notice that the data are spread out from 0 to 9. The graph for the data illustrates this concept, and the range quantifies it. Look back at the graph and notice that there is a large concentration of students in the 5, 6, and 7 region. This would lead us to believe that the center of this data set is somewhere in this area. We use the mean and/or median to measure central tendency, but it is also important that you see that the center of the distribution is near the large concentration of data. This is done with shape.

Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter, is that it has a single large concentration of data that appears like a mountain. A data set that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:

Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves. The most important feature of a density curve is symmetry. The first density curve above is symmetric and mound-shaped. Notice the second curve is mound-shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. It is the direction of the long, spread out section of data, called the tail, that determines the direction of the skewing. For example, in the 3^{\text{rd}} curve, the left tail of the distribution is stretched out, so this distribution is skewed left. Our student bottle data set has this skewed-left shape.

Lesson Summary

A frequency table is useful to organize data into classes according to the number of occurrences, or frequency, of each class. Relative frequency shows the percentage of data in each class. A histogram is a graphical representation of a frequency table (either actual or relative frequency). A frequency polygon is created by plotting the midpoint of each bin at its frequency and connecting the points with line segments. Frequency polygons are useful for viewing the overall shape of a distribution of data, as well as comparing multiple data sets. For any distribution of data, you should always be able to describe the shape, center, and spread. A data set that is mound shaped can be classified as either symmetric or skewed. Distributions that are skewed left have the bulk of the data concentrated on the higher end of the distribution, and the lower end, or tail, of the distribution is spread out to the left. A skewed-right distribution has a large portion of the data concentrated in the lower values of the variable, with the tail spread out to the right. A relative cumulative frequency plot, or ogive plot, shows how the data accumulate across the different values of the variable.

Points to Consider

Review Questions

1. Lois was gathering data on the plastic beverage bottle consumption habits of her classmates, but she ran out of time as class was ending. When she arrived home, something had spilled in her backpack and smudged the data for the 2's. Fortunately, none of the other values was affected, and she knew there were 30 total students in the class. Complete her frequency table.

Table 2.8

Number of Plastic Beverage Bottles per Week Tally Frequency
1 {\color{red} | |}
2
3 {\color{red} | | |}
4 {\color{red} | | }
5 {\color{red} | | | }
6 {\color{red}\bcancel{ | | | | } \ | | }
7 {\color{red}\bcancel{| | | | }\ | }
8 {\color{red} | }
2. The following frequency table contains exactly one data value that is a positive multiple of ten. What must that value be?

Table 2.9

Class Frequency
[0 - 5) 4
[5 - 10) 0
[10 - 15) 2
[15 - 20) 1
[20 - 25) 0
[25 - 30) 3
[30 - 35) 0
[35 - 40) 1

(a) 10

(b) 20

(c) 30

(d) 40

(e) There is not enough information to determine the answer.

3. The following table includes the data from the same group of countries from the earlier bottled water consumption example, but is for the year 1999, instead.

Table 2.10

Country Liters of Bottled Water Consumed per Person per Year
Italy 154.8
Mexico 117.0
United Arab Emirates 109.8
Belgium and Luxembourg 121.9
France 117.3
Spain 101.8
Germany 100.7
Lebanon 67.8
Switzerland 90.1
Cyprus 67.4
United States 63.6
Saudi Arabia 75.3
Czech Republic 62.1
Austria 74.6
Portugal 70.4

Figure: Bottled Water Consumption per Person in Leading Countries in 1999. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm

(a) Create a frequency table for this data set.

(b) Create the histogram for this data set.

(c) How would you describe the shape of this data set?

4. The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.

Table 2.11

Manufactured Material Energy Saved (millions of BTU's per ton)
Aluminum Cans 206
Copper Wire 83
Steel Cans 20
LDPE Plastics (e.g., trash bags) 56
PET Plastics (e.g., beverage bottles) 53
HDPE Plastics (e.g., household cleaner bottles) 51
Personal Computers 43
Carpet 106
Glass 2
Corrugated Cardboard 15
Newspaper 16
Phone Books 11
Magazines 11
Office Paper 10

Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material. Source: National Geographic, January 2008. Volume 213 No., pg 82-83.

(a) Complete the frequency table below, including the actual frequency, the relative frequency (round to the nearest tenth of a percent), and the relative cumulative frequency.

(b) Create a relative frequency histogram from your table in part (a).

(c) Draw the corresponding frequency polygon.

(d) Create the ogive plot.

(e) Comment on the shape, center, and spread of this distribution as it relates to the original data. (Do not actually calculate any specific statistics).

(f) Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?

(g) There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.

(h) What does the steepest part of an ogive plot tell you about the distribution?

On the Web

http://www.earth-policy.org/Updates/2006/Update51_data.htm

http://en.wikipedia.org/wiki/Ogive

Technology Notes: Histograms on the TI-83/84 Graphing Calculator

To draw a histogram on your TI-83/84 graphing calculator, you must first enter the data in a list. In the home screen, press [2ND][}], and then enter the data separated by commas (see the screen below). When all the data have been entered, press [2ND][}][STO], and then press [2ND][L1][ENTER].

Now you are ready to plot the histogram. Press [2ND][STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time. Choose Plot1. Turn the plot on, change the type of plot to a histogram (see sample screen below), and choose L1. Enter '1' for the Freq by pressing [2ND][A-LOCK] to turn off alpha lock, which is normally on in this menu, because most of the time you would want to enter a variable here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in Chapter 1.

Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case, 'XSCL' is what determines the bin width. Also notice that the maximum x value needs to go up to 9 to show the last bin, even though the data values stop at 8. Enter all of the values shown below.

Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.

Common Graphs and Data Plots

Learning Objectives

Introduction

In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.

Categorical Variables: Bar Graphs and Pie Graphs

Example: E-Waste and Bar Graphs

We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected \text{workers}^1. The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.

Table 2.12

Electronic Equipment Thousands of Tons Discarded
Cathode Ray Tube (CRT) TV's 7591.1
CRT Monitors 389.8
Printers, Keyboards, Mice 324.9
Desktop Computers 259.5
Laptop Computers 30.8
Projection TV's 132.8
Cell Phones 11.7
LCD Monitors 4.9

Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.

The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:

While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.

Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!

Pie Graphs

Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.

Here is a table with the percentages and the approximate angle measure of each sector:

Table 2.13

Electronic Equipment Thousands of Tons Discarded Percentage of Total Discarded Angle Measure of Circle Sector
Cathode Ray Tube (CRT) TV's 7591.1 86.8 312.5
CRT Monitors 389.8 4.5 16.2
Printers, Keyboards, Mice 324.9 3.7 13.4
Desktop Computers 259.5 3.0 10.7
Laptop Computers 30.8 0.4 1.3
Projection TV's 132.8 1.5 5.5
Cell Phones 11.7 0.1 0.5
LCD Monitors 4.9 \sim 0 0.2

And here is the completed pie graph:

Displaying Univariate Data

Dot Plots

A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.

Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.

Table 2.14

Percentage of the paper packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country % of Paper Packaging Recycled
Estonia 34
New Zealand 40
Poland 40
Cyprus 42
Portugal 56
United States 59
Italy 62
Spain 63
Australia 66
Greece 70
Finland 70
Ireland 70
Netherlands 70
Sweden 76
France 76
Germany 83
Austria 83
Belgium 83
Japan 98

The dot plot for this data would look like this:

Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.

Stem-and-Leaf Plots

One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:

Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:

It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:

Note: 5|69 means 56% and 59% are the two values in the 50's.

If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.

Example: Consider the following populations of counties in California.

Butte - 220,748

Calaveras - 45,987

Del Norte - 29,547

Fresno - 942,298

Humboldt - 132,755

Imperial - 179,254

San Francisco - 845,999

Santa Barbara - 431,312

To construct a stem and leave plot, we need to either round or truncate to two digits.

Table 2.15

Value Value Rounded Value Truncated
149 15 14
657 66 65
188 19 18

2|2 represents 220,000 - 229,999 when data has been truncated

2|2 represents 215,000 - 224,999 when data has been rounded.

If we decide to round the above data, we have:

Butte - 220,000

Calaveras - 46,000

Del Norte - 30,000

Fresno - 940,000

Humboldt - 130,000

Imperial - 180,000

San Francisco - 850,000

Santa Barbara - 430,000

And the stem and leaf will be as follows:

where:

2|2 represents 220,000 - 224,999.

Source: California State Association of Counties http://www.counties.org/default,asp?id=399

Back-to-Back Stem Plots

Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.

In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:

Table 2.16

Percentage of the glass packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country % of Glass Packaging Recycled
Cyprus 4
United States 21
Poland 27
Greece 34
Portugal 39
Spain 41
Australia 44
Ireland 56
Italy 56
Finland 56
France 59
Estonia 64
New Zealand 72
Netherlands 76
Germany 81
Austria 86
Japan 96
Belgium 98
Sweden 100

In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.

We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.

Displaying Bivariate Data

Scatterplots and Line Plots

Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.

Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.

Table 2.17

Country % of Paper Packaging Recycled % of Glass Packaging Recycled
Estonia 34 64
New Zealand 40 72
Poland 40 27
Cyprus 42 4
Portugal 56 39
United States 59 21
Italy 62 56
Spain 63 41
Australia 66 44
Greece 70 34
Finland 70 56
Ireland 70 55
Netherlands 70 76
Sweden 70 100
France 76 59
Germany 83 81
Austria 83 44
Belgium 83 98
Japan 98 96

Figure: Paper and Glass Packaging Recycling Rates for 19 countries

Scatterplots

We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.

Recall that one of the things you saw from the stem-and-leaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.

With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.

Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.

The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.

New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.

On the Web

http://tinyurl.com/y8vcm5y Guess the correlation.

Line Plots

Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:

Table 2.18

Year Municipal Waste Generated (Millions of Tons)
1990 269
1991 294
1992 281
1993 292
1994 307
1995 323
1996 327
1997 327
1998 340

Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm

In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.

When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:

It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.

Lesson Summary

Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem-and-leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.

Points to Consider

Multimedia Links

  For a description of how to draw a stem-and-leaf plot, as well as how to derive information from one (14.0), see APUS07, Stem-and-Leaf Plot (8:08).

Click here to watch the video



Review Questions

1. Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the 'other' category, because they occur in relatively small amounts that are still dangerous and toxic.

Table 2.19

Material Kilograms
Plastics 6.21
Lead 1.71
Aluminum 3.83
Iron 5.54
Copper 2.12
Tin 0.27
Zinc 0.60
Nickel 0.23
Barium 0.05
Other elements and chemicals 6.44

Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm

(a) Create a bar graph for this data.

(b) Complete the chart below to show the approximate percentage of the total weight for each material.

Table 2.20

Material Kilograms Approximate Percentage of Total Weight
Plastics 6.21
Lead 1.71
Aluminum 3.83
Iron 5.54
Copper 2.12
Tin 0.27
Zinc 0.60
Nickel 0.23
Barium 0.05
Other elements and chemicals 6.44

(c) Create a circle graph for this data.

2. The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.

Table 2.21

State Percentage
Alabama 23
Alaska 7
Arizona 18
Arkansas 36
California 30
Colorado 18
Connecticut 23
Delaware 31
District of Columbia 8
Florida 40
Georgia 33
Hawaii 25
Illinois 28
Indiana 23
Iowa 32
Kansas 11
Kentucky 28
Louisiana 14
Maine 41
Maryland 29
Massachusetts 33
Michigan 25
Minnesota 42
Mississippi 13
Missouri 33
Montana 5
Nebraska 27
Nevada 15
New Hampshire 25
New Jersey 45
New Mexico 12
New York 39
North Carolina 26
North Dakota 21
Ohio 19
Oklahoma 12
Oregon 28
Pennsylvania 26
Rhode Island 23
South Carolina 34
South Dakota 42
Tennessee 40
Utah 19
Vermont 30
Virginia 35
Washington 48
West Virginia 20
Wisconsin 36
Wyoming 5

Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm

(a) Create a dot plot for this data.

(b) Discuss the shape, center, and spread of this distribution.

(c) Create a stem-and-leaf plot for the data.

(d) Use your stem-and-leaf plot to find the median percentage for this data.

3. Identify the important features of the shape of each of the following

distributions.

Questions 4-7 refer to the following dot plots:

4. Identify the overall shape of each distribution.
5. How would you characterize the center(s) of these distributions?
6. Which of these distributions has the smallest standard deviation?
7. Which of these distributions has the largest standard deviation?
8. In question 2, you looked at the percentage of waste recycled in each state. Do you think there is a relationship between the percentage recycled and the total amount of waste that a state generates? Here are the data, including both variables.

Table 2.22

State Percentage Total Amount of Municipal Waste in Thousands of Tons
Alabama 23 5549
Alaska 7 560
Arizona 18 5700
Arkansas 36 4287
California 30 45000
Colorado 18 3084
Connecticut 23 2950
Delaware 31 1189
District of Columbia 8 246
Florida 40 23617
Georgia 33 14645
Hawaii 25 2125
Illinois 28 13386
Indiana 23 7171
Iowa 32 3462
Kansas 11 4250
Kentucky 28 4418
Louisiana 14 3894
Maine 41 1339
Maryland 29 5329
Massachusetts 33 7160
Michigan 25 13500
Minnesota 42 4780
Mississippi 13 2360
Missouri 33 7896
Montana 5 1039
Nebraska 27 2000
Nevada 15 3955
New Hampshire 25 1200
New Jersey 45 8200
New Mexico 12 1400
New York 39 28800
North Carolina 26 9843
North Dakota 21 510
Ohio 19 12339
Oklahoma 12 2500
Oregon 28 3836
Pennsylvania 26 9440
Rhode Island 23 477
South Carolina 34 8361
South Dakota 42 510
Tennessee 40 9496
Utah 19 3760
Vermont 30 600
Virginia 35 9000
Washington 48 6527
West Virginia 20 2000
Wisconsin 36 3622
Wyoming 5 530

(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.

(b) How much municipal waste was created in Illinois?

(c) Draw a scatterplot for this data.

(d) Describe the direction and strength of the association between the two variables.

9. The following line graph shows the recycling rates of two different types of plastic bottles in the US from 1995 to

2001.

    a. Explain the general trends for both types of plastics over these years.
    b. What was the total change in PET bottle recycling from 1995 to 2001?
    c. Can you think of a reason to explain this change?
    d. During what years was this change the most rapid?

References

National Geographic, January 2008. Volume 213 No.1

http://www.etoxics.org/site/PageServer?pagename=svtc_global_ewaste_crisis'

http://www.earth-policy.org/Updates/2006/Update51_data.htm

Technology Notes: Scatterplots on the TI-83/84 Graphing Calculator

Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STAT-PLOT] to enter the STAT-PLOTS menu, and choose the first plot.

Change the settings to match the following screenshot:

This selects a scatterplot with the explanatory variable in L1 and the response variable in L2. In order to see the points better, you should choose either the square or the plus sign for the mark. The square has been chosen in the screenshot. Finally, set the window as shown below to match the data. In this case, we looked at our lowest and highest data values in each variable and added a bit of room to create a pleasant window. Press [GRAPH] to see the result, which is also shown below.

Line Plots on the TI-83/84 Graphing Calculator

Your graphing calculator will also draw a line plot, and the process is almost identical to that for creating a scatterplot. Enter the data into your lists, and choose a line plot in the Plot1 menu, as in the following screenshot.

Next, set an appropriate window (not necessarily the one shown below), and graph the resulting plot.

Box-and-Whisker Plots

Learning Objectives

Introduction

In this section, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread, and outliers will be studied in this context.

The Five-Number Summary

The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.

Example: The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the reservoir capacities of the major water sources for Arizona:

Table 2.23

Lake/Reservoir % of Capacity
Salt River System 59
Lake Pleasant 49
Verde River System 33
San Carlos 9
Lyman Reservoir 3
Show Low Lake 51
Lake Havasu 98
Lake Mohave 85
Lake Mead 95
Lake Powell 89

Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html

This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of \text{capacity}^1.

Placing the data in order from smallest to largest gives the following:

3, 9, 33, 49, 51, 59, 85, 89, 95, 98

Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the 25^{\text{th}} percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:

\left \{3, 33, 55, 89, 98 \right \}

Box-and-Whisker Plots

A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'. Here is the box plot for this data:

The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.

Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:

In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).

Comparing Multiple Box Plots

Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets. In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:

The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's capacities are lower. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.

Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.

Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.

Outliers in Box-and-Whisker Plots

Here are the reservoir data for California (the names of the lakes and reservoirs have been omitted):

80, 83, 77, 95, 85, 74, 34, 68, 90, 82, 75

At first glance, the 34 should stand out. It appears as if this point is significantly different from the rest of the data. Let's use a graphing calculator to investigate this plot. Enter your data into a list as we have done before, and then choose a plot. Under 'Type', you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).

Setting a window is not as important for a box plot, so we will use the calculator's ability to automatically scale a window to our data by pressing [ZOOM] and selecting '9:Zoom Stat'.

While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point. Go back and change your plot to the first box plot option, which is the modified box plot, and then graph it.

Notice that without the outlier, the distribution is really roughly symmetric.

This data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than 1.5 times the interquartile range will be considered an outlier. Because the IQR is the same as the length of the box, any point that is more than one-and-a-half box lengths from either quartile is plotted as an outlier.

A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.

The calculations for determining the outlier in this case are as follows:

Lower Quartile: 74

Upper Quartile: 85

Interquartile range (IQR): 85 - 74 = 11

1.5 * IQR = 16.5

Cut-off for outliers in left whisker: 74 - 16.5 = 57.5. Thus, any value less than 57.5 is considered an outlier.

Notice that we did not even bother to test the calculation on the right whisker, because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.

If you press [TRACE] and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the outlier.

The Effects of Changing Units on Shape, Center, and Spread

In the previous lesson, we looked at data for the materials in a typical desktop computer.

Table 2.24

Material Kilograms
Plastics 6.21
Lead 1.71
Aluminum 3.83
Iron 5.54
Copper 2.12
Tin 0.27
Zinc 0.60
Nickel 0.23
Barium 0.05
Other elements and chemicals 6.44

Here is the data set given in pounds. The weight of each in kilograms was multiplied by 2.2.

Table 2.25

Material Pounds
Plastics 13.7
Lead 3.8
Aluminum 8.4
Iron 12.2
Copper 4.7
Tin 0.6
Zinc 1.3
Nickel 0.5
Barium 0.1
Other elements and chemicals 14.2

When all values are multiplied by a factor of 2.2, the calculation of the mean is also multiplied by 2.2, so the center of the distribution would be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. In other words, the center and the measures of spread will increase proportionally.

Example: This is easier to think of with numbers. Suppose that your mean is 20, and that two of the data values in your distribution are 21 and 23. If you multiply 21 and 23 by 2, you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were 21 - 20 = 1 and 23 - 20 = 3, but now, your deviations are 42 - 40 = 2 and 46 - 40 = 6, so your deviations are getting twice as big as well.

This should result in the graph maintaining the same shape, but being stretched out, or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.

On the Web

http://tinyurl.com/34s6sm Investigate the mean, median and box plots.

http://tinyurl.com/3ao9px More investigation of boxplots.

Lesson Summary

The five-number summary is a useful collection of statistical measures consisting of the following in ascending order: minimum, lower quartile, median, upper quartile, maximum. A box-and-whisker plot is a graphical representation of the five-number summary showing a box bounded by the lower and upper quartiles and the median as a line in the box. The whiskers are line segments extended from the quartiles to the minimum and maximum values. Each whisker and section of the box contains approximately 25% of the data. The width of the box is the interquartile range, or IQR, and shows the spread of the middle 50% of the data. Box-and-whisker plots are effective at giving an overall impression of the shape, center, and spread of a data set. While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot. Any point that is more than 1.5 times the length of the box (IQR) from either end of the box is considered to be an outlier. When changing the units of a distribution, the center and spread will be affected, but the shape will stay the same.

Points to Consider

Multimedia Links

  For a description of how to draw a box-and-whisker plot from given data (14.0), see patrickJMT, Box and Whisker Plot (5:53).

Click here to watch the video



Review Questions

1. Here are the 1998 data on the percentage of capacity of reservoirs in

Idaho.

70, 84, 62, 80, 75, 95, 69, 48, 76, 70, 45, 83, 58, 75, 85, 70\ 62, 64, 39, 68, 67, 35, 55, 93, 51, 67, 86, 58, 49, 47, 42, 75

    a. Find the five-number summary for this data set.
    b. Show all work to determine if there are true outliers according to the 1.5*IQR rule.
    c. Create a box-and-whisker plot showing any outliers.
    d. Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.
    e. Based on your answer in part (d), how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
2. Here are the 1998 data on the percentage of capacity of reservoirs in

Utah.

80, 46, 83, 75, 83, 90, 90, 72, 77, 4, 83, 105, 63, 87, 73, 84, 0, 70, 65, 96, 89, 78, 99, 104, 83, 81

    a. Find the five-number summary for this data set.
    b. Show all work to determine if there are true outliers according to the 1.5*IQR rule.
    c. Create a box-and-whisker plot showing any outliers.
    d. Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.
    e. Based on your answer in part (d) how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
3. Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.
4. If the median of a distribution is less than the mean, which of the following statements is the most correct?
    a. The distribution is skewed left.
    b. The distribution is skewed right.
    c. There are outliers on the left side.
    d. There are outliers on the right side.
    e. (b) or (d) could be true.
5. The following table contains recent data on the average price of a gallon of gasoline for states that share a border crossing into Canada.
    a. Find the five-number summary for this data.
    b. Show all work to test for outliers.
    c. Graph the box-and-whisker plot for this data.
    d. Canadian gasoline is sold in liters. Suppose a Canadian crossed the border into one of these states and wanted to compare the cost of gasoline. There are approximately 4 liters in a gallon. If we were to convert the distribution to liters, describe the resulting shape, center, and spread of the new distribution.
    e. Complete the following table. Convert to cost per liter by dividing by 3.7854, and then graph the resulting box plot.

As an interesting extension to this problem, you could look up the current data and compare that distribution with the data presented here. You could also find the exchange rate for Canadian dollars and convert the prices into the other currency.

Table 2.26

State Average Price of a Gallon of Gasoline (US$) Average Price of a Liter of Gasoline (US$)
Alaska 3.458
Washington 3.528
Idaho 3.26
Montana 3.22
North Dakota 3.282
Minnesota 3.12
Michigan 3.352
New York 3.393
Vermont 3.252
New Hampshire 3.152
Maine 3.309

Average Prices of a Gallon of Gasoline on March 16, 2008

Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://www.fuelgaugereport.com/sbsavg.asp

References

^1 \ \text{Kunzig, Robert. Drying of the West. National Geographic, February 2008, Vol. 213, No. 2, Page 94.}

http://en.wikipedia.org/wiki/Box_plot

Chapter Review

Part One: Questions

1. Which of the following can be inferred from this

histogram?

    a. The mode is 1.
    b. mean < median
    c. median < mean
    d. The distribution is skewed left.
    e. None of the above can be inferred from this histogram.
2. Sean was given the following relative frequency histogram to

read.

Unfortunately, the copier cut off the bin with the highest frequency. Which of the following could possibly be the relative frequency of the cut-off bin?

    a. 16
    b. 24
    c. 32
    d. 68
3. Tianna was given a graph for a homework question in her statistics class, but she forgot to label the graph or the axes and couldn’t remember if it was a frequency polygon or an ogive plot. Here is her

graph:

Identify which of the two graphs she has and briefly explain why.

In questions 4-7, match the distribution with the choice of the correct real-world situation that best fits the graph.

4.
5.
6.
7.
    a. Endy collected and graphed the heights of all the 12^{\text{th}} grade students in his high school.
    b. Brittany asked each of the students in her statistics class to bring in 20 pennies selected at random from their pocket or piggy bank. She created a plot of the dates of the pennies.
    c. Thamar asked her friends what their favorite movie was this year and graphed the results.
    d. Jeno bought a large box of doughnut holes at the local pastry shop, weighed each of them, and then plotted their weights to the nearest tenth of a gram.
8. Which of the following box plots matches the

histogram?

9. If a data set is roughly symmetric with no skewing or outliers, which of the following would be an appropriate sketch of the shape of the corresponding ogive plot?
    a.
    b.
    c.
    d.
10. Which of the following scatterplots shows a strong, negative association?
    a.
    b.
    c.
    d.

Part Two: Open-Ended Questions

1. The Burj Dubai will become the world’s tallest building when it is completed. It will be twice the height of the Empire State Building in New York.

Table 2.27

Building City Height (ft)
Taipei 101 Tapei 1671
Shanghai World Financial Center Shanghai 1614
Petronas Tower Kuala Lumpur 1483
Sears Tower Chicago 1451
Jin Mao Tower Shanghai 1380
Two International Finance Center Hong Kong 1362
CITIC Plaza Guangzhou 1283
Shun Hing Square Shenzen 1260
Empire State Building New York 1250
Central Plaza Hong Kong 1227
Bank of China Tower Hong Kong 1205
Bank of America Tower New York 1200
Emirates Office Tower Dubai 1163
Tuntex Sky Tower Kaohsiung 1140

The chart lists the 15 tallest buildings in the world (as of 12/2007).

(a) Complete the table below, and draw an ogive plot of the resulting data.

Table 2.28

Class Frequency Relative Frequency Cumulative Frequency Relative Cumulative Frequency

(b) Use your ogive plot to approximate the median height for this data.

(c) Use your ogive plot to approximate the upper and lower quartiles.

(d) Find the 90^{\text{th}} percentile for this data (i.e., the height that 90% of the data is less than).

2. Recent reports have called attention to an inexplicable collapse of the Chinook Salmon population in western rivers (see http://www.nytimes.com/2008/03/17/science/earth/17salmon.html). The following data tracks the fall salmon population in the Sacramento River from 1971 to 2007.

Table 2.29

Year ^* Adults Jacks
1971-1975 164,947 37,409
1976-1980 154,059 29,117
1981-1985 169,034 45,464
1986-1990 182,815 35,021
1991-1995 158,485 28,639
1996 299,590 40,078
1997 342,876 38,352
1998 238,059 31,701
1998 395,942 37,567
1999 416,789 21,994
2000 546,056 33,439
2001 775,499 46,526
2002 521,636 29,806
2003 283,554 67,660
2004 394,007 18,115
2005 267,908 8.048
2006 87,966 1,897

Figure: Total Fall Salmon Escapement in the Sacramento River. Source: http://www.pcouncil.org/newsreleases/Sacto_adult_and_jack_escapement_thru%202007.pdf

During the years from 1971 to 1995, only 5-year averages are available.

In case you are not up on your salmon facts, there are two terms in this chart that may be unfamiliar. Fish escapement refers to the number of fish who escape the hazards of the open ocean and return to their freshwater streams and rivers to spawn. A Jack salmon is a fish that returns to spawn before reaching full adulthood.

(a) Create one line graph that shows both the adult and jack populations for these years. The data from 1971 to 1995 represent the five-year averages. Devise an appropriate method for displaying this on your line plot while maintaining consistency.

(b) Write at least two complete sentences that explain what this graph tells you about the change in the salmon population over time.

3. The following data set about Galapagos land area was used in the first chapter.

Table 2.30

Island Approximate Area (sq. km)
Baltra 8
Darwin 1.1
Española 60
Fernandina 642
Floreana 173
Genovesa 14
Isabela 4640
Marchena 130
North Seymour 1.9
Pinta 60
Pinzón 18
Rabida 4.9
San Cristóbal 558
Santa Cruz 986
Santa Fe 24
Santiago 585
South Plaza 0.13
Wolf 1.3

Figure: Land Area of Major Islands in the Galapagos Archipelago. Source: http://en.wikipedia.org/wiki/Gal%C3%A1pagos_Islands

(a) Choose two methods for representing this data, one categorical, and one numerical, and draw the plot using your chosen method.

(b) Write a few sentences commenting on the shape, spread, and center of the distribution in the context of the original data. You may use summary statistics to back up your statements.

4. Investigation: The National Weather Service maintains a vast array of data on a variety of topics. Go to: http://lwf.ncdc.noaa.gov/oa/climate/online/ccd/snowfall.html. You will find records for the mean snowfall for various cities across the US.
    a. Create a back-to-back stem-and-leaf plot for all the cities located in each of two geographic regions. (Use the simplistic breakdown found at http://library.thinkquest.org/4552/ to classify the states by region.)
    b. Write a few sentences that compare the two distributions, commenting on the shape, spread, and center in the context of the original data. You may use summary statistics to back up your statements.

Keywords

Back-to-back stem plots

Bar graph

Bias

Bivariate data

Box-and-whisker plot

Cumulative frequency histogram

Density curves

Dot plot

Explanatory variable

Five-number summary

Frequency polygon

Frequency tables

Histogram

Modified box plot

Mound-shaped

Negative linear association

Ogive plot

Pie graph

Positive linear association

Relative cumulative frequency histogram

Relative cumulative frequency plot

Relative frequency histogram

Response variable

Scatterplot

Skewed left

Skewed right

Stem-and-leaf plot

Symmetric

Tail