Charts and graphs of various types, when created carefully, can provide instantaneous important information about a data set without calculating, or even having knowledge of, various statistical measures. This chapter will concentrate on some of the more common visual presentations of data.
The earth has seemed so large in scope for thousands of years that it is only recently that many people have begun to take seriously the idea that we live on a planet of limited and dwindling resources. This is something that residents of the Galapagos Islands are also beginning to understand. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos. Basic human resources such as water, food, fuel, and building materials must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire to protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.
Figure 2.1
The Recycling Center on Santa Cruz in the Galapagos turns all the recycled glass into pavers that are used for the streets in Puerto Ayora.
It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and that we have a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the information in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!
Example: Water, Water, Everywhere!
Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, 154 billion gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated by the plastic bottles and the small fraction of the plastic that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil-burning vehicles.
Example: Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data, the first step is to organize it so it is easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles of water consumed each week.
Consider the following raw data:
6, 4, 7, 7, 8, 5, 3, 6, 8, 6, 5, 7, 7, 5, 2, 6, 1, 3, 5, 4, 7, 4, 6, 7, 6, 6, 7, 5, 4, 6, 5, 3
Here are the correct frequencies using the imaginary data presented above:
Figure: Imaginary Class Data on Water Bottle Usage
Table 2.1
Number of Plastic Beverage Bottles per Week | Frequency |
1 | 1 |
2 | 1 |
3 | 3 |
4 | 4 |
5 | 6 |
6 | 8 |
7 | 7 |
8 | 2 |
When creating a frequency table, it is often helpful to use tally marks as a running total to avoid missing a value or over-representing another.
Table 2.2
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
1 | 1 | |
2 | 1 | |
3 | 3 | |
4 | 4 | |
5 | 6 | |
6 | 8 | |
7 | 7 | |
8 | 2 |
The following data set shows the countries in the world that consume the most bottled water per person per year.
Table 2.3
Country | Liters of Bottled Water Consumed per Person per Year |
Italy | 183.6 |
Mexico | 168.5 |
United Arab Emirates | 163.5 |
Belgium and Luxembourg | 148.0 |
France | 141.6 |
Spain | 136.7 |
Germany | 124.9 |
Lebanon | 101.4 |
Switzerland | 99.6 |
Cyprus | 92.0 |
United States | 90.5 |
Saudi Arabia | 87.8 |
Czech Republic | 87.1 |
Austria | 82.1 |
Portugal | 80.3 |
Figure: Bottled Water Consumption per Person in Leading Countries in 2004. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm
These data values have been measured at the ratio level. There is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from 80.3 liters to 183 liters. By examining the data, it seems appropriate for us to create our frequency table in groups of 10. We will skip the tally marks in this case, because the data values are already in numerical order, and it is easy to see how many are in each classification.
A bracket, '[' or ']', indicates that the endpoint of the interval is included in the class. A parenthesis, '(' or ')', indicates that the endpoint is not included. It is common practice in statistics to include a number that borders two classes as the larger of the two numbers in an interval. For example, means this classification includes everything from 80 and gets infinitely close to, but not equal to, 90. 90 is included in the next class, .
Table 2.4
Liters per Person | Frequency |
4 | |
3 | |
1 | |
0 | |
1 | |
1 | |
2 | |
0 | |
2 | |
0 | |
1 |
Figure: Completed Frequency Table for World Bottled Water Consumption Data (2004)
Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram. Let's revisit our data about student bottled beverage habits.
Table 2.5
Number of Plastic Beverage Bottles per Week | Frequency |
1 | 1 |
2 | 1 |
3 | 3 |
4 | 4 |
5 | 6 |
6 | 8 |
7 | 7 |
8 | 2 |
Here is the same data in a histogram:
In this case, the horizontal axis represents the variable (number of plastic bottles of water consumed), and the vertical axis is the frequency, or count. Each vertical bar represents the number of people in each class of ranges of bottles. For example, in the range of consuming bottles, there is only one person, so the height of the bar is at 1. We can see from the graph that the most common class of bottles used by people each week is the range, or six bottles per week.
A histogram is for numerical data. With histograms, the different sections are referred to as bins. Think of a column, or bin, as a vertical container that collects all the data for that range of values. If a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. It is important when drawing a histogram to be certain that there are enough bins so that the last data value is included. Often this means you have to extend the horizontal axis beyond the value of the last data point. In this example, if we had stopped the graph at 8, we would have missed that data, because the 8's actually appear in the bin between 8 and 9. Very often, when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoint of each bin, unless you specify otherwise.
On the Web
http://illuminations.nctm.org/ActivityDetail.aspx?ID=78 Here you can change the bin width and explore how it effects the shape of the histogram.
A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. For example, there is only one data value in the first bin. This represents , or approximately 3%, of the total data. Thus, the vertical bar for the bin extends upward to 3%.
A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.
To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.
Here is a frequency polygon constructed directly from the previously-shown histogram:
Here is the frequency polygon in finished form:
Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!
Example: It would be interesting to compare bottled water consumption in two different years. Two frequency polygons would help give an overall picture of how the years are similar, and how they are different. In the following graph, two frequency polygons, one representing 1999, and the other representing 2004, are overlaid. 1999 is in red, and 2004 is in green.
It appears there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower-consuming countries is almost identical in the two frequency polygons, but it increased by 20 liters per person in 2004. In 1999, there was a middle peak, but that group shifted significantly to the right in 2004 (by between 40 and 60 liters per person). The frequency polygon is the first type of graph we have learned about that makes this type of comparison easier.
Very often, it is helpful to know how the data accumulate over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including a particular class.
Table 2.6
Number of Plastic Beverage Bottles per Week | Frequency | Cumulative Frequency |
1 | 1 | 1 |
2 | 1 | 2 |
3 | 3 | 5 |
4 | 4 | 9 |
5 | 6 | 15 |
6 | 8 | 23 |
7 | 7 | 30 |
8 | 2 | 32 |
Figure: Cumulative Frequency Table for Bottle Data
For example, the cumulative frequency for 5 bottles per week is 15, because 15 students consumed 5 or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.
If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram, it would look as follows:
A relative cumulative frequency histogram would be the same, except that the vertical bars would represent the relative cumulative frequencies of the data:
Table 2.7
Number of Plastic Beverage Bottles per Week | Frequency | Cumulative Frequency | Relative Cumulative Frequency (%) |
1 | 1 | 1 | 3.1 |
2 | 1 | 2 | 6.3 |
3 | 3 | 5 | 15.6 |
4 | 4 | 9 | 28.1 |
5 | 6 | 15 | 46.9 |
6 | 8 | 23 | 71.9 |
7 | 7 | 30 | 93.8 |
8 | 2 | 32 | 100 |
Figure: Relative Cumulative Frequency Table for Bottle Data
Remembering what we did with the frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we connected the midpoints of the bins. In a relative cumulative frequency plot, we use the point on the right side of each bin.
The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to a particular value, just like in the frequency table. For example, the point that is plotted at 4 corresponds to 15.6%, because that is the percentage of the data that is less than or equal to 3. It does not include the 4's, because they are in the bin to the right of that point. This is why we plot a point at 1 on the horizontal axis and at 0% on the vertical axis. None of the data is lower than 1, and similarly, all of the data is below 9. Here is the final version of the plot:
This plot is commonly referred to as an ogive plot. The name ogive comes from a particular pointed arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive-type arch:
If a distribution is symmetric and mound shaped, then its ogive plot will look just like the shape of one half of such an arch.
In the first chapter, we introduced measures of center and spread as important descriptors of a data set. The shape of a distribution of data is very important as well. Shape, center, and spread should always be your starting point when describing a data set.
Referring to our imaginary student poll on using plastic beverage containers, we notice that the data are spread out from 0 to 9. The graph for the data illustrates this concept, and the range quantifies it. Look back at the graph and notice that there is a large concentration of students in the 5, 6, and 7 region. This would lead us to believe that the center of this data set is somewhere in this area. We use the mean and/or median to measure central tendency, but it is also important that you see that the center of the distribution is near the large concentration of data. This is done with shape.
Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter, is that it has a single large concentration of data that appears like a mountain. A data set that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:
Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves. The most important feature of a density curve is symmetry. The first density curve above is symmetric and mound-shaped. Notice the second curve is mound-shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. It is the direction of the long, spread out section of data, called the tail, that determines the direction of the skewing. For example, in the curve, the left tail of the distribution is stretched out, so this distribution is skewed left. Our student bottle data set has this skewed-left shape.
A frequency table is useful to organize data into classes according to the number of occurrences, or frequency, of each class. Relative frequency shows the percentage of data in each class. A histogram is a graphical representation of a frequency table (either actual or relative frequency). A frequency polygon is created by plotting the midpoint of each bin at its frequency and connecting the points with line segments. Frequency polygons are useful for viewing the overall shape of a distribution of data, as well as comparing multiple data sets. For any distribution of data, you should always be able to describe the shape, center, and spread. A data set that is mound shaped can be classified as either symmetric or skewed. Distributions that are skewed left have the bulk of the data concentrated on the higher end of the distribution, and the lower end, or tail, of the distribution is spread out to the left. A skewed-right distribution has a large portion of the data concentrated in the lower values of the variable, with the tail spread out to the right. A relative cumulative frequency plot, or ogive plot, shows how the data accumulate across the different values of the variable.
Table 2.8
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 | ||
8 |
Table 2.9
Class | Frequency |
4 | |
0 | |
2 | |
1 | |
0 | |
3 | |
0 | |
1 |
(a) 10
(b) 20
(c) 30
(d) 40
(e) There is not enough information to determine the answer.
Table 2.10
Country | Liters of Bottled Water Consumed per Person per Year |
Italy | 154.8 |
Mexico | 117.0 |
United Arab Emirates | 109.8 |
Belgium and Luxembourg | 121.9 |
France | 117.3 |
Spain | 101.8 |
Germany | 100.7 |
Lebanon | 67.8 |
Switzerland | 90.1 |
Cyprus | 67.4 |
United States | 63.6 |
Saudi Arabia | 75.3 |
Czech Republic | 62.1 |
Austria | 74.6 |
Portugal | 70.4 |
Figure: Bottled Water Consumption per Person in Leading Countries in 1999. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm
(a) Create a frequency table for this data set.
(b) Create the histogram for this data set.
(c) How would you describe the shape of this data set?
Table 2.11
Manufactured Material | Energy Saved (millions of BTU's per ton) |
Aluminum Cans | 206 |
Copper Wire | 83 |
Steel Cans | 20 |
LDPE Plastics (e.g., trash bags) | 56 |
PET Plastics (e.g., beverage bottles) | 53 |
HDPE Plastics (e.g., household cleaner bottles) | 51 |
Personal Computers | 43 |
Carpet | 106 |
Glass | 2 |
Corrugated Cardboard | 15 |
Newspaper | 16 |
Phone Books | 11 |
Magazines | 11 |
Office Paper | 10 |
Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material. Source: National Geographic, January 2008. Volume 213 No., pg 82-83.
(a) Complete the frequency table below, including the actual frequency, the relative frequency (round to the nearest tenth of a percent), and the relative cumulative frequency.
(b) Create a relative frequency histogram from your table in part (a).
(c) Draw the corresponding frequency polygon.
(d) Create the ogive plot.
(e) Comment on the shape, center, and spread of this distribution as it relates to the original data. (Do not actually calculate any specific statistics).
(f) Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?
(g) There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.
(h) What does the steepest part of an ogive plot tell you about the distribution?
On the Web
http://www.earth-policy.org/Updates/2006/Update51_data.htm
http://en.wikipedia.org/wiki/Ogive
Technology Notes: Histograms on the TI-83/84 Graphing Calculator
To draw a histogram on your TI-83/84 graphing calculator, you must first enter the data in a list. In the home screen, press [2ND][}], and then enter the data separated by commas (see the screen below). When all the data have been entered, press [2ND][}][STO], and then press [2ND][L1][ENTER].
Now you are ready to plot the histogram. Press [2ND][STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time. Choose Plot1. Turn the plot on, change the type of plot to a histogram (see sample screen below), and choose L1. Enter '1' for the Freq by pressing [2ND][A-LOCK] to turn off alpha lock, which is normally on in this menu, because most of the time you would want to enter a variable here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in Chapter 1.
Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case, 'XSCL' is what determines the bin width. Also notice that the maximum value needs to go up to 9 to show the last bin, even though the data values stop at 8. Enter all of the values shown below.
Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.
In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.
Example: E-Waste and Bar Graphs
We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected . The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.
Table 2.12
Electronic Equipment | Thousands of Tons Discarded |
Cathode Ray Tube (CRT) TV's | 7591.1 |
CRT Monitors | 389.8 |
Printers, Keyboards, Mice | 324.9 |
Desktop Computers | 259.5 |
Laptop Computers | 30.8 |
Projection TV's | 132.8 |
Cell Phones | 11.7 |
LCD Monitors | 4.9 |
Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.
The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:
While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.
Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!
Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.
Here is a table with the percentages and the approximate angle measure of each sector:
Table 2.13
Electronic Equipment | Thousands of Tons Discarded | Percentage of Total Discarded | Angle Measure of Circle Sector |
Cathode Ray Tube (CRT) TV's | 7591.1 | 86.8 | 312.5 |
CRT Monitors | 389.8 | 4.5 | 16.2 |
Printers, Keyboards, Mice | 324.9 | 3.7 | 13.4 |
Desktop Computers | 259.5 | 3.0 | 10.7 |
Laptop Computers | 30.8 | 0.4 | 1.3 |
Projection TV's | 132.8 | 1.5 | 5.5 |
Cell Phones | 11.7 | 0.1 | 0.5 |
LCD Monitors | 4.9 | 0.2 |
And here is the completed pie graph:
A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.
Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.
Table 2.14
Country | % of Paper Packaging Recycled |
Estonia | 34 |
New Zealand | 40 |
Poland | 40 |
Cyprus | 42 |
Portugal | 56 |
United States | 59 |
Italy | 62 |
Spain | 63 |
Australia | 66 |
Greece | 70 |
Finland | 70 |
Ireland | 70 |
Netherlands | 70 |
Sweden | 76 |
France | 76 |
Germany | 83 |
Austria | 83 |
Belgium | 83 |
Japan | 98 |
The dot plot for this data would look like this:
Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.
One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:
Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:
It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:
Note: means 56% and 59% are the two values in the 50's.
If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.
Example: Consider the following populations of counties in California.
Butte - 220,748
Calaveras - 45,987
Del Norte - 29,547
Fresno - 942,298
Humboldt - 132,755
Imperial - 179,254
San Francisco - 845,999
Santa Barbara - 431,312
To construct a stem and leave plot, we need to either round or truncate to two digits.
Table 2.15
Value | Value Rounded | Value Truncated |
149 | 15 | 14 |
657 | 66 | 65 |
188 | 19 | 18 |
represents when data has been truncated
represents when data has been rounded.
If we decide to round the above data, we have:
Butte - 220,000
Calaveras - 46,000
Del Norte - 30,000
Fresno - 940,000
Humboldt - 130,000
Imperial - 180,000
San Francisco - 850,000
Santa Barbara - 430,000
And the stem and leaf will be as follows:
where:
represents .
Source: California State Association of Counties http://www.counties.org/default,asp?id=399
Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.
In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:
Table 2.16
Country | % of Glass Packaging Recycled |
Cyprus | 4 |
United States | 21 |
Poland | 27 |
Greece | 34 |
Portugal | 39 |
Spain | 41 |
Australia | 44 |
Ireland | 56 |
Italy | 56 |
Finland | 56 |
France | 59 |
Estonia | 64 |
New Zealand | 72 |
Netherlands | 76 |
Germany | 81 |
Austria | 86 |
Japan | 96 |
Belgium | 98 |
Sweden | 100 |
In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.
We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.
Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.
Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.
Table 2.17
Country | % of Paper Packaging Recycled | % of Glass Packaging Recycled |
Estonia | 34 | 64 |
New Zealand | 40 | 72 |
Poland | 40 | 27 |
Cyprus | 42 | 4 |
Portugal | 56 | 39 |
United States | 59 | 21 |
Italy | 62 | 56 |
Spain | 63 | 41 |
Australia | 66 | 44 |
Greece | 70 | 34 |
Finland | 70 | 56 |
Ireland | 70 | 55 |
Netherlands | 70 | 76 |
Sweden | 70 | 100 |
France | 76 | 59 |
Germany | 83 | 81 |
Austria | 83 | 44 |
Belgium | 83 | 98 |
Japan | 98 | 96 |
Figure: Paper and Glass Packaging Recycling Rates for 19 countries
We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.
Recall that one of the things you saw from the stem-and-leaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.
With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.
Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.
The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.
New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.
On the Web
http://tinyurl.com/y8vcm5y Guess the correlation.
Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:
Table 2.18
Year | Municipal Waste Generated (Millions of Tons) |
1990 | 269 |
1991 | 294 |
1992 | 281 |
1993 | 292 |
1994 | 307 |
1995 | 323 |
1996 | 327 |
1997 | 327 |
1998 | 340 |
Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.
When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:
It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.
Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem-and-leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.
For a description of how to draw a stem-and-leaf plot, as well as how to derive information from one (14.0), see APUS07, Stem-and-Leaf Plot (8:08).
Table 2.19
Material | Kilograms |
Plastics | 6.21 |
Lead | 1.71 |
Aluminum | 3.83 |
Iron | 5.54 |
Copper | 2.12 |
Tin | 0.27 |
Zinc | 0.60 |
Nickel | 0.23 |
Barium | 0.05 |
Other elements and chemicals | 6.44 |
Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm
(a) Create a bar graph for this data.
(b) Complete the chart below to show the approximate percentage of the total weight for each material.
Table 2.20
Material | Kilograms | Approximate Percentage of Total Weight |
Plastics | 6.21 | |
Lead | 1.71 | |
Aluminum | 3.83 | |
Iron | 5.54 | |
Copper | 2.12 | |
Tin | 0.27 | |
Zinc | 0.60 | |
Nickel | 0.23 | |
Barium | 0.05 | |
Other elements and chemicals | 6.44 |
(c) Create a circle graph for this data.
Table 2.21
State | Percentage |
Alabama | 23 |
Alaska | 7 |
Arizona | 18 |
Arkansas | 36 |
California | 30 |
Colorado | 18 |
Connecticut | 23 |
Delaware | 31 |
District of Columbia | 8 |
Florida | 40 |
Georgia | 33 |
Hawaii | 25 |
Illinois | 28 |
Indiana | 23 |
Iowa | 32 |
Kansas | 11 |
Kentucky | 28 |
Louisiana | 14 |
Maine | 41 |
Maryland | 29 |
Massachusetts | 33 |
Michigan | 25 |
Minnesota | 42 |
Mississippi | 13 |
Missouri | 33 |
Montana | 5 |
Nebraska | 27 |
Nevada | 15 |
New Hampshire | 25 |
New Jersey | 45 |
New Mexico | 12 |
New York | 39 |
North Carolina | 26 |
North Dakota | 21 |
Ohio | 19 |
Oklahoma | 12 |
Oregon | 28 |
Pennsylvania | 26 |
Rhode Island | 23 |
South Carolina | 34 |
South Dakota | 42 |
Tennessee | 40 |
Utah | 19 |
Vermont | 30 |
Virginia | 35 |
Washington | 48 |
West Virginia | 20 |
Wisconsin | 36 |
Wyoming | 5 |
Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
(a) Create a dot plot for this data.
(b) Discuss the shape, center, and spread of this distribution.
(c) Create a stem-and-leaf plot for the data.
(d) Use your stem-and-leaf plot to find the median percentage for this data.
distributions.
Questions 4-7 refer to the following dot plots:
Table 2.22
State | Percentage | Total Amount of Municipal Waste in Thousands of Tons |
Alabama | 23 | 5549 |
Alaska | 7 | 560 |
Arizona | 18 | 5700 |
Arkansas | 36 | 4287 |
California | 30 | 45000 |
Colorado | 18 | 3084 |
Connecticut | 23 | 2950 |
Delaware | 31 | 1189 |
District of Columbia | 8 | 246 |
Florida | 40 | 23617 |
Georgia | 33 | 14645 |
Hawaii | 25 | 2125 |
Illinois | 28 | 13386 |
Indiana | 23 | 7171 |
Iowa | 32 | 3462 |
Kansas | 11 | 4250 |
Kentucky | 28 | 4418 |
Louisiana | 14 | 3894 |
Maine | 41 | 1339 |
Maryland | 29 | 5329 |
Massachusetts | 33 | 7160 |
Michigan | 25 | 13500 |
Minnesota | 42 | 4780 |
Mississippi | 13 | 2360 |
Missouri | 33 | 7896 |
Montana | 5 | 1039 |
Nebraska | 27 | 2000 |
Nevada | 15 | 3955 |
New Hampshire | 25 | 1200 |
New Jersey | 45 | 8200 |
New Mexico | 12 | 1400 |
New York | 39 | 28800 |
North Carolina | 26 | 9843 |
North Dakota | 21 | 510 |
Ohio | 19 | 12339 |
Oklahoma | 12 | 2500 |
Oregon | 28 | 3836 |
Pennsylvania | 26 | 9440 |
Rhode Island | 23 | 477 |
South Carolina | 34 | 8361 |
South Dakota | 42 | 510 |
Tennessee | 40 | 9496 |
Utah | 19 | 3760 |
Vermont | 30 | 600 |
Virginia | 35 | 9000 |
Washington | 48 | 6527 |
West Virginia | 20 | 2000 |
Wisconsin | 36 | 3622 |
Wyoming | 5 | 530 |
(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.
(b) How much municipal waste was created in Illinois?
(c) Draw a scatterplot for this data.
(d) Describe the direction and strength of the association between the two variables.
2001.
References
National Geographic, January 2008. Volume 213 No.1
http://www.etoxics.org/site/PageServer?pagename=svtc_global_ewaste_crisis'
http://www.earth-policy.org/Updates/2006/Update51_data.htm
Technology Notes: Scatterplots on the TI-83/84 Graphing Calculator
Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STAT-PLOT] to enter the STAT-PLOTS menu, and choose the first plot.
Change the settings to match the following screenshot:
This selects a scatterplot with the explanatory variable in L1 and the response variable in L2. In order to see the points better, you should choose either the square or the plus sign for the mark. The square has been chosen in the screenshot. Finally, set the window as shown below to match the data. In this case, we looked at our lowest and highest data values in each variable and added a bit of room to create a pleasant window. Press [GRAPH] to see the result, which is also shown below.
Line Plots on the TI-83/84 Graphing Calculator
Your graphing calculator will also draw a line plot, and the process is almost identical to that for creating a scatterplot. Enter the data into your lists, and choose a line plot in the Plot1 menu, as in the following screenshot.
Next, set an appropriate window (not necessarily the one shown below), and graph the resulting plot.
In this section, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread, and outliers will be studied in this context.
The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.
Example: The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the reservoir capacities of the major water sources for Arizona:
Table 2.23
Lake/Reservoir | % of Capacity |
Salt River System | 59 |
Lake Pleasant | 49 |
Verde River System | 33 |
San Carlos | 9 |
Lyman Reservoir | 3 |
Show Low Lake | 51 |
Lake Havasu | 98 |
Lake Mohave | 85 |
Lake Mead | 95 |
Lake Powell | 89 |
Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html
This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of .
Placing the data in order from smallest to largest gives the following:
3, 9, 33, 49, 51, 59, 85, 89, 95, 98
Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:
A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'. Here is the box plot for this data:
The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.
Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:
In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).
Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets. In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:
The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's capacities are lower. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.
Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.
Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.
Here are the reservoir data for California (the names of the lakes and reservoirs have been omitted):
80, 83, 77, 95, 85, 74, 34, 68, 90, 82, 75
At first glance, the 34 should stand out. It appears as if this point is significantly different from the rest of the data. Let's use a graphing calculator to investigate this plot. Enter your data into a list as we have done before, and then choose a plot. Under 'Type', you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).
Setting a window is not as important for a box plot, so we will use the calculator's ability to automatically scale a window to our data by pressing [ZOOM] and selecting '9:Zoom Stat'.
While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point. Go back and change your plot to the first box plot option, which is the modified box plot, and then graph it.
Notice that without the outlier, the distribution is really roughly symmetric.
This data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than 1.5 times the interquartile range will be considered an outlier. Because the is the same as the length of the box, any point that is more than one-and-a-half box lengths from either quartile is plotted as an outlier.
A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.
The calculations for determining the outlier in this case are as follows:
Lower Quartile: 74
Upper Quartile: 85
Interquartile range
Cut-off for outliers in left whisker: . Thus, any value less than 57.5 is considered an outlier.
Notice that we did not even bother to test the calculation on the right whisker, because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.
If you press [TRACE] and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the outlier.
In the previous lesson, we looked at data for the materials in a typical desktop computer.
Table 2.24
Material | Kilograms |
Plastics | 6.21 |
Lead | 1.71 |
Aluminum | 3.83 |
Iron | 5.54 |
Copper | 2.12 |
Tin | 0.27 |
Zinc | 0.60 |
Nickel | 0.23 |
Barium | 0.05 |
Other elements and chemicals | 6.44 |
Here is the data set given in pounds. The weight of each in kilograms was multiplied by 2.2.
Table 2.25
Material | Pounds |
Plastics | 13.7 |
Lead | 3.8 |
Aluminum | 8.4 |
Iron | 12.2 |
Copper | 4.7 |
Tin | 0.6 |
Zinc | 1.3 |
Nickel | 0.5 |
Barium | 0.1 |
Other elements and chemicals | 14.2 |
When all values are multiplied by a factor of 2.2, the calculation of the mean is also multiplied by 2.2, so the center of the distribution would be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. In other words, the center and the measures of spread will increase proportionally.
Example: This is easier to think of with numbers. Suppose that your mean is 20, and that two of the data values in your distribution are 21 and 23. If you multiply 21 and 23 by 2, you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were and , but now, your deviations are and , so your deviations are getting twice as big as well.
This should result in the graph maintaining the same shape, but being stretched out, or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.
On the Web
http://tinyurl.com/34s6sm Investigate the mean, median and box plots.
http://tinyurl.com/3ao9px More investigation of boxplots.
The five-number summary is a useful collection of statistical measures consisting of the following in ascending order: minimum, lower quartile, median, upper quartile, maximum. A box-and-whisker plot is a graphical representation of the five-number summary showing a box bounded by the lower and upper quartiles and the median as a line in the box. The whiskers are line segments extended from the quartiles to the minimum and maximum values. Each whisker and section of the box contains approximately 25% of the data. The width of the box is the interquartile range, or , and shows the spread of the middle 50% of the data. Box-and-whisker plots are effective at giving an overall impression of the shape, center, and spread of a data set. While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot. Any point that is more than 1.5 times the length of the box from either end of the box is considered to be an outlier. When changing the units of a distribution, the center and spread will be affected, but the shape will stay the same.
For a description of how to draw a box-and-whisker plot from given data (14.0), see patrickJMT, Box and Whisker Plot (5:53).
Idaho.
Utah.
As an interesting extension to this problem, you could look up the current data and compare that distribution with the data presented here. You could also find the exchange rate for Canadian dollars and convert the prices into the other currency.
Table 2.26
State | Average Price of a Gallon of Gasoline (US$) | Average Price of a Liter of Gasoline (US$) |
Alaska | 3.458 | |
Washington | 3.528 | |
Idaho | 3.26 | |
Montana | 3.22 | |
North Dakota | 3.282 | |
Minnesota | 3.12 | |
Michigan | 3.352 | |
New York | 3.393 | |
Vermont | 3.252 | |
New Hampshire | 3.152 | |
Maine | 3.309 |
Average Prices of a Gallon of Gasoline on March 16, 2008
Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://www.fuelgaugereport.com/sbsavg.asp
References
http://en.wikipedia.org/wiki/Box_plot
histogram?
read.
Unfortunately, the copier cut off the bin with the highest frequency. Which of the following could possibly be the relative frequency of the cut-off bin?
graph:
Identify which of the two graphs she has and briefly explain why.
In questions 4-7, match the distribution with the choice of the correct real-world situation that best fits the graph.
histogram?
Table 2.27
Building | City | Height (ft) |
Taipei 101 | Tapei | 1671 |
Shanghai World Financial Center | Shanghai | 1614 |
Petronas Tower | Kuala Lumpur | 1483 |
Sears Tower | Chicago | 1451 |
Jin Mao Tower | Shanghai | 1380 |
Two International Finance Center | Hong Kong | 1362 |
CITIC Plaza | Guangzhou | 1283 |
Shun Hing Square | Shenzen | 1260 |
Empire State Building | New York | 1250 |
Central Plaza | Hong Kong | 1227 |
Bank of China Tower | Hong Kong | 1205 |
Bank of America Tower | New York | 1200 |
Emirates Office Tower | Dubai | 1163 |
Tuntex Sky Tower | Kaohsiung | 1140 |
The chart lists the 15 tallest buildings in the world (as of 12/2007).
(a) Complete the table below, and draw an ogive plot of the resulting data.
Table 2.28
Class | Frequency | Relative Frequency | Cumulative Frequency | Relative Cumulative Frequency |
(b) Use your ogive plot to approximate the median height for this data.
(c) Use your ogive plot to approximate the upper and lower quartiles.
(d) Find the percentile for this data (i.e., the height that 90% of the data is less than).
Table 2.29
Year | Adults | Jacks |
1971-1975 | 164,947 | 37,409 |
1976-1980 | 154,059 | 29,117 |
1981-1985 | 169,034 | 45,464 |
1986-1990 | 182,815 | 35,021 |
1991-1995 | 158,485 | 28,639 |
1996 | 299,590 | 40,078 |
1997 | 342,876 | 38,352 |
1998 | 238,059 | 31,701 |
1998 | 395,942 | 37,567 |
1999 | 416,789 | 21,994 |
2000 | 546,056 | 33,439 |
2001 | 775,499 | 46,526 |
2002 | 521,636 | 29,806 |
2003 | 283,554 | 67,660 |
2004 | 394,007 | 18,115 |
2005 | 267,908 | 8.048 |
2006 | 87,966 | 1,897 |
Figure: Total Fall Salmon Escapement in the Sacramento River. Source: http://www.pcouncil.org/newsreleases/Sacto_adult_and_jack_escapement_thru%202007.pdf
During the years from 1971 to 1995, only 5-year averages are available.
In case you are not up on your salmon facts, there are two terms in this chart that may be unfamiliar. Fish escapement refers to the number of fish who escape the hazards of the open ocean and return to their freshwater streams and rivers to spawn. A Jack salmon is a fish that returns to spawn before reaching full adulthood.
(a) Create one line graph that shows both the adult and jack populations for these years. The data from 1971 to 1995 represent the five-year averages. Devise an appropriate method for displaying this on your line plot while maintaining consistency.
(b) Write at least two complete sentences that explain what this graph tells you about the change in the salmon population over time.
Table 2.30
Island | Approximate Area (sq. km) |
Baltra | 8 |
Darwin | 1.1 |
Española | 60 |
Fernandina | 642 |
Floreana | 173 |
Genovesa | 14 |
Isabela | 4640 |
Marchena | 130 |
North Seymour | 1.9 |
Pinta | 60 |
Pinzón | 18 |
Rabida | 4.9 |
San Cristóbal | 558 |
Santa Cruz | 986 |
Santa Fe | 24 |
Santiago | 585 |
South Plaza | 0.13 |
Wolf | 1.3 |
Figure: Land Area of Major Islands in the Galapagos Archipelago. Source: http://en.wikipedia.org/wiki/Gal%C3%A1pagos_Islands
(a) Choose two methods for representing this data, one categorical, and one numerical, and draw the plot using your chosen method.
(b) Write a few sentences commenting on the shape, spread, and center of the distribution in the context of the original data. You may use summary statistics to back up your statements.
Keywords
Back-to-back stem plots
Bar graph
Bias
Bivariate data
Box-and-whisker plot
Cumulative frequency histogram
Density curves
Dot plot
Explanatory variable
Five-number summary
Frequency polygon
Frequency tables
Histogram
Modified box plot
Mound-shaped
Negative linear association
Ogive plot
Pie graph
Positive linear association
Relative cumulative frequency histogram
Relative cumulative frequency plot
Relative frequency histogram
Response variable
Scatterplot
Skewed left
Skewed right
Stem-and-leaf plot
Symmetric
Tail