AXIS SHENANIGANS

The human brain did not evolve to process large amounts of numerical data presented as text; instead, our eyes look for patterns in data that are visually displayed. The most accurate but least interpretable form of data presentation is to make a table, showing every single value. But it is difficult or impossible for most people to detect patterns and trends in such data, and so we rely on graphs and charts. Graphs come in two broad types: Either they represent every data point visually (as in a scatter plot) or they implement a form of data reduction in which we summarize the data, looking, for example, only at means or medians.

There are many ways that graphs can be used to manipulate, distort, and misrepresent data. The careful consumer of information will avoid being drawn in by them.

Unlabeled Axes

The most fundamental way to lie with a statistical graph is to not label the axes. If your axes aren’t labeled, you can draw or plot anything you want! Here is an example from a poster presented at a conference by a student researcher, which looked like this (I’ve redrawn it here):

What does all that mean? From the text on the poster itself (though not on this graph), we know that the researchers are studying brain activations in patients with schizophrenia (SZ). What are HCs? We aren’t told, but from the context—they’re being compared with SZ—we might assume that it means “healthy controls.” Now, there do appear to be differences between the HCs and the SZs, but, hmmm . . . the y-axis has numbers, but . . . the units could be anything! What are we looking at? Scores on a test, levels of brain activations, number of brain regions activated? Number of Jell-O brand pudding cups they’ve eaten, or number of Johnny Depp movies they’ve seen in the last six weeks? (To be fair, the researchers subsequently published their findings in a peer-reviewed journal, and corrected this error after a website pointed out the oversight.)

In the next example, gross sales of a publishing company are plotted, excluding data from Kickstarter campaigns.

As in the previous example, but this time with the x-axis, we have numbers but we’re not told what they are. In this case, it’s probably self-evident: We assume that the 2010, 2011, etc., refer to calendar or fiscal years of operation, and the fact that the lines are jagged between the years suggests that the data are being tracked monthly (but without proper labeling we can only assume). The y-axis is completely missing, so we don’t know what is being measured (is it units sold or dollars?), and we don’t know what each horizontal line represents. The graph could be depicting an increase of sales from 50 cents a year to $5 a year, or from 50 million to 500 million units. Not to worry—a helpful narrative accompanied this graph: “It’s been another great year.” I guess we’ll have to take their word for it.

Truncated Vertical Axis

A well-designed graph clearly shows you the relevant end points of a continuum. This is especially important if you’re documenting some actual or projected change in a quantity, and you want your readers to draw the right conclusions. If you’re representing crime rate, deaths, births, income, or any quantity that could take on a value of zero, then zero should be the minimum point on your graph. But if your aim is to create panic or outrage, start your y-axis somewhere near the lowest value you’re plotting—this will emphasize the difference you’re trying to highlight, because the eye is drawn to the size of the difference as shown on the graph, and the actual size of the difference is obscured.

In 2012, Fox News broadcast the following graph to show what would happen if the Bush tax cuts were allowed to expire:

The graph gives the visual impression that taxes would increase by a large amount: The right-hand bar is six times the height of the left-hand bar. Who wants their taxes to go up by a factor of six? Viewers who are number-phobic, or in a hurry, may not take the time to examine the axis to see that the actual difference is between a tax rate of 35 percent and one of 39.6 percent. That is, if the cuts expire, taxes will only increase 13 percent, not the 600 percent that is pictured (the 4.6 percentage point increase is 13 percent of 35 percent).

If the y-axis started at zero, the 13 percent would be apparent visually:

Discontinuity in Vertical or Horizontal Axis

Imagine a city where crime has been growing at a rate of 5 percent per year for the last ten years. You might graph it this way:

Nothing wrong with that. But suppose that you’re selling home security systems and so you want to scare people into buying your product. Using all the same data, just create a discontinuity in your x-axis. This will distort the truth and deceive the eye marvelously:

Here, the visual gives the impression that crime has increased dramatically. But you know better. The discontinuity in the x-axis crams five years’ worth of numbers into the same amount of graphic real estate as was used for two years. No wonder there’s an apparent increase. This is a fundamental flaw in graph making, but because most readers don’t bother to look at the axes too closely, this one’s easy to get away with.

And you don’t have to limit your creativity to breaking the x-axis; you can get the effect by creating a discontinuity in the y-axis, and then hiding it by not breaking the line. While we’re at it, we’ll truncate the y-axis:

This is a bit mean. Most readers just look at that curve within the plot frame and won’t notice that the tick marks on the vertical axis start out being forty reports between each, and then suddenly, at two hundred, indicate only eight reports between each. Are we having fun yet?

The honorable move is to use the first crime graph presented with the proper continuous axis. Now, to critically evaluate the statistics, you might ask if there are factors in the way the data were collected or presented that could be hiding an underlying truth.

One possibility is that the increases occur in only one particularly bad neighborhood and that, in fact, crime is decreasing everywhere else in the city. Maybe the police and the community have simply decided that a particular neighborhood had become unmanageable and so they stopped enforcing laws there. The city as a whole is safe—perhaps even safer than before—and one bad neighborhood is responsible for the increase.

Another possibility is that by amalgamating all the different sorts of complaints into the catchall bin of crime, we are overlooking a serious consideration. Perhaps violent crime has dropped to almost zero, and in its place, with so much time on their hands, the police are issuing hundreds more jaywalking tickets.

Perhaps the most obvious question to ask next, in your effort to understand what this statistic really means, is “What happened to the total population in this city during that time period?” If the population increased at any rate greater than 5 percent per year, the crime rate has actually gone down on a per-person basis. We could show this by plotting crimes committed per ten thousand people in the city:

Choosing the Proper Scale and Axis

You’ve been hired by your local Realtor to graph the change in home prices in your community over the last decade. The prices have been steadily growing at a rate of 15 percent per year.

If you want to really alarm people, why not change the x-axis to include dates that you don’t have data for? Adding extra dates to the x-axis artificially like this will increase the slope of the curve by compressing the viewable portion like this:

Notice how this graph tricks your eye (well, your brain) into drawing two false conclusions—first, that sometime around 1990 home prices must have been very low, and second, that by 2030 home prices will be so high that few people will be able to afford a home. Better buy one now!

Both of these graphs distort what’s really going on, because they make a steady rate of growth appear, visually, to be an increasing rate of growth. On the first graph, the 15 percent growth seems twice as high on the y-axis in 2014 as it does in 2006. Many things change at a constant rate: salaries, prices, inflation, population of a species, and victims of diseases. When you have a situation of steady growth (or decline), the most accurate way to represent the data is on a logarithmic scale. The logarithmic scale allows equal percentage changes to be represented by equal distances on the y-axis. A constant annual rate of change then shows up as a straight line, as this:

The Dreaded Double Y-Axis

The graph maker can get away with all kinds of lies simply armed with the knowledge that most readers will not look at the graph very closely. This can move a great many people to believe all kinds of things that aren’t so. Consider the following graph, showing the life expectancy of smokers versus nonsmokers at age twenty-five.

This makes clear two things: The dangers of smoking accumulate over time, and smokers are likely to die earlier than nonsmokers. The difference isn’t big at age forty, but by age eighty the risk more than doubles, from under 30 percent to over 60 percent. This is a clean and accurate way to present the data. But suppose you’re a young fourteen-year-old smoker who wants to convince your parents that you should be allowed to smoke. This graph is clearly not going to help you. So you dig deep into your bag of tricks and use the double y-axis, adding a y-axis to the right-hand side of the graph frame, with a different scaling factor that applies only to the nonsmokers. Once you do that, your graph looks like this:

From this, it looks like you’re just as likely to die from smoking as from not smoking. Smoking won’t harm you—old age will! The trouble with double y-axis graphs is that you can always scale the second axis any way that you choose.

Forbes magazine, a venerable and typically reliable news source, ran a graph very much like this one to show the relation between expenditures per public school student and those students’ scores on the SAT, a widely used standardized test for college admission in the United States.

From the graph, it looks as though increasing the money spent per student (black line) doesn’t do anything to increase their SAT scores (gray line). The story that some anti–government spending politicos could tell about this is one of wasted taxpayer funds. But you now understand that the choice of scale for the second (right-hand) y-axis is arbitrary. If you were a school administrator, you might simply take the exact same data, change the scale of the right-hand axis, and voilà—increasing spending delivers a better education, as evidenced by the increase in SAT scores!

This graph obviously tells a very different story. Which one is true? You’d need to have a measure of how the one variable changes as a function of the other, a statistic known as a correlation. Correlations range from −1 to 1. A correlation of 0 means that one variable is not related to the other at all. A correlation of -1 means that as one variable goes up, the other goes down, in precise synchrony. A correlation of 1 means that as one variable goes up, the other does too, also in precise synchrony. The first graph appears to be illustrating a correlation of 0, the second graph appears to be representing one that is close to 1. The actual correlation for this dataset is .91, a very strong correlation. Spending more on students is, at least in this dataset, associated with better SAT scores.

The correlation also provides a good estimate of how much of the result can be explained by the variables you’re looking at. The correlation of .91 tells us we can explain 91 percent of students’ SAT scores by looking at the amount of school expenditures per student. That is, it tells us to what extent expenditures explain the diversity in SAT scores.

The Double Y-Axis

A controversy about the double y-axis graph erupted in the fall of 2015 during a U.S. congressional committee meeting. Rep. Jason Chaffetz presented a graph that plotted two services provided by the organization Planned Parenthood: abortions, and cancer screening and prevention:

The congressman was attempting to make a political point, that over a seven-year period, Planned Parenthood has increased the number of abortions it performed (something he opposes) and decreased the number of cancer screening and prevention procedures. Planned Parenthood doesn’t deny this, but this distorted graph makes it seem that the number of abortion procedures exceeded those for cancer. Maybe the graph maker was feeling a bit guilty and so included the actual numbers next to the data points. Let’s accept her bread crumbs and look closely. The number of abortions in 2013, the most recent year given, is 327,000. The number of cancer services was nearly three times that, at 935,573. (By the way, it’s a bit suspicious that the abortion numbers are such tidy, round numbers while the cancer numbers are so precise.) This is a particularly sinister example: an implied double y-axis graph with no axes on either side!

Drawn properly, the graph would look like this:

Here, we see that abortions increased modestly, compared to the reduction in cancer services.

There is another thing suspicious about the original graph: Such smooth lines are rarely found in data. It seems more likely that the graph maker simply took numbers for two particular years, 2006 and 2013, and compared them, drawing a smooth connecting line between them. Perhaps these particular years were chosen intentionally to emphasize differences. Perhaps there were great fluctuations in the intervening years of 2007–2012; we don’t know. The smooth lines give the impression of a perfectly linear (straight line) function, which is very unlikely.

Graphs such as this do not always tell the story that people think they do. Is there something that could account for these data, apart from a narrative that Planned Parenthood is on a mission to perform as many abortions as it can (and to let people die of cancer at the same time)? Look at the second graph. In 2006, Planned Parenthood performed 2,007,371 cancer services, and 289,750 abortions, nearly seven times as many cancer services as abortions. By 2013, this gap had narrowed, but the number of cancer services was still nearly three times the number of abortions.

Cecile Richards, the president of Planned Parenthood, had an explanation for this narrowing gap. Changing medical guidelines for some anti-cancer services, like Pap smears, reduced the number of people for whom screening was recommended. Other changes, such as social attitudes about abortion, changing ages of the population, and increased access to health care alternatives, all influence these numbers, and so the data presented do not prove that Planned Parenthood has a pro-abortion agenda. It might—these data are just not the proof.