Chapter 1. Visualizing Information: First Impressions

image with no caption

Can’t tell your facts from your figures?

Statistics help you make sense of confusing sets of data. They make the complex simple. And when you’ve found out what’s really going on, you need a way of visualizing it and telling everyone else. So if you want to pick the best chart for the job, grab your coat, pack your best slide rule, and join us on a ride to Statsville.

Everywhere you look you can find statistics, whether you’re browsing the Internet, playing sports, or looking through the top scores of your favorite video game. But what actually is a statistic?

Statistics are numbers that summarize raw facts and figures in some meaningful way. They present key ideas that may not be immediately apparent by just looking at the raw data, and by data, we mean facts or figures from which we can draw conclusions. As an example, you don’t have to wade through lots of football scores when all you want to know is the league position of your favorite team. You need a statistic to quickly give you the information you need.

The study of statistics covers where statistics come from, how to calculate them, and how you can use them effectively.

image with no caption

Understanding what’s really going on with statistics empowers you. If you really get statistics, you’ll be able to make objective decisions, make accurate predictions that seem inspired, and convey the message you want in the most effective way possible.

Statistics can be a convenient way of summarizing key truths about data, but there’s a dark side too.

image with no caption

Statistics are based on facts, but even so, they can sometimes be misleading. They can be used to tell the truth—or to lie. The problem is how do you know when you’re being told the truth, and when you’re being told lies?

Having a good understanding of statistics puts you in a strong position. You’re much better equipped to tell when statistics are inaccurate or misleading. In other words, studying statistics is a good way of making sure you don’t get fooled.

As an example, take a look at the profits made by a company in the latter half of last year.

Month

Jul

Aug

Sep

Oct

Nov

Dec

Profit (millions)

2.0

2.1

2.2

2.1

2.3

2.4

image with no caption

How can there be two interpretations of the same set of data? Let’s take a closer look.

So how can we explore these two different interpretations of the same data? What we need is some way of visualizing them. If you need to visualize information, there’s no better way than using a chart or graph. They can be a quick way of summarizing raw information and can help you get an impression of what’s going on at a glance. But you need to be careful because even the simplest chart can be used to subtly mislead and misdirect you.

Here are two time graphs showing a companies profits for six months. They’re both based on the same information, so why do they look so different? They give drastically different versions of the same information.

image with no caption

Software can’t think for you.

Chart software can save you a lot of time and produce effective charts, but you still need to understand what’s going on.

At the end of the day, it’s your data, and it’s up to you to choose the right chart for the job and make sure your data is presented in the most effective way possible and conveys the message you want.

Software can translate data into charts, but it’s up to you to make sure the chart is right.

image with no caption

One company that needs some charting expertise is Manic Mango, an innovative games company that is taking the world by storm. The CEO has been invited to deliver a keynote presentation at the next worldwide games expo. He needs some quick, slick ways of presenting data, and he’s asked you to come up with the goods. There’s a lot riding on this. If the keynote goes well, Manic Mango will get extra sponsorship revenue, and you’re bound to get a hefty bonus for your efforts.

image with no caption

The first thing the CEO wants to be able to do is compare the percentage of satisfied players for each game genre. He’s started off by plugging the data he has through some charting software, and here are the results:

image with no caption

Pie charts work by splitting your data into distinct groups or categories. The chart consists of a circle split into wedge-shaped slices, and each slice represents a group. The size of each slice is proportional to how many are in each group compared with the others. The larger the slice, the greater the relative popularity of that group. The number in a particular group is called the frequency.

Pie charts divide your entire data set into distinct groups. This means that if you add together the frequency of each slice, you should get 100%.

Let’s take a closer look at our pie chart showing the number of units sold per genre:

image with no caption

Genre

Units sold

Sports

27,500

Strategy

11,500

Action

6,000

Shooter

3,500

Other

1,500

Creating a pie chart worked out so great for displaying the units sold per genre that the CEO’s decided to create another to chart consumer satisfaction with Manic Mango’s game. The CEO needs a chart that will allow him to compare the percentage of satisfied players for each game genre. He’s run the data through the charting software again, but this time he’s not as impressed.

image with no caption
image with no caption

Pie charts are used to compare the proportions of different groups or categories, but in this case there’s little variation between each group.

It’s difficult to take in at a glance which category has the highest level of player satisfaction.

It’s also generally confusing to label pie charts with percentages that don’t relate to the overall proportion of the slice. As an example, the Sports slice is labelled 99%, but it only fills about 20% of the chart. Another problem is that we don’t know whether there’s an equal number of responses for each genre, so we don’t know whether it’s fair to compare genre satisfaction in this way.

A better way of showing this kind of data is with a bar chart. Just like pie charts, bar charts allow you to compare relative sizes, but the advantage of using a bar chart is that they allow for a greater degree of precision. They’re ideal in situations where categories are roughly the same size, as you can tell with far greater precision which category has the highest frequency. It makes it easier for you to see small differences.

On a bar chart, each bar represents a particular category, and the length of the bar indicates the value. The longer the bar, the greater the value. All the bars have the same width, which makes it easier to compare them.

Bar charts can be drawn either vertically or horizontally.

Vertical bar charts show categories on the horizontal axis, and either frequency or percentage on the vertical axis. The height of each bar indicates the value of its category. Here’s an example showing the sales figures in units for five regions, A, B, C, D, and E:

image with no caption

Horizontal bar charts are just like vertical bar charts except that the axes are flipped round. With horizontal bar charts, you show the categories on the vertical axis and the frequency or percentage on the horizontal axis.

Here’s a horizontal bar chart for the CEO’s genre data from Chart failure. As you can see, it’s much easier to quickly gauge which category has the highest value, and which the lowest.

image with no caption

Vertical bar charts tend to be more common, but horizontal bar charts are useful if the names of your categories are long. They give you lots of space for showing the name of each category without having to turn the bar labels sideways.

image with no caption

It depends on what message you want to convey.

Let’s take a closer look.

Understanding scale allows you to create powerful bar charts that pick out the key facts you want to draw attention to. But be careful—scale can also conceal vital facts about your data. Let’s see how.

You can show frequencies on your scale instead of percentages. This makes it easy for people to see exactly what the frequencies are and compare values.

image with no caption

Normally your scale should start at 0, but watch out! Not every chart does this, and as you saw earlier in Sharpen your pencil Solution, using a scale that doesn’t start at 0 can give a different first impression of your data. This is something to watch out for on other people’s charts, as it’s very easy to miss and can give you the wrong impression of the data.

image with no caption

There are ways of drawing bar charts that give you more flexibility.

The problem with these bar charts is that they show either the number of satisfied players or the percentage, and they only show satisfied players.

Let’s take a look at how we can get around this problem.

With bar charts, it’s actually really easy to show more than one set of data on the same chart. As an example, we can show both the percentage of satisfied players and the percentage of dissatisfied players on the same chart.

The CEO is thrilled with the bar charts you’ve produced, but there’s more data he needs to present at the keynote.

image with no caption
image with no caption
image with no caption

When you’re working with charts, one of the key things you need to figure out is what sort of data you’re dealing with. Once you’ve figured that out, you’ll find it easier to make key decisions about what chart you need to best represent your data.

The latest set of data from the Manic Mango CEO is numeric and, what’s more, the scores are grouped into intervals. So what’s the best way of charting data like this?

Score

Frequency

0-199

5

200-399

29

400-599

56

600-799

17

800-999

3

image with no caption

We could, but there’s a better way.

Rather than treat each range of scores as a separate category, we can take advantage of the data being numeric, and present the data using a continuous numeric scale instead. This means that instead of using bars to represent a single item, we can use each bar to represent a range of scores.

To do this, we can create a histogram.

Histograms are like bar charts but with two key differences. The first is that the area of each bar is proportional to the frequency, and the second is that there are no gaps between the bars on the chart. Here’s an example of a histogram showing the average number of games bought per month by households in Statsville:

image with no caption

The first step to creating a histogram is to look at each of the intervals and work out how wide each of them needs to be, and what range of values each one needs to cover. While doing this, we need to make sure that there will be no gaps between the bars on the histogram.

Let’s start with the first two intervals, 0–199 and 200–399. At face value, the first interval finishes at score 199, and the second starts at score 200. The problem with plotting it like this, however, is that it would leave a gap between score 199 and 200, like this:

Score

Frequency

0–199

5

200–399

29

400–599

56

600–799

17

800–999

3

image with no caption

Histograms shouldn’t have gaps between the bars, so to get around this, we extend their ranges slightly. Instead of one interval ending at score 199 and the next starting at score 200, we make the two intervals meet at 199.5, like this:

image with no caption

Doing this forms a single boundary and makes sure that there are no gaps between the bars on the histogram. If we complete this for the rest of the intervals, we get the following boundaries:

image with no caption

Each interval covers 200 scores, and the width of each interval is 200. Each interval has the same width.

As all the intervals have the same width, we create the histogram by drawing vertical bars for each range of scores, using the boundaries to form the start and end point of each bar. The height of each bar is equal to the frequency.

The CEO is very pleased with the histogram you’ve created for him—so much so, that he wants you to create another histogram for him. This time, he wants a chart showing for how long Manic Mango players tend to play online games over a 24-hour period. Here’s the data:

image with no caption
image with no caption

He’s right, the interval widths aren’t all equal.

If you take a look at the intervals, you can see that they’re different widths. As an example, the 10–24 range covers far more hours than the 0–1 range.

If we had access to the raw data, we could look at how we could construct equal width intervals, but unfortunately this is all the data we have. We need a way of drawing a histogram that makes allowances for the data having different widths.

image with no caption

Do you think she’s right?

Here’s a sketch of the chart, using frequency on the vertical scale and drawing bar widths proportional to their interval size. Do you see any problems?

image with no caption

Up until now, we’ve been able to use the height of each bar to represent the frequency of a particular number or category.

This time around, we’re dealing with grouped numeric data where the interval widths are unequal. We can make the width of each bar reflect the width of each interval, but the trouble is that having bars of different widths affects the overall area of each bar.

We need to make sure the area of each bar is proportional to its frequency. This means that if we adjust bar width, we also need to adjust bar height. That way, we can change the widths of the bars so that they reflect the width of the group, but we keep the size of each bar in line with its frequency.

Let’s go through how to create this new histogram.

We find how wide our bars need to be by looking at the range of values they cover. In other words, we need to figure out how many full hours are covered by each group.

Let’s take the 1–3 group. This group covers 2 full hours, 1–2 and 2–3. This means that the width of the bar needs to be 2, with boundaries of 1 and 3.

image with no caption

If we calculate the rest of the widths, we get:

Hours

Frequency

Width

0–1

4,300

1

1–3

6,900

2

3–5

4,900

2

5–10

2,000

5

10–24

2,100

14

Now that we’ve figured out the bar widths, we can move onto working out the heights.

Now that we have the widths of all the groups, we can use these to find the heights the bars need to be. Remember, we need to adjust the bar heights so that the overall area of each bar is proportional to the group’s frequency.

First of all, let’s take the area of each bar. We’ve said that frequency and area are equivalent. As we already know what the frequency of each group is, we know what the areas should be too:

Area of bar = Frequency of group

Now each bar is basically just a rectangle, which means that the area of each bar is equal to the width multiplied by the height. As the area gives us the frequency, this means:

Frequency = Width of bar × Height of bar

We found the widths of the bars in the last step, which means that we can use these to find what height each bar should be. In other words,

image with no caption
image with no caption

The height of the bar is used to measure how concentrated the frequency is for a particular group. It’s a way of measuring how densely packed the frequency is, a way of saying how thick or thin on the ground the numbers are. The height of the bar is called the frequency density.

Now that we’ve worked out the widths and heights of each bar, we can draw the histogram. We draw it just like before, except that this time, we use frequency density for the vertical axis and not frequency.

Here’s our revised histogram.

image with no caption

While histograms are an excellent way to display grouped numeric data, there are still some kinds of this data they’re not ideally suited for presenting—like running totals...

image with no caption

Let’s see if we can help the CEO out. Here’s the histogram we had before.

image with no caption

It’s tricky to see at a glance what the running totals are in this chart. In order to find the frequency of players playing for up to 5 hours, we need to add different frequencies together. We need another sort of chart...but what?

The CEO needs some sort of chart that will show him the total frequency below a particular value: the cumulative frequency. By cumulative frequency, we basically mean a running total.

What we need to come up with is some sort of graph that shows hours on the horizontal axis and cumulative frequency on the vertical axis. That way, the CEO will be able to take a value and read off the corresponding frequency up to that point. He’ll be able to find out how many people play for up to 5 hours, 6 hours, or whatever other number of hours he’s most interested in at the time.

Before we can draw the chart, we need to know what exactly we need to plot on the chart. We need to calculate cumulative frequencies for each of the intervals that we have, and also work out the upper limit of each interval.

Let’s start by looking at the data.

Hours

Frequency

0–1

4,300

1–3

6,900

3–5

4,900

5–10

2,000

10–24

2,100

Now that we have the upper limits and cumulative frequencies, we can plot them on a chart. Draw two axes, with the vertical one for the cumulative frequency and the horizontal one for the hours. Once you’ve done that, plot each of the upper limits against its cumulative frequency, and then join the points together with a line like this:

image with no caption

The CEO is really happy with your work on cumulative frequency graphs, and your bonus is nearly in the bag. He’s nearly finished preparing for the keynote, but there’s just one more thing he needs: a chart showing Manic Mango profits compared with the profits of their main rivals. Which chart should he use?

You’ve helped produce some killer charts for Manic Mango, and thanks to you, the keynote was a huge success. Manic Mango has gained tons of extra publicity for their games, and money from sponsorship and advertising is rolling in. The only thing left for you to do is think about all the things you could do and the places you could go with your well-earned bonus.

You’ve had your first taste of how statistics can help you and what you can achieve by understanding what’s really going on. Keep reading and we’ll show you more things you can do with statistics, and really start to flex those statistics muscles.

image with no caption
image with no caption