Chapter 3
Know and Understand the Data

Plotting numbers on a chart does not make you a data analyst. Knowing and understanding your data before you communicate it to your audience does. Understanding data is challenging and this chapter is designed to help you understand these important ideas:

  • The concept of data aggregation
  • How to effectively communicate metrics to your audience
  • How to identify and correct mistakes in your analysis
  • How to understand and present time series data

Using Appropriate Aggregations

Aggregated data is data combined from several measurements for the purpose of summarization.1 Some of the uses for data aggregation include:

This section will help you understand whether data can be aggregated, some basic aggregation methods, and things to watch out for.

Can the Data Be Aggregated?

When working with aggregation, it is important to avoid metrics that aggregate data that is already aggregated. What does this mean?

Aggregating summarized data to a higher level will often work. However, there are aggregations you cannot aggregate further without additional data. It takes time and practice to understand data aggregation. To aid you on your journey, here are four frequent aggregation mistakes to avoid as a data analyst.

Average of Percentages

Combining several data points into a single value is the most obvious way to aggregate data. However, not all data can be summarized. Figure 3.1 shows response rates to a survey question:

Table shows columns for reason, middle class, poor, and rich people and rows for abilities, talents; connections to right people; cunning, cheating; entrepreneurial spirit, courage; fortune, good luck, et cetera, where abilities, talents is 8.0 percent under middle class, is 7.0 percent under poor, and is 13.0 percent under rich people.

Figure 3.1 Pre-aggregated data.

The data in this case is pre-aggregated; that is, the percentages represent the number of people who specified a reason in a social class divided by the total number of people surveyed. Since this is the lowest level of data provided, the numbers are represented accurately.

A common mistake people make is to combine (or aggregate) these measurements even further. For example, you may want to know the overall percentage of people who gave abilities and talents as their reason. The natural thing to do is to remove social strata from the table and take an average across the entire surveyed population. This results in the table shown in Figure 3.2.

Table shows column for reason, under which abilities, talents is 9 percent; connections to right people is 27 percent; cunning, cheating is 21 percent; entrepreneurial spirit, courage is 20 percent; fortune, good luck is 13 percent; good education, high qualification is 26 percent, et cetera.

Figure 3.2 Unweighted average of response rates.

This table tells you that, on average, 9% of people responded that abilities and talents was a reason for success. But this is not accurate because you have not considered the sample size. Taking an average of an average (the original percentage) does not result in a weighted average, which takes into account the sample size (i.e., the number of people from each social stratum who responded to the survey).

If the survey questioned 100 poor people, 200 middle-class people, and 300 rich people, then you could calculate the weighted average by taking each individual percentage, multiplying it by the number of responses, then dividing that result by the overall number of responses.

numbered Display Equation

Using this weighted calculation, you can now calculate the correct overall average, as seen in Figure 3.3.

Formula shows sigma (response rate multiplied by people surveyed)/sigma people surveyed.

Figure 3.3 Weighted average of response rates.

Knowing when your data can and cannot be aggregated is critical to performing accurate analysis. As you can see from these aggregated tables, the numbers differ between the unweighted and weighted average. If your data is pre-aggregated and you are not provided with the sample size, aggregating the data further could yield inaccurate numbers.

Average of Averages

As mentioned previously, aggregating an average to a higher level without appropriately weighting the data could lead to inaccurate representation of the data. Consider Figure 3.4, which Andy created about internet usage trends around the world.

Graph shows years from 2010 to 2015 versus range from 0 to 100 where curve for United States begins at 71.7 on 2010, comes down and to right, goes up and to right, comes down and to right, goes up and to right, and ends at 74.6 on 2015. Line begins at 9.1 on 2010 and ends at 12.3 on 2015.

Figure 3.4 Internet access per capita by country.

This data set contains pre-aggregated data of internet usage per 100 people per year per country. Essentially this data can be presented as percentages since its base is 100 people. Looking at an individual country per year provides an accurate picture, because the data was given at that granularity.

However, there is a mistake in calculating worldwide internet usage. The worldwide average was calculated as a straight average of all of the countries for each year. Why is this a problem? It does not account for the population of each country and therefore it is not weighted.

For example, Iceland has the highest internet usage per 100 people, but it also has a tiny population compared to China, with the largest population, which finds itself in the middle of the countries listed when it comes to internet usage penetration. Giving Iceland the same weight as China when calculating the overall internet usage is wrong and inaccurate. The population of each country should be included in the overall calculation in order to provide an accurate representation of the worldwide internet access per 100 people. In this case, the worldwide average should be removed from the visualization.

Percentages of Percentages

For the dataset about the drinking habits of British people, many Makeover Monday participants focused on the year-over-year change. The data was already provided as percentages, so when calculating year over year change, in order to show accurate results you should use the difference, not the percent difference.

The chart in Figure 3.5 is filtered to one age band and one question. Creating a calculation that is the difference from each year compared to the very first year shows that in 2016, 14% fewer people aged 16–24 “drank alcohol in the last week” (based on the question in the survey) compared to 2005. Every single year from 2006 onward relates its result back to the value of 2005, which forms the baseline, showing that overall alcohol consumption appears to be trending down.

Graph shows years from 2004 to 2017 versus change in consumption rate vs. 2005 from minus 15 to 15 percent where curve begins at 0 percent on 2005, comes down and to right in fourth quadrant, goes up and to right, comes down and to right, goes up and to right, and ends at minus 14 percent on 2016.

Figure 3.5 Alcohol consumption rate in Britain compared to 2005.

Compare that to Figure 3.6, where, instead of using absolute change (difference from 2005) we have applied a relative change calculation: the % difference of each year compared to 2005. This implies that in 2016, 23% fewer people aged 16–24 “drank alcohol in the last week” compared to 2005.

Graph shows years from 2004 to 2017 versus percent change in consumption rate vs. 2005 from minus 25 to 25 percent where curve begins at 0 percent on 2005, comes down and to right in fourth quadrant, goes up and to right, comes down and to right, goes up and to right, and ends at minus 23 percent on 2016.

Figure 3.6 Percent change in alcohol consumption rate in Britain compared to 2005.

Calculating the percent change between two percentages is not completely inaccurate, but it can be very misleading. Instead, you should use the absolute change when you are working with percentages and want to show the difference between two points in time.

Making Comparisons with Ranks

Ranks do not explain how much one item varies from another. Ranked data is ordinal; that is, the data is categorical and has a sequence (e.g., who finished the race first, second, and third). That’s it! Ranked data can be used for showing the order of the data points.

Consider Figure 3.7, which shows the rank of sales for each region across product categories.

Graph shows x axis with values for furniture, office supplies, and technology versus sales bank from 4 to 1 where curves for south and central begin at 4 and 3, respectively, go to right and remain parallel to x axis. Curve for east begins at 2 on furniture, goes to right and remains parallel x axis, goes up and to right at office supplies, et cetera.

Figure 3.7 Rank of regions by sales by product category.

Figure 3.7 tells you which region ranked first, second, third, and fourth for a given product category, no more, no less.

If a box plot is added to show the variance between the data, as in Figure 3.8, you see that the statistical median (i.e., the place where the two gray areas meet in the box plot) within each category is identical. This means that the variance between the ranks is identical, as would be expected with ordinal data. In other words, you cannot make assumptions about the true variance between the regions (i.e., by how much sales vary from one region to the next).

Box plot shows x axis with values for furniture, office supplies, and technology versus sales bank from 4 to 1. Each coordinate of graph has dot of particular color. Three sets of two rectangles of different shades are joined together and are superimposed on six dots in middle of graph.

Figure 3.8 Box plot of the rank of regions by sales by product category.

If you look at the actual sales values instead of the rank of sales, you start to gain an understanding of the variance across the regions. You can now see in Figure 3.9 how much better one region performs in monetary terms compared to another region.

Dot plot shows x axis with values for furniture, office supplies, and technology versus sales from 120K dollars to 260K dollars where dots for south, central, east, and west are around 120K dollars, between 160K and 180K, between 200K and 220K, and between 240K and 260K dollars, respectively on furniture.

Figure 3.9 Dot plot of sales by region by product category.

By adding a box plot in Figure 3.10, it is now easier to see the variance across the regions within each category and also which category has the highest median (technology).

Box plot shows values such as furniture, office supplies, and technology on x axis versus sales from 120K dollars to 260K dollars where dots for South are at lower values on dollars. Dots for Central are above it, then come dots for East, and lastly West.

Figure 3.10 Box plot of sales by region by product category.

When working with ranked data, you cannot make inferences about the variance in the data; all you can say with certainty is which item is ranked higher than the others, not how much higher. When you have ranked data of a sprint race at the Olympics, you can say who came first, second, and third, but without the underlying detail, you cannot say how much faster one person was compared to the next.

Basic Aggregation Types

This section covers the four most common aggregation types that we have observed during the Makeover Monday project. The definition of the aggregation and the formulas for creating each calculation are included, as well as an example of a chart for each type of aggregation.

Summarization

You have gathered all of your sales data and now you need to understand total sales at various levels of aggregation. Perhaps you need to understand sales by month or sales by region or which products are selling the most. Each of these requires a basic summarization; that is, you add the sales together for each member of a field, as in Figure 3.11.

Graph shows years divided into second and fourth quarters from 2014 to 2017 versus sales from 0K dollars to 100K dollars where three curves of one shade go up and down irregularly and each end at 80K dollars, 56K dollars, and 46K dollars, respectively on 2017 Q4. Curve for east region also goes up and down and ends at 98K dollars on 2017 Q4.

Figure 3.11 Sales summarized to the region and quarter level.

Formula

numbered Display Equation

Example Chart

Figure 3.11 shows the sales summarized by region, year and quarter.

Average/Mean

An average represents a typical number for a group of things. For example, on average it rains three days per week. In everyday talk, people use the word “average” as a good indicator, meaning that some of the values are above the average and some are below.

When an analyst calculates an average, what they are really calculating is the mean. The mean is calculated by summarizing the values in the set and then dividing by the number of values that comprise the set.

Formula

numbered Display Equation

Example Chart

In Figure 3.12 we want to know the average number of RBIs (runs batted in) across all players.

Table shows what is average number of RBIs per position? with columns for pos, RBI, players, and avg. RBI and rows for RF, 2B, SS, LF, 3B, DH, 1B, CF, OF, C, UT, and IF, where RF is 100 under RBI, 1 under players, and 100.0 under avg. RBI. 2B is 72 under RBI, 1 under players, and 72.0 under avg. RBI.

Figure 3.12 Table of average RBIs per player.

Median

Median and mean are often confused. While the mean provides the mathematical average, the median returns the middle value in a data set when the values are arranged from smallest to largest.

If there is an odd number of values, then the median represents the value exactly in the middle of the data set. For example, if there are seven values ordered from smallest to largest, then the median would be the fourth value in the data set.

If there is an even number of values, then the median is the average of the two values in the middle. If there are eight values, then the median would be the average of the fourth and the fifth values.

Formula

Median = value of ((n + 1)/2)th data point in an ordered set2

Example Chart

Figure 3.13 shows the median number of RBIs across all players.

  1. Summarize the RBIs per player (i.e., how many RBIs does each player have?).
  2. Sort the players from smallest to largest RBI (or largest to smallest).
  3. Count the number of players, add one, and then divide by two: (22 players + 1) / 2 = 11.5.
  4. Since there is an even number of players, to get the median you need to average the RBI values for the players in the 11th and 12th positions in the order: (18 + 22)/2 = 20.
Table shows what is median RBIs for all players? with columns for player, RBI, and median and rows for Alberto Gonzalez, Juan Miranda, Ivan Rodriguez, Morgan Ensberg, Justin Christian, Richie Sexson, et cetera, where Alberto Gonzalez and Juan Miranda are 1 under RBI and 20 under median, et cetera.

Figure 3.13 Table of median RBIs across all players.

Count (Distinct)

Counting is the action of finding the number of elements of a finite data set.3 For example:

  • How many players are on a team for each season?
  • Who are the female engineers employed by an organization each month?
  • How many SKUs (stock-keeping units) are in the product portfolio?

To count the unique items in a data set, you typically use a distinct function: count the number of elements in a data set, but count each element only once. For example:

  • How many unique first names are there for all players in the league?
  • How many days have at least one sale?
  • How many unique customers did we reach during last year’s marketing campaign?

Counting the distinct elements avoids incorrectly inflating the numbers. Using the marketing campaign example, to get the number of customers reached, you do not want to count a single customer multiple times even if we contacted them every month and therefore there are 12 records in the data set for that customer.

Formula

numbered Display Equation

Example Chart

If you want to know how frequently products are sold in each product category, you should count all of the products each time they had a sale, as shown in Figure 3.14.

Formula shows count equals sigma observations in set, given set [‘blue’, ‘red’, blue’, yellow’, ‘blue’, ‘red’], count equals 6.

Figure 3.14 Bar chart of a count of products sold.

However, if you want to know how many unique products sold in each category, you need to change the measurement to a distinct count. Notice in Figure 3.15 how the order of the product categories changed based on the number of distinct products sold.

Bar chart shows how many unique products sold in each category? where y axis has values for copiers, fasteners, supplies, envelopes, bookcases, tables, machines, labels, et cetera. Bar is highest on paper at 277 and lowest on copiers at 13.

Figure 3.15 Bar chart of unique products sold.

Count and count distinct are used to answer different questions. Use count when you need to know the total number of items in a set. Use count distinct when you are interested in the unique number of items in a set.

Explaining Metrics

Include information for viewers of your visualizations who do not have any background on the topic and need some guidance to interpret the data. If it is a metric the audience is probably not familiar with, take the time to explain how it is calculated and what it means. If you have taken a standard calculation and changed it slightly to fit the purpose, be sure to point that out to the audience. In the end, you want your audience to quickly understand what you created. Using complex metrics will only hinder interpretation.

Know Your Audience

When you create a data visualization, you rarely create something for yourself and instead focus on other users. Unless you know your audience, you are likely to fall short with your communication. If you know exactly who your audience is, great! However, it is rare that you will know everyone who will view your work. It is likely you will have to account for an unknown audience; whether it is online or in your organization, someone you gave a report to may show it to others you never considered as your target audience.

Tips for Identifying Your Audience

A little bit of research can go a long way toward understanding your audience and their needs. Consider making a short list of the people who may be in your audience. Try to be as specific as possible and then do some research about them. If you are presenting to upper-level management, ask others in the organization what is important to them. If you are presenting to a team of analysts, it is likely you will go into a lot more detail than you would for an overview report to the board of directors. Either way, taking some time to get to know your audience will help you communicate more effectively.

If you are addressing a mixed audience, your communication will most likely need to be targeted at the most senior person or the decision maker. Design your analysis for them; you can always follow up with the rest of the team later.

Once you have enough information about your audience, make note of their different needs and use these insights as a guideline for your analysis. If you create something that is not on that list of needs, keep it off your final output; you can always provide it on request. Do not add something that could confuse your audience. Finally, review your work with someone who can give you unbiased feedback. Particularly useful is someone who does not know about the content and is honest with you; if you make your analysis understandable to them, then you have probably communicated the information very well.

Using Appropriate Metrics

When aggregating data, it is important to consider the base metric. Take a look at Figure 3.16, a line chart of BMW and Mercedes car prices by month.

Graph shows years from 2012 to 2017 versus catalog price from 0M to 350M where curves for BMW and Mercedes-Benz rise while going up and down and end at 363.0M pounds and 317.5M pounds, respectively on 2017.

Figure 3.16 Total BMW and Mercedes-Benz prices over time.

It should be obvious that a car does not cost €300+ million. The purpose of this chart is to compare the prices of BMW to Mercedes over time, but the data, as presented, is not normalized by the number of cars sold. Data is often presented in this manner because the analyst did not take the time to understand the data and to identify an appropriate metric.

Since this data is about car prices, it does not make sense to summarize the prices. When you want to compare the prices of car models, you should probably look at the median or average price. When you do that, as in Figure 3.17, the prices are easy to understand, and effective comparisons can be made.

Graph shows years from 2012 to 2017 versus median catalog price from 40K to 60K where curves for BMW and Mercedes-Benz start between 48K and 50K, and between 42K and 44K on 2012, go up and down irregularly, and end at 52,509 pounds and 51,966 pounds, respectively.

Figure 3.17 Median BMW and Mercedes-Benz prices over time.

Creating New Metrics to Tell a Different Story

When approaching an analytical project, you are often presented with a limited set of metrics. Any great data analyst will see this as an opportunity to redefine or restructure the data in order to interpret it in a different way. Reshaping the data provides you with a new structure of fields and metrics that can potentially be used in ways that give you more options for analyzing the data.

Consider Figure 3.18, which examines visitor history to US National Memorials. A common way to look at this data is to plot the cumulative number of visitors over time.

Graph shows years from 1940 to 2030 versus cumulative recreation visitors from 0M to 250M where curves for Lincoln memorial, Jefferson, Mount Rushmore N, and Thomas Jefferson begin below 1950, go up and right and stop between 2010 and 2020. Curve for Vietnam Veterans begins between 1980 and 1990, go up and to right and end between 2010 and 2020.

Figure 3.18 Cumulative visitors to National Memorials.

There is nothing wrong with this metric; however, Shawn Levin took a different approach in Figure 3.19 by creating a metric for how “old” each park was. This metric was not originally in the data set, so he needed to create a calculation to include it.

Graph shows number of visitors from 0M to 240M versus number of years memorial has been opened from 0 to 80 for national memorials which have experienced fastest growth where curves for Wright Brothers N, Washington monument, Mount Rushmore N, Jefferson, and Lincoln memorial start from point of origin and steeply go up and to right.

Figure 3.19 Cumulative visitors to National Memorials since they opened.

This is a subtle change, and when presented in the way that Shawn did, a different story can be told. When you are exploring your data, look for alternate views of the data; you just may find a more interesting insight.

Identifying and Correcting Mistakes

When we gave the community global warming data to analyze, one of the first visualizations posted came from Charlie Hutcheson (Figure 3.20).

Graph shows years from 1956 to 2016 versus range from negative 50 percent to positive 300 percent where curves of particular color go up and down haphazardly and to right around positive x axis, and one brightly colored curve, starts below 1956 on x axis and rises up to positive 270 percent on 2016 while going up and down irregularly.

Figure 3.20 Relative temperature change by country.

While the design looks great, this visualization suffered from a flaw: it showed relative change instead of absolute change. Since this visualization focused on temperatures, what really is relative change? What does it mean for something be 20% hotter? And 20% hotter in °C is very different to 20% hotter in °F. Relative change is the percentage change of the absolute variance to a prior period.

On the other hand, absolute change is the difference to a prior period. When analyzing temperature change, absolute change is more understandable to the audience. We know what +10°F or −5°C feels like. Charlie took this feedback into account and revised his visualization in Figure 3.21; +10°F is much easier to understand than +270%.

Graph shows years from 1960 to 2020 versus range from negative 40 degrees Fahrenheit to positive 40 degrees Fahrenheit where curves of particular color go up and down haphazardly and to right, and one brightly colored curve, starts below 1960 on positive x axis and rises up to positive 10 degrees Fahrenheit between 2010 and 2020 while going up and down irregularly.

Figure 3.21 Absolute temperature change by country.

While Charlie’s visualization related to using a confusing metric, the dashboard Rahul Singh created in Figure 3.22 used an incorrect aggregation. The data about medal winners in the Southeast Asian Games (SEA Games) included every medal won by every individual and team.

Bar graph on right shows share of gold, silver, and bronze medals by gender versus countries such as Thailand, Indonesia, Singapore, Malaysia, Vietnam, Philippines, Myanmar et cetera where bar for gold is highest at 46 and 45 percent for men and women, respectively from Thailand.

Figure 3.22 Duplicated count of medals won in the SEA Games.

Look at the bars on the right side. According to this chart, Thailand won 1609 medals as a country across five years. That is nearly 322 medals per year. A quick analysis of the data set indicates there are only 75 unique sports. So how can a country win more medals than events that exist? Obviously, they cannot. The issue is that every medal for every individual was counted as part of the total medal count for a country. However, the official medal counts are by event.

By counting the unique number of sports for each medal type, you get an accurate count of the number of medals won by each country. Figure 3.23 shows the accurate medal counts.

Bar graph on right shows share of gold, silver, and bronze medals by gender versus countries such as Thailand, Indonesia, Singapore, Malaysia, Vietnam, Philippines, Myanmar et cetera where bar for gold is highest at 46 and 45 percent for men and women, respectively from Thailand.

Figure 3.23 Correct count of medals won in the SEA Games.

As a data analyst, you need to have a sense for when something does not look quite right. You need to be able to spot outliers quickly and investigate their validity. Take time to explore the data and do some research about the topic. Simple steps like this will help you avoid basic mistakes.

Time Series Analysis

A time series is a sequence of values, usually taken in equally spaced intervals. Typical examples include:

  • Sales by month
  • Users added to an app by year
  • Heart rate per minute
  • GPS location by second
  • Stock price per hour
  • Electricity used per day

Essentially, anything with a time dimension, measured in regular intervals, can be used for time series analysis. This section will introduce you to common methods for analyzing and visualizing time series data.

Univariate Time Series

A univariate time series is a sequence of measurements of a single variable collected over time.4 As shown in Figure 3.24, observations of CO2 concentration are a single variable collected over time.

Graph shows years from 2001 to 2019 versus 6-month moving average of CO2 concentrations from 365 to 410 where curve begins below 370 on y axis and 2001 on x axis, rises and stops between 405 and 410 on y axis and 2019 on x axis while going up and down.

Figure 3.24 Time series plot of CO2 concentration.

When analyzing univariate time series, you may be able to quickly see trends, as in the linear trend in Figure 3.25. Adding a linear trend line while observing the R-squared helps you confirm if the data points are well approximated.

Graph shows years from 2001 to 2019 versus CO2 concentrations from 365 to 410 where curve goes up and down in zigzag manner while beginning near bottom left of graph and ending at top right of graph. Line begins near 365 on y axis, goes up and to right while cutting across zigzag curve in middle.

Figure 3.25 Time series plot of CO2 concentration with a linear trend line.

Visualizing Seasonality

If you are interested in seasonality, you can plot a date part (e.g., month) on the x-axis and add a line for each year, as in Figure 3.26.

Graph shows months from Jan to Dec versus CO2 concentrations from 365 to 410 from 2000 to 2018 where curves starting from each year, go up gradually, come down gradually, and go up gradually again. Curves are of darker shade for recent years and become faded for former years.

Figure 3.26 Seasonality plot of CO2 concentration.

The data has consistent patterns across all of the years, with the CO2 concentration peaking in May, decreasing until September, then rising again until the following May. The design of the chart also shows a consistent increase of CO2 concentrations over time as more recent years have higher CO2 concentration levels than earlier years.

Displaying the data in a box plot, as in Figure 3.27, reveals the consistent distribution of the years within the months (lowest to highest data point as depicted by gray and black circles), the pattern of the median (where the two gray shaded boxes meet), and the consistent width of the interquartile range (the range from the bottom of the gray box to the top of the lighter gray box).

Box plot shows months from Jan to Dec versus CO2 concentrations from 365 to 410 where plots go up, come down, go up, come down again, and go up.

Figure 3.27 Box plot of CO2 concentration.

However, the drawback of the box plot is that it tends to hide the values due to its design. The gray shaded boxes cover some of the data points, making them hard to identify. Figure 3.28 uses lines for more effective display of the median, 25th percentile, and 75th percentile, ensuring the focus is on the spread.

Dot plot shows months from Jan to Dec versus CO2 concentrations from 365 to 410 where dots are placed vertically in line on each month and curves labeled 25th percentile, median, and 75th percentile start between 375 and 380, between 385 and 390, and between 395 and 400, respectively on Jan, go up gradually and to right, go down, then up.

Figure 3.28 Dot plot of CO2 concentration with lines for the median, 25th percentile, and 75th percentile.

Finally, comparing each individual month to the average for the year normalizes the data and creates an index for each year (i.e., a variance from the mean). In Figure 3.29, the zero line represents the yearly average, while the black lines are plotted along the months for each year. This display shows the variation from that average (i.e., how much one month differs when compared to the average).

Graph shows variance to yearly average from negative 3 to positive 3 on y axis where curves graphed on first and fourth quadrants, and placed one beside other, go steeply up, down and up again.

Figure 3.29 Monthly CO2 concentration.

This method for visualizing time series data confirms the systematic nature of CO2 concentration.

Using Moving Averages for Smoothing

Smoothing is a technique that can be used to remove some of the variation in short-term data in favor of emphasizing long-term trends. One of the most common methods is moving average smoothing. Typically, a moving average is the unweighted mean of a series of preceding data points. In many cases, the moving average will also include the current data point. For example, a six-month moving average would be the average of values from the current month and the previous five months.

In the sample data set of CO2 concentration, you may want to smooth the data over a period of time by adding a moving average. Applying a six-month moving average to the data (i.e., the current month value plus the values for the previous five months divided by six) gives you a much smoother visualization, as demonstrated in Figure 3.30.

Graph shows years from 2001 to 2019 versus 6-month moving average of CO2 concentrations from 370 to 410 where curve begins below 370 on y axis and below 2001 on x axis, rises above 405 on y axis and on 2019 on x axis while going up and down and forming regular pattern.

Figure 3.30 Six-month moving average of CO2 concentration.

Variance from a Point in Time

Comparing data to a fixed point in time allows the audience to understand magnitude of change. For example, Figure 3.31 shows how home prices have changed since the first month that data was available.

Graph shows heading which reads, how has your home price changed since you purchased? where below are years from 1977 to 2017 versus range from positive 0 percent to positive 1800 percent where two highlighted curves begin below 1977 on positive x axis, go up, come down, and go up again.

Figure 3.31 Change in home prices compared to the first time period.

This type of variance analysis allows you to compare change to any point in time. On 30 December 2008, the Case–Shiller home price index reported its largest price drop in its history.5 Visually plotting time series data against a point in time reveals patterns relative to that period, thus allowing the reader to understand growth and decline before and after the given point in time.

Giving your audience the flexibility to pick and choose their reference point, as has been done in Figure 3.32, allows them to compare values across any time periods that are of particular interest to them.

Graph shows heading which reads, how has your home price changed since you purchased? where below are years from 1977 to 2017 versus range from negative 100 to positive 80 percent. Curves enter from left of viewing window below, go up and to right, come down and to right, intersect at point, and go up.

Figure 3.32 Change in home prices compared to a selected time period.

In addition, allowing the user to choose a state to highlight makes it more contextual. Including a black line for the US average shows how all of the states compare to the nation.

Cycle Plots

A cycle plot (Cleveland, Dunn, and Terpenning, 1978) shows both the cycle or trend and the month-of-the-year effect.6 A standard line chart will display year and then month on the x-axis and a metric on the y-axis. Figure 3.33 represents the median ozone level across the United States by month for a number of years.

Graph shows median measurement from 0.000 to 0.040 where curve begins between 0.015 and 0.020 near y axis, goes steeply up and down and to right many times and exits right of viewing window.

Figure 3.33 Monthly CO2 concentration.

A cycle plot will flip the two parts of the date field. While Figure 3.33 displays months within each year, a cycle plot, as in Figure 3.34, will display years within each month. Once that display is created, add a line for the average across all years for each month and you have a cycle plot.

Graph shows months from Jan to Dec versus median measurement from 0.000 to 0.040 where on each month is one curve which goes up and down irregularly. Curve on April is at highest and curves on January and December are at lowest.

Figure 3.34 Cycle plot of CO2 concentration.

The cycle plot shows you that April has the highest median ozone level, which was much more difficult to see in the original view. You can also see that, generally speaking, ozone levels have increased in most months over the years as many of the lines within each month are trending upward (i.e., from bottom left to top right).

Calendar Heat Map

Heat maps are effective visualizations for seeing concentrations as well as patterns. Adding time series to a heat map can also reveal seasonality that may not be obvious otherwise. As we saw in the previous section, CO2 levels are highest in the US from April until June. See Figure 3.35.

Heat map shows years from 1990 to 2016 versus months from Dec to Jan where quadrant is divided into many squares. Squares from December to September and February and January, from year 1990 to 2016 are of lighter shades whereas squares from August to March are of darker shades.

Figure 3.35 Heat map of median CO2 levels.

Presenting this data as a heat map instead of a cycle plot reveals a pattern you may not have noticed before: CO2 levels have steadily increased since 1990. You can see this by the gradual darkening of the heat map as you read it from left to right.

Figure 3.36 reduces the data to the state of Colorado, revealing an even more striking increase.

Heat map shows years from 1990 to 2016 versus months from Dec to Jan where quadrant is divided into many squares. Squares in middle of left portion of quadrant and entire right portion of quadrant are of darker shades whereas squares at top left and bottom left are of lighter shades.

Figure 3.36 Heat map of median CO2 levels in Colorado.

Figure 3.37, shows decreases in Florida in the summer, likely because people who live there seasonally retreat from the heat (and take their cars with them).

Heat map shows years from 1990 to 2016 versus months from Dec to Jan where quadrant is divided into many squares. Squares from December to June, and February and January are mostly of lighter shades whereas squares from May to March are of darker shades.

Figure 3.37 Heat map of median CO2 levels in Florida.

Consider calendar heat maps as an alternative time series visualization. They can be very effective at revealing hidden patterns in your data.

Summary

Throughout this chapter we have provided some basic advice for how to become a more effective data analyst by really knowing and understanding your data. You should now understand:

  1. What data aggregation means, how you can aggregate data, and when you should avoid aggregating data that is already aggregated
  2. How to identify your audience and how to communicate metrics clearly to them
  3. How to identify and correct a mistake in your analysis
  4. The basics of time series analysis and effective visual representations of time series data

These lessons are easy to incorporate into your daily analytical work. All it takes is attention to detail and an honest assessment of your own work.

Notes