Plotting numbers on a chart does not make you a data analyst. Knowing and understanding your data before you communicate it to your audience does. Understanding data is challenging and this chapter is designed to help you understand these important ideas:
Aggregated data is data combined from several measurements for the purpose of summarization.1 Some of the uses for data aggregation include:
This section will help you understand whether data can be aggregated, some basic aggregation methods, and things to watch out for.
When working with aggregation, it is important to avoid metrics that aggregate data that is already aggregated. What does this mean?
Aggregating summarized data to a higher level will often work. However, there are aggregations you cannot aggregate further without additional data. It takes time and practice to understand data aggregation. To aid you on your journey, here are four frequent aggregation mistakes to avoid as a data analyst.
Combining several data points into a single value is the most obvious way to aggregate data. However, not all data can be summarized. Figure 3.1 shows response rates to a survey question:
The data in this case is pre-aggregated; that is, the percentages represent the number of people who specified a reason in a social class divided by the total number of people surveyed. Since this is the lowest level of data provided, the numbers are represented accurately.
A common mistake people make is to combine (or aggregate) these measurements even further. For example, you may want to know the overall percentage of people who gave abilities and talents as their reason. The natural thing to do is to remove social strata from the table and take an average across the entire surveyed population. This results in the table shown in Figure 3.2.
This table tells you that, on average, 9% of people responded that abilities and talents was a reason for success. But this is not accurate because you have not considered the sample size. Taking an average of an average (the original percentage) does not result in a weighted average, which takes into account the sample size (i.e., the number of people from each social stratum who responded to the survey).
If the survey questioned 100 poor people, 200 middle-class people, and 300 rich people, then you could calculate the weighted average by taking each individual percentage, multiplying it by the number of responses, then dividing that result by the overall number of responses.
Using this weighted calculation, you can now calculate the correct overall average, as seen in Figure 3.3.
Knowing when your data can and cannot be aggregated is critical to performing accurate analysis. As you can see from these aggregated tables, the numbers differ between the unweighted and weighted average. If your data is pre-aggregated and you are not provided with the sample size, aggregating the data further could yield inaccurate numbers.
As mentioned previously, aggregating an average to a higher level without appropriately weighting the data could lead to inaccurate representation of the data. Consider Figure 3.4, which Andy created about internet usage trends around the world.
This data set contains pre-aggregated data of internet usage per 100 people per year per country. Essentially this data can be presented as percentages since its base is 100 people. Looking at an individual country per year provides an accurate picture, because the data was given at that granularity.
However, there is a mistake in calculating worldwide internet usage. The worldwide average was calculated as a straight average of all of the countries for each year. Why is this a problem? It does not account for the population of each country and therefore it is not weighted.
For example, Iceland has the highest internet usage per 100 people, but it also has a tiny population compared to China, with the largest population, which finds itself in the middle of the countries listed when it comes to internet usage penetration. Giving Iceland the same weight as China when calculating the overall internet usage is wrong and inaccurate. The population of each country should be included in the overall calculation in order to provide an accurate representation of the worldwide internet access per 100 people. In this case, the worldwide average should be removed from the visualization.
For the dataset about the drinking habits of British people, many Makeover Monday participants focused on the year-over-year change. The data was already provided as percentages, so when calculating year over year change, in order to show accurate results you should use the difference, not the percent difference.
The chart in Figure 3.5 is filtered to one age band and one question. Creating a calculation that is the difference from each year compared to the very first year shows that in 2016, 14% fewer people aged 16–24 “drank alcohol in the last week” (based on the question in the survey) compared to 2005. Every single year from 2006 onward relates its result back to the value of 2005, which forms the baseline, showing that overall alcohol consumption appears to be trending down.
Compare that to Figure 3.6, where, instead of using absolute change (difference from 2005) we have applied a relative change calculation: the % difference of each year compared to 2005. This implies that in 2016, 23% fewer people aged 16–24 “drank alcohol in the last week” compared to 2005.
Calculating the percent change between two percentages is not completely inaccurate, but it can be very misleading. Instead, you should use the absolute change when you are working with percentages and want to show the difference between two points in time.
Ranks do not explain how much one item varies from another. Ranked data is ordinal; that is, the data is categorical and has a sequence (e.g., who finished the race first, second, and third). That’s it! Ranked data can be used for showing the order of the data points.
Consider Figure 3.7, which shows the rank of sales for each region across product categories.
Figure 3.7 tells you which region ranked first, second, third, and fourth for a given product category, no more, no less.
If a box plot is added to show the variance between the data, as in Figure 3.8, you see that the statistical median (i.e., the place where the two gray areas meet in the box plot) within each category is identical. This means that the variance between the ranks is identical, as would be expected with ordinal data. In other words, you cannot make assumptions about the true variance between the regions (i.e., by how much sales vary from one region to the next).
If you look at the actual sales values instead of the rank of sales, you start to gain an understanding of the variance across the regions. You can now see in Figure 3.9 how much better one region performs in monetary terms compared to another region.
By adding a box plot in Figure 3.10, it is now easier to see the variance across the regions within each category and also which category has the highest median (technology).
When working with ranked data, you cannot make inferences about the variance in the data; all you can say with certainty is which item is ranked higher than the others, not how much higher. When you have ranked data of a sprint race at the Olympics, you can say who came first, second, and third, but without the underlying detail, you cannot say how much faster one person was compared to the next.
This section covers the four most common aggregation types that we have observed during the Makeover Monday project. The definition of the aggregation and the formulas for creating each calculation are included, as well as an example of a chart for each type of aggregation.
You have gathered all of your sales data and now you need to understand total sales at various levels of aggregation. Perhaps you need to understand sales by month or sales by region or which products are selling the most. Each of these requires a basic summarization; that is, you add the sales together for each member of a field, as in Figure 3.11.
Formula
Example Chart
Figure 3.11 shows the sales summarized by region, year and quarter.
An average represents a typical number for a group of things. For example, on average it rains three days per week. In everyday talk, people use the word “average” as a good indicator, meaning that some of the values are above the average and some are below.
When an analyst calculates an average, what they are really calculating is the mean. The mean is calculated by summarizing the values in the set and then dividing by the number of values that comprise the set.
Formula
Example Chart
In Figure 3.12 we want to know the average number of RBIs (runs batted in) across all players.
Median and mean are often confused. While the mean provides the mathematical average, the median returns the middle value in a data set when the values are arranged from smallest to largest.
If there is an odd number of values, then the median represents the value exactly in the middle of the data set. For example, if there are seven values ordered from smallest to largest, then the median would be the fourth value in the data set.
If there is an even number of values, then the median is the average of the two values in the middle. If there are eight values, then the median would be the average of the fourth and the fifth values.
Formula
Median = value of ((n + 1)/2)th data point in an ordered set2
Example Chart
Figure 3.13 shows the median number of RBIs across all players.
Counting is the action of finding the number of elements of a finite data set.3 For example:
To count the unique items in a data set, you typically use a distinct function: count the number of elements in a data set, but count each element only once. For example:
Counting the distinct elements avoids incorrectly inflating the numbers. Using the marketing campaign example, to get the number of customers reached, you do not want to count a single customer multiple times even if we contacted them every month and therefore there are 12 records in the data set for that customer.
Formula
Example Chart
If you want to know how frequently products are sold in each product category, you should count all of the products each time they had a sale, as shown in Figure 3.14.
However, if you want to know how many unique products sold in each category, you need to change the measurement to a distinct count. Notice in Figure 3.15 how the order of the product categories changed based on the number of distinct products sold.
Count and count distinct are used to answer different questions. Use count when you need to know the total number of items in a set. Use count distinct when you are interested in the unique number of items in a set.
Include information for viewers of your visualizations who do not have any background on the topic and need some guidance to interpret the data. If it is a metric the audience is probably not familiar with, take the time to explain how it is calculated and what it means. If you have taken a standard calculation and changed it slightly to fit the purpose, be sure to point that out to the audience. In the end, you want your audience to quickly understand what you created. Using complex metrics will only hinder interpretation.
When you create a data visualization, you rarely create something for yourself and instead focus on other users. Unless you know your audience, you are likely to fall short with your communication. If you know exactly who your audience is, great! However, it is rare that you will know everyone who will view your work. It is likely you will have to account for an unknown audience; whether it is online or in your organization, someone you gave a report to may show it to others you never considered as your target audience.
A little bit of research can go a long way toward understanding your audience and their needs. Consider making a short list of the people who may be in your audience. Try to be as specific as possible and then do some research about them. If you are presenting to upper-level management, ask others in the organization what is important to them. If you are presenting to a team of analysts, it is likely you will go into a lot more detail than you would for an overview report to the board of directors. Either way, taking some time to get to know your audience will help you communicate more effectively.
If you are addressing a mixed audience, your communication will most likely need to be targeted at the most senior person or the decision maker. Design your analysis for them; you can always follow up with the rest of the team later.
Once you have enough information about your audience, make note of their different needs and use these insights as a guideline for your analysis. If you create something that is not on that list of needs, keep it off your final output; you can always provide it on request. Do not add something that could confuse your audience. Finally, review your work with someone who can give you unbiased feedback. Particularly useful is someone who does not know about the content and is honest with you; if you make your analysis understandable to them, then you have probably communicated the information very well.
When aggregating data, it is important to consider the base metric. Take a look at Figure 3.16, a line chart of BMW and Mercedes car prices by month.
It should be obvious that a car does not cost €300+ million. The purpose of this chart is to compare the prices of BMW to Mercedes over time, but the data, as presented, is not normalized by the number of cars sold. Data is often presented in this manner because the analyst did not take the time to understand the data and to identify an appropriate metric.
Since this data is about car prices, it does not make sense to summarize the prices. When you want to compare the prices of car models, you should probably look at the median or average price. When you do that, as in Figure 3.17, the prices are easy to understand, and effective comparisons can be made.
When approaching an analytical project, you are often presented with a limited set of metrics. Any great data analyst will see this as an opportunity to redefine or restructure the data in order to interpret it in a different way. Reshaping the data provides you with a new structure of fields and metrics that can potentially be used in ways that give you more options for analyzing the data.
Consider Figure 3.18, which examines visitor history to US National Memorials. A common way to look at this data is to plot the cumulative number of visitors over time.
There is nothing wrong with this metric; however, Shawn Levin took a different approach in Figure 3.19 by creating a metric for how “old” each park was. This metric was not originally in the data set, so he needed to create a calculation to include it.
This is a subtle change, and when presented in the way that Shawn did, a different story can be told. When you are exploring your data, look for alternate views of the data; you just may find a more interesting insight.
When we gave the community global warming data to analyze, one of the first visualizations posted came from Charlie Hutcheson (Figure 3.20).
While the design looks great, this visualization suffered from a flaw: it showed relative change instead of absolute change. Since this visualization focused on temperatures, what really is relative change? What does it mean for something be 20% hotter? And 20% hotter in °C is very different to 20% hotter in °F. Relative change is the percentage change of the absolute variance to a prior period.
On the other hand, absolute change is the difference to a prior period. When analyzing temperature change, absolute change is more understandable to the audience. We know what +10°F or −5°C feels like. Charlie took this feedback into account and revised his visualization in Figure 3.21; +10°F is much easier to understand than +270%.
While Charlie’s visualization related to using a confusing metric, the dashboard Rahul Singh created in Figure 3.22 used an incorrect aggregation. The data about medal winners in the Southeast Asian Games (SEA Games) included every medal won by every individual and team.
Look at the bars on the right side. According to this chart, Thailand won 1609 medals as a country across five years. That is nearly 322 medals per year. A quick analysis of the data set indicates there are only 75 unique sports. So how can a country win more medals than events that exist? Obviously, they cannot. The issue is that every medal for every individual was counted as part of the total medal count for a country. However, the official medal counts are by event.
By counting the unique number of sports for each medal type, you get an accurate count of the number of medals won by each country. Figure 3.23 shows the accurate medal counts.
As a data analyst, you need to have a sense for when something does not look quite right. You need to be able to spot outliers quickly and investigate their validity. Take time to explore the data and do some research about the topic. Simple steps like this will help you avoid basic mistakes.
A time series is a sequence of values, usually taken in equally spaced intervals. Typical examples include:
Essentially, anything with a time dimension, measured in regular intervals, can be used for time series analysis. This section will introduce you to common methods for analyzing and visualizing time series data.
A univariate time series is a sequence of measurements of a single variable collected over time.4 As shown in Figure 3.24, observations of CO2 concentration are a single variable collected over time.
When analyzing univariate time series, you may be able to quickly see trends, as in the linear trend in Figure 3.25. Adding a linear trend line while observing the R-squared helps you confirm if the data points are well approximated.
If you are interested in seasonality, you can plot a date part (e.g., month) on the x-axis and add a line for each year, as in Figure 3.26.
The data has consistent patterns across all of the years, with the CO2 concentration peaking in May, decreasing until September, then rising again until the following May. The design of the chart also shows a consistent increase of CO2 concentrations over time as more recent years have higher CO2 concentration levels than earlier years.
Displaying the data in a box plot, as in Figure 3.27, reveals the consistent distribution of the years within the months (lowest to highest data point as depicted by gray and black circles), the pattern of the median (where the two gray shaded boxes meet), and the consistent width of the interquartile range (the range from the bottom of the gray box to the top of the lighter gray box).
However, the drawback of the box plot is that it tends to hide the values due to its design. The gray shaded boxes cover some of the data points, making them hard to identify. Figure 3.28 uses lines for more effective display of the median, 25th percentile, and 75th percentile, ensuring the focus is on the spread.
Finally, comparing each individual month to the average for the year normalizes the data and creates an index for each year (i.e., a variance from the mean). In Figure 3.29, the zero line represents the yearly average, while the black lines are plotted along the months for each year. This display shows the variation from that average (i.e., how much one month differs when compared to the average).
This method for visualizing time series data confirms the systematic nature of CO2 concentration.
Smoothing is a technique that can be used to remove some of the variation in short-term data in favor of emphasizing long-term trends. One of the most common methods is moving average smoothing. Typically, a moving average is the unweighted mean of a series of preceding data points. In many cases, the moving average will also include the current data point. For example, a six-month moving average would be the average of values from the current month and the previous five months.
In the sample data set of CO2 concentration, you may want to smooth the data over a period of time by adding a moving average. Applying a six-month moving average to the data (i.e., the current month value plus the values for the previous five months divided by six) gives you a much smoother visualization, as demonstrated in Figure 3.30.
Comparing data to a fixed point in time allows the audience to understand magnitude of change. For example, Figure 3.31 shows how home prices have changed since the first month that data was available.
This type of variance analysis allows you to compare change to any point in time. On 30 December 2008, the Case–Shiller home price index reported its largest price drop in its history.5 Visually plotting time series data against a point in time reveals patterns relative to that period, thus allowing the reader to understand growth and decline before and after the given point in time.
Giving your audience the flexibility to pick and choose their reference point, as has been done in Figure 3.32, allows them to compare values across any time periods that are of particular interest to them.
In addition, allowing the user to choose a state to highlight makes it more contextual. Including a black line for the US average shows how all of the states compare to the nation.
A cycle plot (Cleveland, Dunn, and Terpenning, 1978) shows both the cycle or trend and the month-of-the-year effect.6 A standard line chart will display year and then month on the x-axis and a metric on the y-axis. Figure 3.33 represents the median ozone level across the United States by month for a number of years.
A cycle plot will flip the two parts of the date field. While Figure 3.33 displays months within each year, a cycle plot, as in Figure 3.34, will display years within each month. Once that display is created, add a line for the average across all years for each month and you have a cycle plot.
The cycle plot shows you that April has the highest median ozone level, which was much more difficult to see in the original view. You can also see that, generally speaking, ozone levels have increased in most months over the years as many of the lines within each month are trending upward (i.e., from bottom left to top right).
Heat maps are effective visualizations for seeing concentrations as well as patterns. Adding time series to a heat map can also reveal seasonality that may not be obvious otherwise. As we saw in the previous section, CO2 levels are highest in the US from April until June. See Figure 3.35.
Presenting this data as a heat map instead of a cycle plot reveals a pattern you may not have noticed before: CO2 levels have steadily increased since 1990. You can see this by the gradual darkening of the heat map as you read it from left to right.
Figure 3.36 reduces the data to the state of Colorado, revealing an even more striking increase.
Figure 3.37, shows decreases in Florida in the summer, likely because people who live there seasonally retreat from the heat (and take their cars with them).
Consider calendar heat maps as an alternative time series visualization. They can be very effective at revealing hidden patterns in your data.
Throughout this chapter we have provided some basic advice for how to become a more effective data analyst by really knowing and understanding your data. You should now understand:
These lessons are easy to incorporate into your daily analytical work. All it takes is attention to detail and an honest assessment of your own work.