This chapter reviews several concepts that are useful for applied analysis of ecommerce data. The review here is at a higher level focused on an audience that has had an introductory level of exposure to analytical concepts, such as data visualization and mathematical and statistical methods. It’s written for analysts and those who are analytically inclined and want to learn more about how to analyze data. The analyst should be producing and communicating helpful analysis that creates value by helping people make better decisions. This chapter discusses some of the methods for doing so by:
• Overviewing past and current academic theory on analysis, which is useful and applicable to ecommerce analysis
• Discussing the techniques for examining and interrogating data after it has been extracted from sources using tools
• Reviewing important and helpful data visualizations for applying to ecommerce data
• Providing a high-level review of useful statistical data mining and machine learning techniques for data analysis
By the end of this chapter, you should have confirmed and expanded your knowledge on the fundamentals of analysis and the techniques and methods applied in many types of ecommerce analysis.
Ecommerce analysis frequently involves comparing data across different time periods. Time series data is common. Stakeholders want to know how sales and margin compare to last week, last month, last year, and so on. Day over day (DoD), week over week (WoW), month over month (MoM), quarter over quarter (QoQ), and year over year (YoY) are standard comparisons. Seasonal periods and the impact of seasonality is necessary to track and measure when analyzing ecommerce. Unfortunately, comparing time periods in ecommerce isn’t as simple as it seems. The same numerical day from last year does not occur on the same day of the week. For example, July 4, 2016 occurs on a Monday, but July 4, 2015 occurred on a Saturday. Thanks giving may always occur on the fourth Thursday in November, but the holiday is on a different numerical day. Cyber Monday is always the same day of the week, but it is infrequently on the same numerical day. The National Retail Federation (NRF) realized that data comparisons could be problematic and invented a retail calendar. The NRF calendar eliminates the problem of confused date range comparisons because it standardizes the way weeks are counted and thus compared. It’s called a 4-5-4 calendar. For ecommerce analytics, the adoption of the National Retail Federation’s 4-5-4 calendar is a common, though entirely voluntary, industry practice. It is especially useful for companies that also operate traditional retail businesses offline and also online.
The 4-5-4 calendar divides the year into months based on 4-week months, then 5-week months, then 4-week months. It solves for the problem of weekend days (Saturday and Sunday) through division into standard week increments. The NRF calendar aligns holidays so they can be compared across time. It also ensures that the same number of weekend days occurs during each month. In this way, time can be compared accurately. Fluctuations and changes in data aren’t misinterpreted or misunderstood simply by date confusion. This calendar also accounts for leap-year changes and what is known as the “53rd week,” occurring in 2017.
At the end of the day, being successful with analysis involves more than just technology, data, reports, dashboards, and applied methods. The analytical deliverable, whatever that is, needs to be understood sufficiently by stakeholders. Communication creates understanding, and storytelling creates understanding of communication. In fact, narrating a compelling story to stakeholders about what the data tells you or is being used to do is one of the last steps in the delivery of analysis. Use stories to communicate the answer to a business question. Keep in mind that the applied methods and techniques discussed in this chapter may be interesting to analysts, but other businesspeople may have a low tolerance for new, unfamiliar, or difficult concepts. I once had a colleague tell me she writes her analysis so that a 12-year-old or her grandmother could understand it. The takeaway is that the story and the way you deliver the message behind the analysis may in fact be more important to successful outcomes than the techniques or data.
Keep in mind that if you present analysis that in any way differs, changes, or presents a perspective different than is commonly understood or if your analysis shows business performance in any way that is not positive, it is likely that your data and analysis will be challenged. The analyst must ensure that analysis is presented in the most humanistic way possible focused on organizational behavior, motivation, and human emotion. Storytelling is an inherently human activity, so communicating narratives is a natural way for people to understand things. Thus, instead of blasting up data only with numbers and slides with charts and graphs, fancy PowerPoints, and glitzy data visualizations—which are all important—also make sure to weave a narrative through the data to tell a story about what the data says. Do not make the mistake of presenting only data and visualizations. Tell stories with the data. It’s easy to say “tell stories.” It’s harder to do. Some people invent pretend characters to represent customers and internal and external actors. Others use personas or customer segments as the basis for storytelling.
Following are several guidelines for use when forming a story to tell from data. Consider applying these techniques when socializing analysis in the form of story-based narratives:
• Identify why the analysis has occurred and why the story you are about to tell is important. Businesspeople are incredibly busy and require context for reporting and analysis. Explain why they should care.
• Indicate the business challenge you want to discuss and the cost of not fixing it. By clearly stating what business issue catalyzed the analysis and framed the recommendations, you can eliminate confusion.
• Identify any forewarnings. If there are any errors, omissions, caveats, or things to discuss, clearly indicate them in advance.
• Depersonalize the analysis by using fictional characters to help humanize the data you are reporting. Using fictional characters helps to depersonalize analysis and lowers political risk. Aliasing scenarios and abstracting eliminate the risk of offending a specific stakeholder or group.
• Cite important events that help to illustrate a narrative. Annotating externalities and things that happen in the business as the data is collected or when the behavior occurs can help to clarify analysis.
• Use pictures. They are worth a thousand words. And they can save you time explaining concepts in writing. Charts, graphs, trend lines, and other data visualization techniques are helpful.
• Don’t use overly complex, wonky vocabulary. Esoteric and scientific vocabulary is best left within the analytics team. No one will really be impressed if you use words like stochastic. It really means “random.” Try to make the communication and presentation of analysis as simple as possible.
• Identify what is required. Clearly indicate what you think needs to be done in written language using action-oriented verbs and descriptive nouns. Say what you think and what you want to do.
• Identify the cost of inaction. Clearly indicate the financial impact of doing nothing and compare it against the cost of doing something. It may help to present comparable costs from other alternatives.
• Conclude with a series of recommendations that tie to value generation (either reduced cost or increased revenue). Although you may not be an expert at the same level as the person requesting the analysis, analysts should express their ideas and perspective on the data and business situation. Recommendations should be made that are clearly and directly based on data analysis—and these recommendations must be able to withstand the scrutiny and questions.
John Tukey authored the (in)famous book Exploratory Data Analysis in 1977 and was the first person to use the term “software” and the word “bit” or “binary digit” in the modern lexicon.
Exploratory data analysis (EDA) is more of a mind-set for analysis than an explicit set of techniques and methods; however, EDA does make use of several of the techniques contained in this chapter. Tukey’s philosophy on data was one that favored observation and visualization and the careful application of technique to make sense of data. EDA is not about fitting data into your analytical model; rather, it is about fitting a model to your data. As a result, Tukey and EDA created interest in non-Gaussian and non-parametric techniques in which the shape of data indicates that it is not normally distributed and may have a fat head with a long tail. The idea of the long tail is a Pareto concept that Tukey probably would have favored for understanding big data. After all, some ecommerce behavioral data is not normally distributed, so using basic statistics that expect a normal distribution would not be optimal.
The reason Tukey is referenced in this text is not only because he has been hugely influential in using mathematics and statistics to understand what the data is saying, but because his paradigm for data analysis is based on a set of philosophies or tenets that involve data visualization, pattern analysis, contextual understanding, hypothesizing, and simply looking at the data. Tukey recommends the following:
• Visually examining data to understand patterns and trends: Raw data should be examined to learn the trends and patterns, which can help frame what is possible using analytical methods.
• Using the best possible methods to gain insights into not just the data, but what the data is saying: Tukey espouses getting beyond the data to what the data is saying in the context of answering your questions. This approach is integral to ecommerce analytics.
• Identifying the best performing variables and model for the data: Ecommerce analytics is ripe with so much data, but how do you know what is the right data to use to solve for? EDA helps ascertain what is important.
• Detecting anomalous and suspicious outlier data: Digital data has outliers and anomalies that in and of themselves may be important or just random noise.
• Testing hypothesis and assumptions: The idea of using insights derived from data to make changes within digital experience is crucial to EDA.
• Applying and tuning the best possible model to fit the data: Predictive modeling and analysis can benefit from an EDA approach.
Tukey’s principle helps to simplify the creation of analysis because it first emphasizes the visual exploration of the data instead of first applying statistical methods to the data.
The philosophy of EDA is aligned with ecommerce analysis. One of the first things an analyst may do when examining a data set is to identify key dimensions and metrics and use analytical software to visualize the data before applying any statistical method. This way, the analyst can use pattern-recognition abilities to observe the data relationships and unusual data movements to focus his or her work. Then after taking a look at the data, determine how to analyze it and apply the appropriate analytical model and method. Tukey’s EDA approach (or modality) to data analysis can be used separately or with other analytical techniques. EDA can be used in combination with other, perhaps more well-known, modalities for understanding data, such as classical statistics and Bayesian methods. Fortunately, all three modalities for data analysis provide frameworks for finding insights in data that can be applied to ecommerce analytics. EDA, however, imposes less formality than either classical or Bayesian approaches. This flexibility is helpful when analyzing all the different types of ecommerce data.
EDA advocates that you look at the data first by plotting a data visualization and then analyzing the data using the best possible techniques for that data, which may be classical or Bayesian. Classical statistics, unlike EDA, would first instruct the analyst to fit the data to the preferred model, perhaps trimming data to make them fit. A Bayesian approach is an extension of the classical approach in which you would first look at prior data. EDA recommends creating data visualizations first before selecting a model or reaching any conclusive insights. Classical and Bayesian analysts would likely view data visualization as a supporting artifact created during or after analysis (not as the first step to start an analysis). EDA would consider Gaussian and non-Gaussian techniques equally valid based on the data and would encourage the analyst to explore the data and make simple conclusions based on useful visualizations. The advanced applied analysis in EDA comes after the simple conclusions.
When conducting digital analysis, keep classical, Bayesian, and EDA approaches in mind and remember what Tukey said. The following is a short suggestion of exploratory data analysis from Tukey:
It is an attitude AND a flexibility AND some graph paper (and transparencies, or both).
No catalogue of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as recognition that the picture-examining eye is the best finder you have of the wholly unanticipated.
Data typing is a useful concept. Data type refers simply to the types of data that an analyst runs into in the ecommerce world. My goal here is to not overwhelm you with complexity nor confuse you with unusual and uncommon words. Data type doesn’t refer to the common computer science and engineering terms (integers, Boolean, floating point, and so on). I review that, lightly, in Chapter 5, “Ecommerce Analytics Data Model and Technology.” Instead, what I describe here is a simple way to understand data types and data subtypes in a practical business-focused way for ecommerce analysis:
• Quantitative data: Data that is numeric. The data is a number, such as 2 or 2.2 whole or floating-point integers, in engineering parlance. Quantitative data can be further subdivided into these types:
– Univariate data: Like the prefix “uni,” this data type deals with one single variable. The analyst uses this variable to describe the data to stakeholders using methods to examine distribution, central tendency, dispersion, and simple data visualization techniques like box plots. A question of univariate data might be “How many unique visitors have we had on the site by month for the past 24 months?”
– Bivariate data: The prefix “bi” means “two”; thus this data type deals with two variables. The analyst uses these variables to explain the relationship of one to another. The method’s uses include correlation, regression, and other advanced analytical techniques. A question of bivariate data might be “What is the relationship between marketing spend and product purchases?”
– Multivariate data: This data type covers data that has more than two variables. Many advanced analytics techniques, from multiple linear regressions to automated tested, targeting, and optimization algorithms and technologies, use multivariate data. Most if not all analytics systems create multivariate data. Big data is multivariate data.
• Qualitative data: Data that is not numerical but text-based. Traditionally, qualitative data could be pass/fail (P/F) or multiple choices (A, B, C, D) or text-based, verbatim answers derived from market research.
Quantitative and qualitative data can further be divided into subtypes such as these:
• Discrete data: Data that can be counted separately from each other; for example, the number of unique customers.
• Nominal data: Data in which a code or variable is assigned as a representation. Nominal data can be quantitative or qualitative; for example, using Y or N to represent whether a particular marketing campaign was profitable.
• Ordinal data: Data that can be ranked and has a ranking scale attached to it. For example, Net Promoter Scores and the star rating on a mobile application are examples of ordinal data.
• Interval data: Data that is based on two points; where the count starts does not matter. Interval data can be subtracted and added but not divided or multiplied; for example, the recency of two distinct segments of data (as expressed in days).
• Continuous data: Data that can take any value with a specific interval; for example, the amount of time customers spend on a mobile device on weekdays compared to weekends. Or the size over time of your web site’s home page.
• Categorical data: Data that is, as the name implies, represented by categories. Think of search categories, inventories, taxonomies, and other classification systems that need to be represented in analysis; for example, the brands and products in your product catalog.
The reality in ecommerce analytics is that an analyst runs into each type of data often when solving for the same business problem. Take, for example, the analytical project in which a search-referred visitor’s online opinion data about a mobile application is joined to his digital behavior. In this case, the search keyword (and related ad group) is known, as is the person’s either positive or negative opinion about the relevancy of the digital content to his intent—and the person’s behavior that led him to his conclusion. Taking on an analytical challenge like this example requires the analyst to work with multiple data types (and sources).
When first beginning an analysis project, it helps to look at the shape of the data to understand what method may be appropriate. You can use data visualization tools to look at the shape of data. Data shape is likely a familiar concept to you due to the popularity of concepts like the bell curve in the normal distribution. A perfect distribution is normally shaped like a bell. In ecommerce reality most data is not perfectly shaped; instead, it is usually skewed negatively to the left or positively to the right.
At the ends of the distributions, you can find outlier data. Outliers are data values that fall outside of where most of the data is located. The traditional statistics rule is that an outlier is indicated by a data measurement at or more than two standard deviations away from the average. When data has many outliers, they are considered to have kurtosis such that the ends of the distribution may be fatter and turn up at the ends. Kurtosis can be mesokurtic (like a normal distribution), leptokurtic (high peak in the middle, flat ends), or platykurtic (low peak and flat ends).
Shape is important when data is analyzed because it is an easy way to immediately infer the type of data and the possible methods or approaches for dealing with the data. For example, if you notice that the shape of your data is Pareto with a long tail, it may not make sense to use a model for normally distributed data. Perfectly symmetrical data would be ideal to work with, but it never exists; thus, analysts attempt to use various techniques to turn skewed data into symmetrical data.
Ecommerce analytics uses classical statistics, like those taught in business schools throughout the world, to make sense of ecommerce data:
• Mean or average of the data is understood by mostly everyone. By summing all the observed values in a data set and dividing by the number of observations, you can calculate the average. Averages are perhaps the most commonly used technique for making sense of data. They can also be one of the most misleading because the mean can be skewed by the outlier data.
• Median is the term used to describe the middle point in the data. Stated another way, half of the observations were above the median and half were below. The median basically takes the statistical midpoint in the data. It is the middle of the data set, the median.
• Mode is the often neglected or forgotten concept. Generally speaking, mode is the value most frequently in distribution. For example, if 29 out of 50 people got a score of 82 and 21 people got scores that were not 82, then the mode would be 82 (because it is the most frequently occurring value).
• Standard deviation is a measure of the spread in a data set. The standard deviation measures the dispersion of the values in the data. For example, if analytics showed that people spent between 3 and 27 minutes on web site A and between 13 and 15 minutes on web site B, then web site A would be considered to have a larger standard deviation because the data is more dispersed.
• Range is another useful concept used in digital analytics. It is the measure between the highest and lowest values in a data set. As such, it is highly influenced by outliers. For example, if one month a mobile app has 200,000 downloads and the next month the app has 500,000 downloads, the range would be 300,000 (500,000–200,000).
• Outliers is a common term in data, measured by an observation in a data set that is equal to or larger than two times the standard deviation. In real-world practice of ecommerce analytics, some analysts choose to trim outliers from the data set to shape the data for application in the model. Other analysts think this is not correct. In a true EDA-esque approach, outliers may be investigated to determine whether an unusual insight may exist in the outlier. Take the 2016 water crisis in Flint, Michigan. It was reported that the highest lead levels recorded in the town’s data set were trimmed, which caused the lead level to fall below the threshold where the town was required to report to the federal government. Now imagine if EDA had been used. The data would not have been trimmed; the federal government may have investigated sooner. Trimming data has real-world implications in practice. Be careful.
Another example of outlier detection is common in the financial services. If a person deposits $1 million into her bank account instead of her usual $10,000 paycheck, the $1 million deposit would be considered an outlier. Banks use outlier data detected by analytics systems as input for targeting and promotional offers. For example, the person’s bank may offer a financial instrument for investing those million dollars upon next login. If a customer starts spending more money on your ecommerce site than historical norms indicate, then outlier detection can help you realize it, so you can respond.
These basic statistical concepts are the foundation for understanding how to analyze quantitative data. Make sure to comprehend these concepts and their definitions, and apply them in your ecommerce analysis.
One of the simplest, lowest risk, quickest, and highest value analytical activities is plotting data. Plotting data is an approach to data visualization. In fact, some of the plots described in this chapter are referred to in the next chapter on data visualization. Plotting data is data visualization, but the goal and purpose can be different. EDA requires plotting data as input to analysis, whereas beautiful data visualizations put into PowerPoints are the output of analysis. By taking the data in raw or detailed form and applying it to a set of coordinates and related visualizations, prior to doing any analysis, an analyst can see what the numbers say. That’s core to EDA. In EDA, the graphical interpretation of data is central and primary. EDA requires data visualization at the start of the process, not at the end of an analytical process. In ecommerce analytics, plotting data using the techniques described next can help the analyst identify the best model for analyzing the data. These techniques may reveal outliers and other anomalous data that should be closely investigated as a part of an ecommerce analytical plan. These plots are the block, lag, spider, scatter, probability, and run sequence plot discussed here.
The block plot is an EDA tool that attempts to replace the Analysis of Variance (ANOVA) test used in Bayesian statistics. The block plot is a graphical technique that enables the comparison of multiple factors on a particular response across more than one group. Block plots are useful in ecommerce analytics for comparing data generated from testing and experimentation where multiple combinations of elements on a goal are being analyzed.
The block plot can help you determine whether a particular variable impacts your goal and whether the impact is significant. By using a block plot to visually examine the results of testing and experimentation, you can identify the best possible combination of variables meeting a goal and how much current performance may be impacted by the various experiments.
For example, you can use a block plot to visualize the impact of a business plan on an average order value (a common ecommerce metric) where the plot experiments with marketing channel, site speed, time of day, and the user’s persona. You can then use the block plot to determine whether the average order value is significantly impacted by the people being exposed to different advertising at different times of the day and the impact of speed.
The block plot helps you quickly identify the impact of your experimentation without using ANOVA or another method. The challenge when trying to employ this basic EDA technique is that most commercial software can’t create block plots, so you may need to use data science.
A lag plot is a more complicated type of a scatter plot. It is used for visualizing whether a data set is random (stochastic) or not over a particular lag (time). After all, random data should look random and not actually take any noticeable and definable shape. For example, if you plot data and notice that the lag plot shows data points in a pattern (like a line), you could quickly surmise that the data was linear or quadratic and apply the appropriate analytical method. The lag plot is one easy way to check for randomness—and also notice if any outliers exist in the data. Use a lag plot to check the shape of data and visually inspect it to determine a suitable model to apply for analysis.
You may ask what is the difference between a scatter plot and lag plot? The difference is that the two variables measured in a lag plot are plotted over time displacement. If you don’t understand what time displacement means, use a scatter plot.
A spider plot is a type of plot for multivariate data in which the analyst wants to understand the impact of one variable against others. This visualization can also be called a star or radar plot. In this plot, each variable is connected by lines between a set of spokes. Each spoke is a variable, and the length of the angle is in proportion to the impact of that variable (against all other variables). As such, the data looks like a star or spider. This type of data plot is especially useful when you compare a number of observations across the same scale. The angles visually demonstrate whether any variables have more of an impact than others and can also help in comparing whether similarities or differences exist when different subjects are compared across the same attributes. Just remember not to use too many variables, or the plot can get messy and unreadable. A spider plot could be used to visualize the performance of a web site. Each geography could be compared by visits, visitors, time spent, and conversion rate. Spokes would be drawn between these dimensions to create different shape per geography. This shape would help to quickly show differences in the data.
A scatter plot is a fundamental data visualization that quickly helps to show relationships between variables. You would plot one variable and all observations on the x-axis and the other variable and values on the y-axis. Metrics such as conversion rate and dimensions such as marketing campaign and time can be scatter-plotted to reveal relationships, such as linearity or nonlinearity. As with most EDA visualization, the noticeable relationship between the data in the scatter plot can help the analyst understand correlation (visually) and help in selecting the best analytical model to use for analysis. As with other analytical techniques, be careful not to over interpret a correlation noticed in scatter plots.
The probability plot is a powerful EDA technique for determining the type of distribution of your data. For example, it is helpful to know whether you are working with a normal distribution or another non-Gaussian type of distribution. The mechanics and mathematics of creating a probability plot are well beyond the goal of this book; however, probability plots are easy to understand and interpret. The analyst plots each data point in a straight line (or at least attempts to do so)—and any data point that falls outside of the line is considered to not fit the hypothesized distribution based on a correlation coefficient. Because of the flexibility of seeing whether the data fits into the plot (and thus the hypothesized distribution), this technique enables the analyst to run tests on the same data against different distributions. The probability plot with the highest correlation coefficient indicates the best-fitting distribution for the data.
The run sequence plot is among the most common data plots because it is applied to univariate data. That is, an analyst needs only one variable plotted across time to create this simple, but powerful, data visualization. It is the data summarization technique that helps to detect changes in the data. This plot enables a data set to be examined on a common scale and across the distribution to determine outliers, the scale of the data, the location of the data, and the randomness. The response variable, such as conversion rate, is always plotted on the y-axis.
Four plots and six plots are, respectively, sets of four and six EDA techniques for graphically and visually exploring your data. The main difference in the presentation is that a four plot uses the run sequence plot, and the six plot uses scatter plots. The four-plot technique is more frequently associated with univariate data, whereas the six plot is more associated with multivariate data. Both visualizations are in fact useful for ecommerce data. Table 3.1 shows the four- and six-plot techniques.
Histograms are graphical representations that show scale of one or more observations to summarize the data distribution. They help an analyst visually comprehend the spread of a distribution along with its center, skew, and any outliers. Typically, the y-axis shows the measurement and the x-axis shows the variable measured. Histograms are flexible visualizations in that you can custom define both measurements you want to show. Showing more than one measured variable is simple with a histogram because the analyst can create the groupings (called classes) based on their own rules—or using classical statistical methods (such as dividing into ten equal classes).
Histograms show scale on the y-axis and different data on the x-axis based on type. Histograms can be any of the following:
• Regular histograms show one or more similar measurements, for example, displaying the count of customers and orders by month for 2016.
• Clustered histograms show scale on the y-axis and a grouping of variables along an interval. For example, you may use a clustered histogram to show the count of customers per month acquired by different marketing channels.
• Stacked histograms show the components (the detail) in a distribution. For example, you may stack marketing spend by month by campaign.
Pie charts are an extremely common visualization in data analysis. They are circular and divided into sections such that each section represents a portion of the total measurement. Pie charts do not, however, make all analysts happy. The pie chart is quite disdained as an insufficient or unnecessary technique. Pundits claim that a data table can show, more easily, the slices of pie. A histogram shows the exact same data as a pie chart—these two visualizations are interchangeable. The pie chart starts to get messy and hard to read when divided into more than six sections.
Pie charts are easy to understand and are a common dessert metaphor, which explains their popularity. Everyone understands how to slice up a pie, and it’s an easy leap for students and new analysts alike to put their data into this familiar image. Pie charts are of four types:
• Standard pie charts are circular and show the proportion to scale of each piece in the total measurement.
• Expanded pie charts are where the sections of the pie are dislocated from the entire pie and then shown adjacent in space. By using whitespace to separate sections of the pie, the analyst is visually highlighting the data.
• New types of pie charts, such as the 3D pie, the pie ringchart, and the doughnut chart, are evolutions of this visualization and have their own parameters and applications that further break down the pie chart to communicate and highlight more data.
• Harvey Balls are not actually pie charts according to the traditional definition, but are highlighted because of their similarity of shape. A Harvey Ball uses a hollow, solid, or sectioned circle to communicate information about the applicability of an object to criteria. For example, you could use Harvey Balls to illustrate whether the speed of a web page meets a given threshold.
A line chart is a visualization for communicating trends in data that occur, most typically, over time. Because the chronology of experience can be charted using a line, this chart is frequently employed by analysts to show trends and time series. By plotting data points in a distribution and then connecting them with a line, you can communicate the scale and pattern in a trend and the temporality. Outliers, trends, and anomalies can be seen using a line chart. By comparing lines representing the same measure across different intervals, changes in data can be observed. Line charts are created by plotting the measure for which you want to trend on the y-axis, and time on the x-axis.
The “line” in the chart in most cases represents the trend exposed by connecting the data points. In other cases, an analyst may present a line on a chart that “fits” that data. The “best fitting” line when plotted, typically within another type of chart, is meant to show the general trend in a large number of data points where it is not possible to draw a meaningful line by connecting the data points. In these cases, the best-fitting line can be created using many statistical methods, such as linear regression or other methods in which the best-fitting line may not necessarily be a straight line, such as quadratic or exponential techniques.
The most common line charts an analyst will produce are the following:
• Area charts are used to show portions of a total or to compare more than one variable across the same measures (generally scale and time). Like the stacked bar, the area chart can be used to show the distribution and movements of sections of data against other sections and the whole.
• Sparkline charts are popular because they are so simple to understand, visually powerful, and easy to create. Unlike the standard line chart, the sparkline is never bivariate or multivariate. It’s always univariate. In application, the sparkline is loosened from the “chartjunk” and “infoglut” such as axes, gridlines, words, and numbers to concisely and quickly communicate a small amount of information.
• Streamgraph charts are an evolution of the area chart in which more than one variable is trended across time (or another measure) against some scale. Each “stream” in the graph represents a portion of the total in the same way that a bar in a stacked bar chart represents a portion of the total. The difference in the stream is that the axis is displaced such that the lower and upper bounds of the chart are not limited or trimmed. Each stream touches the bottom of the higher stream and the top of the lower stream.
Flow visualizations have their roots in operational management and other phased processes that result in an outcome. The metaphor of a “flow” is suitable for ecommerce analytics in which customers are coming and going from many channels, on many devices, to many different digital experiences. As these prospects and customers flow through an ecommerce experience, it is important to understand whether the customer is creating value by measuring whether the customer completes the goals you have defined either in one visit or across time. You can read more about flow analysis in Chapter 7, “Analyzing Behavioral Data.”
The idea of “flow” should sound familiar to those who already work in the ecommerce industry. One of the most common constructs for representing customer flow is a data visualization that shows the discrete steps in a user’s behavior that makes money. Take the well-known notion of the “conversion rate” in which three to five pre-identified steps, such as entry page > search > product page > checkout > thank you page, define the conversion flow on your web site. Customers may not convert when they begin the conversion process, or they may jump between steps, abandon the process, or complete it at a later date via different marketing channels. To help visually communicate these complex digital experiences and the customer flow over time, the following flow visualizations are useful:
• Bullet chart: A bullet chart is a flow data visualization technique whose closest offline analog is a thermometer you might have found in the Austrian Alps in the 1960s. Bullet charts not only display the scale of a univariate observation, but also use color to highlight a qualitative judgment of success and enable plotting a goal. Bullet charts are a type of histogram—and could be categorized as such; however, because they can be associated with a goal and a target; they can also be used to visualize conversion—and thus can be considered flowcharts. By showing multiple bullet charts in an adjacent space, you can illustrate a sequence of steps.
• Funnel chart: A funnel chart is a graphical technique for illustrating the sequence of steps that lead to a macro or micro conversion within a digital experience. Funnels can be custom defined to begin at any point in the customer lifecycle. For example, a multichannel funnel may start with exposure > acquisition source > landing page > product page > checkout. A site funnel might simply represent the steps taken to purchase a product or to sign up for a newsletter. Other funnels may be in-page funnels representing the fields a user must fill out to complete an action.
Funnel charts are often represented linearly such that each step in the funnel immediately occurs sequentially before the others. It is also valid to show a nonlinear, nonsequential funnel in which steps are jumped, skipped, or entered from other parts of the site. The funnel has no formal structure or creation rules except that the last step in the funnel is the conversion point at which value is created. Advanced funnel visualizations, such as those found in some tools, attempt to visually demonstrate funnel linearity and nonlinearity including step-jumping, interpolation, and abandonment in a single chart.
• Tumbler chart: A newer concept, which you may be reading about for the first time here, is the Tumbler. The Tumbler expresses flow as a series of step-jumping in and out of various states. In the context of ecommerce, a person goes through the following states: seeking (when they look for a product); shopping (when they buy the product); and sharing (when they talk about the product with other people). The Tumbler is a visualization that shows the flow as people move in and out of these purchasing states.
The next section moves away from the discussion of the philosophy behind and techniques important and helpful to ecommerce analytics—which is exploratory, observational, visual, and mathematical—to a business review of data analysis methods used in today’s analytical companies. These quantitative techniques can be applied judiciously to data to answer business questions. Statistics and machine learning are complex topics beyond the technical scope of this book. The quantitative techniques that form the algorithms for advanced analytical tools are numerous and, without a background in statistics, can be esoteric.
When you’re executing on an analytical plan, certain techniques exist for understanding the order of the data to determine what is important and represented in a distribution. You can determine whether there is a correlation between two or more data points. An analyst can use tools to automate different types of regression analysis to determine whether certain data can predict other data. The details of distributions and assessments of probability can be calculated. Experimentation can be evaluated, and the hypotheses on the data can be tested to create the best-fitting model for predictive power.
The statistics adage is that “correlation is not causation,” which is certainly true. Correlation, however, can imply association and dependence. The analyst’s job is thus to prove that observed associations in data are truly dependent and relevant to the business questions, and ultimately to determine whether the variables caused the relationship determined. Correlation is whether two variables move together. For example, if every time a visitor comes to your site he buys something, you could consider a strong positive correlation between a site visit and a purchase. This insight may lead you to conclude that all a person needs to do is visit a site and he will always buy. Although you might want this relationship to be true, it’s more likely that the person has already decided to purchase the item before coming to the site and is just fulfilling his desire. Thus, although the mathematics may show a positive correlation between data, common sense indicates that correlation does not imply causality—and that there is only an association between a site visit and revenue. Thus, there is no true causality, and the conclusion that a site visit always creates revenue would be a specious and arguable conclusion at best.
The most common measure of correlation you find in an analytics practice is a type of correlation named Pearson’s correlation. Pearson produces a measurement between 1.0 and −1.0. The closer the measure is to 1.0, the stronger the positive correlation; whereas the closer the measure is to −1.0, the weaker the positive correlation, such that a negative correlation coefficient indicates that the data move in the opposite direction from one another.
In a world of linearity, Pearson’s correlation is useful; however, if the data relationship for which you are calculating causality is not linear, Pearson’s correlation should not be used because the conclusion based on the measure will be wrong. Test for linearity (using a number of methods) on your data set before using Pearson’s correlation. If you determine that the relationships in your data are not linear, the world of statistics has other quantitative methods for determining correlation.
Rank correlation coefficients, instead of Pearson’s correlation, can be applied to data sets in which the distribution is not linear. If you use correlation on a set of predicted variables, you can use a partial rank correlation to understand the data. Rank correlation also indicates the relationship in which one variable increases or decreases in proportion to another.
Nonlinear dependent correlation calculations, such as Kendall’s and Spearman’s coefficient, express the same type of positive or negative data relationship but for non-normal distributions. An analyst, however, should be careful when testing data to determine the correct correlation coefficient. Although it may be possible to substitute a linear correlation measure for a nonlinear correlation measure, these calculations are measuring differently. Such difference needs to be understood in the context of data and explained in your analysis.
The phrase regression analysis means the application of a mathematical method to understand the relationship between one or more variables. In more formal vocabulary, a regression analysis attempts to identify the impact of one or more independent variables on a dependent variable. There are many different approaches to completing a regression analysis based on all sorts of well-known and not-so-well-known methods. The more common methods are based on Bayesian statistics and probability distributions, such as single linear regression and multiple linear regression.
Analytics professionals and the people who ask for analytical deliverables often talk about regression, regression analysis, the best-fitting line, and ways to describe determining or predicting the impact of one or more factors on a single factor or multiple other factors. For example, the impact of various marketing programs on sales may be determined through regression analysis. The most common regression that you see in business is the linear regression. It’s taught in business schools worldwide, and many of the widespread spreadsheet and data processing software programs support regression analysis. In the ubiquitous Excel by Microsoft, the complexity of calculating a regression is reduced to a simple expression on a data set.
In ecommerce analytics, the regression analysis is used to determine the impact of one or more factors on another factor. As in formal statistics, regressions in ecommerce analytics have one or more independent variables and at least one dependent variable. In some cases the application of a multiple linear regression analysis is possible with digital data. It is far more likely that one of the other types of regression analysis, such as exponential, quadratic, and logistic regression, is a much better fit for your data. With regression analysis in digital analysis, your mileage can vary due to the interplay of relationships in big data.
Although this book, and particularly this chapter, is not meant to give exhaustive coverage, by any means, of the mathematical principles behind the application of various models to digital data, a true understanding of the application of advanced applied analytical techniques such as regression, ANOVA, MANOVA, and various moving average models requires comprehending the underlying small data.
In the purest form, as explained in the discussion of correlation, the type of distribution impacts the model you select. In true EDA fashion an analyst must first look at each factor proposed to be used in a potential regression analysis. Multicollinearity, kurtosis, and the other shapes and measures of dispersion help the analyst determine whether classic, Bayesian, or nonparametric techniques are the best fit for the data.
For ecommerce data derived from digital experiences, such as the keywords and phrases from search engines to the frequency of purchases of various customer segments, data is most often not normally distributed. Thus, much of the classic and Bayesian statistical methods taught in schools are not immediately applicable to digital ecommerce data. That does not mean that the classic methods you learned in college or business school do not apply to digital data; it means that the best analysts understand this fact. Fortunately, so do the engineers and product managers who create analytical software whose applications assist analysts in preprocessing non-normally distributed data to fit classic methods all the way to applying the best nonparametric model to the data.
The remainder of this chapter discusses frequently mentioned types of regression analysis. It also exposes newer thinking by current academics and gives overviews of techniques that you can explore to understand how to fit your data to a model if you choose to go that route—or if you choose to fit your model to the data. Remember that regression analysis is not appropriate with all types of variables, such as discrete variables for which alternative regression must be used.
The underlying math behind simple and multiple linear regression can be studied in detail in books such as Applied Regression: An Introduction (Quantitative Applications in the Social Sciences) by Michael S. Lewis-Beck (August 1, 1980). For the purposes of ecommerce analytics, a simple linear regression is used when an analyst hypothesizes that there is a relationship between the movements of two variables in which the movements of one variable impact, either positively or negatively, the movements of another variable. See the “Correlating Data” section earlier in this chapter for more information.
Multiple linear regression and other forms of regression in which the dependent variable—that is, the variable for which you are predicting—is predicted based on more than one variable are used in ecommerce analytics. Understanding the marketing mix and how different marketing channels impact response is often modeled using multiple logistic regression.
Logistic regression predicts a dependent categorical variable based on several independent (predictor) variables. The output of a logistic regression is binomial if only two answers are possible or multinomial if more than one answer is possible. A 0 or 1 may be the result of a binomial logistic regression, whereas an output of “yes,” “no,” or “maybe” may be the output of a multinomial logistic regression. The predictor variables are used to create a probability score that can be used to help understand the analysis.
Logistic regressions are used frequently in predictive modeling for ecommerce analytics and, particularly, marketing analytics data. The best predictors should be tested for their impact on the model; however, the output is easy to understand. Take, for example, how a logistic regression could be used to segment data into a 1 or 0, in which 1 meant to sell that product online and 0 meant to sell it only in stores. Logistic regression is one type of predictive data analysis.
Observing the shape can help an analyst understand it and select the right analytical method to use on. After all, the way an analyst applies a method to a normal distribution is different from the way an analyst applies a method to a non-normal distribution.
Probability, simply stated, is the study of random events. In analytics you use statistics and math to model and understand the probability of all sorts of things. In ecommerce analytics, you are concerned about probabilities related to whether a person will buy, visit again, or have a deeper and more engaging experience. And using analytics tools, you can count and measure events related to marketing and customer purchasing behavior and patterns. Measures of probability are used to determine whether events will happen and then to help identify or predict the frequency of those events.
Probability analysis in ecommerce analytics can be done mathematically (using existing data) or experimentally (based on experimental design). Simple and compound events occurring discretely or continuously, either independent of or dependent on other events, are modeled in probability.
An ecommerce analyst should be familiar with the following concepts:
• Modeling probability and conditionality: Building a model requires selecting (and often, in analytics, creating) accurate data, the dimensions, and measures that can create your predictor variables. Central to the tendency to create models is statistical aptitude and an understanding of measures, probability, and conditionality. Conditional probability may sound complicated (and it can be), but the term simply means understanding the chance of a random event after something else has occurred previously (that is, a condition).
• Measuring random variables: A random variable is a type of data in which the value isn’t fixed; it keeps changing based on conditions. In ecommerce analytics, most variables, whether continuous or discrete, are random. Because the nature of random is not possible in mathematics, random variables are understood as probability functions and modeled as such using the many techniques discussed in this chapter.
• Understanding binomial distributions and hypothesis testing: A common way to test for statistical significance is to use binomial distribution when you have two or more values (such as yes or no, heads or tails). This type of testing considers that the null hypothesis is done using Z and T tables and P-values. The types of test are one-tailed and two-tailed. If you want to understand more than two variables, you would use a multinomial test and go beyond simple hypothesis testing to perhaps chi-squares.
• Learning from the sample mean: Measures of dispersion and central tendency (such as those discussed in this chapter: mean, median, mode, and standard deviation) are critical to understanding probability. The sample mean helps you understand the distribution and is subject, of course, to the central limit theorem, which states that the larger the sample population, the more closely the distribution will approximate normal. Thus, when modeling data (especially smaller data sets) the sample mean and the related measures of standard deviation and of variance can help you understand the relationship between variables.
Experimenting with ecommerce data means changing one element of the experience to a sample of prospects or customers and comparing the behavior and outcomes of that group to a control group who received the expected ecommerce experience. The goal of experimentation is to test hypotheses, validate ideas, and better understand the audience/customer. In reality, though, ecommerce data is not biology, and it is often impossible to hold all elements of digital behavior equal and change just one thing. Thus, experimenting in ecommerce means controlled experimentation.
A controlled experiment is an experiment that uses statistics to validate the probability that a sample is as close as possible to identical to the control group. Although the boundaries of a controlled experiment may be perceived as less rigorous than a true experiment in which only one variable changes, that’s not actually true because controlled experiments, when performed correctly, use the scientific method and are statistically valid.
The data collected from controlled experimentation is analyzed using many of the techniques explored in this chapter, such as applying measures to understand and work with distributions. The type of data analysis you do on the data can be as multivariate as the experimental data itself; however, controlled experiments typically have the following elements:
• Population: The aggregate group of people on which the controlled experiment is performed or for which data already collected is analyzed. The population is divided into at least two groups: the control group and the test group. The control group does not receive the test, whereas the test group, of course, does.
• Sampling method: The way you select the people, customers, visitors, and so on for your experiment. It depends on whether you want to understand a static population or a process because different sampling methods are required. Sampling is important because a poorly or sloppily sampled group can give you poor results from experimentation.
Ultimately, you want to randomly sample your population to create your test group. Every person in your group should have the same probability of being selected as the other. When there is an equal potential for selection in the data, you have created a truly random sample.
You can also break down a population into segments that each have their own attributes you define, for example, all customers who are male, below the age of 30, and make more than $100,000 a year. Breaking down a population by its attributes is called stratified sampling.
When you are measuring processes in ecommerce analytics, such as a conversion process, it is likely that the process will change over time. Thus, you can’t hold the population static. In cases like these, in which you analyze a process, you must consider process-based sampling methods, such as systematic sampling.
In systematic sampling, the first datum is chosen randomly; then the next one is chosen based on some algorithm, such as every 25th or 50th or 100th visitor is selected for the test. This type of sample selection method approximates random and incorporates the dimension of time, which, of course, is important in the analysis of ecommerce behavior. An analyst can also look at sampling subgroups. Basically, the analyst finds a common dimension in a set of customers, and then picks the population from various subgroups according to best practices for sample size and at a sampling frequency that creates the necessary sampling size.
• Expected error: When analyzing the results of experiments by applying the methods discussed in this chapter, you need to go into your experiment with an idea of the expected amount of error you are willing to tolerate. There are various types of errors (such as type 1 and type 2). Confidence intervals and confidence levels are applied to understand and limit expected error (or variability by chance) to an acceptable level that meets your business needs.
• Independent variable: What you are holding static in the population or what is shared among the population or subgroups are the independent variables. Not all of them matter, but some (hopefully) will.
• Dependent variables: These are the predicted variables that are the outcome of the data analysis. For example, the conversion rate is a common dependent variable around which experiments in ecommerce analytics are intended to inform.
• Confidence intervals: Confidence intervals are commonly stated at 95% or 99%. Other times they could be as low as 50%. The meaning of a confidence interval is generally said to be that “99% of the population will do X or has Y,” but that interpretation is incorrect. A better way to think of confidence intervals in ecommerce analysis is that were you to perform the same analysis again on a different sample, the model would include the population you are testing 99% of the time.
• Significance testing: This testing involves calculating how much of an outcome is explained by the model and its variables. Often expressed between 10% and 0.01%, the significance test enables you to determine whether the results were caused by error or chance. When this testing is done right, analysts can say that their model was significant to 99%, meaning that there’s a 1 in 100 chance that the observed behavior was random.
• Comparisons of data over time: Comparisons such as year over year, week over week, and day over day are helpful for understanding data movements positively and negatively over time. Outlier comparisons need to be investigated.
• Inferences: What ideas are conceived or what thoughts are generated as a result of the analysis? Inferences are the logical conclusions—the insights—derived by using statistical techniques and analytical methods. The result of an inference could be a recommendation and an insight about the sampled population.
Experimentation in ecommerce analytics is often executed through advanced testing and optimization, which are discussed in more detail in Chapter 8, “Optimizing for Ecommerce Conversion and User Experience.
Here are three useful techniques to use when analyzing complex and large data sets:
• Discard outliers that you can prove are erroneous: The common rule in statistics is that an outlier is any data observed to be two times the standard deviation. Techniques like the box plot can help you visualize outliers you have identified by applying descriptive statistical measures. Because outliers can skew data to the left or to the right and generally pull the distribution in one direction or the other, you may want to remove them to focus on the center of the distribution. But be careful! Although this best practice is useful in many cases where erroneous data is found, you don’t want to do it in all cases. For example, in ecommerce analytics, if you remove outliers without actually thinking about the implication or knowing they are errors, you may be throwing away the most important data. Remember that there is a chance for outliers and anomalies in data to have meaning! Outliers may be worthy of deeper analysis, but first you need to prove that the outliers were not created by error. Don’t just throw out outliers without heavily considering the implication on the analysis—and without having absolute certainty the data isn’t an error.
• Pick the best variables: In ecommerce analytics, there are so many different data types, dimensions, measures, and values that it can sometimes be overwhelming to determine the best variables. Every variable could be an independent one, so how do you select the right variables? One common approach is to use stepwise regression to determine which variables are best for the model. That being said, stepwise regression is a garbage-in/garbage-out approach. Dimension reduction is becoming a feature in data science tools.
• Don’t overfit models: Overfitting a model occurs when it becomes too complex by having too many variables. As a result, the output of the model yields questionable results and, in many cases, can produce inaccurate results. When you are creating a model, it is better to be as simple as possible, not as complex as possible. The principle of Occam’s Razor should be applied to ecommerce analysis. The idea is that simplicity in analysis creates better outcomes and insights than being complex.
• Don’t let the model dictate the data; let the data dictate the model: As Tukey’s concept of EDA commands, the model should fit the data. You don’t want to just apply a model because it’s the one you know, it’s the easiest, it’s good enough, or because it is new and/or interesting. Make sure the model fits the data, not vice versa. Sure, the other way can work, but it’s not preferred.
The best analysts take the time to study the data and understand the relationships in the dimensions and measures not only within the data itself but also against the business questions from stakeholders and the overall strategic business context. Data visualization before applied analysis is the right order of work for ecommerce analysis. As such, these best practices and ideas presented in this chapter are suggested and can be helpful for you; however, your mileage could vary. Regardless, it is certain that by focusing on the business questions, visualizing and exploring the data, and determining the best model and most appropriate set of analytical techniques, the analytical outcomes and insights resulting from your analysis will be highly effective, useful, and profitable (Adams 2015).
Key performance indicators (KPIs) are metrics and ratios that represent important data to measure and track over time in order to understand business goals. They are created from the basic and complex techniques, methods, models, and analysis discussed earlier in this chapter. KPIs are typically descriptive in nature. They quantify data already collected over time and put them into context by comparing KPIs to past periods, such as month over month. KPIs can be segmented by cohort, customer segment, marketing campaign, and other rational business segments. KPIs often form the dependent variable in predictive models. Thus KPIs are thought of as also being leading or lagging indicators for performance. What this means is that certain KPIs may start to change before a key business event occurs, whereas other KPIs change only after an event occurs. Keeping track of what happens in a business, from promotions and marketing campaigns, to seasonality, to other impactful events, helps to understand why KPIs change.
When creating KPIs, you want to ensure that the data definitions for them address three audiences: technical, operational, and business. Business goals need to be identified from each audience. Dashboards should be created for the audiences such that each KPI is tied to the business goal. Each dashboard should have a communication plan for communicating KPI analysis to the audience. The expected actions people may take when the KPI changes and the potential outcomes expected from movements up or down are identified. Finally, creating KPIs involves mapping out the systems and integrations necessary to calculate, analyze, and report them.
In ecommerce there are many KPIs that represent behavior, events, interactions, conversions, orders, products, promotions, campaigns, costs, and revenue. The list of possible KPIs is large. Recommending a set of KPIs as “best practice” for every ecommerce business is difficult, because KPIs should help to guide against business goals, which are different for most businesses. Of course, there are a few shared goals across all ecommerce businesses: rates, averages, derivatives, percentages, cost and revenue-per metrics, and other useful ecommerce quantifications, which are described next.
Page views are simply a metric that measures the number of times a page or screen was seen by a user. This metric is used to understand the popularity of content in an ecommerce site. As a standalone metric, this measurement is almost entirely useless. I almost didn’t include it in this book, but it is a core metric related to what content on a site is actually seen by people. It can help to put page views into context by looking at the number of users and number of sessions with more than one page view. The derivative KPI, page views per session or per user, can be created, which begins to indicate how engaging the content on your site is by tracking how many pages people view in one visit to your site. This KPI tends to be used in both executive and line of business KPI dashboards (although it can have limited utility).
Another common metric and core to many digital analytics tools is the concept of a visit or session. A visit or session is a set or group of interactions and events that occur during a given period within an ecommerce experience. When a person visits a web site or opens a mobile app and begins to look at pages and use the features, a session is counted. The session lasts for a duration defined by the company’s data definition for visits or sessions. Most commonly a session ends after 30 minutes of inactivity or after 30 minutes. There are many challenges to accurate session measurement, which most businesspeople don’t care about but analysts find interesting. These nuances include tabbed browsing, starting an expired session by a move of the mouse, time and date crossovers (i.e., what happens when the visit lasts past midnight and goes into the next day), and the inherent challenges of measuring time using cookies (i.e., single-page visit duration and the duration on the last page of a multipage visit are not counted). This KPI tends to be used in both executive and line of business KPI dashboards.
The number of returned items and orders is important to measure in ecommerce. Because returns have a cost associated with them, which erodes margin, it is important to track the total number of returns at both an overall order and an item level. The return KPI can also be expanded to include other metrics, like returned revenue, and the derivative cost per return. This KPI tends to be used in both executive and merchandising KPI dashboards.
Revenue is the sum of all the money collected by customers from purchasing goods or services during a given time frame. The results of all your efforts—from site operations, user experience, marketing, buying, merchandising, analytics, conversion optimization, and management—are reflected in this important metric. Simply having an accurate count of revenue earned is an important goal. Keep in mind that third-party payment processing puts transactions through states of approval (authorized, charged, approved, settled and so on), so you need to have your definition for what is revenue accurate and consistent across all channels where you sell. You want to avoid situations in which the revenue data is not defined identically. For example, if you want to track as a KPI a daily revenue total of all purchases authorized on that day, then you need to make sure what you are reporting is authorized purchases. If you choose to report as a KPI daily settled revenue, this is an entirely different revenue metric. Although the idea of revenue is quite simple to understand, it has nuance when it is being tracked in ecommerce environments. Be aware of the definition of revenue when you report, analyze, and reconcile it.
Gross margin is the amount and percentage of the total that remains when you subtract the cost of goods sold (COGS) from net revenue. This financial metric may be important enough to your business to track as KPI, or another margin metric, such as Contribution Margin or Net Profit Margin, may be tracked. The point of margin-based metrics is that they put context on revenue by adjusting it to include costs. In this sense you can see the financial performance of the business before other costs are accounted. Sales, promotions, and discounts all impact margin, so it’s important to analyze, model, forecast, and predict gross margin impact.
Lifetime value is a measure of the predicted revenue and profit of a customer across the expected duration of his relationship with the ecommerce site. See Chapter 9, “Analyzing Ecommerce Customers,” for more information. Lifetime value (LTV) puts cost into context because, when modeled and analyzed correctly, it can predict the value of a customer over time. This prediction allows for costs to be allocated such that profitability is maximized. By looking at lifetime value overall, in the context of cohorts and segments, and of particular marketing campaigns, a business can fine-tune its spending to deliver maximal return. When this metric is tracked in a KPI dashboard, LTV can be understood as it changes. This KPI is frequently used in executive and marketing KPI dashboards.
Repeat customers is a metric that counts how many customers of total customers have visited your ecommerce site in the past. This measurement is related to recency and frequency (discussed later in this chapter). Repeat customers are brand-aware customers who likely have an intent to shop. By understanding and tracking the volume of repeat customers across time, you can use other data related to their past purchases and preferences to target them. If you consistently notice a downward trend of repeat customers in a business in which customers tend to purchase frequently, then there could be an issue. When the number of repeat customers is in a rising trend, one needs to figure out why they continue to visit but not purchase again. For ecommerce companies that track time between customer purchases, it is possible to know when a return customer is near their next purchase, and then target appropriately. To begin to get this level of managing one-to-one repeat-customer experiences with data for understanding, detecting, and targeting, it all starts with measuring the number of repeat customers you have coming back.
The concept of conversion and the associated derivative, conversion rate, is one of the most enlightening KPIs in ecommerce analytics. Conversion is when a purchase occurs on your site. For example, in an ecommerce experience, conversion can be defined as “ordering a product.” Overall conversion occurs when an unknown visitor becomes a customer who may or may not be known; in the case of step completion, discussed later in the text, the conversion may occur when an event occurs after a user completes a series of predefined steps.
The key takeaway regarding the digital concept of conversion is that it occurs when a person does something that the owner of the ecommerce experience thinks is valuable—and thus creates material, financial business value in some way. In that context, there are several mathematical definitions for conversion.
For example, conversion can be measured based on a visitor or visit or customer or audience basis. Various camps exist about the usefulness of each numerator. Visitor measurement has inherent challenges with accuracy due to cookie deletion and externalities of the Internet (such as cookie blockers), social media, and mobile. Some analysts argue that using “visits” is a better denominator for conversion than “visitors” because a visit represents a unique opportunity to convert, whereas the visitor metric could include more than one visit. Other people argue that conversion should be unique people, not visitors or visits. While the discussion can be academic, it has implications. Due to the technical and conceptual challenges with measuring and understanding all the derivatives of conversion, visit conversion is most commonly measured, followed by visitor, by customer, then by person.
Regardless of your preference for the numerator in a conversion calculation, the larger point is that conversion occurs when a person does something that is considered valuable. And as such, conversion and the movements in conversion rates can be tied to financial measures, such as revenue and profitability.
Step completion rate, sometimes called micro conversion or even waypathing, is similar to conversion in that a step is a transitional point in an ecommerce experience that is part of a conversion flow. The idea of moving across a shipping experience from one page to another is a step. When a person completes the final step, the conversion is tracked.
Step completion can best be understood as an example. Say, an ecommerce site’s goal is to sell products. Products are sold via orders. To get to the order page, a person must access a landing page (like the home page), search for a product, view the product page, and complete an order. The conversion steps would be as given here:
1. View a landing page.
2. Search for a product.
3. View a product page.
4. Order the product.
CONVERSION = View the order thank-you page.
In the example, steps 1 through 4 begin with the customer arriving on the landing page and then taking the next steps to complete the purchase. Step completion measures how many visits or visitors complete each step in the path. Drop-off in each step is quantified and known. Each step has an associated completion rate. The worst performing steps can be tested and optimized, which helps the overall conversion rate.
An abandoned cart occurs when a person adds one or more products to a shopping cart and then doesn’t buy them. Cart abandonment can be measured on a visit basis: The number of abandoned carts is divided by the number of total visits to calculate a rate. This rate represents what percentage of carts are left with items unbought. This metric can get complicated to measure when you consider that users can create an account, add products to the cart, and buy them later. The cart items are not necessarily tied to a cookie; they are stored on the back end in a persistent cart ready to be bought when the user comes back. Thus, it’s difficult to say whether carts that are abandoned by known users are really abandoned or whether they are just delayed. It’s these types of use cases that make abandonment measurement nuanced. That said, many companies define abandonment to occur whenever a session ends with unbought products in the cart. Regardless of your definition, by tracking abandoned carts, you have the ability to recognize how much revenue is being abandoned and then implement programs to recapture that abandoned revenue.
One of the standard metrics in ecommerce experiences in which a purchase occurs is average order value (AOV). Simply constructed, AOV is the sum total cost of the items purchased, divided by the number of items. It’s the average cost of all the items purchased—easy and informative. Even small sites that sell only one product benefit from tracking AOV. After all, AOV helps to identify inventory and purchasing trends as well as influence marketing, advertising, and promotions. Related to that is median order value, or MOV, which is less common. MOV is useful where ecommerce transactions create data with a large range in cost.
Bounce rate is a visit-based metric that identifies the number of single-page visits on a landing page. When a visit begins on a page and the visitor does not view another page, a bounce is said to have occurred. A lot has been written about “bounce rate”—so much that I thought of excluding it. My friend and colleague Avinash Kaushik, whose books I encourage you to read, has a good definition of bounce rate: “I came, I saw, I puked.” In other words, the bounce rate measures the percentage of people or visits that came to your site and immediately left the site without doing anything. They simply looked at one page, made a decision that the site wasn’t helpful to them, and left the site. In mobile applications, when a user goes to the background after opening the app, a bounce has occurred. Again bounce rate is most typically associated with landing pages on a site such that the rate measures how many people didn’t view another page after starting their visit on that landing page. Any entry page on an ecommerce site has a bounce rate. Bounce rate is a visit- or session-based metric. Related to bounce rate is the concept of exit rate, which measures how frequently a particular page is the last page viewed before the visitor leaves the site. Exit rate is page-based. Bounce rate is visit-based. That nuance is important to understand.
This derivative KPI identifies what percentage of orders have a promotion or discount applied to them. Track discounted orders and their volume and also measure the gross margin of orders to add more color to understanding promotional effectiveness. Of course, if you can track promotional discounts at the order level, you can consider tracking them at the item level in each order to understand the impact of discounts.
The measurement of inventory turnover is very important for ecommerce companies that maintain inventory. Quite simply, this measure lets you know how many times during a given period the site’s inventory can be expected to sell out and be replaced. For more information, see the discussion in Chapter 10, “Analyzing Products and Orders in Ecommerce.”
Return on investment (ROI) is a standard business calculation for which revenue is subtracted from cost of goods sold and that number is divided by cost of goods sold:
ROI = (Revenue − Cost of Goods Sold) / Cost of Goods Sold
This metric indicates the financial impact from the particular investment you have made—and is general enough that it can be applied to almost any business activity. For example, the creation of a set of conversion tests has a cost and it has a demonstrable revenue impact; thus it has an ROI. Marketing activities have a cost and a return, and thus can be held to an ROI as can the management of promotional and merchandising activities. Any business activity with a cost that can be tied to revenue (or not) can have an associated ROI calculated.
Customer loyalty is what businesses with short, repeat purchasing cycles (toothpaste) strive to create. Even businesses with longer usage cycles (appliances such as washing machines and home windows) between new purchases benefit from loyal customers. One way to measure loyalty is with a concept derived from traditional marketing named recency.
The concept of recency is simple to understand. It is the time since the last visit or purchase by a customer. As a time-based metric and one tied to individual customers, it is a metric most easily measured in experiences and transactions in which the visitor is known via login, registration, unique ID, full name, or some other identifier. In environments with more anonymity, recency is identified on a segment level for identifiable customer segments—and at an object or event level; for example, time since the last download in a mobile application. Recency can be a helpful metric for tracking how loyal your customers are to your brand, products, or services.
Retention is another common concept in traditional marketing that is reused in ecommerce analytics. Frequency refers to how often a known person or an anonymous or mostly anonymous person comes back to an ecommerce experience she has visited previously.
Frequency, like recency (previously discussed) is also a time-based measure. As such, in ecommerce experiences in which people are identified by some mechanism, the ability to time-stamp when that person last came and then recently came to a site can be straightforward. Frequency in anonymous or mostly anonymous environments, such as those dependent on browser cookies, is harder to pinpoint. Cookie deletion and the inability to persist an association between one cookie and another over time can impact the accurate calculation of frequency.
Frequency is important to track in businesses for which repeat visits are important. For example, for news sites or social media sites, a decrease in frequency could indicate an issue with content relevancy. On ecommerce sites, the cause of an increase in frequency around a particular product or by a particular customer is something to investigate.
In the section heading above, X is meant to be some measure from a source named N—for example, the percentage of customers from paid search or the percentage of revenue generated from marketing campaigns. As such, the abstraction of the percentage of something X from some source N is helpful to apply to the concepts in ecommerce analytics: distributions of the sources where people come from when they begin a site visit or the percentage of revenue from different brands, product categories, and so on.
The many ways in which a person enters a digital experience can be tracked in percentage terms and against the key metrics you want to segment by source. As a result, you can derive KPIs such as percentage of visitors from online advertising, or percentage of profit by marketing campaign. Percentages as a KPI are widely used in analysis.
Percentage of new customers is a helpful KPI for sites that want to measure customer growth rate or market share—for example, the percentage of new customers from search, the percentage of repeat customers from display advertising, the percentage of customers visiting on both mobile app and desktop, and so on. Site owners want to know the percentage of new customers or the percentage of new customers in the past 12 months or year over year.
The highest value of a KPI is when it can be tied directly to a financial metric. In the case of advertising, the “cost per” metrics are numerous and their usage is widespread and well understood. The most common advertising-based “cost per” is the CPM or the cost per thousand (in the context of display advertisements). Ecommerce analytics uses “cost per” metrics in similar ways. The cost can be any object in the ecommerce analytics data model, such as cost per visitor or cost per conversion or cost per action or cost per lead or cost per engagement or cost per Facebook like or cost per user-generated tweet. These metrics may be too vague, so again segmentation or further derivation can be helpful, as are time-series comparisons of the cost metrics. The analog, revenue per visitor, can be calculated by summing total visitors and dividing by revenue. Subtracting “cost per” metrics from the comparable “revenue per” metric can be financially insightful. It is common to see “average cost per” metrics in use. Revenue per customer is discussed next.
The counterpoint to the cost metrics reviewed in the preceding section are revenue-based metrics. The ultimate link to the business is when KPIs are joined with financial data. Helpful insights can be found when ecommerce analytics teams focus on bringing together financial data related to the “revenue per” with the “cost per” data. Thus, a useful KPI to track for measuring business performance is the “revenue per” metric. As the counterpoint metric for “cost per” metrics, the “revenue per” metric indicates how much money was generated. The most common usage of a “revenue per” metric is the revenue per customer KPI or revenue per product or revenue per product category. Also related are the “revenue per customer segment X or Y” and derivatives “revenue per new customer” or “revenue per repeat customer” and so on.
If your KPI strategy can execute to the level where “cost per X” and “revenue per X” KPIs are known, such as the “cost per paid search campaign” and the “revenue per paid search campaign,” then you can calculate the “gross margin per paid search campaign” and “net profit per paid search campaign.” Taking this to a deeper level, the analytics team could tell you “profit per keyword in search campaign X” and related comparative and time-series views of the KPI. As you may conclude, the power of such insights in transforming the profitability of a business using ecommerce data and derivative KPI analysis cannot be underestimated.
A very important derivative KPI is named cost per customer acquisition. This KPI shows how much money it takes to generate one new customer. It’s an important metric because it shows the effectiveness of your marketing spend and other customer acquisition strategies and tactics. When compared to lifetime value, you want to see a relationship such that the lifetime value is higher than the cost to acquire that customer. For more information, see the discussion in Chapter 6, “Marketing and Advertising Analytics in Ecommerce.”