Chapter 2. The ABCs of Using Data

Comparatively, data collected from many participants (often called “large sample research”) can give you more precise quantity and frequency information: how many people feel a certain way, what percentage of users will take this action, and so on. In an ideal world where you had unlimited resources, you might find it better to always collect more data rather than less. This would ensure that you learned everything possible. However, you may not have the time to do this kind of research. Generally, the larger your sample, the more sure you can be that your findings will generalize to the population (so long as the sample is representative, which we will discuss later). There are statistical methods you can use to determine how many users you need to collect data from to reach a certain degree of confidence in your findings. We won’t be getting into the details of this here, but we recommend reaching out to your analyst or data scientist friends to talk about the relationship between sample size and statistical power if you want to learn more.

Why Experiment?

We just shared with you the many dimensions that data can take. In practice, it takes years (and would take thousands of pages in a book!) to give a fair and nuanced treatment of all of those kinds of data. Rather than giving superficial treatment to many forms of data, then, we decided to focus our energies on a single type of data collection: experimentation through A/B testing.

So why do we care enough about experiments to dedicate a whole book to them? The short answer is that experimentation helps us learn about causality in a way that is grounded in evidence, that is not anecdotal, and that may be statistically significant. Thus, we can develop informed opinions about what will happen if we released that design, feature, or product in the wild. We realize that that’s a bit of a mouthful, so we’ll spend a little bit of time now breaking down that statement into its parts.

Learning About Causality

We’ll start with the most obviously important benefit of an experiment. You might have heard the old adage that “correlation does not imply causation”; just because two or more things are correlated, meaning they have some kind of mutual relationship or connection between them, does not mean one thing causes a change in the other(s). In fact, as human beings, we make associations about how one thing affects another all the time in our everyday life.

In our camping example, let’s say we were trying to learn more about which advertising causes more campers to sign up. We might conclude that the successful sales of a magazine containing a camp advertisement caused increased enrollment at camp (Figure 2-2).

If a magazine containing a camp advertisement sold well, and the enrollments at camp went up, we might assume the advertisement for camp caused increased enrollments at camp.

Figure 2-2. If a magazine containing a camp advertisement sold well, and the enrollments at camp went up, we might assume the advertisement for camp caused increased enrollments at camp.

The problem with assuming that the increased magazine sales caused the enrollment to go up is that we aren’t omniscient in an uncontrolled setting: there’s always factors we might not be observing perfectly that could provide alternative explanations to our observations. We can’t rule these out, so we can’t conclude that we’re observing a causal relationship.

In our example, the fact that the magazine sales containing our advertisement have increased may indeed have caused increased enrollment; however, other causal relationships are also possible. We can show this easily and intuitively by adding another variable. For example, what if we consider the performance of the economy?

In fact, it could be the case that an improvement to the overall health of the economy led to both the successful magazine sales and increased enrollment at your camp, as economic health could lead to more disposable income in families for spending on both magazines and summer camps (Figure 2-3).

However, it’s also possible that the improving health of the economy caused both increased magazine sales and increased enrollments at camp. Which is right? We don’t know without an experiment.

Figure 2-3. However, it’s also possible that the improving health of the economy caused both increased magazine sales and increased enrollments at camp. Which is right? We don’t know without an experiment.

The power of A/B tests and experiments is that they provide controlled environments for us to understand why something happened; in other words, they let us establish causality. This is important to designers because by understanding the underlying causes of behavioral effects, we can make informed decisions about what will happen if we make a product or design change. This also lets us understand with accuracy and confidence how our decisions cause changes in our user’s behavior. Furthermore, we can protect ourselves against the very human tendency to see patterns in data and behaviors that confirm what we already think (what psychologists call “confirmation bias”), and mitigate the risks of investing time and company resources on assumptions that aren’t proven.

Statistically Significant, not Anecdotal

Learning about causality is unique to experimentation as a methodology, which we believe is already a strong reason to be excited about this book, and about A/B testing in general. But we still want to speak to some of the other strengths of experiments.

Generally speaking, whichever methodology you use to collect data, your goal should be to find meaningful evidence that you can trust to inform the design and product decisions that you’re making. You should always tread carefully when you hear someone on your team suggesting a new product direction or design change based on a one-off comment from a friend, acquaintance, or business stakeholder: most of the time, these are anecdotes or opinions rather than true pieces of evidence. We encourage you to be thoughtful about the limitations of anecdotal evidence that you hear (not the least of which is the risk of bias), and ask questions to help understand the impact of what you’re hearing.

That said, there’s different ways to define “meaningful.” Rigorous qualitative methodologies are undoubtedly meaningful sources of evidence, and are essential in making good product decisions. One way to ensure that your data is meaningful is by designing good research—for instance, by asking well-thought-out questions that are not biased, bias inducing, or leading. User researchers, for instance, are trained experts in doing this type of work.

Another way to identify data that might be meaningful is using statistical methods. These methods only apply to quantitative measures, but because experiments and A/B tests are quantitative methods, they can be judged against measures of statistical significance. Statistical significance helps quantify the likelihood that your data reflects a finding that’s true in the world rather than the result of random chance. Depending on the type of data you’re collecting, there are different measures you can use to calculate statistical significance such as p-values. A p-value measures the probability of the occurrence of a given event given a certain set of circumstances. Thus, a p-value helps quantify the probability of seeing differences observed in the data of your experiment simply by chance. We won’t go into the details about how to calculate p-values here, but we encourage you to reach out to your data science or analytics friends or check out the resources at the end of the chapter if you’re curious about learning more. We’ll also note here that the limited statistics we discuss in this text come from a school of thought called “frequentist” statistics, which are most commonly applied to online experiments.

You’ll note that we said statistical methods can help you determine data that might be meaningful, rather than data that is definitively meaningful. To weigh in on this, we spoke with Arianna McClain. Arianna is Director of Insights at DoorDash, and was most recently a Design Researcher and Design and Data Specialist at global design company IDEO. At IDEO, Arianna has worked to bring the user closer to the design process by leveraging different kinds of data across a range of projects. She was also on the founding team of Stanford d.school’s course “Designing with Data.” We believe Arianna has a great perspective on the intersection between these two fields, and how to make data more accessible for designers without much training yet.

When thinking about statistical significance and its relationship to whether something is meaningful, Arianna says:

Statistical significance doesn’t tell me whether something is “right or wrong” and it doesn’t determine the actions I should take. Instead, statistical significance simply suggests to me that something interesting is going on.
When I notice that a correlation or model is not close to being statistically significant, I view it as a cue to quickly move on. However, if it is close to statistical significance, it simply tells me, “Hey, you should pay attention to this.” It makes me dig a little deeper and ask additional questions.
For example, take randomized clinical drug trials. A pharmaceutical company does not decide to approve a drug simply because the data shows it achieves its intended effect with statistical significance. They take into account the drug’s clinical impact, side effects, and cost as well. There are many examples in medicine that show that something has a statistically significant effect, but it isn’t meaningful. For example, a weight loss trial may show that a new drug significantly decreases weight compared to a lifestyle intervention with p < .0001. However, the patients may have serious side effects and only lose one or two pounds more, which doesn’t make taking the drug clinically meaningful or worth taking for the patient.

A small p-value alone doesn’t mean that something is meaningful; rather, like Arianna, we believe that it indicates that you should pay attention to that result and think about it in the context of how you collected your data, how that change would affect your business, and what other sources of data are telling you.

We caution you against following significance without being thoughtful about what it actually indicates. That said, one of the primary strengths of experimentation is that so long as your test is designed well (more on that in the coming chapters), you’ll have a signal that you observed a true effect rather than something random—something that you should pay attention to and think about as you’re making design decisions.

Informed Opinions about what will happen in the Wild

As we have noted, A/B tests are great for identifying statistically significant results, allowing you to form beliefs that you observed a true effect rather than something that happened by chance. Aside from it being very exciting and satisfying to see your work pay off with a statistically significant result, why should you care about significance in experimentation?

When a team decides to invest in building and launching a product and design, they want to make informed bets about how real users would react to the product—for example, in the case of an ecommerce site, would users click that button, navigate successfully through the new flow, and complete the checkout process? For all of their strengths, one primary limitation of research with a small number of participants is that the goal of it is rarely to be representative of the full user base. Rather, many other methodologies help you uncover compelling insights that help you build a better understanding about the type of issues some of your users might encounter, or the needs of specific groups of users.

Recall our dimensions of data introduced previously. Well-designed and well-executed experiments can help fill in that gap by providing product teams meaningful insights that generalize to how a feature or product will perform in the wild. A/B testing is an observation-based, behavioral method that collects large-scale data in the user’s context. This means that as long as we do a good job designing our test, we can be fairly confident that the results we see in an A/B test are likely to be mirrored if we rolled out the product to more users. By using A/B tests, then, teams can “glimpse into the future” to see how their product is likely to perform, allowing them to measure and quantify the impact of design changes.

This ability to make data-aware choices about what will happen is extremely valuable. Your company can save time and resources by investing further in projects that perform well, while redesigning, rethinking, or pivoting away from ideas that perform poorly or don’t elicit the intended user behavior. In addition to these business advantages, A/B testing allows designers to quantify the value of their work on the user experience or their company’s bottom line. This is important because it helps designers articulate why investing in and prioritizing good design is important to their stakeholders and their business. We believe that understanding and speaking the language of data-aware methodologies like A/B testing empowers designers to argue that investing in good design is measurably, not just philosophically, critical to a business’s success.

Basics of Experimentation

Throughout this chapter, we’ve been speaking about experiments and extolling the value of experimentation. We hope that the previous section has illustrated for you why experiments are an essential tool to have in your toolkit during the design process. Now that we’ve gotten you excited to learn more, we need to step back and teach you the basics of what an experiment is. These concepts and vocabulary may be familiar to you, especially if you studied the scientific method in school. Whether this is your first time learning about experiments or just a refresher course, we hope that by the end of the section you’ll be ready to learn how to apply this terminology to an A/B test.

Language and Concepts

In the words of Colin McFarland,^[5] “an experiment is a means of gathering information to compare an idea against reality.” In some sense, you’ve probably been running experiments your whole life: maybe last time you cooked brownies, you substituted whole eggs for egg whites. You tracked how quickly your family and friends ate the brownies, and compared it to the rate of consumption last time you made the brownies using whole eggs. This scenario seems basic (and probably resembles things you’ve actually done), but it contains the basic building blocks of every experiment: a change, an observation, and a control (Figure 2-4).

The basic building blocks of an experiment are a change, an observation, and a control. At the end of the experiment, the measured difference between your control group (a) and your test condition (b) helps establish the effect of the change you made.

Figure 2-4. The basic building blocks of an experiment are a change, an observation, and a control. At the end of the experiment, the measured difference between your control group (a) and your test condition (b) helps establish the effect of the change you made.

In an experiment, you make some change (using egg whites), and measure that change against a control (using whole eggs, the default recipe), to observe whether the change has had an impact on rate of consumption. The intuition here is that you are comparing two or more things that are almost the same: an experimental group and a control group. The only difference between these groups is the change that you deliberately made; in other words, your experimental group is basically your control group + some change. Therefore, if you observe a significant difference between these groups in a well-designed and well-controlled experiment, you can conclude that that difference was most likely caused by the change that you made (rather than just the result of random variation), because that was the only difference between the two groups This is how experiments establish causality.

In designing an experiment, you make a hypothesis, or a testable prediction, about what the effect of your change will be. Based on what you observe, you determine whether or not to reject your hypothesis. Rejecting a hypothesis generally means that you assume that your change did not have the intended or expected effect. This is the most important part of an experiment, because determining whether or not you reject your hypothesis is tangible learning. The goal of every experiment (just like the goal of using data in general) should be to learn something. Given the importance of hypotheses, we’ll give a more in-depth treatment to this concept later in the chapter, and in subsequent chapters as well.

Race to the Campsite!

Now that you understand the setup of an experiment, we’ll use a basic example from our summer camp metaphor to help introduce more granular language about experiments. Imagine that as part of your summer camp, you have a hike to another campsite deep in the woods. This is a bonding activity for your entire camp and a great source of exercise. You’re realizing that as you expand the fun activities and offerings at your camp, you need to shorten the time it takes for kids to do this hike, so that you have time for other activities in your summer camp programming. You decide to treat this hike as a bit of an experiment, by varying some factors while controlling for others, to learn about what equipment makes kids faster at the hike. This way, you can invest in equipment for the whole camp that is effective, instead of spending money on equipment that isn’t useful for your campers.

An experiment is run on a sample of subjects selected from a population. A sample is a subset of a population. Generally speaking, your goal in sampling from a population is to capture a group that is representative of the whole (or at least the whole you intend to apply your learnings to), without needing to collect data from every person. For instance, imagine that you wanted to learn about camper satisfaction after you finished a year of camp. You could stand by the exit and speak to every single camper, but of course, that would be extremely time consuming and impractical. You could also ask your favorite five campers for their opinions, but they would be a biased sample, as your favorite campers are probably more likely than the average camper to really enjoy summer camp (why else would they be your favorites?).

Let’s assume that campers are assigned to a cabin based on their age; that is, all campers in a cabin are similar in age. We can also assume that there are eight campers per cabin, and we’re trying to build four groups for our hike. We could sample campers by assigning two campers from each cabin to each of the four groups, resulting in groups that have a range of camper ages. So long as there’s no systematic bias in the two campers who are assigned to each group from each cabin, all groups should be representative of the whole camp.

Now, let’s say that your experiment is to vary the different type of equipment that each group gets:

Group	Equipment Received
1	Map (control group)
2	Map, compass
3	Map, GPS
4	Map, protein bars

When people talk about experimental design like this, they often call these changes variables. We talked about control groups earlier. You can also control for certain variables: in this case, because every group (including your control group) received a map, this is a controlled variable. By controlling for this variable, you’re designing an experiment that will not reveal any effects of having a map on the child’s speed. This probably makes sense—you’re likely giving them maps anyway, and maps are cheap to print off or photocopy.

When people speak about variables, you’ll most often hear them talk about independent variables and dependent variables. In fact, we’ve already taught this concept to you—introducing this language just formalizes the intuitions that we’ve already shared. The independent variable is the variable that you change in order to observe a change in the dependent variable. Therefore, in this example, the equipment received is the independent variable, and the time it takes for the kids to get to the campsite is the dependent variable.

We told you before that these cause-and-effect relationships can be inferred from experiments only when the experiment is well designed and well controlled. In practice, designing experiments that meet those criteria can be quite challenging. Confounds are issues in your experiment that might lead to confusion about what caused the difference in the dependent variable you observed. Generally, confounds are the result of unplanned differences between the control group and experimental group(s) of which you were not aware.

For example, in the experiment about racing to the campsite just described, you assume that randomly assigning campers from cabins to the groups would be enough to make each group the same. However, what if every camper in Group 3 had proper hiking boots, while most of the campers in the other groups were wearing ordinary sneakers? Failing to control for footwear might have substantial effects on the speed the campers could climb. If you didn’t account for this difference in groups, it would confound your experimental results: you wouldn’t know whether Group 3 was faster due to their GPS system, or their hiking boots.

As you can see, experiments are an effective way to identify causal relationships between an independent variable you change and a dependent variable you observe, so long as you are careful to control for all confounding variables. In practice, designing effective experiments needs to be a collaborative effort where you work collaboratively with other people on your team to account for all possible confounds, thereby designing the best shot at learning what you aim to learn.

We hope that you’re excited to learn about how to apply experiments to your design problems. But first, we’ll share how the internet has facilitated experimentation that is both fast and scalable. It is these two factors that have made A/B testing a practical method and reality for many companies.

Experimentation in the Internet Age

People have been running experiments formally and informally for a long time. Back in the 1700s, a British ship captain named James Lind ran a naïve experiment after observing that sailors on Mediterranean ships who received citrus as part of their rations had lower rates of scurvy than other sailors. He gave half his crew limes as part of their rations (experimental group: diet + citrus fruit) and the other half just ate the regular diet (control group: diet only). He found that compared to the control group, the experimental group had a significantly lower rate of scurvy, leading him to conclude that citrus fruits like limes prevent scurvy. He wrote:

The most sudden and visible good effects were perceived from the use of oranges and lemons; one of those who had taken them being at the end of six days fit for duty.... The other was the best recovered of any in his condition; and being now deemed pretty well, was appointed nurse to the rest of the sick.^[6]

The digital age has radically changed the speed and scalability of experimentation practices. Not too long ago, the primary way that you shared photos with someone was much more complicated: first, you would have to load your camera with film and then take a series of snapshots. When your film roll was done, you’d take that film to the local store where you would drop it off for processing. A few days or a week later you would need to pick up your developed photos and that would be the first time you’d be able to evaluate how well the photos that you took many days prior actually turned out. Then, maybe when someone was at your house, you’d pull out those photos and narrate what each photo was about. If you were going to really share those photos with someone else, you’d maybe order duplicates and then put them in an envelope to mail to them—and a few days later, your friend would get your photos as well. If you were working at a company like Kodak that had a vested interest in increasing people’s use of their film, processing paper, or cameras, and you were asked to collect insights about customer behavior to drive designs that would increase usage, there would be many aspects of customer behavior and parts of the experience that would be hard to accurately measure. You’d also have almost no way to collect insight into your customer’s behaviors and actions along the process, and it would be even harder to conduct experiments to compare different product offerings. How would you observe the different rates of photo taking or sharing with different kinds of cameras, or the performance of real user’s photos with different kinds of film? It would be extremely challenging to track down that kind of data, especially outside of a customer experience research lab and in real (and therefore imperfect) contexts.

Now let’s take the same example of sharing a photo in the digital world. Your user will take out their phone, open the camera app, and take a photo. They may open up Instagram, apply some filters to the photo, and edit it on the spot before adding a caption and then sharing it. They might also choose to share it on different channels, like Twitter, Facebook, or via email. The entire experience of sharing a photo has been collapsed and condensed into one uninterrupted flow on a single screen, one that you can hold in the palm of your hand. And because all of this is digital, data is continuously being collected along the way. You have access to all kinds of information that you wouldn’t have had before: location, time spent in each step, which filters were tried but not used, what was written about the photo, and to whom the photo was sent. In addition, you’re not limited to observing just one user—you can gather this information from each and every user. You can make changes to your interface, filters, flow, or sharing options and observe—at scale—how real users respond as they take real photos of real moments in their life. And, because the data logging can be made automatic and seamlessly integrated into your experience, you can amass huge amounts of data quickly.

This example illustrates the power of digital interfaces for data collection. Although experimentation has been around a long time, the internet enables us to collect large amounts of data about our users cheaply and quickly. Many internet companies invest in internal tooling to enable anyone in the company to deploy an A/B test and collect results in a matter of days or weeks, helping them make business-critical design decisions in real time, as a fundamental part of their existing product development processes. For instance, Skyscanner has “Dr Jekyll”,^[7] LinkedIn has “XCLNT”,^[8] and Etsy has “Catapult”^[9] for this purpose. Imagine how much more effort it would have taken Kodak to learn enough about their users to make reliable decisions in the pre-internet age. For these reasons, now more than ever is an exciting time to embrace designing with data. A/B tests—or online experiments—are one powerful, versatile way to do that.

A/B Testing: Online Experiments

As we have been discussing, A/B tests are essentially online experiments. The concepts—making a change and measuring its effect compared to a control group—are nearly identical. However, over time, A/B testing has adopted a language of its own that is more closely aligned to existing business terms. Here, we’ll help map over the general concepts we introduced to A/B testing-specific terminology that you’re more likely to hear in your business context. We’ll also get a bit more specific with a few additional concepts that are practical and important to understand, such as statistical significance.

Sampling Your Users Online

When we kicked off our race to the campsite metaphor, we started out by talking about how you would assign your cabins of campers to groups for the purposes of your test. This is an extremely important area of focus for A/B tests as well. Even small differences in the groups you assign to your experimental and control conditions can compound when your sample sizes are large, resulting in confounded and therefore unreliable experiments. This is why it’s crucial that assignment to each group is random. When assigned randomly, there’s less chance for one group to have an advantage over the other.

Before we kick off, we want to introduce one new term. So far we’ve been talking about different conditions: the experimental and control groups, for instance. In A/B testing, these conditions are often referred to as test cells. These are conceptually the same as the conditions we’ve already been speaking about: test cells are the different experiences that users in your sample can be randomly assigned to, which vary systematically in ways you choose.

Now that we’ve introduced that term, in this section we’ll review some important considerations to think about when sampling selected users from your entire population of users. We hope that the terminology and concepts in this section will help you have more meaningful conversations with your team about the right users to experiment on to gather meaningful data about which designs best serve your target audience.

Cohorts and segments

When you’re looking to learn more about your users through data, one of the first questions you need to ask is which users to gather data about. Doing research with the right group of users is really important and will play a factor in how you interpret your results. Your user base is probably quite varied. By subdividing your user base into either cohorts or segments, two ways of thinking about how to split your users, you can gain different insights into their behaviors or motivations that you wouldn’t have seen if you had just considered them as one large group.

A cohort is a group of users who have a shared experience. That experience might be based on time (they all signed up for your product or service at the same time) or it may be that they have another factor that gives them a common experience (e.g., students who graduated in 2015).

For example, perhaps in January you get a lot of visitors to your service who are coming because they got mobile phones for Christmas. These people might be different or have different motivations than the people who might sign up for your product at other times of the year. Applied to our summer camp metaphor, one cohort might be first-time campers in summer 2016. The types of activities you ran that summer, as well as the advertising you focused on prior to that year, will define their baseline and expectations for summer camp.

Alternatively, you can also segment your user base into different groups based on more stable characteristics such as demographic factors (e.g., gender, age, country of residence) or you may want to segment them by their behavior (e.g., new user, power user). A summer camp example of a segment might be campers of a certain age, or campers who come from New York City. Where a camper grows up and how old they are both likely affect the types of camp activities they’d be interested in.

We’ll talk more in later chapters about the importance of analyzing the success or failure of the experiences you design by looking at how different groups of your users react to them. For now, it’s just important to recognize that different groups of users might have different reactions to your product or experience driven by different needs.

For instance, according to former Chief Product Officer John Ciancutti, online learning and course site Coursera has several different segments that they consider when building products: lifelong learners, seasoned professionals, and unseasoned professionals. These different segments have unique needs, approach the product in different ways, and may be more or less likely to pay for Coursera’s offerings. He said:

You’ve got your lifelong learners. They’re looking for entertainment. They skew older. Learning is more interesting than television: they’re thinking they want to exercise their brain, learn a new thing. Or they’re thinking, “Hey, I’m going to travel to Italy, so I want to learn a little Italian, I want to learn about Roman architecture and history.” The lifelong learners want access to the content, but they don’t care about completing courses, don’t care about credentials. But they want to give you some money because you’re giving them value, and that’s how their brains work.
Then you have the seasoned professionals. They’re engineers, they’ve already made it. They’ve got a job, but they want to keep their skills fresh. This is about making them better at their job, if there’s some new data analysis technique, or analytic software or new programming language, new technology. They don’t care about the credential, or about completion. It’s just about sharpening the blade.
Then there is the unseasoned professional. They want to get a job as a data analyst, so they care a ton about credentials, about completion, about getting a “badge.” Going to a prestigious university is a badge that you will have forever. You have this badge that says to the world, “What you learned is amazing and you’re an amazing person because you got accepted,” and all these different things. People want that badge.

As you’re thinking about your A/B test, making a decision about which cohorts and segments to test with is essential. Focusing your sampling on a single cohort might help you understand the unique problems and needs that face that particular user group in depth.

For instance, perhaps you decide to focus on a cohort of campers who attended your summer camp for the first time in 2015. This might help you learn meaningful insights about campers who are very similar to that cohort—for instance, they might extend to other middle school–aged campers from similar home backgrounds, as your camp only accepted middle schoolers in 2015, and you advertised primarily to suburban neighborhoods near New York City. However, the data you gather if you do research with this cohort would not necessarily apply to other potential future campers, such as whole families (if you became a family camp), high school–aged campers, or campers from the West Coast or other countries, because those perspectives weren’t reflected in your original cohort. As you can see, then, the sample(s) you focus on in any given A/B test will determine the population(s) to which your insights will apply; you should confine insights and generalizations to the groups of which your sample is representative.

Demographic information

When defining segments, you sometimes want to segment users based on more stable characteristics, such as demographics. The following questions might help you to determine what kind of information you might be asking yourself and how you might go about gathering it:

What demographic information can I gather on my users? (Consider what you already know about your users based on questions that you ask during the sign-up process or think about reports that you can purchase about audiences that you might want to target but don’t currently serve.)
How do factors like location, age, gender, race, disability, and income change their needs relative to the experience I’m trying to craft?
What behaviors and habits have your users established? How does time of day or location affect those behaviors?
What devices do they use?
What are their values and expectations that might affect their reception to your product?
How comfortable/experienced are they with respect to technology, devices, and the internet in general? What is their attitude toward using new technology or experiences?

For example, online accommodations rental service Airbnb answered some of these questions by sending some of its top experience researchers directly to their population in question: Japanese superhosts. The company was curious about why the number of home listings in Tokyo was so low relative to the urban city’s large population. The team did ethnographic research with some of Tokyo’s Airbnb hosts to understand what demographic these hosts occupied, what their values were, and what made them different than people in Tokyo who didn’t host. Airbnb’s researchers found that although the hosts in Tokyo seemed very different at face value, all of them were “outliers” in the sense that they had a positive defining experience with outsiders. Unlike many Tokyo residents, these super-hosts were willing to share with outsiders and bring foreigners to Japan, making them ideal early adopters of Airbnb in Tokyo.^[10]

As you’re trying to learn more about your users, not all of these questions will be relevant, but hopefully you can see how getting some of this information and data will shape the way you design for your customers. It’s also rare that your experience won’t need to adapt and change as your user base evolves and grows over time. For this reason, it’s also important to remember that gathering data and understanding your users is an ongoing effort.

New users versus existing users

For most products and design decisions, you’re probably already thinking beyond your existing users toward the acquisition of new users. Data can help you learn more about both your existing users and prospective future users, and determining whether you want to sample from new or existing users is an important consideration in A/B testing.

Existing users are people who have prior experience with your product or service. Because of this, they come into the experience with a preconceived notion about how your service or product works. This learned behavior can influence how they think, what they expect, and how they experience new features that you introduce to your product or service, which is an important consideration when testing new designs in front of existing users. Compared to existing users, new users do not have experience with your product. If you’re trying to grow your business, you might be more interested in learning about new users because they aren’t predisposed to your current experience.

To illustrate the difference between new and existing users, imagine that you’re going to make some changes to the layout of your camp during the off-season, by moving the outhouse closer to the dining hall. The previous layout of your summer camp is shown in Figure 2-5.

Figure 2-5. The old layout of your summer camp. Old campers have an established habit of walking out to the street and then to the outhouse.

After you move the outhouse, you observe that returning campers from cabin 3 take a much longer path to get to the outhouse, while new campers in cabin 3 take a more direct path. This makes sense; they are basing their navigation on their past experiences. Returning campers need to overcome the learned behavior about how to get to the old outhouse location by using the road; they have engrained habits that lead them to walk that direction whenever they need to use the toilet. By contrast, new campers lack these old habits about where the outhouse used to be, and therefore can walk to the new location more directly through the other cabins. Figure 2-6 demonstrates the differences between returning and new campers.

Routes taken by old versus new campers on their way to the outhouse. Old campers have learned habits that affect the routes they take, even after the outhouse has moved. New campers don’t have these existing habits and therefore take a more direct route.

Figure 2-6. Routes taken by old versus new campers on their way to the outhouse. Old campers have learned habits that affect the routes they take, even after the outhouse has moved. New campers don’t have these existing habits and therefore take a more direct route.

Thus, it’s important to be careful about whether your test is with new or existing users, as these learned habits and behaviors about how your product used to be in the past could cause bias in your A/B test.

Jon Wiley from Google shared his experiences around taking into consideration whether your experiment has a “learning effect” when thinking about how long to let them run. When you make a change to your experience, users may need time to overcome any habits or learned behavior that they’ve established with your original design. Jon said:

We run our own experiments for longer periods of time particularly for anything that is a visual change, anything that they’d notice. The reason that we do that is because of learning effects. We know that there are a lot of things that happen when you get hit with a new interface or a new design.
When we worked on a redesign a couple years ago we did a big visual change. What A/B testing told us time and time again is that for these types of changes there is often a learning effect, meaning that our metrics will go haywire initially, then they’ll start to stabilize. That’s because it basically requires multiple times for someone to encounter the experience and return to either normal or better behavior for what we’re looking for. The timing can vary by what we’re changing. A really big change might impact over the course of six weeks. Whereas for a smaller change, the timing in terms of the learning effect is going to be much, much smaller.
I was one of the lead designers on one of the first big visual changes to search, which we launched in 2010. As a designer, I was pretty confident in my design. However, there were metrics that we got back for some of the design directions that said that our design wasn’t very good. I did not believe them. I started poking and prodding a bit, working pretty closely with our analysts and our engineers and really digging into these numbers.
One of the things I did was to ask, because of this learning effect, “Well why don’t we look at these numbers for high-frequency users, people who use Google search a lot, who enter a lot of queries and do a lot of searches? What would the numbers look like for that group?” We discovered that the numbers were actually very different for that group. They were a lot better in several of the places that we were concerned about. That was our first clue that maybe there is an effect going on here that had to do with the exposure of these visual design changes. We decided to let the experiments run a little bit longer. As we did that, we discovered that the low-frequency users and the medium-frequency users started to fall in line over a much longer period of time with the high-frequency users.

This is also a great example because it shows how active a role Jon played in both defining and understanding how his design was being measured. When designers become very curious about the effects that they are seeing in the data, it often empowers them to push harder for a better understanding of what is really going on in underlying user behavior that they are trying to affect.

In addition to learned habit effects, you also need to be thoughtful about demographic differences between your existing users and folks who might become your users in the future. For instance, your existing user base might have a different demographic bias than potential new users. If your initial offering had high traction with tech-savvy or younger audiences, then it’s likely that any sample of existing users will actually have disproportionately more representation from younger and more tech-savvy folks than the average population. It’s good to ask yourself if your original customers are representative of the kinds of people that you want to be your customers one year from now. Will you continue to target tech-savvy people or are you hoping to gain audience share by moving toward a more mainstream, less tech-savvy population?

In this section, we introduced three considerations when you start thinking about who to include in your A/B test samples: what cohorts or segments do you want to represent in your findings, what demographic considerations are relevant, and are you interested in learning about new or existing users? You should revisit these three considerations for every A/B test you run. Taking time to think about your sampling practices upfront ensures that you will be able to collect the right insights about the right audience, which is essential to running effective A/B tests.

Metrics: The Dependent Variable of A/B Testing

Now that we’ve covered some practical considerations for sampling from your population in an A/B test and taken a quick vocabulary break to introduce the idea of test cells, we’re going to revisit what dependent variables look like in an A/B test. Recall that the dependent variable is the variable you observe to see the impact of the change you made. We want to start off this section by getting a bit more specific about how dependent variables generally look in the context of an A/B test.

Broadly, a measure is anything you observe, capture, and count. Examples of measures might be the number of users that visit a page on your website or the number of people that successfully complete a process. A metric is a predetermined and evaluative benchmark, which has been determined to have some business value. Metrics are sometimes the result of comparing several measures, often as a ratio. Metrics are used because they can tell a compelling story about the health of your business or your designs. Acquisition, retention, and activation rates are all examples of metrics.

Metrics are the dependent variables of A/B testing; that is, the thing you measure to determine the result of your A/B test. For instance, thinking back to our example of racing to the campsite, your metric was time—specifically, time for the campers to go from their start location to the campsite.

In this case, and more generally, metrics allow you to measure and quantify the impact of your designs or product changes, and therefore measure your success or failure at causing some change in your users’ behavior. To observe these changes, you often look specifically at your company’s key metrics. Key metrics are the focus of your business and the main metrics you would like to see improve and the determining factor on whether or not your design is a success. Key metrics should be based on what drives the success of your business—you can think of it as a way of measuring the customer behavior that correlates to your success. Generally you want to increase some metric that is essential to your business (such as customer retention or conversion, which is the percentage of your users who take a desired action), so the metrics determine whether your design is successful or not. However, in developing tests and analyzing results, you should also be thinking about deriving new metrics for the business. This is where data, design principles, and business come together.

A simple example of a metric intimately related to a business’s goals comes from Dan McKinley’s time at Etsy. Dan McKinley was a Principal Engineer at Etsy and was part of the journey in seeing it grow from a small company of 20 to the success that it is now. His talks about how Etsy leveraged A/B testing have helped to inspire many companies to do the same. He talked to us about how defining a key metric at Etsy was fairly straightforward:

At Etsy [defining key metrics] is less controversial than it would be at many companies. Etsy is a business selling things, so the metric that we can optimize for is how much money we can make. The other way that we were lucky is that those metrics directly correlate to our users’ interests. The point of Etsy is to have other people sell stuff, and we make money when all those people sell stuff. When we sell more stuff, our sellers are happy and we are also happy. Those are the things we at Etsy cared about.

A more complicated example comes from the online learning platform Coursera. Coursera is a credential-driven business; in other words, they make money when users pay for credentials (certifications) after completing a course. One of their key metrics might be the number of credentials sold, or the revenue generated from credential purchases. You might be skeptical about this metric, though, and for good reason: because Coursera courses are often 13-week college classes, measuring how a design changes this metric would take far too long to be practical. John Ciancutti walked us through the process of how Coursera derived other metrics they could use to track the impact of their work with more immediacy:

With Coursera, the business model is to get people to pay for credentials. To buy a credential, it requires that you complete a course. So we wanted people to complete the courses. And courses are divided into modules. So then we found, of course, that the number of modules someone completes correlates with course completion. Then we did even better, and we found two big actionable things that happen even earlier on.
The first one is: Do they get through the first test? Getting through that first test is really important, so then you can even do things like, let’s inform the pedagogy and design courses so that you have a test early, because learners get more invested.
Then another is: Do they come back and engage with the course two or three different times on different days? This is again because of that feeling of commitment. If you were quitting smoking and you make it through two days, you’re like, “I’m not going to do it on day three even though it’s hard. I’m not going to lose all that investment, right?”
These are two ways that you take this overall company goal and you bring it back to something that you can test in just a couple of days. That enables whole teams to iterate more quickly.

When you can’t discern any effect on key metrics easily, proxy metrics can be used. These metrics measure behavior from your users that give you an indication that they will likely also change the behavior behind their key metric as well. Proxy metrics are easier to measure than your key metric, or a leading indicator that something you have done successfully created the behavior that you wanted to change. To pick good proxy metrics, look for other metrics that are strongly correlated with your key metrics. You can also think about what behaviors might indicate early on that a user is going to engage in the desired behavior.

With some key metrics you’ll get the answer right away—for example, if you’re measuring revenue, you’ll know at the end of their session if they’ve actually purchased something or not. However, as we just discussed with this example from Coursera, other key metrics might not be so easily measurable. Coursera has intelligently developed several proxy metrics that they found to correlate and predict credential purchases and course completion. Course completion is predicted by module completion, which is in turn predicted by test completion and how often a user engages with a course. So Coursera measures test completion or course engagement as proxies for their key metric, allowing them to cut down the time to measure the impact of a design change dramatically.

As you can see, then, a huge part of designing your A/B test will be making thoughtful decisions about what metrics to measure—that is, what data to track when the test is over. How do you make these kinds of decisions?

For one, the kind of business you’re in will influence how you measure the “health” and success of that business. Business health is an extremely complex concept. It includes a myriad of different measures that roll up into a “bottom line” that defines whether the business is viable or not. Such measures include engineering analytics (service delivery and robustness metrics), business analytics and metrics (focused on balancing profit and loss, and assessing the business impact), and analysis of markets and business competitiveness.

Ultimately, whether you are focused on one group of users or on many disparate groups of users, the dynamic quality of the market today means our focus as designers is on what users do; their behavior is key. And, although your job may be focused very specifically on designing the user experience, it is worth always considering how what you do impacts the core signals of your business’s health. How what you do intersects with these other metrics and measures depends on three things:

What kind of business are you?
What is your revenue model?
How mature is your business?

Your answers to these questions will determine the kind of data that you will want to collect. As John Ciancutti shares, choosing metrics that tie to your business is essential for making data actionable at your company:

People care about all of these metrics that seem good, where it intuitively seems like more of something would be better than less of it. More clicks, more this, more that. But if you can’t find a way to tie that metric to your business, it’s not actionable. Whereas if you can drive retention, or if you can drive course completion at Coursera, you’re driving the business.
When I think about a subscription business, it’s like people are expressing their value very clearly and directly by saying “I choose to pay for this service another month” versus not. For almost any consumer business, a big resource that people invest in, other than money, is time. So time spent often correlates with the business and the metrics you really care about. Measuring these things can often trump other things, but data doesn’t obviate the need for judgments. You’re the pilot, but it’s really nice to have a compass. If you’re trying to get to a particular place, you need a compass, but it’s not like the robot’s flying the plane.

We won’t go into detail about how these different factors affect the metrics you should be thinking about here, but a great starting place is to ask around and find out what metrics your company is already measuring. As John shared, metrics that track time or money are often intimately connected to your business. Many companies will track key metrics even outside of A/B tests. For instance, you might be interested in how many “engaged” users you have. A basic engagement measure is the Active User (AU). The idea is to capture how many people use your product or service on a daily or monthly basis. Business reports often include summaries of Daily Active Users (DAU) and Monthly Active Users (MAU), potentially across many categories if the nature of the business is complex. To Wikipedia, an AU may be someone who contributed to more than one article. According to the Wall Street Journal, Twitter considers a user active if they log in once a month. For a social platform, an active visitor is someone who has come back to the platform at least once within 30 days. For an ecommerce platform, a metric like active browsing 2 days out of 7 may be the metric the business considers successful. For a news media outlet, active engagement with stories once a day may be sufficient.

As a designer, you’re probably most concerned with giving your users a good customer experience. Good business metrics should always keep the user in mind. For instance, you would expect that your users would not be very engaged if using your product is a terrible experience. We encourage you to question metric(s) which seem inconsistent with a good user experience—stop and assess whether the metric needs to be changed, and if so, consider proposing a metric that is more reflective of the user experience and the long-term, business-positive customer journey. Remember, a successful business should always prioritize giving its customers a great experience.

The point we are trying to make is that deciding what to measure—and knowing that you’re measuring it well—can be challenging. Think about how this might apply to your race to the campsite. Originally, we said your metric of interest was time to the campsite. But maybe emphasizing time over all else really isn’t the most important metric, because campers will be happy doing less activities at camp but with a more fun and enjoyable pace. A different key metric you could measure is camper happiness. As you can see, there’s no objective way to determine whether camper happiness or time to campsite is more important—this is a judgment call that you’ll have to make, and the kind of judgment that designers and other folks on a product team have to make every time they run an A/B test.

We wanted to close out this section with a great comment from Arianna McClain about the subjectivity of measurement. Arianna reminds us that, “Measurement design is subjective. Someone decides what to measure, how to measure it, and how to build the model. So all data is subject to human bias.” As we just alluded to, the decision about what to measure and how to measure it is subjective; the way we usually phrase it is that behind every quantitative measure is a set of qualitative judgments. Designers have a big role to play in asking thoughtful questions and applying their expertise about users to guiding how a design or experience should be assessed, what matters from a user experience point of view, and how to get meaningful data that informs those questions.

Detecting a Difference in Your Groups

When we talked about the basics of experimentation, we told you that in order to determine whether your change had an effect, you observe a difference in the dependent variable. But how do you know whether that difference actually mattered? What if Group 2 beat the other groups to the campsite by 30 seconds—is that enough of a difference to invest money in compasses? These are questions of statistical significance. We don’t aim to write a statistical textbook here, though there are many great sources about this topic. Instead, we hope to share the role these statistical concepts should play from a design perspective, in service of helping you understand and have empathy for the work your statistics-minded teammates would be thinking about in an A/B test. We believe that having some language will give you the ability to partake in discussions about significance and power, and empower you to ask good questions when you help design A/B tests to see how your work performs in the wild.

Recall from earlier in the chapter that statistical significance is a way of measuring the probability that the difference you observed was a result of random chance rather than a true difference in your groups. You calculate significance at the end of your test once you have the results, to determine whether the differences you observed were due to random fluctuations in your measures or a meaningful consequence of the change you implemented; you thus engage with determining the likelihood or probability of a causal relationship.

Even though statistical significance is calculated at the end of a test, you’ll need to think about whether you can measure a statistically significant result during the design of your A/B test. Power is the probability that you can correctly detect a statistically significant result when there is a real difference between your experimental and control groups. When you design an A/B test, you want to make sure that your test is powerful enough to detect a difference in your groups if one does in fact exist. Unlike statistical significance calculations, this is an upfront calculation, before you launch your test. Here’s one way to think about the difference: power tells you whether you’re capable of observing a difference, while statistical significance tells you if you did see one in the samples that you observed. You can think of an underpowered test as having glasses that are too weak to correct your eyesight: if you don’t have a strong enough prescription, you probably won’t be able to tell the difference between a cat and a dog, and you’ll end up with a blurry and untrustworthy view of the world.

We won’t go into too much detail about the difference between these two concepts and exactly how to calculate power here. Instead, we’ll walk through a few factors that influence the power your test needs.

How big is the difference you want to measure?

Besides wanting to know whether or not there is a difference between two groups, the next obvious question is, how big is that difference? Effect size is the size of the difference between the control and experimental groups. Unlike statistical significance, which only tells you whether or not there is a difference, effect size helps you quantify how big that difference is. In science, a larger effect is generally taken to be more meaningful than a small effect. This is probably also true for products—you stand to gain more by implementing a design change with a large effect size, because that design change can have a very big impact on your user’s experience or your company’s key metrics.

In product design, we define the minimum detectable effect (MDE) to be the minimum difference we want to observe between our test condition and control condition in order to call our A/B a success. The MDE often depends on business factors, like how much revenue increase would result from a difference at least that big in your metric. The intuition here is basically that the cost to test and implement that change should be “paid off” in some way, through a meaningfully large difference in some metric that is key to your business’s health and success or a sizable improvement to your user experience. You might also choose an MDE based on previous A/B tests you’ve run—this knowledge of how big an effect you’ve seen in the past can help benchmark how big of an effect you’d want to see in the future.

The statistical power required for your test depends on the minimum detectable effect you’re trying to detect. It’s easier to detect bigger differences—imagine if some of the campers could take a cable car up the mountain instead of having to hike. This would lead to an enormous difference in time to campsite compared to the other groups, and you wouldn’t need a very powerful test to detect that difference. Comparatively, much smaller MDEs require more powerful tests in order to meaningfully detect a difference.

A big enough sample to power your test

Depending on the minimum detectable effect you want your test to have, you’ll determine how powerful your test needs to be. Sample size is one factor that changes the power of your test.

Let’s say that one camper tells you they saw a skunk behind the outhouse. You might be inclined to think that they just saw a squirrel or a raccoon but thought it was a skunk. Now, what if you heard from five campers that there was a skunk behind the outhouse? This would probably make you a little bit more inclined to believe it, and you might even have an inkling of worry about a camper encountering the skunk. What if you heard from 50 independent campers that there was a skunk behind the outhouse? By now, your confidence that the skunk is there would likely be so strong that you’d probably temporarily allow campers to use counselor bathrooms, lest they get sprayed on their way to or from the outhouse.

Here is another example. Let’s say that each of your four groups for the race to the campsite had only one camper, because everyone else got sick and couldn’t go. You might observe a difference between the groups, but you’d probably be skeptical about making purchasing decisions on such a small sample—it’s only one child who was faster, how do you know that it wasn’t just because she was extra tall or extra athletic? But what if, now, each of the four groups had 40 campers? Assuming that the groups stayed together, if you observed that Group 4 was fastest to the top, you’d probably feel fairly confident basing decisions off of that data because you have more information. All of the differences between the groups would probably level out, and 40 kids beating out 120 other kids would be more compelling than 1 kid beating out 3 kids.

The principle behind this intuition is that when you observe something in a larger sample, you’re more inclined to believe it. As a result, a larger sample size is more powerful—observing even a small difference in time or happiness would be compelling if you observe it in many campers, whereas the same small difference with just one or a few campers would not be as convincing.

Significance level

Recall that p-values represent the probability that the difference you observed is due to random chance. When we see a p-value of .01, for instance, this means that 1% of the time we would observe the difference we saw or an even bigger difference just by random chance, not because of any meaningful difference between the groups. But how small of a p-value is small enough? This depends on how confident you want to be. In many social science fields like psychology, any p-value less than .05 (5%) is taken to be statistically significant—that is, the observed difference is assumed not to be due to chance. Another way to say this is that 5% of the time, you’ll think you’re observing a real effect in your data when actually it’s just random noise in the data that occurred by chance. In other fields such as physics, only p-values less than 0.0000003 are taken to be statistically significant.^[11] This is of course impractical for the types of changes we make in product design, even for the largest Internet sites.

Part of designing an A/B test is determining the degree of confidence you will accept ahead of running your test. Are you OK with the result of your tests being wrong 5% of the time? That’s the typical range most internet teams set. How about 10% of the time? 20%? Only you and your teammates can decide the type of risk you’re willing to take. The main reason to be more generous about your risk taking is that more risk means you require less statistical power. And less power means smaller sample sizes, which in practice probably means shorter and less costly tests because you need less time to get data from less users.

As you can see, much of designing an A/B test involves making trade-offs between these different factors that depend on your context. However, the statistics of your test are only one important piece of the puzzle for gathering important learnings about your users. Having a solid hypothesis that expresses what you aim to learn is equally important. This is the topic of our next section.

Your Hypothesis and Why It Matters

So far, we’ve tried to highlight that the work of a designer isn’t only to design; rather, we believe that the most effective designers are engaged across the entire process of designing with data, from designing the data strategy based on thoughtful questions, to understanding the analysis of the data, to recommending what to do in response to the data. At its core, the reason to leverage data in the design process is to learn: to learn something about your business, something about your product, or something about your users.

It is especially important that designers, product managers, and others you might work with (such as user researchers or analysts) all take responsibility for defining the hypothesis statement or statements together. As we have said earlier, a hypothesis is a testable prediction of what you think the result of your experiment will be. When we say testable, what we mean is that the hypothesis can be disproven through the experimental methodologies we outlined earlier. A well-crafted hypothesis should express your beliefs about what the world will be like if you launch your design change. As a designer, you already have a well-honed intuition for how design changes can lead to differences in user experiences and behaviors, which we believe is an essential perspective in crafting worthwhile hypotheses to test with real users.

Defining a Hypothesis or Hypotheses

Defining your hypothesis defines what you will learn from your test. It’s as if you’re putting forth a belief about the world to determine whether it holds up: if it doesn’t, you’ve learned that your belief was wrong. If it does, you’ll likely have gained some confidence in that belief. With that in mind, let’s dive deeper into what a hypothesis is.

Notice that we defined a testable hypothesis as one that can be disproven. Colloquially, many people will talk about this as whether or not you “proved” your hypothesis. We wanted to highlight that formally speaking, you can never actually “prove” a hypothesis, you can only disprove it. A famous illustration of this comes from the philosopher Karl Popper:

No matter how numerous; for any conclusion drawn in this way may always turn out to be false: no matter how many instances of white swans we may have observed, this does not justify the conclusion that all swans are white...but it can be shown false by a single authentic sighting of a black swan.^[12]

We find that keeping the formal definition in mind is helpful because it can be a good reminder to us that the beliefs we form about the relationship between user behavior and metrics should not be held so strongly that we become blind to the possibility that they can be disproven. Having this mindset will allow you to maintain a healthy attitude toward experimentation. Sighting a black swan—that is, being disproven—can often be the most valuable kind of learning. We challenge you to embrace these opportunities as a way to correct misconceptions and build sharper design intuitions.

With respect to data and design, your hypothesis should be a clear articulation of how you think your design will affect customer behavior and metrics and why it will have that effect. Said slightly differently, it states what you presume will happen to your metric(s) because of a change that you are going to make to your experience—essentially, it’s a prediction about the outcome of an experiment. If you have formulated your hypothesis well, then you will also have a good understanding of what you will learn about your users whether your hypothesis holds or is disproven. Having a strong hypothesis is key to the experimentation process and having a hypothesis that can’t be tested is ultimately of no value to you.

Your hypothesis should be a statement, a proposition that you make to describe what you believe will happen given specific circumstances. It often but doesn’t follow the form: “If we do X, users will do Y because of Z which will impact metrics A.” Returning to our earlier example of the race to the campsite, recall that we assigned the following equipment to our four groups:

Group	Equipment Received
1	Map (control group)
2	Map, compass
3	Map, GPS
4	Map, protein bars

As you might have noticed, relative to the control group, Groups 2 and 3 both received equipment that have to do with facilitating navigation, while Group 4 received equipment that has to do with food. These assignments might have been made based on the following three hypotheses.

Hypothesis 1: If we give campers equipment that makes it easier to navigate to the campsite, then they will be less likely to get lost and more likely to find an optimal route, which will decrease the time it takes to get to the campsite.
Hypothesis 2: If we give campers additional food during the hike, then they will have more energy and hike faster, which will decrease the time it takes to get to the campsite.
Hypothesis 3: If we give campers additional food during the hike, then they will not be hungry during the hike, which will increase camper happiness.

It’s easy to see here how giving campers navigational equipment like a GPS or compass would help test Hypothesis 1, while giving them protein bars would test Hypothesis 2. Simply put, if we observed that there was no difference in the time to the campsite for Group 1 (control) and Group 4 (experimental group, Hypothesis 2) then we could disprove Hypothesis 2—giving campers additional food did not decrease the time it took to get to the campsite. Similarly, if there was no difference between Group 1 and Group 2 or 3 we could disprove Hypothesis 1—navigational equipment did not help to get to the campsite faster. One important note here is that both Group 2 and Group 3 address navigation. These are two possible ways we could test the same Hypothesis 1. In Chapter 5, we’ll talk about different ways to “get at” the same hypothesis.

We might also observe that Hypothesis 3 is not disproven—the campers who got food are in fact measurably happier relative to the control group. As you can see, then, what you stand to learn depends not only on the exact experimental setup but also the metric you specify in your hypothesis. This is important, because it illustrates one major reason why defining a clear hypothesis is important: it helps build alignment among you and your team around what you think is important, and the criterion by which you will evaluate your test outcome. A clear hypothesis that involves a specific metric makes it clearer whether a test is a success or failure, paving a clearer path to next steps based on data.

There is always a temptation when embarking on a new project to jump right into exploring different design solutions or to begin to plan how you might execute on your ideas. It is a commonly accepted best practice in the design world that it takes many iterations to find the best design. The same is true for experiments. We don’t want to mislead you here into thinking that you should make one hypothesis and then go straight into testing it with an A/B test. In the remaining chapters, we’ll deep dive into how going broad is essential to learning the most from your A/B tests. But for now, just remember that choosing what you test is critical, and taking upfront time to focus your efforts on the most meaningful tests will help you learn the most.

That’s what A/B testing is at its core: uncovering the most meaningful insights about your users and how they respond to your product and design changes. We believe that if you start by considering why you think a design will have an important or meaningful impact, it will ensure that you and your team can learn from each design in an A/B test. In order to emphasize learning as central to your A/B testing process, we encourage you to clearly articulate what it is you’ll learn from each hypothesis you test.

You should aim to have two things when crafting a hypothesis:

A hypothesis statement that captures the essence of the change you propose to make and what you think the effect will be.
A clear understanding and a plan that addresses what you would learn by testing that hypothesis.

You should reach a clear agreement on both the hypothesis statement and learning statement with the other folks you’re working with (your team and any other stakeholders) before you embark on your design process. This is especially important when your hypothesis is disproven. Working with your team early on to articulate your possible learnings for all outcomes will help ensure that no matter what the data shows, you’ll have gained interesting and actionable insights about your users.

Know What You Want to Learn

In order to keep learning at the center of each hypothesis, you should aim to answer these questions:

If you fail, what did you learn that you will apply to future designs?
If you succeed, what did you learn that you will apply to future designs?
How much work are you willing to put into your testing in order to get this learning?

For example, if the changes that you made to the experience you were testing resulted in metrics that were negatively impacted instead of positively impacted (e.g., you lowered the sign-up rate rather than raised it) what useful information did you learn from your test? Did your customers behave differently than you expected or did they behave as you expected but the result of that behavior was different than what you predicted would happen? We’ll cover analyzing your results in Chapter 6, but the point here is that tying your hypothesis back to the lesson you are trying to learn can be really helpful in the long term. Don’t run a test, conclude that it failed, and forget about it. Unfortunately, this is a pitfall we see teams fall into all the time. Emphasizing that all experimentation is about learning ensures that negative test results aren’t really “failures” at all, because you still gain valuable insights into your users and the impact of your designs, data that will inform and improve your future experiments. Only when you can leverage failed results as well as successes can your company truly be “data aware.”

Running Creative A/B Tests

Anyone involved in product or design work would agree that creativity is essential to doing good design work. However, typically there is more pushback on the claim that experimentation is a creative process. Unfortunately, the narrative around A/B testing that we often hear is that it’s “just a way to validate your design before shipping” and “it’s for crazy and meaningless optimizations, not interesting design problems.” If you take nothing else away from this book, we hope that you come away feeling that A/B testing and other data methodologies can be as creative as the design process you already know and love. To help make that point, we’ll emphasize two main ways that you can bring creativity into your A/B tests.

Data Triangulation: Strength in Mixed Methods

In the beginning of the chapter, we spoke at length to some of the strengths and limitations of different kinds of data. Recall that A/B testing is an unmoderated quantitative method that involves observing a large sample of user behavior in context. This has many strengths: it helps you gather statistically significant learnings about causality so that you and your team can form a rigorously informed opinion about what will probably happen if you launch your design in the real world.

What we haven’t emphasized as much so far is that for all its strengths, A/B testing has several important weaknesses. A/B testing alone cannot tell you why the different test cells performed differently, give you insight into the attitudes or emotions that users feel about the change (beyond those reflected directly in the metric you’re measuring), or let you probe deeper into behaviors that might be interesting or confusing. Especially for designers who are used to approaching work from these very perspectives, these sacrifices can feel lofty.

We firmly believe that best way to do A/B testing is to also do things other than A/B testing. Mixing methodologies is the only way to gain a truly complete picture of your users and how your designs affect them. Moreover, using different kinds of data can inspire new approaches or hypotheses to A/B test. Using other forms of data, therefore, is a way to inspire more creative A/B testing.

Data triangulation is using multiple methods to form a holistic picture of your users and your data. It can help further improve your understanding of user behavior by explaining why you found a result to be true among many users in an A/B test, or understanding the magnitude of a finding you saw in a small-sample, moderated research activity like a usability test. Data triangulation also helps you avoid falling into the pitfall of relying too heavily on a single source of data and can spur endless new ideas for future A/B tests or design iterations.

Let’s imagine after the race to the campsite that you were surprised to find that Groups 2 and 3 (the compass and GPS groups) were even slower than the control group. This would undoubtedly be a puzzling result—how could having equipment that intuitively seems to make group fasters by helping them take a more direct route actually make them slower? To understand what happened, you decide to interview some of the campers in those two groups to learn about their experience. In those interviews, you might discover that many of the campers complained a lot about mosquitoes. They stopped often during the hike due to bug bites, and were slowed down by swarms of mosquitoes in their face. Just by looking at the results of your experiment you never would have understood such a puzzling result; data triangulation gave you a more complete view of your campers’ experience.

What would the outcome of this data triangulation be? One simple idea is that bug spray is a worthwhile investment for your hike in the future—no doubt the mosquitoes not only made the campers slower, but also less happy. In the grand scheme of things, this is a small change to make. But perhaps learning about the mosquito infestation could have larger consequences as well. You worry about sending future campers through the same buggy patch, because mosquitoes can carry dangerous diseases. Perhaps next time, you’ll try to find a new campsite instead of just optimizing routes for the existing campsite. This might make the hike more pleasant and faster overall, a creative opportunity you never would have discovered without data triangulation.

Arianna McClain gave us a great real-world example of how different research methods can lead to different insights and outcomes. She shared the following example of a time she was consulting on a healthcare-related project at IDEO:

On a healthcare project, when we asked people through a survey what was most important in choosing a healthcare provider, people chose “costs and coverage.” Had we taken this data at face value, we would have designed a solution that was wrong, like a cost calculator.
That’s because when we spoke to people to hear their stories and looked at the behavioral data from the healthcare site, we saw and heard something very different from what they reported in the survey and told us through their stories. Even before costs, people consistently first clicked on “Things you should know when choosing a provider.” And afterwards, they made their final decision by figuring out costs. By merging the insights, we were able to create a more valuable product by designing for that tension.

This example shows how triangulating different methods gave her and her team a stronger understanding about how people make healthcare-related decisions. Had they not used both methods, they would have designed for the wrong thing (for instance, by focusing too much on things you should know, or by focusing too much on cost). Combining methods in this way gave the team a richer understanding of how these decisions are really made.

The Landscape of Design Activities

Data triangulation can help you understand why you saw the data that you did, and inform creative approaches to future A/B tests. We want to make another claim about why A/B testing is creative: A/B testing can help you approach many types of problem spaces, giving you the flexibility to try out different ideas and approaches and learn substantially from them. In other words, A/B testing is not only for small optimization problems; it can also help you take a stab at understanding a space you’ve never designed in before.

In Chapter 1, we showed you an illustration (shown again here in Figure 2-7) to get you in the mindset of thinking about the types of problems you’re trying to solve. Design can solve many problems, and from our experience it can be easy to lose track of the landscape of possible design activities you could be engaging in, depending on your goals and type of problem you’re trying to solve.

Figure 2-7. Revisiting the space of design activities from Chapter 1.

Previously, we left the space of possible design activities very vague. Now, we’re going to take a moment to provide a framework for thinking about how you could understand this space. One of the themes we hope you take away in his book is that using experimental methodologies as a means of gathering data fits seamlessly within the existing design process. Flexible methods like A/B testing can be used for a wide variety of design problems, but in order to do so successfully, you need to be aware of the design activity you’re trying to work on. This is another way that creativity makes its way into the process of designing with data: the data can answer many questions, but you need to be the driving force behind asking and solving the right design questions.

You can imagine taking the space of design activities and overlaying it with a grid. One axis represents the scope of the problem you’re solving: Is it a global or a local problem? The other indicates how far along you are in solving that problem: Are you close to finished, and just trying to evaluate your work, or are you just beginning to explore the space of possible solutions (Figure 2-8)?

We can take the space of possible design activities and make it more concrete, by considering the framework of where we are on the spectrum of scope (is the problem global or local?) and how close we are to finishing (are we just starting explorations, or evaluating our work?)

Figure 2-8. We can take the space of possible design activities and make it more concrete, by considering the framework of where we are on the spectrum of scope (is the problem global or local?) and how close we are to finishing (are we just starting explorations, or evaluating our work?)

Exploring and evaluating Ideas

As a designer, you may have already encountered conflicts with your team about how “finished” your design output is. We’ve often heard designers complain that they agreed to launch a piece of work they deemed “unfinished” thinking that it was a temporary placeholder, only to find that their team thought it was the finished and final product. Thinking about whether your problem is exploratory or evaluatory can help you address these team-wide decisions on how close you are to finishing the work. As a general rule, you can think of this dimension as conveying how close or far you are to coming up with a solution to launch to your entire user base (not that this is to say the process is purely linear; you might be evaluating a solution only to uncover something that forces you to start exploring other alternatives!). In other words, exploration helps you figure out what to build next, while evaluation helps you measure the causal impact of your work.

When you are at an exploration stage, you don’t necessarily have a specific solution in mind. Instead, you are seeking directional input on the types of solutions that will or will not work for your users. In the exploratory phase, your attitude will need to be more open minded; you might discover things out about your users that you weren’t expecting, which could alter the way you are going to approach your design moving forward. When you’re in an exploratory stage, the types of questions you might be asking are “How will my users respond to this change?” or “What happens to the metrics when I do this?” If instead you are at an evaluation stage, then you will likely be looking for very specific answers. You’ll want to know whether or not your design works in the way you expected and why it performs that way. When you’re in the evaluation stage, you might ask yourself “Can my users complete their goals with this design?” or “Did this design improve or maintain my company’s key metrics? Did I observe any drops in my metrics?”

If you’re designing in the exploration phase, you should feel empowered with the freedom to take risks and try out designs that you’re not sure about, knowing that the designs aren’t final. We often see the pitfall of designers not getting creative enough here—in an exploratory test, your test cells don’t have to be finished pieces of work that you would be proud to ship. You could pursue seemingly crazy, potentially risky, or groundbreaking ideas with the intention to learn from them, rather than ship them. An example that Colin McFarland, Head of Experimentation at the travel site Skyscanner, shared with us involves his team’s investigation of a “share” feature. The team tested two different ways to share—one that was a very subtle sidebar button that the team would have felt comfortable shipping, while the other was a prominent pop-up that would have been too extreme for them to ever ship. They found that because neither the subtle nor the extreme treatment led to a difference in their metrics, they abandoned the idea. Had they only tested the subtle sidebar button, the team might have thought that they hadn’t gone “big” enough, and they might have run more experiments chasing a dead-end idea.

The result of your A/B test should never be a decision to launch the winning design and move on when you’re in the exploration phase; rather, it should be to take the learnings from the test and apply them to future design work, iterations, and more A/B testing. Comparatively, as you move toward evaluation of your designs, you should expect that you’re getting closer and closer to a “finished” piece of work that you would be comfortable launching to your users. At this point, you and your team should already be directionally aligned and you might be making smaller tweaks or polishing your work. The output of a successful A/B test might be to start launching the design to more of your user base. We’ll talk more about this in Chapter 6.

Thinking Global and Thinking Local

A second dimension you want to think about defines the scope of how broad or narrow your design thinking is, and how optimal you want your solution to be. We can classify this decision of scope as looking for global versus local solutions. You can think of this dimension as determining how big of a design change you’re willing to make; that is, how much you are willing to optimize.

When designing for a local problem, you are focused in on a specific solution, and you’re making small changes to just one or two pieces of the experience in order to understand their impact or importance in isolation. In design folklore, these are the types of problems traditionally assumed to use A/B testing—for instance, when varying factors like the placement, color, size, and copy of a button to yield the best click-through rate. Local problems generally involve shorter time scales and are most appropriate when your existing solution is already performing adequately. Your design iterations for a local problem are generally less pronounced, changing only one or a few factors at a time (Figure 2-9).

In a local test, you will change just one or a few variables at a time; for instance, centering your button (Test A), or changing the placement from the bottom left to top right (Test B).

Figure 2-9. In a local test, you will change just one or a few variables at a time; for instance, centering your button (Test A), or changing the placement from the bottom left to top right (Test B).

By contrast you are solving a global problem when you’re completely redesigning an existing experience. This might be because your existing solution is performing poorly, because it’s outdated, or because you’re looking to make many large changes to your experience or product at the same time (for instance, you want to change the entire sign-up flow for your product because user attrition during the existing flow is very high). Your design iterations will likely be quite dissimilar, and it will take more time to land on the best solution when you’re changing many factors at once, because you won’t be able to separate what part of your change caused a difference. You might be trying to understand whether a given feature matters at all by changing many variables at once (Figure 2-10).

Comparatively, in a global problem, you might be comparing two entirely different experiences which don’t necessarily resemble each other at all (Test C).

Figure 2-10. Comparatively, in a global problem, you might be comparing two entirely different experiences which don’t necessarily resemble each other at all (Test C).

Getting back to the question of whether you’re tackling a global or a local problem, you might ask yourself which of these two approaches is right for your design problem? Are you thinking too broad (your problem is actually local), or too narrow (your problem is actually global)? The answer depends on many factors. We want to be clear that a global change is not necessarily going to be more impactful to your metrics than a local change—global versus local defines how big a difference is in your experience relative to your control, not how big of a difference you’ll observe in your metrics. Often, making a global change to your app or experience will require more resources and a much longer timeframe. In those cases, you should be thoughtful about how much “better” the global maxima (that is, the best possible result across the whole of the app experience) is compared to the local maximum (that is, the best possible result on the specific thing you are testing). Does it justify the extra resources, effort, and time? Is solving for this global problem the most important thing for you to be working on right now, or is a local optimization good enough? Does local optimization have the potential to create a large effect on your metrics? These are questions you can’t decide on your own. Bringing your team into alignment on these issues at the beginning of the project will help you design solutions that are best suited for the constraints and goals you’re currently trying to solve. Bringing other teams in and engaging with business strategists can also be useful because “global” can be confusing—some simple or local experiments can impact the whole product, and can be global. A good example is changing a “call to action design” on every page.

Notably, both of these dimensions can vary together: you can do global evaluations and local explorations. And, as we noted in Figure 2-11, these two dimensions are not binary but rather exist on a spectrum. Our goal in introducing this framework is not to encourage you to pin down exactly where you are or introduce more bureaucracy into your design process. Instead, we believe that giving you a framework to think about the type of problem you’re trying to solve is essential for two reasons:

It forces you to take the time to consider the space of other possible design activities you could be working on, and confirm for yourself and with your team that you’re focusing your efforts on the right scope and problem.
Different spaces in the design activity landscape leverage data in different ways, and by being thoughtful about your specific problem space you can be more effective in how you design your A/B test (or other research methodology).