In Chapter 1, we discussed strategy for data science teams. Yet you can’t benefit from a strategy until you have achieved something for your customers. For a data scientist, this usually means completing a project. The strategy for the team guides some of the projects and its prioritization and is the base of the project’s strategy. However, projects still need their own strategy to properly define their objective and the means that are available and allowable to achieve the objective. By nailing this at the beginning of each project, you ensure that each one achieves its objectives, and, crucially, your work receives full recognition from your customers.
Project Management
It is often considered that good project management is vital to having a successful project result. You can’t simply start a project without a method to work out how to organize the work and how to estimate the likely time and resources required to complete your project and achieve your hoped-for result.
Over the years, many different processes have been used to manage projects. These processes all have in common that they systemize an approach to deciding what to do, estimating how long it will take to do it, and then ensuring that what is done is what the customer wants. However, they differ in the means of achieving those goals.
In the recent past, many software companies and software-oriented companies (i.e., companies where the main product is not software, but software is critical to the product’s delivery) have adopted Agile as a key project management methodology. Agile is usually in contrast to a formal “waterfall” approach, where requirements are defined early in the project, and then changes to requirements are difficult.
The waterfall approach was pioneered for complex civil engineering and construction projects where the final desired outcomes are heavily dependent on earlier stages and change is very expensive. For example, consider how the floor plan of a building determines the foundations needed; once the foundations are poured, going back and making a change would be extremely difficult and expensive, and so, a change to the floor plan only be made that doesn’t require different foundations.
As Agile has taken off, sometimes it appears that using Agile principles have made the frameworks used in waterfall projects redundant. However, there is a risk of throwing the baby out with the bathwater in carelessly or naively implementing any specific project management framework without considering the messages and strengths of the others.
Two myths busted by lean software development1 pioneers, Mary and Thomas Poppendieck, combine to illustrate the point. The first is “Early specification reduces waste” and the second is “Planning is commitment.”
These myths are ideas from traditional project management which have often been used to promote specifying early and creating a plan that is committed to. These are important things to do if you need to begin digging a foundation 6 months ahead of delivery of a piece of equipment.
However, the tools to make the plans and decide the specifications can still be used while maintaining the attitude that the plan can change. An early specification doesn’t have to be handcuffs, it can simply be the results of your research.
Therefore, by accepting that planning is not commitment, we can use the best tools from traditional project management methods, with as much of an Agile mindset as suits our situation.
In the material that follows, where we borrow material from waterfall project management, such as recommending creating documents and recommending talking to customers early, we are not demanding a commitment to the early discovery, simply observing that chances engage with customers may not come smoothly throughout a project life cycle, so use them where they arise.
One of the important features of many of the traditional project management exercises is an emphasis on formal opening and closing procedures in order to ensure that the right path is followed—an approach that emphasizes risk and quality management.
In traditional approaches, such as Project Management Body of Knowledge (PMBOK) 2 and Projects in Controlled Environments (Prince2),3 this work takes place up front and produces a set of documents that record the results of the discovery. The disadvantage of this approach is that a rigid Project Manager can use these documents as a straitjacket, refusing to allow them to be altered in the face of changing circumstances or new information.
The advantage is that objectives are clear from the beginning. At the same time, there is no need to stop being Agile—simply take the best of the other approaches, learn from it, and apply it to your own situation. Indeed, the idea of hybrid approaches is one gaining greater currency today. Indeed, both PMBOK and Prince2 have developed guides on how to use those approaches while being Agile.4
As we will see next, choosing the right objective and defining it correctly is a difficult task. It can also be expensive when it goes badly. In that context, it makes sense to take learnings from as many areas as possible with a view to avoiding that expense.
Ultimately, there is nothing about trying to understand the customer as deeply as possible as soon as possible that conflicts with Agile principles, as long as it is done right. Rigidity in the face of changing requirements stems more from human defects than it does from defective processes, and isn’t a reason to abandon a process.
Defining the Objective
In many ways the most difficult aspect of data science in general, and of any project, is choosing the correct objective. In the case of data science, as a mathematical discipline or at least a discipline inhabited by people whose background in areas such as computer science and statistics makes them prefer to see problems in a quantitative light, practitioners need objectives with a numerical definition.
In contrast, it can be argued that in the context of framing a relevant business problem, a qualitative viewpoint can often be more helpful in the initial phases. For a start, it is important to understand what kind of answer the customer is looking for. Are they going to use your work as decision support, for example, to choose between some alternative courses of action?
Are they instead going to use the result in a quantitative way themselves, for example, to determine the number of resources to allocate or stock level to hold?
In many cases the output of the model is not the final action—it only contributes to the final action in some way. The actual final action is an implementation of the model or a decision based on the model’s output. For example, although the output of a regression model is a set of coefficients to define an equation, and the output of applying that equation to new inputs is a column of numeric values, the true output is the decisions that are made as a result of understanding those values.
Taking a more concrete example, correctly estimating the person-hours taken to do something may be an obvious choice for a model, but it may not be the ultimate goal. The goal may really be to estimate the elapsed time (when can I have it?) or labor resource (how many people should I allocate?). Hence, choosing the right dependent variable to model may not be obvious, and there may be another variable, easier to model that is also more apposite to the client’s needs.
Another side to this coin is understanding why the customer wants a data science solution. Do they want higher accuracy forecasts than they currently achieve? Or would they be happy with the same accuracy but want to leverage computational speed and power to achieve a faster turnaround time?
This last question should be key to an effective data scientist’s approach, as clearly an hour or two spent ensuring you have the correct understanding of the client’s objective will be more useful than 10 hours spent optimizing a model serving a different objective.
Costs of Getting It Wrong
Partly due to data science being a relatively new activity, compared to other disciplines, few examples of the cost of getting data science wrong have been compiled. However, there have been costs of failure analyzed in the overlapping context of software engineering. Some of these are glossed by Code Complete 2,5 itself a brilliant resource for data workers resolved to avoid wasted time, despite the book’s focus on software construction.
McConnell, the author of Code Complete 2, combines the results from a selection of papers to give potential ranges for the time taken to fix defects at different stages of software projects—not altogether surprisingly, fixing an error in requirements after the requirements stage quickly blows out as projects progress, until fixing a requirements error post-release is estimated at 10–100 times as time-consuming as fixing it during the requirements process.
The Manager of a large high-rise office building was receiving mounting complaints about the poor elevator service. She decided to call in a consultant to advise her on what to do to solve the problem … [the consultants suggest expensive engineering solutions] … fortunately, one of the hotel tenants was a psychologist.7
The psychologist succeeds where the elevator engineers fail by realizing that the key reason that people complain about the elevator is that they are bored. She installs mirrors near the lifts to keep the lift users amused, and the problem vanishes (and the authors note that had they invented the problem a few years later, it might have been TV screens, not mirrors).
The moral of the story is that solving the right problem is crucial. The problem was never that the lift was going too slowly, but that the lift’s users didn’t have anything to do while they waited for the lift.
The error that the engineers made of trying to speed up the lift rather than relieve the lift users’ boredom is an example of a Type III error—solving the wrong problem with a precise solution. In data science this could easily happen if you create a highly accurate model with the wrong target.
More insidiously, there may be multiple “right” targets and multiple ways of building accurate models for them. The multiple right targets may differ in terms of the data they require and the tools they require to model that data. Therefore, there is a great benefit to the data scientist who can identify the right target that offers the most tractable solution. Sometimes, you won’t need a better data preparation tool but a different way of looking at the problem which reframes it with a less onerous data challenge.
The multiple ways of building accurate models can also determine a lot about what sort of solution is offered to the client. For example, many people have observed that the super accurate models made to win Kaggle competitions are very different to the leaner models commonly found in industry, especially as the processes used to achieve the last 0.1% of accuracy that wins the competition are usually too computationally intensive to get an answer to the customer in a reasonable time frame.
Most data science professionals realize that the massive stack of advanced neural nets and Gradient Boosting Machines that win Kaggle competitions is unsuitable for most real-world clients’ needs.
The more subtle area is where there are multiple models that could be feasibly implemented, and poor understanding of the customer’s preference means that an optimum model could be selected that actually isn’t optimum in the eyes of the customer. This is an example of Project Risk, in that it is a risk that threatens the perceived success of the project, but it is not one of the most commonly discussed project risks.
Project Risk and Unintended Consequences
Project risk is often defined and examined from the point of view of risks to the completion of the project. That is, in my experience, a typical project risk discussion will focus on hazards to completing the project either on time or correctly. For example, we will discuss the CRISP-DM data mining approach later in this chapter—it describes risk in terms of things “that might delay the project or cause it to fail.”8
This definition is natural when considering projects that begin with a very strong definition, as, for example, civil engineering projects where the project team is constituted to build to a structure to a prescribed blueprint might be. However, in data science, and in fact in many software engineering contexts, the objective is far murkier.
A more insidious form of project risk is similar to a Type III error—the project is completed on time and correctly but doesn’t give the customer the expected benefits. Worse than that are cases where the project is completely successful, and then inadvertently causes problems. This could be a Type IV error, although Mitroff and Silvers have already proposed a definition of Type IV error.
The possibility of adverse outcomes arising because of unintended consequences is an aspect of the safety of machine learning systems.9 This is a new consideration in machine learning, but likely to be of increased interest in the near future.
A recent example illustrates the hazard of unintended consequences. An Australian woman wrote an open letter to tech companies involved in targeted advertising after her social media became inundated with baby-related ads after she bore a stillborn child.10 In her open letter she posed the question “If [targeted marketers] could identify she was pregnant, couldn’t they identify that she had had a miscarriage?”
Intuitively, the marketers are able to identify women who have recently had had a miscarriage (and they could probably even market, for example, counseling services to them) but had not identified the need. While it’s only possible to speculate on what the marketers did or didn’t do, not deliberately considering when a wrongly placed ad can have negative consequence is a plausible reason why they didn’t identify this potential problem.
Big Data—Big Risks?
The trend toward Big Data offers another area where attention to the problem being solved is crucial. Although it is true that Big Data can offer solutions that are not available with a smaller data set, there is also a widening realization that a larger data set also brings greater risks. In particular, data sets with many possible input variables bring an especially high risk of falsely identifying an input variable as important.
Therefore, if there is a way of answering the customer’s needs (find more customers, better identification of risk, etc.) without using ever-larger data sets to achieve the result, this will often be a better outcome—and this is before considering the extra effort in computation and coding time that is usually associated with Big Data over more moderate data.
Finally, the biggest problem with Big Data is that it encourages people to focus on the fact of the data being big at the expense of a clear understanding of the client’s problem. As we discussed and will reiterate further, a clear understanding of the client’s problem should always be the central concern of any data scientist making a genuine attempt to provide value.
Defining the Objective
There are a number of approaches to elicit the best understanding of the customer’s real requirements. We will review three of them, and discuss the context of how they are usually applied in their relevant contexts.
Each of these approaches has in common that they exist to widen the discussion from what people would automatically do if allowed to default to problem solving by reflex—human nature is to suggest a solution as soon as possible, without slowing down to discover the real problem, or at least, without checking to make sure that the real problem has been identified.
Hence, each of these approaches exists as much as anything else, to force people to slow down and hasten more slowly in their approach to problem solving.
The Six Sigma11 process was originally developed for use in a manufacturing environment. However, after success, especially at GE, which began as a manufacturing company but branched into other areas including finance, it began to be used for a wider range of applications.
The Six Sigma approach is stage-gated, with the stages in the original version being remembered through the mnemonic “DMAIC”—Define, Measure, Analyze, Improve, and Control.
Fundamental to the success of the Six Sigma approach to solving problems is the lengths that are gone to understand the voice of the customer, which occurs during the initial “Define” stage. This is then translated into a measurable target that is the focus of the Six Sigma team. There are two important outcomes with this approach.
Firstly, it ensures that the subject of the project is genuinely relevant to the end user. Secondly, it ensures that people maintain their understanding of whether the measurable target is the actual thing the customer is interested in or a proxy, and if the latter, ensures that the way the proxy and the customer’s concern relate is transparent.
So how can we properly establish what our customer actually wants? The cold reality is that all too frequently they won’t be able to tell us, although this may not mean that they don’t actually know.
Although the Six Sigma was developed for a different environment than is seen in some data science projects, there is still a lot we can from this approach. This, especially, because arguably Six Sigma’s greatest achievement was taking pre-existing quality assurance tools, and pairing them with a rigorous approach to understanding the real objectives of quality projects.
This link enabled Six Sigma users to ensure that they could explain their successes to the rest of their organizations—a vital consideration in most companies in the modern age, where if the management can’t see you adding value, you can be very quickly removed from the business.
The first notable feature of the Six Sigma approach is that the Define stage—where the goals are set and success is defined—is given its due as the true foundation of any project. At the same time, there is no assumption made that the customer or client will be able to express their needs in terms that easily lend themselves to developing the clear targets required for this kind of project. Instead, various tools are used to convert what the customer knows they want or need into something more tangible lending itself to specific and achievable goal setting.
The tools that are typically used in a Six Sigma context are intended to help practitioners zero in on the part of the problem that has the most influence over the end result, or to put it another way, the area of the problem that has the best effort to benefit ratio.
In this book, it’s not our aim to give a comprehensive or any guide at all to Six Sigma design tools. We’ll look at just one to demonstrate the philosophy. We also note that, in general, tools and techniques claimed by Six Sigma were not invented for use with Six Sigma projects—they usually already existed and were identified as fitting the philosophy sometime after their use became widespread, even if being recommended as a Six Sigma tool made their use more widespread still.
The Voice of the Customer
Fundamental to the success of the Six Sigma approach to solving problems is that the objective is defined by voice of the customer. Only by understanding the voice of the customer can you find an objective to be translated into a measurable target that becomes the focus of the Six Sigma project. As a result, a substantial part of the Define stage is devoted to understanding the voice of the customer, and understanding the voice of the customer is recognised as the first step towards defining a project. There are two important outcomes with this approach.
Firstly, it ensures that the subject of the project is genuinely relevant to the end user. Secondly, it ensures that people maintain their understanding of whether the measurable target is the actual thing the customer is interested in or a proxy, and if the latter, ensures that the way the proxy and the customer’s concern relate together is transparent.
So how can we properly establish what our customer actually wants? The cold reality is that all too frequently they won’t be able to tell us, although this may not mean that they don’t actually know.
In an ideal world, we might want to use a tool that Six Sigma practitioners use, or something very similar. However, we are at a disadvantage because Six Sigma practitioners are able to train their (usually internal) customers to expect certain tools, and unfortunately the expectation has grown that data scientists will march in, make and implement some models, and march out leaving easy profits in their wake. That doesn’t mean we can’t learn from the thinking behind some of the Six Sigma models, however.
The tools that are typically used in a Six Sigma context are intended to help practitioners zero in on the part of the problem that has the most influence over the end result, or to put it another way, the area of the problem where the ratio of the ease of fixing compared to benefit of fixing provides the most favorable results.
In this book, it’s not our aim to give a comprehensive or even any guide to Six Sigma design tools. We’ll look at just one to demonstrate the philosophy, and how it can work in practice. As always, and as will be the case for other tools presented in this chapter, the correct choice of tool depends on the situation.
From the point of view of understanding the customer’s needs, especially within the context of a larger situation, one of the most powerful tools associated with Six Sigma is Quality Function Deployment (QFD).

A simple House of Quality diagram, as typically used in Quality Functional Deployment. Note that the diagram shows correlations within the design requirements themselves and between the design requirements and the customer requirements.
Six Sigma also employs the Seven Management and Planning tools, which were popularized by the post-WWII Japanese approach of Total Quality Control.12
QFD is a method which allows its users to employ System Thinking and Psychology to their problem, which means they can develop a proper understanding of where the customer sees value. It covers both the “spoken” and the “unspoken” requirements to avoid the problem of developing a product that is precisely what the customer asked for, without being anything like what the customer wanted.
The key message is simply that what the customer values isn’t necessarily what we think she values. It also isn’t always the first thing the customer complains of when they initially engage. Discovering the real motivator for the customer can be difficult.
However, as doing so is crucial to selecting the right objectives, using the available tools to uncover the user’s underlying motivations should be an essential part of the data scientists’ process. Quality Function Deployment is a key example of the kind of tools that have been successful in understanding a customer’s concerns in their wider business context.
CRISP-DM
Six Sigma and the DMAIC process were not developed with data science or data mining in mind. Although we have recommended at least considering the use of some of the tools of Six Sigma to establish the customer’s needs, an important way that the Six Sigma process is not suitable for a data science project is that it is linear.
The CRISP-DM13 process allows for insights gained from the data to be re-incorporated into the understanding of the user’s problem. The degree to which this will be suitable obviously depends on how available the user wants to make themselves to you; when deciding on an overall strategy for a particular project, it will be important to consider how available the user is prepared to make themselves for questioning.
The iterative nature of CRISP-DM—something it has in common with the Agile philosophy—makes it a good way to think about data science projects. On the other hand, as it lacks some of the customer focus, project management, and close out elements of other approaches, these may need to be borrowed from another approach.
The CRISP-DM cycle can be seen in Figure 2-2. By using the information gathered at subsequent cycles, a data scientist using the CRISP-DM cycle can improve on their first guesses. In particular, there is a specific provision made to go back to the customer for further discussion after the initial phases of data discovery.
Each of the phases is subdivided into smaller areas,14 which contain checklists of important areas to consider. The Business Understanding phase, for example, being the phase that most corresponds to this chapter, has the distinct goals of understanding the business goals, as well as success criteria for the data mining project.
These are then referred to in the successive phases of the project. For example, in the evaluation phase, the evaluation is performed with respect to the original success criteria, as might be expected.

CRISP-DM cycle
Empathizing with the User
The approach taken by Six Sigma practitioners is very much an engineering approach. The key tactic is to define a concrete problem with a financial payoff. This is appropriate in the usual Six Sigma context where problems are most often associated with a specific cost of poor quality outcome—a need for rework or scrap, for example.
However, this may not be the best approach for every opportunity that can be tackled through data science. Where the opportunity is not tightly coupled with a specific poor quality outcome, a Design Thinking approach might be a viable alternative.
In a typical Design Thinking life cycle, developers build empathy with users before generating ideas that represent possible solutions. This approach allows the practitioners to get a better understanding of the customer’s underlying or true problem, rather than the surface problem they may have presented with.
The process of empathizing with the user allows a wide range of possible solutions to be generated, with the potential for some very blue sky thinking. The elevator example from Mitroff and Silvers we saw earlier in the chapter in fact illustrates this perfectly—the psychologist thought about what the elevator users actually wanted rather than the kind of problems that she usually solved. As a result, she was able to suggest a solution that succeeded where a technical solution wasn’t as successful.
- 1.
Empathize: Gather information from your users.
- 2.
Define: Turn user information into insights.
- 3.
Ideate: Generate ideas based on insights.
- 4.
Prototype: Build a version of your ideas.
- 5.
Test: Validate your ideas.
Although these stages are written as a sequence, that doesn’t mean they need to be performed in a strict sequence—iteration is possible and desirable.
The initial stages are particularly important in our context. Empathizing with the user means ceasing to impose your own ideas, and putting aside your own assumptions about what this problem looks like to the client or user. This is exactly what the psychologist did in the elevator problem. She asked herself, “As an elevator user, how am I affected by the time it takes for the elevator to arrive?”
By empathizing with the user, you will be able to understand the actual problem they are facing, rather than the problem that your own assumptions about the world make you impose on the client.
Don’t Waste a Second
Time with clients is precious. Imitate experts in other professions—doctors, lawyers, management consultants—to apply the best interviewing techniques that ensure you get the most from the short time you have with your clients. If your clients are internal, you have more time, but it’s still easy to waste it, and therefore to fail to get the best results—and it still isn’t infinite.
Part of the problem is speaking the same language. There is no reason to assume that experts in other fields are conversant in machine learning and statistical techniques and jargon. The second difficult aspect is how to project confidence in your model or other data analysis without projecting arrogance.
Appearing, if not humble, at least in a way that means you don’t appear to place yourself above the person you are trying to help will prevent turning them off. If you remember that the information they have about the nature of their problem is just as useful, while more difficult to obtain (there are lots of data scientists around the world, but only a handful of people understand your clients’ problem like they do, and they are likely all at the same work site) should make it easy to stay grounded.
Getting a gauge on where your client’s statistics or data analytics knowledge is also important. Explaining every basic statistics term to your clients when they already perfectly well what the median is will get them offside as surely as will baffling them with esoteric discussions of the spherical family of distributions.
In general, there is a strong chance that members of your audience have done a business degree. It’s likely that they understand the basics of descriptive statistics. Concepts such as standard deviation may be fuzzy or misremembered, but they probably aren’t completely novel.
Machine learning algorithms and their jargon are relatively rarely encountered by people who aren’t trying to become experts in the field, the most obvious example being data scientists.
Consider carefully whether you need to tell an audience that your model may be based on a Random Forest . Before any models have been built, there shouldn’t be any need to go into that sort of detail—for any model, the algorithm used to build it is one of its least important attributes. If you start getting side tracked into explaining how a particular algorithm works to a client who isn’t going to build their own models, you will exhaust the time they have available to explain their business to you. Let’s face it—there is almost no chance they have enough time to explain it to you properly.
Note
One of the catchphrases often heard in a design thinking environment is to avoid solutioning. Solutioning is where someone who is running a session with users or other local experts intended to define problems begins to suggest solutions. Doing this can be fatal to the session, as the local experts will stop explaining their problems and either shut down or offer their own solutions.
Better yet, hold back on the details of possible models until after you report on what you have built. Ultimately, you are not selling a Random Forest or a neural network, you are selling a reduction in their exposure to risk, or the means to reduce some of their costs. The customer isn’t interested in how that is done beyond needing to feel confident that you really can do it.
In short, a discovery session with your client or user is a great time to remember the adage that “you have two ears and one mouth: use them in that proportion,” and then talk even less. People often want to fill in silence with their own words—use it to your advantage. Think carefully about when you need to guide the conversation, and in what direction to guide it.
Conversations with the customer almost always turn up new views on what is really important. Sometimes all you need to do is sit in the room and listen. Other times you will need to coax the customer to the right frame of mind to tell you what they really need.
Data Science and Wicked Problems
There are multiple definitions of “wicked problem” but one characteristic that may be seen in problems presented to data scientists is that the lack of a definitive formulation.
Wicked problems are likely to be intractable if you try to apply traditional data science tools (some problems thought of as wicked problems have been at least partially solved using advanced forms of decision theory, but that’s way beyond our scope). That doesn’t mean that data science tools can’t be used to indirectly tackle at least certain aspects of these problems.
It does, however, mean that is very important to identify when you are being asked to solve a wicked problem as early as possible. Once you know that that is what you are dealing with, you can decide an appropriate course of action. The two chief possible courses being pass on the problem, or restructure the problem into a tame problem.
Important markers of a wicked problem include the clients you talk to failing to agree on a common understanding of the problem. Attempts to define where the problem begins and ends may be fruitless. The problem may appear to be a symptom or a cause of another problem. In this way, the problem shifts its shape so as to defy attempts to place a boundary on it. This shape-shifting makes the use of any sort of model or algorithm extremely problematic.
As has been noted before, a significant part of solving a problem is defining it correctly. A common element of wicked problems is that they defy an easy definition. However, deciding on a stance beforehand sometimes forces the problem into a solvable state.
By reframing the problem in an acceptable way, it can become tractable. As data scientists, the preference is logically toward numerical solutions or solutions involving automation.
There are in fact a range of problem restructuring methods that are designed assist in transforming wicked problems into problems that can be easily solved, assuming the right inputs are available.
Strategic Assumption Surfacing and Testing was introduced in a paper by Mitroff (of the elevator problem mentioned earlier). The goal of this method is to understand the underlying assumptions that govern a problem.16 This is done in groups over a five-step process.
Strategic Assumption Surfacing and Testing is just one of a range approaches to redefining problems so that intractable problems become tractable. There are now at least several methods with reasonably long track records to restructure problems.17 While they have some common elements, such as usually being originally envisaged as a group activity (although they are often reworked for use by individuals), the differences between approaches mean that you can select the right one for your specific environment, and sometimes modify an approach with ideas from another approach.
Understanding the correct objective at the beginning of a project can be clouded if the problem has been set up as a wicked problem, and may seem intractable. However, as the form of any problem is usually determined by many assumptions—some easy to identify, some less easy—there is very often scope to restate the problem in a way that yields to your available tools.
Careful application problem structuring methods will often let you turn an ill-posed problem into a well-posed problem, and succeed where others have failed to gain traction.
Documenting the Project’s Goals
Just as is the case for team goals, goals for projects can easily be forgotten or misunderstood. Strong documentation of the right information ensures that you don’t fall at the last hurdle by delivering to the wrong target through a simple misunderstanding of what you have already discovered.
Aside from defining the target you are modeling there are multiple dimensions that may be of importance to the user. Obvious examples include ease of understanding the model’s results, speed at which the result can be returned, and speed at which the solution can be implemented. It can be helpful to document these aspects also.
It is also helpful to document the data sources that are available to the project, and the platform that the results will be delivered to, to define the final format.
Individually these points of information may seem very trivial, but missing a few of them can at least waste time, if not lead to a project that fails to meet the client’s expectations.
The notion of documentation has unfortunately developed an association with traditional waterfall project management paradigms, and the risk of giving a power-hungry project manager a blunt instrument he or she can use to beat other unfortunates over the head with. As noted elsewhere, this risk is caused by human personalities, not by the tools or processes themselves. The project manager who beats you over the head with a formal requirements document is more than capable of developing Agile rituals that waste everyone’s time and lock the company up in pointless mummery.
It also pays to remember that documentation doesn’t have to be formal. It’s lovely to have comprehensive plans produced, which distill every brainstorm into their essential wisdom, but it’s also very time-consuming.
Fortunately, we live in a time where documentation doesn’t have to formal typed minutes in ring binders that no one will ever read or access. From wiki software, such as Confluence, through to virtual whiteboarding software, there is a range of options for capturing user discovery as it happens or for distilling the results of that discovery into documents with actionable insights.
The existence of electronic means to keep documentation is especially useful considering that things often change. When documentation was synonymous with hard copies, there was a psychological barrier to making those changes. Now there is no reason not to update documents as customer needs become clearer. In some cases this will require a change management process, if the documentation is formal and there is a risk of contradiction.
Other times it will be enough to collect the results of workshops as they occur as faithfully as possible and update them when necessary.
All in all, learning about the customer’s real needs is a difficult process. It is time-consuming, and it depends heavily on the customer’s own goodwill. With that in mind, we want to ensure that nothing is lost.
Selecting a correct objective and framing it to ensure the best results for a data science solution are essential elements to defining projects that will succeed and that will help your team maintain the credibility as expert problem solvers. However, although the correct objective may be the most crucial element, having the correct skills in each project team, and the best possible data set are also critical, and we will consider these elements in the following.
Ways and Means—Project Resources
In Chapter 1, we saw strategy presented as the equation “Strategy = goals + ways + means.” In general, in a data science context, our ways and means are the people in our team and their skills, which represent the “ways,” and the available data, which represents the “means.”
Although in this chapter, we have focused on the risk of having a badly chosen goal, mostly because this is the least talked about risk, the risk of not having the right resources is also an ever-present danger, although one that is more likely to lead to a project not being finished, rather than an improper project being finished.
Either way, though, the risk is real to your reputation as a data scientist, and to the reputation of data science as a valid way to solve problems within your organization. To continue to maintain your license within your company to take on difficult challenges, you need to make sure that people see you succeed as often as possible—carefully considering to what degree particular projects are in your capability is vital.
We will consider the two prongs—first the capabilities that exist within your team—the ways. Second, the data at your disposal—the means.
The Ways—Data Science Skills
A challenge in this area that is bigger for data scientists than for other professionals is that the broad definition of data scientist sets up an expectation that any single data scientist has the skills of any other single data scientist. So far, the idea that there might be some data scientists who specialize in text mining while another who specializes in true big data hasn’t penetrated fully, which makes it entirely plausible for the team to be asked to do something outside your expertise.
At the same time, the data science community to some degree buys into this idea by promoting relatively disparate areas of expertise, such as Deep Learning, Natural Language Processing, and Geostatistics as all equally a part of one skill set.
In this environment there is a risk with every project that it will be outside your capabilities. This, in itself, isn’t necessarily a hard no. You should be taking on at least some projects that have been stretching your capabilities as one of their core purposes. The problems start when expectations haven’t been reset to take into account the fact that the team’s capability isn’t quite there.
This is where the concept of the project contract, seen in systems like Prince2, is important—they allow you to agree to projects you don’t have complete capability for, while explicitly setting the expectation that there is more risk attached to both finishing the project at all and the expected timeline.
When you want to stretch your capabilities, you need to identify projects that will fly below the radar, at least to some degree. Otherwise, your organization will continue to maintain expectations at the same level as for projects that are more clearly in your wheelhouse. If you can succeed at maintaining a steady trickle of capability stretching projects, however, you will be able to develop skills.
As long as you are clear that the capability to achieve the full set of expected benefits may not yet exist, completing projects that are a little outside—or sometimes a long way outside—of your capabilities is a good way to improve capabilities. The important thing is to ensure that everyone is clear that stretching capabilities is actually a more important goal for the project than the nominal project goal.
The Means—Available Data
The final part of the ways and means equation as it applies to data science is the means—data. You obviously won’t always have all the data you need to model anything your customer wants.
Both Agile and CRISP-DM offer a partial solution to this problem. In the CRISP-DM cycle, the next stage after Business Understanding is Data Understanding. By arranging with the Project Sponsors or Clients to make the Data Understanding stage a toll gate, you can manage expectations around the possibility that the data isn’t plentiful enough or of sufficient quality to support the goals of the project. Note too, that, in the CRISP-DM process, there is room at this point to iterate on the goal of the project—to nudge the goal into something that is more achievable (the essence of strategy, in a sense).
An Agile approach allows something similar by making an initial assessment of the data an early deliverable that sets the scene for future deliverables. Hence, similar to the way we took each stage as a deliverable within the CRISP-DM framework, we can do something similar within Agile. We would define an Exploratory Data Analysis report as a deliverable, and then a prototype of the model as a second deliverable.
At either point we have an option to either end the project, obtain different data, or pursue a different goal at each stage. If we’re happy with the prototype, the implementation can be its own set of deliverables, each time becoming more complete. At any time, based on the results of model evaluation, you may decide to build in a stage of capturing or cleaning additional data.
Always remember that data is elastic. That is, when you need more data, sometimes more data can either be obtained (i.e., bought or otherwise acquired from someone who has the data now) or gathered (i.e., a sensor or procedure can be put in place to trap the data). Either way, time or money (which can be seen as having a similar equivalence to energy and matter) is frequently the only thing standing between you and more data, so if your case is persuasive enough, your organization will often get it for you.
When you are considering which data you are going to incorporate into your model, consider whether the data that you score the model on will be at the same quality as the data you are building the model on. It’s common to use a historical data set to train the model, to begin with, and as that data has had time to be considered and reconsidered, late arriving data has had time to arrive, and problems with the data have had time to be corrected.
In contrast, the scoring data will be much closer to live or real time. As a result, there may be more data missing, and more of the data may be incorrect or poor quality in some other way. Keep an eye for these sort of problems when you intend to recommend a model for implementation.
Data is the vital clay for any data science project, so assessing its availability early in the project life cycle is crucial to ensuring that your project will achieve the desired results.
However, the quantity and quality of data are not carved in stone, and additional data can often be found or gathered without incurring too large an expense. Build the right case, and the data will come.
The Project Hopper
There is always more work to be done than anyone can do. There are more projects to do than anyone can do also, but something that isn’t worth doing now might be the most important thing in the future.
A data scientist who works within an organization can benefit from the use of a project hopper—sometimes seen in Six Sigma organizations. The idea is simply that sometimes when you have completed the Define phase or the Business Understanding phase, the decision will be made that there are other activities to work on that are higher priority. However, it is obviously wasteful to simply dump the work that has already been completed.
Instead, summarize the work that has been done so far and create a project hopper which keeps at least the outline of the business understanding section for later.
The hopper is an especially good place to store projects that have increasing skills as one of their primary goals. You can record the skills you hope to gain by doing the project, so that you can develop new skills for other projects likely to attract more scrutiny on a just in time basis.
For example, if you can see a high-profile project on the horizon that is likely to require deep learning, you can select a lower profile deep learning-oriented project from the hopper to ensure the required skills are up to scratch.
The unicorn data scientist, who understood every facet of statistics, machine learning, and programming, is increasingly recognized as the mythical creature she always was. Validating whether the skills to perform vital projects exist within your team members needs to be a priority.
Ensuring that you maintain a steady flow of projects that have a key objective of stretching your team’s skills will ensure the right skills exist when you need them. At the same time, ensuring the rest of the organization understands which activities are within your team’s wheelhouse means that expectations remain realistic about what you are able to achieve.
The project hopper is a great strategic tool that allows the manager of a data science team to take control of the flow of their team’s work on the one hand, and on the skill building and overall direction of the team on the other. By using it properly, you will be able to successfully build new skills for your team on projects that aren’t at the center of your organization’s attention, while also completing a steady stream of projects that meet your organization’s most crucial goals.
The project hopper is also a place we can keep excellent projects that are missing one of the three elements—projects with clear objectives but no data, projects with great objectives and a viable data set but requiring skills not yet found in the team, to name two examples—but will make great projects in the future.
Summary
Project risk is often considered from the point of view of risks to timely project completion of the project only. Less often, project managers and their project sponsors consider properly the risk of completing a project that doesn’t solve the client’s problem or one that creates a new problem.
Sometimes creating a new problem moves into the territory of “unknown unknowns,” as they were termed by Donald Rumsfeld. We can’t always prevent this happening, but we can make it less likely by carefully considering the voice of the customer.
There are a number of ways of doing just this. The Six Sigma DMAIC process emphasizes the voice of the customer in the initial Define phase and proposes some tools for better learning the voice of the customer that are applicable in some data science contexts. One of the most powerful tools available within Six Sigma is Quality Function Deployment, which employs Seven Management Tools within the House of Quality framework to expose.
CRISP-DM, which has been designed as a standard framework created specifically for data mining, describes an iterative process between Business Understanding and Data Understanding. This has the advantage of allowing the data scientist to refine their Business Understanding by reference to the available data—presenting data findings aids the conversation with the client.
Finally, the Design Thinking approach that is associated with Agile promotes empathizing with the user and suggests another set of ways to get closer to the customer’s expectations.
All of these require the best use of the client’s time when you are able to spend time with them. As much as anything else, they are unlikely to have an unlimited amount of time to give you. Hence, we cover some techniques for ensuring this time is put to best use.
Data science has a reputation as being able to solve the most difficult problems available. This reputation is a double-edged sword, as while on the one hand it means that data scientists are given the chance to work with the most challenging and interesting problems, on the other hand they are given a lot of license to develop their preferred solutions.
However, the flipside of this opportunity is that sometimes the problems given are genuinely unsolvable. A notorious category of unsolvable problems is called wicked problems. Identifying them gives you an opportunity to turn them down, and therefore having your reputation by a problem that was unsolvable from the outset. Alternatively, you can attempt to convince your client to allow you to reframe the problem into a format that allows it to be solved. A number of techniques exist to do just this.
Although well-chosen objectives are vital to the success of a data science effort, and moreover to the perception of success, they aren’t the only important aspects of a successful data science project. In the previous chapter, we also discussed how ways and means—the skill set within a project team and the available data—are also vitally important to the overall success of a strategy.
In the context of data science, the ways are effectively the team’s skill set, and the means is the available data. At a project level, careful attention needs to be paid to whether the team has sufficient skills to complete a project. This may sometimes lead you to turn down a project.
At the same time, trying projects outside your current skill set is the best way to develop new skills. With this in mind, sometimes you will want to attempt projects clearly outside your current skill set. When taking on such a project, it is important to ensure that expectations within your organization are properly managed with respect to the likely timeline, and the probable efficacy of the final product.
In this chapter and the previous chapter, we’ve looked at how to apply strategic thinking at a data science team level, and at a data science project level. In both cases, the goal was to ensure that your efforts as a data scientist achieved their objectives and were fully appreciated.
In Chapter 3, we will look at how to sell data science teams and data science projects, so that you are able to maximize the usage, and therefore usefulness of the projects you work on. More fundamental than that, being able to sell the projects you are working on means they will get a green light to begin with.
Project Checklist
This checklist with items worth considering while during projects is divided into three parts—objectives, skills, and data.
Objectives
Are there regulatory requirements, such as usually the case with financial or insurance models? What is their effect, for example, do they restrict algorithm choice, require additional documentation, or additional reporting during the model’s life?
How often will the model be updated? Possible answers could range from “never” down to every micro-second.
What are the consequences when (not if!) the model is incorrect? Nothing? Someone loses some money? Someone loses their life (could be the case, for example, for a medical diagnosis model)?
What volume of data will be fed to the model? What turnaround time is acceptable for the model’s results?
How will users access the results?
How much access will the data science team have to end users? Will the data science team be able to access the end user more than once during the project life cycle?
Skills
Do the skills exist currently within the team?
Are the people with the right skills also the people who are available?
What will the consequences be if the project is not completed?
Is the project urgent?
How difficult would it be to hire a temp? How much of a delay would hiring a temp cause to the project?
Data
Has the team worked with this data set before?
What is the provenance of the data? How likely do you believe it is that the data set is of good quality before you begin to explore it?
Would you get a better result with more data? How much would it cost to gather more data? How long would it take?
If you incorporate this data into your model, will you have permission to use the data when the model is implemented?
When you implement the model, will the data be refreshed as often as you need the model to be refreshed?