Chapter 16

Narrowing In on the Optimal Data Science Use Case

IN THIS CHAPTER

check Assessing the current state

check Selecting a data science use case

check Analyzing the data skill gap

check Evaluating AI ethics

Selecting the best data science use case to implement first (or next) is one of the most important aspects of setting up your projects for success. To be in a position to do that successfully, you need to assess your company based on the evidence you’ve collected about its current state. Assuming that you’re following the steps prescribed by the STAR framework, which I discuss in Chapter 15, at this point you’ve been authorized to propose a new data science project, you’ve surveyed the industry, and you’ve taken stock of your company’s current state, as shown in Figure 16-1.

At this point in the process, you’re ready to assess your company and identify the lowest-hanging-fruit data use case — the focus of this chapter. In Chapter 17, you can see how to make recommendations based on your assessment.

Snapshot of assessing the company’s current state. — FIGURE 16-1: Assessing your company’s current state.

Reviewing the Documentation

If you follow the instructions I lay out in Chapter 15 for researching your company, you end up with a lot of documentation describing the inner and outer workings of your company. That’s great! That’s just the strong foundation you need for meaningfully and accurately assessing your company’s current state. If you’re not familiar with this term, it’s just consulting vernacular that refers to the state of your company, as it is now. This term is juxtaposed with a company’s future state — the status of the company at some defined point in the future. When a company hires a consultant — data or otherwise — it’s hiring someone to provide advice, strategic plans, or training (or all three) to help the company bridge the gap between its current state and future state.

After you have all that documentation on hand, it’s time to review it. This process involves thoroughly reading and reviewing all the documentation you collected on your company. You can produce a SWOT analysis as you work through it, in order to make note of the key findings that jump out at you. (In case the term SWOT is new to you, it’s just a simple set of notes you create about your company’s strengths, weaknesses, opportunities, and threats that you notice while you’re working your way through the documentation.) These notes can help you summarize your findings so that you can more easily identify the lowest-hanging-fruit data science use case.

Selecting Your Quick-Win Data Science Use Cases

After you have thoroughly reviewed all the documentation you’ve collected and have produced a high-level SWOT analysis, you should be in a good position to select some contenders for the lowest-hanging-fruit data science use case.

You can do that by bringing to mind the data science use cases you surveyed during Part 1 of the STAR framework. (For more on the STAR framework, see Chapter 15.) Consider all the data resources, technologies, and skillsets that are required in order to implement each case. If you conclude, based on your documentation review, that your company lacks (or cannot easily acquire) the data resources, technologies, and skillsets required in order to implement a particular data science use case, throw it out. It isn’t the lowest-hanging-fruit use case you’re looking for.

Review data science use cases thoroughly when you work through the Survey step of the STAR framework, for two reasons:

You want to have an idea of the best and worst of what's possible for your company with respect to the data science use cases it chooses to implement.
You need a sufficient number of data science use cases to choose from so that you can take “the best of the best” and quickly discard the ones that don’t represent a quick win for your company.

After you’ve sifted through relevant data science use cases and removed the ones that aren’t promising, you have to then apply your best judgment to select the top three data science cases that seem the most promising for your company — based on what you know about its current state. These three use cases are the final contenders for the lowest-hanging-fruit data science use case for your company’s next successful data science project.

Zeroing in on the quick win

After you’ve narrowed the list to a set of three promising data science use cases, you then select the absolute best, most promising data science use case. If this is your first time being strategic about your data science project planning, you most definitely want this use case to be a quick win for your company. (When I say “quick win,” I mean that you want the project to produce a positive, quantifiable return on investment, or ROI, in its first three months — this success establishes a solid foundation for you in gaining the confidence from business leaders that you need in order to lead even bigger data science projects.)

To assess these top three data science use cases against your company’s current state, prudently produce an alternatives analysis, or at least some draft versions of one. For each of the potential data science use cases, first consider your working knowledge of your company as well as your findings from the documentation review, and then answer the following questions:

Is your use case based purely on big data technologies, or does it entail data science too?
How much data would this project require?
How much of that data does your company own?
Would you need to acquire outside data resources in order to implement this data science use case? If so, how feasible is it, in terms of cost and time, for these resources to be acquired?
Which data skillsets do you have on staff to help implement this data science use case?
Which skillsets would you need to acquire? How feasible is it, in terms of cost and time, for these skillsets to be acquired?
Which data technologies will you need in order to implement a particular data science use case?
Which technologies do you already own, and which ones will you need to acquire? How feasible is it, in terms of cost and time, for these technologies to be acquired?

Answering these questions should help narrow your data science use case options, but I suggest whittling them down further by using a POTI model.

Producing a POTI model

Before doing a final data science use case selection, you need to assess how each of the potential use cases will affect your company’s POTI — its processes, organization, technology, and information, in other words. Let’s look at each of these factors in greater detail:

Processes: Before deciding on a data science use case, you need to look at your company’s operational models and processes. Describe how this data initiative will impact them. Which current business processes will be impacted by the implementation of this new data project? How will these business processes be modified or extended after the new data system is up and running?
Organization: Describe how this data initiative will impact the people and culture at your organization. Which people-requirements will be impacted by the implementation of this new data project? Document anticipated changes to these elements:
- Data skillset requirements
- Business culture
- Hiring requirements
- Team member repositioning
- Training requirements
Technology: Describe how this data initiative will impact technology across your organization. Which technologies will be required, eliminated, or augmented during the implementation of this new data project? How will these changes impact other business units or projects? What needs to be done to mitigate disruption of services across the business?
Information: Which information or data will be required by your company, in its future state, for this data science project to be considered a success? Describe how this data initiative will impact the type of information delivered to people and to business systems across your organization. Which information sources will be discarded? Which will need to be created? What changes are there to who-gets-which information? With respect to machine-to-machine systems, what changes take place to wherever information is sent? On which frequency does information need to be delivered? How do these changes in information requirements impact technology, people, and processes across the organization?

The data dictionary I recommend you use in Chapter 15 should come in handy here. Reference it when you’re assessing the Information portion of the POTI model.

Based on your answers to these questions, it should be rather straightforward to pinpoint a data science use case that offers the greatest potential ROI for your company, in the shortest amount of time, with the lowest level of risk.

From there, I suggest one more stopgap assessment question. Does the data science use case support your company in better reaching its business mission and vision? If the answer to this question is yes, then let this use case be your lowest-hanging-fruit data science use case. If the answer to the question is no, step back to the second-most-promising data science use case.

A data science project ultimately has two goals: Increase revenue for the company and support the company in reaching its vision. Always be mindful of, and prepared to discuss, exactly how your data science projects are accomplishing these two requirements. If you can’t do that, you may find yourself in hot water sometime in the near future.

If you’ve made it to this point in your data science journey, congratulations! You’ve identified the lowest-hanging-fruit data science use case for your company. That’s a huge deal — and a lot of work you’ve accomplished to reach this point. If you’ve followed the documentation collection-and-review processes as I’ve described them, your decision-making will be well-supported with ironclad evidence. Furthermore, you created draft assessments before selecting a final use case. Be sure to keep all your documentation and draft assessments in a safe place. I recommend that you make them addendums to your data science project plan so that you have a full, comprehensive body of evidence to support any recommendations you might make. Before you reach that point, however, you still need to complete the assessment phase. The act of selecting your use case only marks the halfway point when it comes to assessments within the STAR framework.

If you made proper requests for information and could not gain access to all the documentation you need, document that aspect as well. Retain email records and meeting minutes that document the fact that you did your due diligence and that you simply were not granted the access you need. In case the documentation omission results in a downstream problem in your data science project plans, that documentation will defer responsibility to the responsible party (not to you!).

Picking between Plug-and-Play Assessments

The type of assessments you need depends heavily on the data science use case you’ve selected. Though you won’t find a one-size-fits-all approach, I offer you some suggestions for assessments that I feel would be helpful in a wide variety of data science projects.

Most data professionals aren’t working in a microcosm, where they have all the power and say-so about their company’s data operations. To the contrary, I expect that most readers of this book are working in larger organizations where corporate politics menace their every move. The assessments I suggest in the remainder of this chapter are just that — suggestions. If you have the power and initiative to conduct these assessments, I am sure you will find them immensely helpful. But if you don’t — just do what you can and move on.

The following assessment protocols are meant to be used as helpful plug-and-play suggestions that you can use according to your company’s needs.

Carrying out a data skill gap analysis for your company

The STAR framework, which I talk about in Chapter 15, offers you a clear, repeatable process that you can use whenever you need to plan out a data science project. As part of this framework, you take stock of the data skillsets of the people who work at your company. Because you’ve already surveyed these individuals, you have a pretty good idea of the existing competencies to be found within the relevant human capital at your company. You also have already chosen a data science use case. With that use case selection, you’ve also narrowed in on a range of skills you’ll need in order to support the data science project. In the best-case scenario, your company already has people with those skills, and those people have the capacity to help support your data science project. Another favorable outcome is that your company has people with the basic prerequisites needed to learn the data skills your project requires. And, in the worst-case scenario, your company needs to make one (or a few) new hires.

The reason that it’s vitally important to select the lowest-hanging-fruit use case is so that you can achieve quick wins for your company without needing to make expensive and risky new hires or acquisitions. Especially if this is one of your first times leading a new data science project, I strongly recommend that you do not select a data science use case that requires your company to make new acquisitions of employees, technologies, or data resources. That’s a good way to get your project put on hold indefinitely.

Assessing the data skillset requirements for the project is straightforward. You need to look at the technologies the project will require and the data science modeling approaches that are implicit in the use case you selected. With that step out of the way, all you then need to do is cross-reference those skillset requirements with your survey results to see who at your company can help deliver this project.

If you need your data science project to score a quick win for your company, you simply have to already have the requisite data skillsets, technologies, and resources on hand. There’s no time for training, creating service agreements, or other related tasks. So, for your first few data science projects, make sure you select use cases that your company can implement right away and see a positive ROI within three months.

As you gain more experience and trust with respect to planning and leading data science projects, you earn a little more of an allowance in terms of resource allocation (or reallocation, as it were). In this situation, if your company has people who have the exact skills you need, you also need to address the other sticking point — their availability. If those individuals are completely locked down in supporting other teams and projects, adding one more project to their plate will be a tough sell to their superiors. You have to decide whether it's worth the risk to hire someone new to help. In other cases, you may find people with the available capacity who just don't have the exact skills that are needed. If it’s possible that, with training, these individuals could support your data science project, that would eliminate the need for making a costly new hire. It would also create more value from that person’s time (and your company’s investment in retaining that time). In this case, you’d want to assess your survey results and produce a training plan for each of these workers in order to take them from their current competency to the higher competency levels your project requires.

Assessing the ethics of your company’s AI projects and products

If you’re working at a company that truly supports its mission, vision, and values and its leaders are data literate, those leaders will back initiatives that inquire about and reinforce higher ethical standards with respect to the company’s AI solutions. Unfortunately, most leaders aren’t all that data literate (which represents another opportunity for improvement in most companies' data strategies). In such cases, speak to them in a language they understand. Speaking strictly from a business perspective, gaps in AI ethics represent significant reputational risk to the company. If your company uses AI technology in a way that produces inequitable and/or biased outcomes and that grievance is discovered, you can expect the company to appear in media headlines — somehow, somewhere.

AI is inherently risky — it’s simply prudent to take proactive measures to mitigate any ethical issues related to your company’s AI solutions. In this section, you get a head-start in assessing the ethics of your company's AI solutions.

In Chapter 15, I talk about how important it is for you to itemize all your company’s active AI solutions as well as to collect reports and documentation that describes the machine learning processes and model metadata that each of these AI solutions deploys. I also talk there about the fact that you should collect, for each AI solution, user manuals, so to speak, that explain how each of these solutions works. Lastly, I make it clear in Chapter 15 that you have to gather any information that references potential biases produced by these solutions. After you’ve gathered all this requisite information, you can use it to do a preliminary assessment of your company’s AI ethics. The next few sections point you in the right direction.

Illustrating the need for ethical AI

Accountable, explainable, unbiased: These words represent what your company’s AI solutions need to be in order for them to be truly ethical. But what does “accountable, explainable, unbiased” AI actually mean, and why does it matter to real-life human beings? Let me explain with a true story.

Though this scenario actually happened to someone in real life, the names and demographic details of the people involved are fictional.

Imagine yourself as a healthcare provider who’s been tasked with treating Mr. Smith, a 65-year-old who has already been diagnosed with lung cancer and who experiences severe bleeding. You’re working with one of the leading oncology clinics in the US, so you’re privy to all the latest-and-greatest technologies. The latest gizmoid acquired by your clinic is IBM Watson for Oncology. You’ve been instructed to consult with this cutting-edge, costly software when making your treatment recommendation.

You follow orders, so you go in and feed Mr. Smith’s patient data into the machine and await its recommendations. A few minutes later, it spits out a recommendation that the chemotherapy drug Bevacizumab should be administered to Mr. Smith as a form of treatment for his lung cancer. You stand back in complete shock because, experienced oncology specialist that you are, you know that Bevacizumab has a black box warning that it should never, ever, be used to treat cancer patients who experience severe bleeding.

This expensive technology is supposed to improve the quality and safety of the medical recommendations you make. Instead, its recommendation is downright dangerous. What if you hadn’t been educated and aware of the medical contradiction for yourself? Do you see what a huge liability this machine could be setting you up for? Not to mention the negative impact on the lives of the people who depend on you to survive. I am sorry to break it to you, friends, but this scenario actually happened in real life. It’s just one example of the real and present dangers you expose yourself to when you depend on AI solutions in the healthcare industry. Implicit risks like these, however, are baked into AI solutions that are used across every industry in existence. As users and beneficiaries of AI solutions, everyone must be extremely vigilant about their implicit risk.

What does that look like? I mean, what do you look for to determine whether an AI solution is trustworthy? Well, for starters, you have to take proactive measures to make sure that your AI system is, as I say a little earlier, accountable, explainable, and unbiased. (For more on this topic, see Chapter 15.)

Proving accountability for AI solutions

How can you identify whether your AI system is indeed accountable? You start by thoroughly reviewing the documentation you collected about the accountability of your company’s AI solutions. Those are the reports you collected that describe the machine learning processes and model metadata that each of these AI solutions deploys. First read the documentation yourself, and jot down notes inside a draft SWOT analysis, just to get your own thoughts recorded, in a first pass, on paper. Then make another pass through the documentation — this time, answering the following questions:

Does each active AI solution come complete with its model metadata? And does that metadata detail where the datasets are stored, their filenames and version numbers, variable names, and some basics about underlying data set distributions?
Does the documentation detail what type of machine learning models are being used by the system? How about the steps that are required to preprocess the data before using it in the machine learning model? Does it include the parameter settings within the models? All this information is helpful if you need to either compare the system to others or rebuild it sometime in the distant future. (Heaven forbid!)
In terms of context, does the model metadata detail factors such as the programming language and version used to build the model, the source code, any dependencies, or the CPU/GPU and operating system in which the models were developed?
With respect to metrics, does the metadata detail the metrics that were used to evaluate the models that are included in your AI system?
Does the system come accompanied by a brief video that describes all these details?

You probably have guessed it, but your answer to all these questions should be a resounding yes! If any of the answers comes out as no, you have a gap in the accountability of your company’s AI solutions. This gap represents risk to the business and should be remedied.

Vouching for your company’s AI

Let me bring up the elephant in the room here: General Data Protection Regulation, otherwise known as GDPR. I talk about data privacy in Chapters 14 and 15, but never go so far as to name exact regulations. (There are a lot of them!) GDPR is a mammoth in terms of its elephant-ness, though, and it’s one of the main drivers behind the need for explainable AI, so it pays to look at it in greater detail.

GDPR asserts data privacy rights for all EU citizens, regardless of where they reside in the world. According to Recital 71 of GDPR, “[the data subject should have] the right … to obtain an explanation of the decision reached” if their personal data was used in any part of reaching that decision — period. End of story.

GDPR extends to all EU citizens the right to an explanation anytime a predictive model is used in making a judgment about them. This right stipulates that EU citizens can demand an explanation for any judgments made that impact them, including for reasons such as credit risk scoring, autonomous vehicle behaviors, healthcare decisions, and what-have-you.

As for the financial risk to companies found in violation of GDPR, Article 83(4) of GDPR states that infringements shall be “subject to administrative fines up to 10 000 000 EUR, or … up to 2% of the total worldwide annual turnover of the preceding financial year, whichever is higher.” Enough said?

Now that you understand the gravity of this explainable AI matter, I want to show you a couple of the ways that you can identify explainability in your company’s AI systems. Read the documentation you’ve collected, and then answer the following questions:

Do all of your company’s active AI solutions come with a manual that explains how the predictions are generated by the machine?
Is this manual written in plain English so that people who aren’t data professionals can comprehend it without having to take a course in data literacy?

Having this manual comes in handy when inevitable questions arise about how and why recommendations are being made. In that case, the appropriate decision-maker can just explain their judgment and how the AI solution impacted it, providing a copy of the plain-language manual to supplement that response. In most cases, this type of explanation should be more than sufficient.

Across the AI community, some deliberation takes place about whether AI systems really need to be explainable and interpretable. Because many data scientists are using a black box approach (they don’t know all the details of the math and code that are implemented within the model), of course they can’t explain them. This isn’t a safe approach to AI.

Designated representatives of your company must be able to explain how its AI solutions work. If your company uses a vendor AI solution, the vendor who supplies the AI system must provide a plain-language manual explaining how it works, and in a way that makes sense to non-data people. If they can’t, it’s time for you to seek alternatives.

Unbiasing AI

To move into a description of one more characteristic of ethical AI, your AI system needs to produce unbiased results. This one is tough to explain, so let’s start with a fictitious example.

An elderly man named Tom has been admitted to the hospital for medical testing related to a bronchitis diagnosis. The testing takes longer than expected, and Tom’s loving family desperately want him to be released to them at home. In a few days, when it comes time for Tom to be released, he is discharged into a state-run nursing facility and isn’t permitted to go home. When asked about this decision, the healthcare provider thoughtfully explains that, because Tom’s household income is low, the AI system has predicted that he won’t receive adequate support and care at home. The AI has determined that Tom should be discharged to a state care facility — where he will subsequently incur higher costs of services and a greater chance of readmission to the hospital. This is an example of an incredibly biased outcome.

Bias can be integrated into a predictive model in two main ways, when either of these statements is true:

Training data isn’t representative of the general population.
The AI relies heavily on demographics when making its predictions.
The cognitive bias of data professionals whose work impacts the project.

Examples of variables that are at high-risk of producing bias are race, gender, skin color, religion, national origin, marital status, sexual orientation, education background, income, and age. Generally speaking, if your model requires these variables, be sure to evaluate its results for unfair bias. If you find bias, you have to go back to the drawing board and find another way to generate results — one that does not produce unfair discrimination of people based on their demographic.

As to what to look for when assessing whether your AI systems are biased, your best bet is to rigorously test the solution in trials that approximate real life as much as possible and then make some personal judgments about its outputs. It would be worth your time to assemble a user group to evaluate the level of potential bias. Here are some questions to consider when evaluating bias in AI outputs:

Is the model’s training data representative of the general population?
Are the current AI systems relying heavily on demographics to make predictions about people?
What are the downstream implications of the outputs generated by the AI system?
What would a counterbiased system recommend?
What are the potential ramifications with respect to systemic bias, if this AI is implemented across a large swath of the population? Do you think it’s fair? Would your peers think so?
How is your organization working to ensure that its current AI projects produce unbiased results?

If, by exploring these questions, you find troubling answers, I suggest that you take it on yourself to start brainstorming which measures your company can put into place to ensure that future AI projects are accountable, explainable, and unbiased. These are exactly the type of suggestions you’d want to include when you produce recommendations, a topic we cover thoroughly in Chapter 17.

Assessing data governance and data privacy policies

Here’s the simple truth: You can’t have ethical AI without good data governance supporting it. Why? Because in order for your AI to be explainable, unbiased, and accountable, it must be built from high-quality data that has been properly documented and maintained. If it helps to grasp this concept, think of data governance as a type of chain of custody that secures the integrity and reliability of your organization’s data. To build explainable AI, you need to understand and, more importantly, trust the data that goes into it.

Being in a position to explain your AI to any doubters is one of the main benefits of insisting on good data governance, but there are other benefits as well, such as these:

Team members across the organization will trust the data as a credible and reliable source of information.
Your organization will have a unified set of data definitions. In other words, everyone will speak the same language when making reference to your organization’s shared data resources.
Your organization’s data will be maintained and will function as a consistently reliable source to end users.
You will have processes in place to centralize and simplify regular reporting processes.

You can assess your company’s data governance by simply looking to see whether it exhibits the characteristics of good data governance I just outlined. Some signs of overly lax governance policies are redundant duplicate data sets running amok, poor data quality, and overall unmanageability in a company’s data operations. Excessively strict data governance often makes itself known in the form of data bottleneck problems, where users can’t access data without submitting a formal request from the IT department — a bureaucratic hoop that takes a long time to jump through. The result is that you and all the other business users wait in line to gain access to the basic data needed to do your jobs. If you see gaping data governance problems inside your company, you definitely need to address them in the planning and recommendations of your data science project.

As you might expect, data governance policies don’t appear out of thin air — somebody has to come up with them. That’s where the data governance council — a team of elected individuals who together decide on a company's data governance policies — comes into play. Any mature company has a data governance council in place, but in case your company is smaller or newer, you probably need to form one. When doing so, assess and identify those individuals best suited to serve on the council.

For starters, you should select people who have some mixture of the following characteristics:

Educated and experienced in developing data policies and procedures
Capable of good communication with a knack for getting along with others

Look for some training background (so that the person can help educate the rest of the organization about the importance of data governance).
Experienced with working with data coming from a wide variety of business units at your organization
Cognizant of operational ROIs related to data governance

Data governance policies — or data policies, for short — are simply policies that document the rules and processes that should be followed in order for your data resources to remain consistent and of high quality. Document these rules and processes for every major data-intensive activity that happens within your business. Such rules and processes are often referred to as data governance standards. A mature company should be in a position to bring together the documentation for the entire set of major business activities into one data governance document — that’s your organization’s data policy. After being compiled, the data policy must be maintained and adjusted per changes in your organization. And again, data policy is but one of the fundamental constituents of solid data governance.

In this book, I don’t tell you how to build solid data governance from an implementation perspective. Data governance is, however, a prerequisite to ethical AI. By selecting the right people to spearhead your data governance council, you set up your organization for success. And those right people? They should understand and make decisions to support the data science implementation requirements for your organization.

When it comes to your company’s data privacy policies, that’s a topic you definitely want to leave for the lawyers, because it’s one that only they can develop, according to requirements by law. Your task is to assess how well those policies, when developed, are adhered to within your company. When looking to assess your company’s data privacy policies, you may want to start by first answering the following questions:

What, exactly, are your company’s data privacy policies?
How are these policies enforced?
Do the policies contain any hold-ups or gaps in enforcement?
Does the data privacy policy have any gaps that may pose a later financial or reputational risk to your company?

Document any potential gaps, omissions, or risks you can think of so that you can address them in the recommendations you’ll surely make in line with your upcoming data science project. (For more on how to fashion these recommendations, check out Chapter 17.)