I am certain you already know all the hype about Big Data and may have even achieved some success with Big Data technologies. But I’m here to tell you that the Big Data opportunity is not just about the new technological capabilities to process and analyze large volumes, variety, or velocity of data. In some organizations, these technology capabilities may be a catalysts, but deploying them and even having the talent to leverage them are not sufficient to drive business value, enable new product offerings, demonstrate competitive advantages, or enable smarter decision making.
The Big Data opportunity is better described by how organizations get value from the data they already have, data that they can develop or create, or third-party data that they can integrate. Here’s a definition of Big Data that I offered on my blog:
Big Data Defined—Big Data is not defined by its data management challenges but by the organization’s capabilities in analyzing the data, deriving intelligence from it, and leveraging it to make forward-looking decisions. It should also be defined by the organization’s capability in creating new data streams and aggregating them into its data warehouses.
This definition covers two elements of change. First, it’s the skills and organizational changes required to analyze, derive insights, and encourage data-driven decision making. This is what is typically called the data-driven organization, and I’ll be covering why this type of culture is critical for organizations driving digital transformation. More importantly, I’ll cover how to cultivate a data science program and how to enable it through technology, practices, and governance.
The second element covers a capability that data-driven organizations must mature, which is the ability to source and aggregate new data sources, integrate in new analytics capabilities, and partner with third-party analytical services.
The final element is using data for strategic and competitive advantage by enhancing existing product offerings or developing new ones. It means embedding data collection into products so that you can personalize user experiences and delight customers. It implies that you can collect operational metrics and process in real time to improve the performance of a system or component. It enables you to deliver new intelligence to your customers by aggregating and anonymizing data that you collect.
Underpinning all of this is the new data technology capabilities even beyond the volume, the velocity and variety of Big Data that I covered in Chapter Three. It’s aggregating data from massive arrays of sensors (IoT), processing for macro and micro responses, and sending real-time controls back to the appropriate environments. It’s new predictive analytics capabilities, including machine learning, deep learning, and artificial intelligence, that can provide capabilities beyond human intelligence and the personalization of large amounts of disparate data. It’s the ability to participate in the API economy and digital ecosystems where organizations can sell, share, and buy algorithms and capabilities.
Before we can talk about becoming a data-driven organization, establishing Big Data capabilities, enabling data scientists, or developing revenue-generating products and services from an organization’s data, it’s important to look at the organization’s underlying data legacy and culture.
It’s not like more organizations have zero data analytics capabilities. Most do in some form and probably believe that they are data driven based on the existing talent, capabilities and tools. But the issue with data technologies and practices is that it’s not just about the ability to produce insights; it’s whether the underlying process is accurate, repeatable, efficient, and scalable.
Is the analytics accessible only to the person who performed the work, or can it be reviewed and extended by others? How many others can use the final analytics for decision making? How often is the data refreshed, and how many human and machine hours does it take to process this data? Are data definitions, calculations, and assumptions clearly defined and understood? Is there an underlying process to measure and improve data quality? And who has access to this data? How can the data legally be used, and what security, privacy, or other compliance is required for it?
Our legacy of data analytics and processing doesn’t easily address these issues. The tools used in the past led to the proliferation of duplicate data, data silos that are hard to extend, and analytics that are hard to repeat. Data integration often requires manual steps, and the automated ones were rarely implemented with data validation or with scalability in mind. Most organizations don’t know what data exists in their enterprise or how can they leverage it for competitive advantage.
Over the last few years, I’ve written on my blog about these different issues with more specific examples. First, let’s look at the impacts of two of the most widely used data analytics and database tools across organizations of all sizes. Then I’ll share some specific concerns of how most organizations implement data integration. Lastly, I’ll provide my definition of an organization’s “dark data” and its implications.
How many spreadsheets do you have on your network? Do you champion analysts that demonstrate wizardly skills performing calculations, pivot tables, macros, and formatting spreadsheets? One of the most trafficked posts on my blog is a letter to spreadsheet jockeys1 that they need to expand their skill set in order to leverage Big Data technologies, reduce human error, and collaborate in a data-driven organization.
The European Spreadsheet Risk Interest Group2 published a number of horror stories coming from poorly configured spreadsheets.3 The Wall Street Journal reported that a spreadsheet error cost Tibco shareholders $100 million. A $6 billion trading loss was at least partially due to modeling flaws in Excel.4 A flaw in Harvard economists’ Excel spreadsheet discredited their highly cited conclusions that higher government debt is associated with slower economic growth.5 Barclays unknowingly picked up 179 additional Lehman contracts because the data was marked as hidden in a spreadsheet.6 Genetic research incurred errors in their gene lists when Excel automatically converted them to calendar dates or numbers.7
How bad can it get? Think Enron bad. Felienne Hermans, of Delft University of Technology analyzed the 265,586 attachment files that were made public, of which 51,572 were Excel files and 16,189 were unique spreadsheets.8 The full analysis is available in the published report, showing 2,205 spreadsheets that contained at least one Excel error and 755 files that had more than 100 errors.9 If that wasn’t bad enough, The Telegraph reports, “Some people were sending more than 100 spreadsheets back and forth on a daily basis which proves there was no agreed system or standardized way of working.”10
This is not just a problem for big companies or researchers. As reported in Inc., “Spreadsheets Are Destroying Your Business”, “Spending valuable time working on inaccurate, static spreadsheets is a waste of your hard-earned dollars.”11 An estimated 88% of spreadsheets have errors in them,12 so do you think your team of spreadsheet jockeys can beat the odds?
Think of spreadsheets as algorithms. Instead of “code,” you have formulas and possibly even scripts but with few tools to debug or diagnose errors. Unlike a programmed application, there is also the potential for manual errors done through copy/paste or when a formula is applied to a range with missing cells. Most programmers today separate data, business logic, and presentation, but Excel merges all three, providing speed and convenience at the expense of manual errors and complexity. Finally, spreadsheet sharing is a pandemic that is very difficult to control. Even Cloud-based spreadsheet tools like Google Sheets or Excel in Office 365 cannot fully prevent people from copying worksheets and sharing derivative works.
But executives can’t get off the juice of getting quick and cheap analysis. They can’t see the impact of errors, the time applied to creating and recreating the analytics, and the lost opportunities when access to data is limited through email distributions. The need for business-created analytics is real, but ungoverned spreadsheets are no longer the solutions.
And spreadsheets aren’t the only issues.
I also wrote about the SadBA,13 the self-appointed database business analyst that creates single-purpose databases to address an analysis or a workflow. Many companies allowed business analysts to use Microsoft Access and other desktop database tools when they needed to do something more complex than what can be done in a spreadsheet. I inherited more than 40 of these siloed databases at one of my organizations, and I know other CIO and Chief Data Officers also struggling with them.
As big of a database mess as this is, the underlying data mess can be a daunting maze to get through. Considering even a single database, a trained database administrator (DBA) would need to understand the underlying data model, document any scripts or procedures loading data, and itemize reporting needs. If any forms were developed, especially if multiple people are using the database as part of a workflow, then you’ll need a business analyst and possibly an application developer to consider how these business processes are accomplished.
Perhaps you’ve never had to read someone else’s code? Rebuilding a database when it likely has poor naming conventions, missing data relationships, and a complete lack of referential integrity requires a DBA with the skills of a linguistic anthropologist. Now tell this DBA that there are multiple databases that contain duplicate and related data and that special software tools are needed to normalize the data model, load in data from multiple sources, and match, merge and de-duplicate records, before even considering how to replicate existing functionality.
Is this your company’s sales data, customer data, marketing data, or financial data? More likely, the answer is yes because it’s this data that business users work with the most. If the business user needed to perform a quick analysis and IT wasn’t available or had the necessary agility to come up with a solution, then it is likely that a spreadsheet jockey or a SadBA established a solution.
There’s a paradox about the work data scientists take on. On one hand, data science is described as the “sexiest” job,14 given the importance of the role in digital organizations. On the other hand, much of the work is messy “janitorial” work that data scientists have to perform15 in order to be able to find “nuggets” in Big Data.
Why is the work of data scientists so complicated and messy, requiring them hours if not days to bring data together before they can perform any analysis? The history of data science starts with complicated data warehouses, expensive BI tools, hundreds if not thousands of processes moving data all over the place, and bloated expectations. On the other extreme, many organizations have siloed databases, DBAs largely skilled at keeping the lights on, and spreadsheet jockeys performing analytics in hundreds of files. The janitorial work data scientists are performing partially exists because of the mess of databases and derivative data sources created in the past and never cleansed into formalized data repositories.
And I’m not sure this generation will get it better. If spreadsheets and database silos are the data legacy issues, are modernized tools any better? How about your BI solution that has thousands of reports that are rarely used and others that are minor derivatives of other ones? Does your database look more like urban sprawl than a planned architecture, with hundreds of tables added over time to solve one-time problems? Have you replaced some spreadsheets with more capable data profiling, data integration, or visualization tools without having a defined practice of what gets created, where is it stored, and how it is documented?
With great power comes even greater responsibility. All the technologies and tools data scientists have at their fingertips also have the power to create a new set of data stashes—informal places where data is aggregated—or buried data mines—places where analytics are performed but not automated or transparent to future scientists. See Figure 5-1.
Figure 5-1. Illustration of a messy data architecture and unwieldy data flow
What can be even more challenging is the underlying steps to get data from point A to point B. Engineers call these ETLs, extract, transform, and load procedures, that move data from a source like a database, a file, a document, or a website, transform the data so that it is in a format that can be analyzed, used in an application, or shared, and then loaded to a destination tool or data repository. More universally, it’s called data integration or data processing.
But those terms all largely apply when the process is scripted and automated. What you are more likely to find when an organization has ungoverned data practices is that most of the data integrations implemented have manual steps. Someone writes a query to pull data, loads it into a spreadsheet, and performs some manual data cleansing. They write formulas, pivot tables, copy/paste data, manually edit, and format the data—all steps that are very difficult to reverse-engineer. The data is then loaded into their personal data stash for their target business purpose.
If data scientists, DBAs, and CIOs are not careful, the data stashes and buried data mines can slowly transform into full-blown data landfills.
DBAs know what I’m talking about. It’s a combination of data warehouses, reports, dashboards, and ETLs that no one wants to touch. No one understands who is using what reports or dashboards in what business process for what purpose or benefit. ETLs look like a maze of buried unlabeled pipes developed using a myriad of programming approaches and with no standards to help future workers separate out plumbing from filters and valves.
Data scientists and their partners, data stewards, DBAs, business analysts, developers, and testers need to instill some discipline in how they source, process, improve quality, analyze, and report on data. This is the heart of data governance, a term that makes some business leaders cringe and paralyzes many technologists when they don’t know where to start. But before going into solutions, let’s go deeper into a few other issues.
When someone says that the data integration process is automated, I suggest asking questions to clarify what they mean by “automated.” You’ll probably conclude that the process is anything but automated, let alone reliable, scalable, secure, or configurable.
To some, automation implies efficiency and reliability but can include manual steps so long as they are performed quickly and easily. Others assume that if the process can be completed without IT involvement, then it is automated. Still others don’t care whether it’s automated but are angered when there is a breakage in the process or if it doesn’t scale magically as more data is piped in. There is also a bad assumption that the daily process running on a volume of data will magically scale when existing data must be reprocessed for a change in business logic or storage schema.
As hard as it is to modify software, modifying data integration can be even more daunting especially if the steps are not documented. For example, fixing data quality issues or addressing boundary conditions tend to be undocumented steps performed by subject matter experts.
So now you want to fix data integration. Take out the manual steps. Make the process more nimble, agile, reliable, and so on. Why is it so hard to get business leaders on board with the investment needed in data integration?
Big Data technologies like ETL, Hadoop, Spark, Pig, Hive, or IFTTT16 are difficult enough for technologists to fully understand and apply to the appropriate data issues, but the jargon just frustrates business leaders. Many are clueless about the technologies (other than the overhyped term “Big Data”) and are more surprised about the need to have them and invest time to build expertise in them. Information technology and processing data have been around a long time, so there is underlying assumption, even with Big Data, that integration is cheap and easy.
Now, unless you are doing very basic, point-A-to-B, straightforward plumbing, data integration can become quite complex as new data sources are added, logic is changed, and new applications are developed. Data integration may start off simple, but over time legacy data flows are difficult to understand, extend, or modify.
The complexity slows down IT, and if data and analytics are strategic to the business, it frustrates business leaders that they can’t just add a new data source, modify calculations, improve on data validation, or strategically change the downstream analytics.
So now, whatever technologies IT selects and however it is presented to the business, it comes across as plumbing. All that time is invested just to have a stable data flow? Analytics, visualization, application, and for the most part doing anything useful with the data will cost extra?
The simple answer is that data integration is a key foundational technology capability if data is strategic to the business. It’s not just a technology; it is a competency. Unfortunately, ranked against other data technologies like data warehousing and application development, data integration capabilities are often a distant third in staffing and funding priority. So it’s no surprise that many processes are not fully automated and that business leaders don’t get the importance of this capability.
Knowing you have a problem is the first step to solving it. Getting data in and out of primary data sources securely, efficiently, and reliably is as important as the underlying structure of the data warehouses and the analytics performed over it.
The scope of data integration is a lot greater today. It used to be that data integration had to do largely with internally hosted data sources and where DBAs and developers had many tools and ways to access the data. But what about data in SaaS platforms?
Enterprises are maturing their use of strategic software as a service (SaaS) or platform as a service (PaaS) and experimenting with innovative Cloud hosted services solutions that deliver new capabilities. On average, enterprises are using between five and nine SaaS platforms,17 and the number is expected to grow to 30 over the next three years.18
But while top SaaS platforms offer advanced functionality, scalable infrastructure, speed to market on new capabilities, and often lower costs, using many platforms creates some new challenges.
Every SaaS platform captures content and data, enables workflow and collaboration related to its core capabilities, and provides some out of the box reporting capabilities. Ideally, these features should be “good enough” to unlock its primary business values, but is the platform “sufficiently open” to enable integration once usage hits critical mass and the platform becomes more business critical?
If you are evaluating an SaaS platform and ask the salesperson about openness and the ability to pull data out or push data in, you’ll get the knee jerk response, “We have APIs.” Business leaders are now overly sensitized to this response, and when they hear it, they assume that having APIs implies that it’s easy to integrate and that the technologists should be able to figure out how to do simple operations like accessing or changing the underlying data. The salesperson may even go a step further and show additional capabilities like software development kits that enable using the APIs more easily in specific programming languages. Ideally, they should show you their app store showing example applications that were developed using their toolkits and present metrics on number of applications, developers, and other metrics to illustrate there is a critical mass of activity.
The more marketing around these capabilities, the more third-party apps available, and the more developers in the program, the more likely it is that these integration tools and the skills required to leverage them can be used to achieve a desired enhancement or integration. Even if the SaaS vendor is a small and emerging business, you can evaluate their commitment to this program if these integration capabilities are prominently featured on the vendor’s website, if the documentation is easily accessible, and there is some evidence of external usage. When I evaluate technologies, I consider it a red flag if the vendor buries marketing, information, or documentation around their APIs or, worse, makes it available only to paying customers.
But APIs, SDKs, and the like are all tools for developers and require engineering efforts to leverage. In addition, APIs are proprietary, so there is a learning curve to everyone that is needed to fulfill integration needs. What if I am trying to implement a more simplified out-of-the-box data integration? Is there an ODBC connection that enables me to easily connect, and is the underlying data model easy to understand and leverage? Can I easily connect an enterprise ETL or data quality tool so that I can move data in and out of the platform? Do they have built-in web and mobile widgets that enable me to plug in their functionality into other platforms? Do they plug into IFTTT platforms such as IFTTT or Zapier, Tray.io, or similar platforms?
The better SaaS solutions will have mature APIs that have had significant customer usage. The better ones will enable direct integration with other platforms and simple tools for business users to connect workflows and data. The ones to watch are thinking three steps ahead and building on-ramps for your organization’s future digital transformation priorities, IOT strategies,19 and other integration needs.
Are you evaluating platforms on integration capabilities? Are you investing in integration where it is necessary or provides value? Critical corporate data is becoming more decentralized across multiple SaaS environments, so over time, a strategy is needed of if, when, and how best to integrate this data.
Let’s now look at one final issue and revisit the definition of Big Data:
Big Data Defined—Big Data is not defined by its data management challenges but by the organization’s capabilities in analyzing the data, deriving intelligence from it, and leveraging it to make forward-looking decisions. It should also be defined by the organization’s capability in creating new data streams and aggregating them into its data warehouses.
You can provide analytics only on the data that you know about, but what about the data that exists in the enterprise that data scientists don’t know about or other data that is difficult to access? The industry call this dark data, and here is a definition that I offered:20
Dark Data Defined—Dark data is data or content that exists and is stored but is not leveraged and analyzed for intelligence or used in forward-looking decisions. It includes data that is in physical locations or formats that make analysis complex or too costly or data that has significant data quality issues. It also includes data that is currently stored and can be connected to other data sources for analysis but that the business has not dedicated sufficient resources to analyze and leverage. Finally (and this may be debatable), dark data also includes data that currently isn’t captured by the enterprise or data that exists outside of the boundary of the enterprise.
This basically demonstrates three conditions where data could be but is not leveraged in Big Data analytics: (1) It could exist in a sufficient format, but the business hasn’t leveraged it yet. (2) It exists, but it is too costly to clean or process. Or (3) it doesn’t exist, and it needs to be captured or acquired.
Why is this important? Well, you can translate dark data to examples of data that’s difficult to find or access. Let’s say you need employment records of the enterprise from the 1990s, and that data is stored on tape in an offsite vault. What if you found a presentation from a colleague delivered a year ago, you want to revisit the data that was used in the underlying analysis, and your colleague is no longer with the firm. Are you going to be able to find this data? Or let’s say you are seeking a data set—let’s say it’s competitive data for a specific product—do you know how to find this data or the appropriate subject matter experts? How about scanning the network drives for instances of spreadsheets? Will you be able to identify key information on this data to know whether it is valuable in a new context?
These are all examples of detectable dark data, but my definition is expansive and includes data that the enterprise doesn’t have but is valuable. Maybe this is data that could be captured, but the organization hasn’t configured products and services to do so, or maybe it’s third-party data that could be purchased, but no one in the organization has the role to identify these data opportunities. To the enterprise, this data is dark and may put them at a competitive disadvantage if someone else has it and is using it effectively to compete against them.
Dark data may not have been a business issue several years ago, but it is now. If you drive the organization to become more data driven, but no one knows what data already exists in the enterprise and what sources to leverage, then you run a risk of developing analysis and driving decisions on the wrong or incomplete data sets. If you ignore the ecosystem of publicly and commercially available data, then your competitors that do tap into these sources may develop a competitive advantage.
Lastly, let’s consider the databases and data warehouses already developed and being used in the enterprise. I’ve seen databases developed in a few ways. A formalized data warehouse may have been designed using standard patterns or may have been assembled in a hurry for a specific business case. Application-specific databases may have been developed by a software developer with minimal assistance of a database engineer and designed to the specific requirements of the application being developed. Commercial data sources, the back-end databases to commercial software, may enable customer direct access while other databases have terms of service or technical mechanisms that prevent you from tapping into them. Then there are databases developed by business users that can have a wide range of quality.
Regardless of how databases are created, I’ve seen several issues that most fall victim to:
They are poorly documented. Even when there is a formal database diagram, there is rarely data dictionaries and other documents to help users leverage these databases.
The original engineers leave little information on how to maintain or extend these databases. When new needs arise, few guidelines are passed to a second generation of engineers to aid in implementing these enhancements.
Many databases are developed for specific needs without aligning to a documented database and data strategy. The result is a lot of duplicate data, inconsistent naming conventions, and unclear understanding of where master data resides.
There is little formalized on how to monitor these databases for performance or to track the impact of growth. DBAs are left to monitor them using generic database tools looking for growing tablespaces, slow queries, and other optimizations when they occur. This makes it difficult to know whether any database can handle a 10× increase in data, transactions, or other activity.
Many were constructed before NoSQL, Big Data and Cloud technologies were widely available, so they may have constructs in them that underperform given today’s modern database technology options.
In summary, most of our legacy is riddled with poor data practices. We enable ad hoc analysis by business users that easily pollute our file systems with multitudes of unstructured data. We develop data integrations on a need-to-need basis with little thought to strategy, maintainability, or vision to capture new data sources. We have both dark data and poorly maintainable data warehouses. This is our legacy, and this is what we should learn from before investing in and layering on Big Data and data science capabilities.
Your organization may have already invested in Big Data technologies and data science programs, and you can hardly call it an emerging technology anymore. Regardless of where you are in the adoption curve and ignoring the impact of legacy practices, there remains several Big Data and data science challenges that require consideration. They fall into some key areas shown Figure 5-2.
Do you have business leaders ready to leverage data and analytics?
Have you selected optimal and nimble data platforms?
Do you have the talent, collaboration and practices?
Are you thinking about future technologies that are enabled by Big Data?
Figure 5-2. Factors that enable the data-driven organization
Challenge number one is leadership and the mindset of decision makers. Are decision makers in the organization leveraging data and reports that already exist to aid in decision making?
Every enterprise already collects lots of data but likely underutilizes it for intelligence, insight, and decision making. Data exists in disparate databases that can’t easily be connected and analyzed holistically. Unstructured data in the form of documents, spreadsheets, and presentations exist that are largely used for individual or departmental needs and rarely coalesced centrally and analyzed. Enterprises that have deployed collaboration platforms could better leverage the networks and intelligence of its employees by analyzing relationships, contributions, and consumption on these platforms. Data collected in workflow solutions such as ERP and CRM are rarely merged and analyzed for intelligence. Email is certainly one of the largest repositories of unstructured data.
Also, consider that every enterprise system—ERP, CRM—has built-in reporting and analytic capabilities. Think about all the capabilities in web analytics available to track sources of traffic, segment users, pipeline activity, and track goals. Consider all the reports developed in enterprise business intelligence systems. How much of it is used?
The bottom line is that most organizations underutilize the data, reports, and dashboards they already have in their environment. Is it the lack of skill? Is it the lack of time? Sorry folks, these are underlying concerns, but the bigger issue is the lack of leadership and culture. If top-down leadership and cultural principles aren’t instituted on becoming data driven, then the organization is less likely to shift to this form of operation even when there is a new generation of data enthusiasts driving change. Bottom-up culture change doesn’t work on its own, and those trying to push a rope uphill will likely get stuck when they encounter a manager or leader who’s never worked with analytical paradigms and capabilities.
Even when there are leaders pushing a data-driven culture, the next challenge is whether existing data practices, regardless of their technical sophistication, are scaled and get used by managers across the organization in decision making.
What I mean by “scale” is that data and insights need to deliver new intelligence deep into your organization and beyond the data scientist that produced it. Is your data used to adjust operational priorities? Is it influencing product managers on what products to develop, features to prioritize, or marketing strategies to focus on? Is the average salesperson adjusting their sales pitch or prioritizing prospects differently based on all the data available? Is finance using data beyond financial benchmarks in their forecasts?
You’ll see many refer to this challenge as developing “actionable insights.” In other words, taking data, finding insights, and taking action based on the learnings. Driving action from insights is an organizational capability, not a technological one.
The single largest issue most organizations face when looking at Big Data, predictive analytics, and other data programs is whether they have the talent and collaboration to drive transformation. Talent must come from traditional technology teams that require the skills to properly select, install, configure, integrate, develop, and scale the underlying technologies. Talent must also come from the data scientists, statisticians, quants, and other data analysts that are going to leverage these tools to develop new insights, experiences, and products.
According to an Accenture study on the data science shortage,21 “The U.S. is expected to create around 400,000 new data science jobs between 2010 and 2015, but is likely to produce only about 140,000 qualified graduates to fill them.” Similarly, a McKinsey study on Big Data22 predicts, “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” Finding the technology talent is also not easy, as reported in a recent Computing Big Data Review:23 “It would be great to have more people with Spark/Cassandra and big data skills. They are out there, but they tend to be more seasoned developers who have been around the block a few times and have become involved as they became aware of it. But at a junior level, those skill sets don’t seem to be there.”
Even if you cultivate the talent, getting data scientists and the Chief Data Officer to collaborate with the technologists and the Chief Information Officer on technology, process, and methodologies is not easy. They both know enough of each other’s responsibilities and skills to “be dangerous,” and conflicts between these teams can lead to long debates or flat-out departmental wars. Equally challenging is having these departments adequately service business needs with results and insights while team members are debating the underlying technologies and methodologies. In the Computing Big Data Review,24 61% of their respondents claimed, “Company structure that allows data scientists to bridge IT and business departments” is the most important factor for data scientists to thrive.
This is a significant because in the war for top talent, most organizations will be challenged when larger organizations offer aggressive compensation, benefits, and notoriety for talent in hyper competitive areas. In other words, most organizations are going to have a hard time competing with Google, Facebook, IBM, JP Morgan, GE, and other companies making significant bets on Big Data and other emerging technologies. The rest of us are going to have to cultivate talent and teams to close the gap.
The second issue we should consider is what data capabilities and technologies to invest in. At some point, many organizations that truly want to become data driven must consider investments.
And therein lies a huge challenge for many organizations operating with small technology budgets or who are middle to late adopters of new technologies. In Chapter Three, I reviewed some of the Big Data technologies in Version 3.0 published by Matt Turck in March of 2016,25 who has the technologies broken into six major capabilities, 60-plus types of technologies, and too many products to count. This is enough to paralyze many Chief Data Officers, CIOs, data scientists, or data architects left with justifying and then selecting Big Data investments. It can lead to long analysis and technology selection processes to identify the best-fit technologies and products that enable the organization. It’s not uncommon for organizations to take six months or more just to select the technology, and, by then, the organization may be left behind by the competition.
The other approach is to go bottom-up and enable the technologists, data scientists, or ideally a collaboration between multiple disciplines to prototype Big Data technologies aligned with an innovation program that prioritizes experiments. This approach also has challenges. First, you need really good talent that is capable of prototyping with these technologies. Even if you do, they may not have all the skills or infrastructure needed to be successful or have sufficient knowledge of the technology to make quick prototyping with it successful.
As if all the legacy data issues, the Big Data technology investment needs, the cultural challenges becoming data driven, and the steps organizations need to take to develop talent that works collaboratively aren’t enough of a challenge, along comes a new wave of emerging technologies.
I’m speaking about some critical technologies that represent the next wave of cognitive computing and transaction processing, including machine learning, artificial intelligence, predictive analytics, the internet of things (IoT), and blockchain technologies. It will be far more challenging to successfully implement these new technologies without a sound data management practice.
Big businesses are taking significant bets on these technologies, and the successful ones will realize a significant competitive edge. Think what happens when Tesla, Apple, or Google market the self-driving car if Ford, GM, or BMW lack a competing strategy. What happens to Honeywell, Philips, or Maytag if they are left behind leveraging IoT in home automation? Can Citi, JP Morgan, Chase, or TD continue to be the dominant banks if they don’t have a way to replace existing banking systems with ones enabled by blockchain technology? Can the top medical research facilities continue to push medical science and drive down costs without considering the use of artificial intelligence to enable patient diagnostics?
The question is not if, it’s when, by whom, and with what level of disruption. Recall from Chapter One that these and other emerging capabilities are driving many businesses to consider their future digital business and digital transformation. You should now be able to see that raising the bar on how the organization consumes and processes data is foundational to enable digital transformation.
So what are the solutions to today’s challenges and tomorrow’s threats and opportunities? There are some foundational data practices to consider in order to achieve competitive table stakes in the digital era.
If you worked with me during the last decade, you can attest that I tend to move fast. Time is not on your side if you are working for a startup burning cash to achieve critical mass and profitability or if you are an enterprise that needs to become digitally competitive. These businesses need to become data driven, to enable new data science teams, to leverage Big Data technologies, and to be ready for the next wave of competitive computing. Going slow, sequential, or waiting for a clear understanding of financial returns is not an option. You are bound to be sitting at a poker table with enough players that some will win with loose bets, more chips, or better talent.
When sitting at a poker table, you’re likely to face competitors with different skills, playing styles, and number of chips. A “loose” competitor plays more hands in the hope that they can get “tighter”, more conservative players to fold their hands prematurely. Traditional business leaders often resemble tighter players who can struggle at winning poker against at a table with many loose players willing to take more risks with bigger bets. Many of these loose players will lose, but statistically, some that place the right bets at the right time will amass large amounts of chips and make it difficult for tighter players to compete with conservative playing strategies.
That essentially means that you must attack in parallel the challenges that I’ve laid out. You can’t start with a linear approach and expect everything to fall in line as you develop capabilities and practices.
Digital leaders should follow these four lines of attack to develop their Big Data organization:
1.Begin getting the greater organization to leverage the data, reporting, and analytics that already exist in the enterprise. This is because anything that you do with Big Data technologies or advanced analytics is either additive or replaces existing capabilities. If the organization isn’t prepared to leverage data in making decisions, driving priorities, improving customer experiences, or developing new products, then new capabilities will yield marginal results. It’s like giving a high-performance automobile to someone who barely knows how to drive. Getting business leaders, line managers, and customer-facing teams to leverage analytics, dashboards, and data to better drive decision making is the challenge of becoming a data-driven organization.
2.Second, you must clean up the mistakes of the past; otherwise, a new generation of data scientists and database engineers will repeat them with a new set of technologies. You should clean up the spreadsheets, siloed databases, data quality problems, and master data issues that exist in your existing databases.
3.You must enable new talent, new capabilities, and new practices. How do you enable a new generation of proficient data analysts with competitive capabilities? How should they operate to solve new challenges in ways that are repeatable and maintainable? How do we prioritize this work to know that we are targeting resources to the areas of highest value?
4.More importantly, you should invest in data governance practices that define how data can and should be used across the enterprise. Governance should define the practices that help prevent the data quality issues of the past but also ensure that use of data can scale across the organization. Governance should also define strategy and priorities when it comes to Big Data investments.
Figure 5-3. Collaboration among business leaders, data scientists, and technologists enabling the data-driven organization
Figure 5-3 shows that these practices represent two-way drivers. For example, increased consumption by business leaders in their journey to become more data driven will likely drive data scientists to increase the analytics they provide, improve data quality, and drive the business rationale to invest in more competitive capabilities.26 Technologists should deliver new capabilities and operational performance but are also capable of defining data practices as they often mimic similar practices used in a software development lifecycle. Data scientists should demonstrate that their insights and tools drive both efficiency and growth in their business deliverables and must pass on technical needs and requirements to drive new capabilities.
The center box represents the function of a cross-disciplinary group of leaders that drive the overall practice by defining strategy, governance, priorities, and investments. For this to be successful, it will likely require representation from executive management, technology (the CIO) and data or analytics (the Chief Data Officer). If talent is a significant gap, then adding human resources is useful. If the primary mission is efficiency or cost savings, then the team should include the CFO or COO. If growth, product, or customer experience is a bigger driver, then the leadership team should include the head of sales, a CMO, the Chief Digital Officer, a Chief Revenue Officer, or a head of product management.
Here are some very simple questions: Can you get business leaders to perform a review of their operations? Can you get them to do it without using a PowerPoint or other set of slides? How about getting them to do the review without using spreadsheets? One final question: Can you get them to do this review with less than two business days’ notice?
You’re probably going to get some hard stares with any of these requests because most departments are ill equipped without their bread-and-butter tools and with such little notice. It might be interesting to test leaders to see, if they can get one exception, which one would they select? Would they want more time, the ability to repurpose existing spreadsheets, or a preferred presentation format?
Inability to respond to any of these three requests indicates that a business is not ready to become data driven as manual effort would be required to perform an analysis or report results. The impact is that they can’t perform this analysis with any frequency and likely the department’s ability to share this data with individuals in the department, their colleagues, or senior management is inhibited.
But there is a reasonable answer to this challenge that most organizations, at least in theory, should be able to perform. Does your sales team have a CRM? Does your finance team operate an ERP? Can your digital marketing team report out of the customer engagement tools that are yielding the greatest success? Can your operating teams pull reports directly from the systems managing workflow? Do you already have a business intelligence system or reporting tools that can be leveraged? Unless you are an organization using Microsoft Office, Google Docs, or another competitive office suite as your back office system of record, then ask yourself why is it that most organizations elect to process data manually instead of pulling reports directly from the appropriate tools?
There are a few practical answers. First, some of these tools do a lousy job of reporting and analysis. They are often the last investment made by software vendors, giving users clunky user experiences, ugly visualizations, inadequate functionality, or slow performance. It’s why so many businesses are forced to abandon out-of-the-box reporting capabilities and often invest in having custom reports developed.
Whether you use custom reports or out-of-the-box ones, the next issue is training. Some organizations are good at training users when a system is rolled out, but the practice typically fades over time, making it difficult for new employees to leverage the system to learn new capabilities when upgrades are performed. If a manager has to access data in multiple tools, if she’s been barely trained on them, and if these tools are upgraded with reasonable frequency, you might see how using these tools over time can become a challenge. It’s easier just to export the data and use the few spreadsheet tricks she knows to clean data and make it presentable.
The third issue is the reality of garbage-in/garbage-out. If the CRM or the ERP has data quality issues, then reporting directly out of these systems becomes a lot more challenging. Sales has 3,000 contacts? No, it really has 1,900, and the others are duplicates or records with missing data that are hard to filter. Your department is under budget? Well, no, you haven’t submitted invoices for the last 90 days, so it’s unlikely the report you generated is accurate. These examples are ones that the business leaders know about, but the issues that they don’t know about are what ultimately frighten them. These issues keep them from reporting to senior management directly out of systems without manual intervention.
Lastly, there is the reality that meaningful business metrics and KPIs are valuable only when connecting data from multiple sources. Want to know whether the revenue forecast is at risk in Q3? You probably should analyze CRM and ERP data. Want to know which marketing campaigns are generating the most profitable customers? You likely need to pull data from at least three systems.
These barriers encourage many business leaders to go back to basic tools. They ask someone to get exports, cleanse data, and run numbers in Excel, email their team results, and paste finalized insights into management presentations.
Addressing this issue isn’t trivial. To make the switch, you’re going to need to get some supporters and pick a spot or two where you are more likely to have some marginal success. Maybe it’s sending department P/ Ls out of the ERP or sharing year-to-date sales figures from the CRM? In any example, you’re going to need top-down support to help drive behavioral changes.
You need to pick a spot where you can challenge the status quo by pushing the use of existing enterprise tools and leaving manual practices behind. In rolling this process across multiple businesses, I’ve found that you must find an early adopter and collaborate on new practices. In one organization it was marketing, in a second it was operations, and in a third one it was sales. It doesn’t matter, so long as that you have a leader who’s willing to partner with you on the journey.
Figure 5-4. Data discovery process to ask questions and drive insights
Figure 5-4 shows the basic planning process to develop a new analytics capability. The planning process starts with learning current workflows, opportunities, and issues:
1.What’s driving the business leader to make changes? Is it efficiency because performing the analysis and publishing it is too time-consuming? Is using data and reports in the current format too cumbersome? Does someone need the information more frequently? Does the current approach expose data quality issues?
2.How is the report generated today? What is (are) the data source(s), and what system and manual steps are taken to generate final outputs? Who does this work, and how much effort is required? Can they tell you why it was set up as it was, and what are some of the challenges? What would people do if they had the opportunity to redo the approach? What existing tools are being used?
3.Who uses the reports and data? Are they being used individually or in a group setting? What are some of the actions they take based on the data? How frequently do they review the data? What’s the impact if this report were no longer generated or if it were to be delivered late? What would recipients do differently if the report was generated more frequently, and what would be the impact? What are some current pain points and what are some suggested enhancements? Ask users what other reports they use on a frequent basis and where the report in question ranks in importance versus the other ones identified.
4.Are there any business constraints that need to be considered up front? Is it important for the business leader to have a solution in place in front of a key deadline? Are there timing constraints or opportunities for rolling out new practices? What are the security, privacy, or compliance considerations?
5.Are there any funding opportunities for this program? Are there cost benefits to any legacy practices or systems that can be shut down? Are there any existing KPIs that may improve with better reporting?
Coalesce the information you gathered into a charter. Who are the process owners that use the report? Can you create a diagram showing how the report is constructed? How would users change the report to be more insightful or efficient, and how would they benefit from these improvements? Are there any technical or process dependencies that need consideration if you were to rebuild or upgrade this report? What other assumptions need validation?
Agree on available resources that can be used to formulate a solution. Do you have access to funding that will enable bringing in additional help or expertise? How will the team collaborate and share results with stakeholders and end users to ensure a successful solution and transition? What technologies currently available to this group will likely be used in the solution? Which leaders will help drive priorities and decisions during the implementation phase?
Brainstorm to get the team and stakeholders to form a solution. Develop a prototype, then look to see how different user types (personas) or roles affect functionality and workflow. Identify existing data quality issues and how they impact operations, as well as other technical considerations.
Present top options to stakeholders for feedback. Get consensus on a minimal approach and what is out of scope. Rank most promising options and see whether there are any timing constraints or funding opportunities. Lastly, think through the people and workflow impacts of implementing this upgrade.
When you kick off this practice, you’re going to have a limited number of technologies available to deliver analytics from, but as you add new Big Data platforms and develop expertise, you’ll have more options. Similarly, when you start this program, you might have a small team with limited expertise and little practice collaborating. As you invest in data science and agile data practices, you will see them mature to deliver insights more frequently and with more capability.
As you expose more data and deliver new insights, business leaders will ask questions that spur additional analysis. They may also want to see dashboards shared with appropriate decision makers and expect department leaders to amend processes in order to leverage them. This, in the best of business situations, will help provide funding to add technologies, expertise, and develop practices.
Don’t be afraid of charging forward with humble beginnings. Your goal in this first stage is not to get a data science program going or to get value from Big Data technology investments. Your goal is to get the greater organization, the frontline managers and business leaders to better experience leveraging data in their decision making. Plus, you need them to get away from manual processing data and moving toward leveraging data to make proactive decisions.
Here is an example from my past with an organization that was charged to source data that we would process and sell to customers. They would call on businesses, ask them a series of questions, request access to some of the underlying documents, and input all of this into a data capture tool. This data would then flow through a series of data processing steps and eventually end up in platforms that delivered analytics and data back to customers subscribing to this data.
Their challenge was to collect more data faster with higher data quality and with fewer resources. This seemed like a daunting challenge, especially since the executive team could not afford to give them outside investment in technology or in practices to achieve their goals. We were too busy improving the end user experience for subscribing customers, and working on our data sourcing and processing practices was a distant secondary priority.
It turns out that the group already had a ton of data capturing productivity and quality metrics. The problem was that the data was locked in complex spreadsheets and siloed databases that took too long to distribute, and it was too complex for the average person to interpret. So, while they had plenty of data, they weren’t 100% data driven. They needed to simplify the analytics so that individual managers could drill into their metrics and make recommendations and decisions based on their team, region, and mission.
The good news for them was that we had already made an investment in a “self-service” BI tool that we were using to improve the analytics to subscribing customers. The question was whether this team could leverage the same tool to develop visualizations against their existing data and whether using these analytics would help them achieve their goals.
Flash forward several months later, and I was pleasantly surprised at the results. I attended one of their semiannual meetings where they made operational decisions regarding targets, process, and people. Instead of looking at prepared presentations and complex spreadsheets, presenters shared their insights by navigating the analytics directly in the BI tool. When someone asked a question, they made a couple of clicks to find answers. They could see whether a specific issue or opportunity was a local condition specific to a single manager or a macro issue. Managers made collective decisions on what to do in the upcoming months to achieve their goals.
We’ll visit other approaches to get the organization more data driven. But let’s recall that this is one of four streams of activity to become a Big Data organization. Let’s turn our attention now to the some of the other disciplines that need cultivating.
If you’re going to scale your data practices, then you need to consider people and process. Let’s start with people and review roles and responsibilities that are part of a data practice.
Figure 5-527 gives you a sense of the main functions in a data science program. Data scientists have the primary function to develop analytical models, construct dashboards, and perform data analysis aimed at improving decision making by leaders, managers, and individuals. More importantly, organizations in a digital transformation program should rely on data scientists to enable products and services with real-time analytics capabilities that demonstrate value to end customers.
If data scientists are primarily involved in modeling and delivering analytics, data stewards are responsible for sourcing data and managing overall data quality. Sourcing data can come from a new business relationship, for example, when marketing departments acquire marketing lists and take steps to nurture them into qualified leads. Data sources can also include data captured directly from products and services such as operational data, clickstream data on websites, or purchasing data from transaction systems. These internal data sources need owners responsible for documenting data definitions and owning the overall data quality. Finally, more organizations are taking steps to create proprietary data sources. This could include market research, data that is crowdsourced, or web scraping efforts.
Figure 5-5. Relationship between data science, stewardship, and management
The last major function is what I broadly term data management. It would be a mistake to equate this only with running the underlying technologies because a big part of what makes data practices successful is how data is stored and processed. Your data scientists will be a lot less successful if the data is stored in disparate database silos rather than a holistic data architecture. Data pipelines can be robust and provide data stewards the ability to monitor data quality, or they can be dumb pipes without monitoring or management tools. Data may be marked with known master data records, or you might have a lot of duplicate data. Systems may or may not be secure, with high availability and disaster recovery capabilities.
The figure shows that the success of these roles is interdependent. Data scientists can’t perform meaningful insights if data quality isn’t documented and managed. Data stewards are limited to what they can do to improve quality if the underlying technology and data architecture doesn’t enable them to manage it. Technologists spend more time maintaining disparate systems if data scientists do not partner with them to help consolidate to a defined data architecture and set of platforms.
The other reality is that people rarely fall squarely into these defined responsibilities. I’ve seen data scientists design basic databases and perform data steward tasks to be successful. There are technologists with sufficient math and statistical background to write meaningful R models and display results in powerful data visualizations. There are data stewards that, without modern tools available, have learned to perform wizardry in querying databases or using spreadsheets to profile and cleanse data.
The other issue is that most organizations have a lot of people who have some skills working with data but who have never been educated on the practices and responsibilities of data scientists or stewards. In addition, with the research presented earlier on the shortage of data scientists, hiring from the outside is not a sufficient strategy. You should look to convert the spreadsheet jockeys into bona fide citizens in your data organization.
So now that you’ve fostered some demand for data capabilities, let’s look at some options on filling the talent gap.
If you want to find potential data talent in your organization, create an environment for employees to review operational data, ask questions, and seek out answers. Good questions lead to discovery efforts that produce insights and intelligence regarding the gaps between what data exists and what is needed.
A colleague and I accomplished this at a town hall we hosted to introduce our organization to becoming a data-driven organization. It’s not as if the organization didn’t understand data because we sold subscription and data analytics as a business. Like most businesses, analysis of internal sales and operational data was a second priority because of the underlying complexities. After showcasing some of the “quick wins” on new analytics developed and being used in different departments, we left participants of the town hall with one very simple request: “Go ask questions and try to find data that helps provide answers.” We didn’t say what to do after that and were relying on human curiosity to drive some individuals to seek answers.
I like to find individuals who ask good questions. They tend to be analytical and data driven but don’t necessarily have the technical skills to perform data discovery work to find answers on their own. Ideally, they are business decision makers or are highly influential to drive others to review data.
These people aren’t necessarily the ones asking for reports from IT systems. IT systems tend to serve a single or a small number of departments, so requested reports tend to be narrow in focus. Reports that show the health of a sales pipeline (CRM), trends in purchasing behavior (e-commerce), velocity of the development team (agile tools), or P/L performance (financials) are all examples of basic reporting. They all help answer questions on performance, alert people if there is risk, or provide evidence if specific activities are changing performance.
The best questions are usually more strategic and often require correlating information from multiple data sources. The answers may demonstrate collaboration opportunities by showing how an activity performed by one organization affects others. They help connect customer behavior to operational activities or decisions regarding the supply chain.
It has been my experience that identifying individuals that ask the tough questions is a key step to becoming a data-driven organization. Their curiosity often leads to new discoveries.
When you find people asking questions, your job as a digital leader is to motivate, enable, and mentor them in data practices. These are emerging leaders who are probably subject matter experts in their departments, are willing to challenge the status quo by asking questions, and are looking to leverage data to back up their hypothesis. You’ve motivated them enough to take a step forward, now you need to enable them.
Your new recruits are what are called Citizen Data Scientists. They are not data scientists, quants, or statisticians by training with analytics, machine learning, or programming skills but have stood up to be citizens in the program. For organizations that cannot hire enough data scientists or find adequate partners to outsource analytics needs, cultivating these citizens is an approach to increase the number of analytics produced and drive the data-driven organization. If they ask data-driven questions, are skilled in Excel, are willing to learn new skills, and are disciplined to follow basic data practices that you will lay out for them, then they are good candidates to be citizen data scientists.
When a citizen data scientist in marketing aggregates data from multiple marketing tools to determine what approaches are yielding the best leads, the CMO must listen. When a sales operative shows that account reps who proactively engage their clients at least once a quarter have the highest retention rates, then that should grab the attention of sales leadership. They are insiders in the mechanics of how the department operates and are likely to be trusted. So, in addition to growing the skill set, you are also indirectly growing a department’s reliance on using data in decision making.
There’s just one potential problem once you begin recruiting citizen data scientists. Do your new recruits have the skills to be data driven? Do they know how to use analytics, visualization, and other tools to drive at answers? The answer to this question largely depends on what tools you have available. To enable citizen data scientists, you need tools that are easy for them to learn and be successful, especially on simple analytics. Their drive will enable them to learn more complex analysis over time, but if they can’t get started with the basics then they will either lose confidence, lose interest, or lose any support from their managers to take on data challenges.
I label these tools as “self-service BI” and they are a form of citizen development technologies. The makers of these software tools not only target the expert data scientists; they also target the citizens.
They certainly won’t be targeting IT as that’s the sin of BI tools of the past. Legacy BI tools enabled developers to program the analytics, develop dashboards, or create custom reports. The problem with these enterprise tools is that they are too slow and expensive for most organizations to operate. They follow similar design pattern as custom software where business analysts document requirements and developers implement. To be competitive today, most organizations must do more analytics faster, and these tools fail to enable the greater organization. See Figure 5-6.28
Figure 5-6. Getting from data to insights
“Self-service implies that the analysts can do all or much of their work without IT resources or with services from other organizations or experts. These tools have intuitive user interfaces and help analysts develop data visualizations without the need of a lot of (or any) programming or SQL. They help data stewards profile data, remove duplicates, merge records, and perform other data-cleansing steps without having to call in IT to develop scripts. The tools are easy to navigate and enable the novice to “click and learn” rather than having to go through extensive documentation or training classes to learn how to do the basics. There should be lots of examples, either with the tool or online. Ideally, there is some form of app store showcasing other work. When some training is useful, they have a library of publicly accessible short-length videos that enable users to learn concepts and step through the implementation.
They require some built-in constructs to make it easy for novices to make mistakes. Some capture the full clickstream of the user to enable several “undo” steps in case they make mistakes. Some document all the steps taken so that a “pro” can understand what was developed without having to examine the underlying implementation. Most will have some capabilities to enable reusing chart configurations, formulas, and algorithms. They must have collaboration capabilities so more than one data scientist or data steward can work together using the tool.
These are the basics, and I then look at what happens when we need to do something a little more complex. Ideally, these tools have low-code paradigms that enable citizens to perform basic automation, data validation, analytic calculations, or data manipulations. What coding language is optimal for these users? Here’s a hint: If it’s based on a programming language used by developers like SQL or JavaScript, then the software company probably didn’t have citizen developers in mind when it was created. Look for dialects that are intuitive and are not overly expressive. If you have to write a lot of code to simplify things, it’s going to be too cumbersome for citizen development.
Finally, I look at the last mile. Does it have a plugin architecture, APIs, and other constructs to extend the software vendor’s implementation? You must evaluate these APIs and extensions carefully. If these constructs are there because the software is underdeveloped and they expect developers from your organization to program as needed, then that’s a huge problem. Again, “self-service” implies that you probably don’t need to call IT in for doing more than 80% of common work, so if configuring basic visualizations or performing common data-cleansing steps ultimately requires a developer, then this tool isn’t ready to be called self-service.
One exception is if the tool has hit critical mass and there is already a healthy ecosystem of developers who have extended the application and made their extensions available in an app store. If the citizen developer can shop for plugins, download, test, and request an internal approval for using it in production, then that is certainly a positive endorsement for the technology.
These tools are only one aspect of establishing a self-service analytics capability. Here is my definition of what users should be able to do “easily” to deliver on this promise. A user wants to:
1.Know what data repositories exist in the organization and what type of data exists in them.
2.Make requests to get access to data, get tools installed, or find out where documentation is stored without significant delay.
3.Understand individual data repositories by leveraging easy-to-understand documentation that defines data fields, data flows, connected applications, and data sources.
4.Comply with governance and rules on the proper use of data.
5.Connect with “owners” or subject matter experts on data repositories to ask questions.
6.Develop their expertise with analytical tools. Know how to request support from internal experts or from technology providers.
7.See working examples of dashboards, reports, or analysis performed on the data.
8.Have some understanding on data quality issues and any efforts under way to make improvements.
9.Escalate and resolve technical needs such as performance, linking data, or loading in new data sources.
10.Leverage organizational best practices on implementing visualization standards, collaborating with other data scientists, publishing and referencing insights, and sharing information with colleagues.
These requirements will shape the data services that technologists and data stewards must fulfill in the Big Data organization. I’ll be covering this in the next section because a successful program requires an IT partnership and maturing their data services practices. We’ll then turn back to getting the cross-disciplinary team structure and enabling collaborative agile data organizations.
Traditional IT teams have many more responsibilities in a Big Data organization beyond just selecting, installing, configuring, and running databases. If you look at so-called citizen data science programs through the lens of an IT skeptic, then all I’ve done is arm business users with more powerful analytical tools that will create a new generation of data silos, dashboards, and reports on top of the legacy data warehouses and spreadsheets that already exist in the enterprise.
The skeptics would be right if we stopped there. The solution is to instill governance and management practices while providing business users and citizen data scientists process capabilities that work with their analytical tools. As the previous section noted, they need these capabilities to be successful, but will ask for practices only when it is too late and they are either struggling to get something done or are overwhelmed with a problem. IT must be proactive partners in this program, suggesting and influencing new governance, practice, and data capabilities well before business leaders require them.
Prioritizing data service practices ahead of business need and demand is a primary role for IT in digital transformation programs. It must sell business leaders on implementing new practices like change management that are IT-centric. They should shift data quality from being an “IT issue” to one that business leaders need to manage by accepting data stewardship as a new responsibility. Success will likely require additional collaboration because a single spreadsheet jockey working in isolation will not be as successful as an aligned, collaborative, and organizationally distributed analytics team.
Let’s look at some of these practices in more detail. My objective in prioritizing these practices is on scaling the organization’s ability to train or hire new data scientists, introduce more analytical capabilities, improve data quality, and aggregate new data.
Almost nothing frustrates me more than seeing a complex spreadsheet being emailed between colleagues with the intent to collaboratively edit them. The sender is emailed back multiple versions of the original spreadsheet and will have the arduous task of merging them. When the spreadsheet is completed, it is then emailed to a larger number of decision makers to review the results.
There are far better ways to work on documents together or to share access to them, including capabilities in Microsoft Office 365, Google Drive, Jive, Box, Dropbox, and many others. This isn’t a technology issue today, it’s a training and change management issue getting business users to phase out behaviors that create duplicate data and cumbersome, email-driven workflows.
The problem with email sharing could get worse as you enable new analytical tools beyond Excel. Imagine that creating dashboards in BI tools and having business leaders request the reports sent to them in PDF format by email or creating screen shots and pasting them into PowerPoints is the status quo of how business leaders want to review their results.
Therein lies the core of the problem. Going off emails, PDFs, and PowerPoints and onto clicks directly into analytical tools is a huge challenge in changing executive behaviors. They are the ones largely stuck on email powering everything, and it is very difficult to change this behavior. It may be impossible with some leaders, and you might have to work around them and not let it force delivering new capabilities back into outdated practices.
The newest tools have sharing built in; for example, tools like Tableau and Qlik have Cloud and server versions to enable publishing analytics and make them accessible. Both enable bookmarking results and creating “stories” from different views of the data. Your executives are unlikely to see these in action until they’ve been to a conference and seen a presenter tell a data story through a visualization tool instead of a presentation tool.
Don’t let these executives become the bottlenecks. Remember that becoming a data organization requires both top-down and bottom-up changes, and instilling collaborative practices is definitely a bottom-up transformation.
Think through two primary forms of collaboration:
1.Collaboration during the “development” process where citizen data scientists, data stewards, and technologists are going to need some vehicle to work together and at other times to make service requests. This can easily be done in the agile tools that you’re using in the IT organization and now extended within citizen data science programs.
2.Collaboration with consumers who will be using analytics to perform their work or make decisions. How will dashboards be shared to this group? How will documentation be provided? What example “stories” can be used to illustrate specific use cases?
If you’ve ever managed application development, then you’ll recognize that both are collaboration practices typically implemented as part of a software development lifecycle. Developers need to collaborate, know what’s being worked on, and know how to task one another. When software is ready to be released, end users need to be trained and taught best practices on using the new capabilities being introduced.
The main difference here is that you’re extending this collaboration beyond the formal walls of IT to less technically trained business users. Tools for citizen data scientists must be simplified and easier to use than what software developers use, and the ones targeted to the software development community may not be good fits for use by business teams. There are exceptions, so you should look at the underlying user experiences.
When it comes to collaboration with consumers, my recommendation is to lead by example and expect a lot of pushback from laggards. Send emails with links to tools and without attachments. Provide some context so that you can get successful “open rates” but lead users to click through to additional data. Hold internal sessions to demonstrate the power of using the tools and telling stories through data rather than just how-to sessions.
This practice needs to evolve over time, so be realistic about the pace of the change that you drive. If an executive refuses to look at data that’s not in a PDF or PowerPoint, then appease him while you get others to change behaviors. Help the ones using the tools frequently to be successful, then have them encourage middle to late adopters to get on board.
There are other IT practices that need to be considered in data science programs as citizen developers begin developing more sophisticated models, algorithms and visualizations. Does the new R script, analytics dashboard, or upgrade to data quality go live without some form of change management practice? It might, unless this discipline is enabled, promoted, and required byeveryone working in the data organization.
Change management practices must be enforced because it is unlikely that data scientists, citizen data scientists, sponsors, or leaders will automatically sign up for these practices. Remember, they come from the world where when a spreadsheet is done, it’s just emailed out to everyone. What’s change management?
But the better analogy to explain change management involves the editorial and publishing of formalized PowerPoint and other presentations. A presentation sent to the Board probably goes through dozens of edits and reviews to ensure the accuracy of the information, the simplicity of key messages, and sequencing. Deploying changes in data artifacts should be validated for accuracy and quality of results, boundary condition issues, dashboard performance, data privacy/security considerations, and other potential risks. I once quoted a famous movie line in my blog, “With more power comes greater responsibility,” and analytics need to be embedded in a basic change management practice.
What goes into this process? Just the basics to start:
Who tests the artifact before it is ready to be released? Is the data accurate? Are the analytical calculations selected appropriate and implemented correctly? Do the results make business sense? Are the business rules for data quality engineered appropriately? Does the new artifact reduce the performance of a key system resource?
What kind of compliance and communication checks are required before sanctioning the release? Are you releasing new data to end users that comply with privacy and security policies? What documentation needs to be updated? Do users understand any new or changed dimensions or measures and how to use them in decision making? How are you communicating the change and making sure end users and stakeholders understand the timing and impact?
What system environments are being used to support development, testing, and production? How are new artifacts transitioned, and what automation is developed to ensure that it can operate smoothly? What are the business rules on synchronizing data between these environments?
Who ultimately signs off on the release? Where is the change log captured? How is the change monitored for any unexpected conditions? What are the steps to roll back and under what circumstances is a rollback triggered?
For those of you well versed in change management practices, you’ll note that I am only scratching the surface with these principles. Like the previous section on collaboration, change management practices need to mature over time as the data organization grows in size, capability, and impact.
Now let’s consider documentation. Relational diagrams, data flows, defined calculations, data dictionaries, database connection parameters— how many of these do you have documented across critical databases in simple formats that business users—not DBAs—can consume? How many of these are in formats that make it easy to update and maintain?
Databases and data flows are the foundations for Big Data organizations, yet documentation is often lagging, especially on legacy data warehouses. For the business to take better advantage of this data, it often requires some form of documentation to help them recognize what data is available, what the data fields mean, and how the data can be used.
If you’re successful getting the business interested in using data from a legacy data warehouse and demonstrating value from it, then you are more likely to get funding for cleaning it up, improving it, or rebuilding it. If the data warehouse is underutilized, then it effectively is another form of dark data.
Unfortunately, without some basic documentation for the new citizen data scientists you are grooming, knowledge and usage of data are limited to the few subject matter experts who have used it before and maybe were part of the team that originally designed the databases. You can’t easily mature a data organization if knowledge of key data warehouses is limited to a few experts.
What types of documentation? I look for very basic starting points:
Asset details on data systems identifying the databases, data flows, underlying technology and version, and what systems they are running on in production. I also want the inventory of systems identified with some basic attributes like cores, memory, and accessible storage. Finally, I like to see what external applications are interfacing with the database and some basic information on the connection type, data consumed, and data transacted.
Database diagrams including a high-level entity diagram along with a more detailed one that is developed from an entity relationship tool.
Data dictionaries that provide more context on the fields in the databases.
Data security considerations should also be documented answering some basic questions like what information is deemed private (PII or other), what are the different group- or role-level access types to this data, and what are the underlying business rules and approvals on who gets access to each role.
Data flows showing movement of data from source to destination and identifying primary processing steps.
List of analytics and calculations that are computed either in data flows, database stored procedures, or applications.
This documentation is either a diagram (like database diagrams) or a database itself (asset lists, data dictionaries, list of assets). A governing team should decide up front what tools will be used to manage this documentation. The team should also decide where these documents should live so that they are accessible and may even come up with a tagging taxonomy so that documents are easily searched.
Equally as important to having documentation is to decide when and who will maintain them. Usually this can be done as part of a change management practice, so when changes are being deployed by either data scientists or technology, the underlying documentation is updated appropriately.
One last note is that some of these documents are better developed and maintained by the Chief Data Officer or by the data scientists. Data dictionaries and catalogs of reusable analytical calculations are created and consumed by this group, so getting data scientists to “own” them has the added benefit of ensuring materials will be written and maintained by the target audience.
Now that data scientists have tools and a change management practice and existing data assets are documented, the next step is for IT to provide specific services related to the existing data warehouses and integrations, as well as the ability to create new ones.
The self-service tools deployed to data scientists only go so far in functionality. They are often point solutions to enable specific analysis and data integrations. Once created, a mature team will review what was created and decide which elements should be refactored over time and implemented directly in the data architecture.
The main benefit of centralizing it is to enable reuse, to secure it, and to scale it for larger volumes of data. But the downside is that it often takes longer to update data models, analytics, and data integrations when they are centralized versus what data scientists can do independently. The team needs to collaborate and decide when it is appropriate to centralize data in the enterprise data warehouses and when it is reasonable to have it managed privately for one-off analysis by a handful of data scientists.
Sometimes, there is no choice or the choice is obvious, and the team needs to centrally implement from the get-go. You might be developing a new data warehouse for petabytes of data or integrating a new data source that needs to be joined with multiple data sets. Even in these situations, it often makes sense to prototype against a subset of the data before implementing it directly into the central systems.
What are some of these services? You might take this out of technical terms and make this more business friendly. “I need help with . . .”
Getting access to a data asset
Reviewing a new data source and how to best integrate it
Improving performance of a query, a database, a dashboard, a data integration
Writing a query to pull the required data
Identifying and resolving a data quality issue
Regenerating any documentation that is system generated
You’ll notice this list looks more like a service desk’s practices rather than an engineering practice. That’s by design because if you’re working with citizen data scientists, they will more likely get stuck and request help than directly ask an engineering team for implementation services. Once the request is made, then the data services team can assess the opportunity and determine the appropriate course of action.
Larger organizations may call this a center of excellence rather than just a service desk and engineering practice. This is a smart way to position and market these services; however, it has a couple of implications that need to be considered. First, the center of excellence needs to be staffed with experts on the various tools being used by both technologists and data scientists. Second, you need a service commitment from the individuals who will be fielding requests and a time commitment from management. This may prove more challenging than expected since many organizations would prefer seeing their best data scientists and technologists working on business problems rather than participating in the center. Organizations that are truly committed to becoming data driven should see the benefit in having experts spread their knowledge, but they must recognize that it requires an investment and commitment by their best people.
Now let’s review operational considerations. No one is happy when a query is slow, a dashboard takes too long to load, or there is a delay in processing a data feed. Not happy is putting it mildly—more like furious and frustrated.
Your database operations are primitive if you are only monitoring at the systems level showing whether resources like CPU, memory, disk, or network resources are performing as expected. You need to be tracking actual database performance, application performance, dashboard performance, and data integration durations to better sense when there is a performance issue. You also must be able to sense an episodic issue like a slow query bogging down the system, or a slowly emerging one such as a data integration that’s getting slower week to week.
Database performance is the dependent variable, so you should be able to track some of the inputs that drive poor performance. Are you tracking the size of the database, the amount of data flowing through ETLs, or the number of simultaneous user activity on primary dashboards? These are common inputs that may affect both episodic and longitudinal performance degradations.
Performance should be reviewed with the collective team because issues likely require collaborative decisions. If data integration has episodic slowness when it is processing large quantities of data, then perhaps it needs to be moved to a Cloud computing environment where computing resources can be ramped up during the period. If a database performance is slowing, then the easiest thing may be to archive data rather than add resources or performance-tune the database.
The main transformation is that if you’re driving the organization to be data driven, then consumption will rise, complexity will increase, and tolerance for subpar performance will decrease significantly. Your operations teams need to be more knowledgeable about how resources are being utilized and measure performance to resolve issues, find the root causes of recurring problems, and scale computing resources.
More data quality issues will be identified as consumption of data and analytics increases. The analysts and data scientists working with data often work around the underlying data quality issues. Sometimes that means ignoring issues by filtering out “bad” data; other times they will create complex formulas and other operations to cleanse data. Vocal analytical teams will highlight data issues so that there is a better chance that they can be addressed earlier where the data is collected or processed. They will say, “Let’s fix the data upstream” because they know correcting issues at the sources or early in the data flows benefits all consumers of the data.
Many business leaders are oblivious to data quality issues until they are knee-deep in them. There’s an underlying assumption that “the system” will prevent duplicate records, bad street addresses, formatting issues with emails, and other data validations, not realizing that it takes investments either to prevent bad data at the source or to implement data-cleansing practices.
Like any other below-the-surface issues, the best way place to get started is to build awareness of the issue by publishing some basic metrics. How many bad emails have come in from different marketing lists? What salespeople are entering the least amount of prospect metadata? What are the primary sources of duplicate records? How many records are missing values for critical dimensions?
Once there is awareness, then solutions can be implemented either by providing data stewards with data quality tools and practices, by implementing data cleansing within data integrations, by creating the right incentives to drive end user behaviors when entering data, and by formally creating master data repositories.
But onetime awareness isn’t sufficient, and IT should consider how to establish and publish metrics. Many data integration tools have this capability built in, and they are worth considering. But it’s a key consideration because organizations need their data scientists performing more analysis and less cleansing. Furthermore, new data quality issues emerge when data sources are added. By building up awareness, you’re helping the organization recognize the need to create and assign data stewards, a fundamental role in a growing data organization.
You now have many of the building blocks to develop a data-driven organization. You have some business leaders consuming existing data, an emerging citizen data science program, technologists that are taking on greater data services, and an emerging role for data stewards. How do you put this all together into one cohesive process and collaborative team that can deliver frequent business results?
Now that we’ve defined some roles, a primary governance challenge is in balancing responsibilities. Who does what steps in the data management practices, and how are key decisions made?
Figure 5-7 gives you a basic model of the different responsibilities with technologists handling the data management practices and governance while citizen data scientists handling the analytics. Now how can we get these people working together on prioritized questions and opportunities?
Figure 5-7. Balancing data management and data science responsibilities to enable collaboration and drive results
The answer is agile. In software development, the agile product and software delivery teams are almost always cross-functional between business and IT. The heart of agile is the product owner managing a backlog of features and enhancements, defining minimally viable solutions, working with IT on implementation scenarios, and prioritizing planning and development stories. Strong agile teams also have mechanisms to express and prioritize technical debt, larger business investments, and more significant infrastructure changes.
The same practice can be applied to agile data teams, except that, instead of prioritizing features, teams prioritize Big Data questions. What questions provide value to stakeholders and customers that are worth answering? How do we attribute value and estimate feasibility on answering the question? How do we factor in other work such as loading in new data sets, data-cleansing efforts, upgrading data security, or improvements in data processing?
The next step is to get a team working together on discovery efforts. Once a multidisciplinary group understands priorities, there is a stronger likelihood that they will work together and disregard organizational boundaries.
Figure 5-8. Agile data discovery process that enables data teams to prioritize analytics and demo results
Figure 5-829 shows an approach to align data science with agile practices. It starts with a single data set that is either part of the organization’s dark data or may be a new data set. The first step is to catalog the data so that business users can learn about its existence. Break the data into basic entities, dimensions, metrics, and metadata to provide more details to business users looking for data sources.
Identify three to five potential questions, target insights, decisions, or activities that can be researched using this data should someone commission a data scientist to investigate.
List known issues (defects) with the data source. This can be measures of data quality, information on how the data is sourced, and other feedback that might undermine any analysis of the data.
Score this data set based on its potential versus known issues. Absent of any easy way to quantify value, scoring by a voting committee can at least rank what data sets look attractive for further analysis.
Commission agile sprints on data sets that have the highest scores. Review results and rerank based on findings.
Demo the insight and adjust the score based on its business value.
This reinforces that the data scientist’s work should be grounded in analysis that answers questions and delivers business value. Data scientists look at the value of data before investing too much time in discovery. Imagine that all the friction in the analysis because of the data set’s size, speed, complexity, variety, or quality can be “solved” given sufficient skills, tools, and time—what do you hope to get out of this data analytics or mining exercise?
The approach can be applied to other cases beyond just analyzing new data sources. There might be business units looking for specific analytics, models, and dashboards. This should be expressed in the form of questions, scored based on value, and have solutions provided by the team. Another case is addressing technical debt by taking steps to eliminate legacy artifacts or simplify existing ones. Data quality debt efforts can be prioritized by data stewards to address data, integration, and other processing issues.
The key to the approach is to enable the prioritization of questions, a cadence to deliver results, and a theater to observe the results.
Now let’s put this in the context of the analytics planning process introduced earlier in this chapter. If you recall, that process enabled you to learn and develop a charter and then a planning process to identify a team, solutions, stakeholders, and scope. The agile analytics delivery process in Figure 5-8 then defines how this team can deliver iteratively. Figure 5-9 brings this together to show you the full lifecycle of strategy, planning, and delivery.
Figure 5-9. Data discovery practice: strategy, planning, and delivery
You’ll notice that in delivery, the team needs to continue to ask questions and reset priorities. This is because, as the team works with the data and technology, they will likely reshape their scope. Some questions determined during agile planning will be less useful, and new ones will emerge that will be of greater importance.
In fact, one way to judge the health of an agile data team is how quickly they can go from strategy to delivery on a data opportunity. In other words, agile planning in a mature organization should be fast because teams are easy to assemble and stakeholders are acclimated to the process. The main planning tasks are related to scope and solutions or, in agile terms, to developing the initial backlog of questions for agile delivery.
So getting the team structure is key, but how you go about it will change as the agile data team matures.
You might be wondering how to structure “multidisciplinary” agile data teams considering the skills include trained data scientists, citizen data scientists, data stewards, and technologists. There isn’t a simple answer to this question as a lot of this depends on organizational structure, objectives, competing objectives, and the maturity of the overall practice.
Figure 5-10 shows the first two stages to mature the team structure.
Figure 5-10. Stage 1 and 2 data organizations
In Stage 1 when you’re just getting started with data science it helps to centralize a cross-disciplinary team while Big Data technologies are still being evaluated, and the overall agile practice is still being developed. Centralization enables the highest degree of alignment of the team that is optimal for both technology selections and process adoption. Roles can be sorted out at a more detailed level, and missing skills can be identified more easily. Finally, if data stewards are on the team, then they can be tasked to address data quality issues in parallel to the analytic solutions.
Stage 2 should begin once an agile planning and delivery practice is established and some technology in place. Focus then should be to help organizations that want to be data driven to be the most successful. When this happens, I’d rather see citizen data scientists work independently in their organizations delivering the analytical results and aim to be “self-serving.” If you’ve established a center of excellence, then they can make requests of the technology organization for either technical assistance or upgrades to the underlying technologies.
Similarly, in many cases where data sources are shared across the business and the skills of data stewards harder to fill, it makes sense to centralize data stewards in a service organization that can independently monitor and upgrade data quality. Lastly, if the organization requires trained data scientists to work on analytical models, then it makes sense to separate this out as a separate service.
Stage 3, as shown in Figure 5-11, is when a business has the opportunity to leverage data and analytics as part of a customer-facing product. In this case, the team working on the product should be part of the group working on the user experience. It is an add-on to the structure of Stage 2 but adds complexity in that centralized services can now receive “requests” from multiple internal- and external-facing organizations.
You can probably guess that the structure in Stages 2 and 3 can lead to complexity if multiple organizations are participating and multiple products are being developed or supported. How does work get prioritized and coordinated between these teams and the services that they may have dependencies on?
Figure 5-11. Stage 3 data organization, illustrating shared services and decentralized data science teams
This is where the concepts of “change management,” “releases,” and even “estimation” become important. Sprints give the teams a process for working on short-term priorities. Releases promote change management formalities but also help dependent teams have some visibility over a team’s midterm deliverables. When teams can estimate well, then they can be used to provide roadmaps beyond the upcoming release.
Prioritization can be a bit more challenging as it is difficult to define “minimal viable” for analytical services and difficult to compare “business value” between the deliverables across different organizations.
To solve this, you can leverage ideas from portfolio management. The governance group can be given authority to set priorities for each team either at a release level or at a roadmap level (multiple releases). Each team can then use voting mechanisms to judge the business values of the deliverables requested by different business and product teams.
You can’t lead with governance. No one understands what governance is, and if you try to explain it, it comes across as an administrative bottleneck. Business leaders understand corporate governance and are beginning to appreciate IT governance, but data governance will be puzzling to executives in organizations that are just embarking on leveraging their data assets.
But once you have collaboration and a basic set of practices among data scientists, technologists, and data stewards, instituting data governance is demystified. Instead of a hard-to-define “governance,” it becomes a set of business rules that define the underlying practice. The rules to focus on what will emerge as the teams collaborate, instead of having to define everything up front or through a commercially designed data governance maturity program.
Here are some responsibilities for the data governance team:
Determine and approve strategies on what organizations to prioritize data services and to approve charters.
Oversee financial aspects of the agile data organization; justify investments and demonstrate ROI where applicable.
Review and approve standards for technologies, agile practices, data definitions, documentation formats, metrics, security, privacy, quality, and performance.
Review roadmaps, releases, and priorities set by teams and ensure alignment with strategy.
Review the current state of talent; identify gaps where new talent or training is required; align incentives and rewards.
Oversee data or technology R&D priorities; prioritize where new data sources, technologies, or quality improvements are required.
Handle disputes and bottlenecks across teams.
Two key responsibilities that should be the focus of the leader overseeing this transformation are to:
1.Own data security policies and prioritize its implementation.
2.Manage stakeholder expectations and nonstakeholder derailments.
Several key policies fit under data security. What are the organization’s data classifications? Who “owns the data” to define its policies, such as who gets access and defines the permissible use cases for the data set? Are there specific entitlements related to row or columnar views of the data? What kind of auditing is required to track viewing, changing, and distributing the data? What are the business rules for replicating, retaining, and archiving data? If the data contains personal or confidential data, what level of encryption, data masking, or anonymizing is required? What rules should be implemented to prevent data loss? What are the legal and compliance requirements on where the data can be physically stored?
This is a subset of security considerations that requires significant attention when organizations transform their data practices.
A key reason for having the transformational leader involved in data governance is that you need senior leaders to be involved as stakeholders in the transition to a data-driven organization. We’ve reviewed how important it is to get some early adopters to sponsor analytics charters and partner with the agile data organization. Once partnered, you’re likely going to need the transformational leader’s partnership to help set realistic expectations.
Stakeholders don’t always understand or embrace agile principles when it conflicts with their priorities. They’ll remember all the things you signed up for in agile planning but forget that it’s an evolving process that prioritizes every sprint based on learnings and implementation realities. Some will look at charters and scope like contracts and bury teams if they underdeliver, yet provide little acknowledgment for additional scope taken on or for other complexities encountered along the journey.
If teams feel this pressure, they will be less likely to take on additional work for this stakeholder. If forced to, they will become more conservative in their commitments, and the stakeholder will likely be dissatisfied with the results given the investment. Things can get really bad when stakeholders challenge the underlying talent, technology selections, delivery practices, or data governance when they aren’t getting everything they want.
It kills the culture. Instead of getting a data-driven culture, you can end up with one that pits business stakeholders against the people and teams that are providing services.
The data governance team needs the transformational leader to step into these situations. Maybe the team miscalculated the complexities, oversold their stakeholder, or didn’t execute efficiently. Maybe the stakeholder wasn’t listening to the team’s recommendations or didn’t fully grasp tradeoffs in priorities. Maybe the data governance team didn’t handle conflicts efficiently, introduced too much governance too quickly, or empowered teams with inadequate funding or skills.
The transformational leader needs to best handle the situations from the perspective of getting the culture and alignment right first, execution considerations second. This means being the final arbitrator when it comes to disputes but also helping everyone in the program see the big picture. If something is systemically wrong, then the transformational leader should draw attention to it and make sure leaders are assigned to address it.
But what about the other issue of working with nonstakeholders? The leaders who are satisfied with status quo and don’t want to become data driven. The passive-aggressive ones who are sitting on the sidelines waiting for the first sign of execution issues to bury the program. The ones that may be losing power, clout, or budget because data-driven decisions are driving priorities out of their domain? The ones who are hostile to the program because they would rather have more control over it.
In “The Reason So Many Analytics Efforts Fall Short,”30 researchers revealed “that leadership issues were often at the heart of the problems” and “the commitment to advanced analytics disrupted [the C-suite] equilibrium” and “in all too many cases, the CEO devoted little time to trying to manage this dynamic.” They have a number of recommendation for the CEO to keep their analytics programs from being a waste of time31, including managing the C-suite and getting the right leadership and governance instituted for analytics programs.
But the researchers highlight two other key responsibilities to “challenge existing mental models” and to “create an environment of rapid innovation” that are key to successful programs. The first is a change management exercise that takes significant time to nurture, while the other is more tactical, encouraging team members to take calculated risks. Both are key elements of becoming a data-driven organization and reforming both leadership and culture.
A data-driven organization is a cultural statement. So, what does this “look like”?
A defined business strategy helps people focus their data discovery efforts.
People ask good questions, challenge assumptions, and have healthy debates on where to make improvements, investments, or other changes.
Many people in the organization have the skills, access to data, and sufficient expertise with BI tools to perform data discovery tasks and are successful extracting insights from data.
Meetings kick off with a display, discussions, and a dive into data first to help shape opinions and the dialogue based on facts.
Presentations reference their data sources—and these data sources are available to appropriate people in the organization to review, challenge, and conduct follow-up analytics.
The IT organization provides a defined set of data management and BI services to assist the organization in their discovery efforts.
There are already efforts to drive organizational changes, improve communications, and to get bottom-up contributions. Becoming data driven is a key but not the only cultural priority.
Efforts to leverage data in decision making have measurable results—improvements in operations, sales, new revenue-generating products, happier customers, and so on.
Data is shared and leveraged across different businesses, organizations, and products.
There is priority, demand, and hunger to address data quality issues, to integrate multiple data sources, to capture new types of data, and to leverage third-party data to become smarter, faster, and more competitive.
Why is being data driven important in a digital business? Why is it critical to transformation? The answer is that you need organizations to help shape the future based on rapidly changing market conditions. They should be smarter and faster to compete with new competitors. And to do this, the products and services delivered to customers must be smarter, more efficient, more convenient, and more reliable.
Leveraging data and analytics is the core competency to enable this transformation. It’s not just the job of the technologists or data scientists, it requires the entire organization to be data driven.