Chapter 5
Best Practice #4
Source Data for Analytics Strategically

“Strategy is about making choices and trade-off.”

Michael Porter

One of the key elements for deriving good insights through analytics is quality data. However, many business enterprises are impacted by poor data quality that reduces productivity, creates inefficiencies, and increases operational costs. Recent Gartner research has found that poor data quality to be responsible for an average of US$ 15 million per year in losses in a company [Gartner, 2018]. The state of data quality is likely to worsen in the future for three main reasons:

IT systems are increasingly becoming large, complex, and disruptive and are altering the way consumers, industries, and businesses operate. This means diverse data models, data sources, and data definitions to manage.
Secondly, more companies are now diversifying with new products and services. Also, because of globalization, many companies are entering into new markets. This means more data on customers, suppliers, assets, and products, plus the transactional data on local regulations and business activities.
Thirdly, the mergers and acquisitions (M&A) market is consistently strong year-over-year across the globe. In 2017 there were 53,302 global M&A deals, in 2018 it was 52,912 transactions, and in 2019 there were 49,849 M&A deals [Szmigiera, 2020]. This means more system consolidation and data integration in business enterprises, and data integration often results in poor data quality.

So, what exactly is data quality? There is no one universal definition of data quality. But from the business perspective, data quality is ensuring that the data is useful in business operations, compliance, and decision making [Southekal, 2017]. It is important to understand the key dimensions of data quality to define data quality comprehensively, as the definition of data quality is largely contextual. The word “dimension” is used to identify the key aspects of data that can be defined, quantified, implemented, and tracked. In this backdrop, there are 12 key dimensions of data quality, as shown in the figure on the facing page, and appendix 1 covers the definitions of these 12 data quality dimensions [Southekal, 2017].

Data quality doesn’t mean that all the 12 data quality dimensions are applicable to a business all the time. Data quality is contextual, and often it is only the subset of these dimensions that matter to a business enterprise. Depending on the business problem, only a subset of these 12 attributes might be needed to assess the level of data quality.

So, as a fundamental question - why data quality is poor in business enterprises? Why are data analytics projects constrained with poor data quality? Why is considerable time and effort invested in data cleansing and remediation, technically known as data engineering? According to Armand Ruiz Gabernet of IBM, 80% of the time in analytics is spent on data engineering, leaving only 20% to derive insights [Gabernet, 2017].

Given that data quality is an essential requirement for analytics, there are five key reasons why data analytics is heavy on data engineering.

Different systems and technology mechanisms to integrate data. Business systems are designed and implemented for a purpose, mainly for recording business transactions. The mechanisms for data capture in business systems such as ERP are batch or discrete data, while in the SCADA or IoT Field systems, it is for continuous or time-series data. This means that these business systems store diverse data types. Hence the technology (including the database itself) to capture data is varied and complex. When one is trying to integrate data from these diverse systems from different vendors, the metadata model varies, resulting in data integration challenges.
Different time frames of data capture. The timeframes for data ingestion into the IT systems during data capture varies. For example, in transactional applications like the ERP systems, data ingestion is typically batch, discrete, and manual, while in the field systems, data ingestion is usually automatic and near real-time. For example, when the product delivery to the customer is done, the shipment details are normally captured in real-time by the hand-held devices. But the invoicing cannot be immediately processed as invoices are issued from the ERP systems to the customer. This creates a delay in delivery-invoicing compliance reporting.
Different user value-propositions. In business, the same data is created and consumed by different stakeholders. This type of data is usually the internal business process data, and the value differs as the perspectives of the stakeholders vary. For example, while the technical attributes of an asset concern the engineering manager, the procurement manager looks at the purchasing aspects of the asset. This means for holistic asset analytics, the engineering data, and the purchasing data should be available.
Different business processes. The same data element can be different due to differences in business processes based on geographies, laws, regulations, market conditions, etc. For example, the date-of-birth data element of an employee in Canada is subject to data privacy regulations, while in many countries, the data privacy regulations are not very strong. So, getting the customer buying habit report based on age for a developing market is much easier than getting the same report in Canada.
Different aggregations driven by organizational structures. A data element can be viewed differently based on differences in granularities or aggregations driven by organizational structures. For example, the VP of procurement might need a spend report based on item categories - an aggregation of items procured based on the item type, supplier type, delivery location, etc. But the procurement manager might need the spend report based on individual items procured, which are granular.

The different kinds of data engineering efforts and activities that make data engineering complex and an expensive process are illustrated below. In this backdrop, the phrase “real-time analytics,” which is often used in the marketing materials of analytics consulting and product companies, is an oxymoron. This is because there is always a time lag between data origination and data capture. This time lag can be a few microseconds in SCADA/PoS systems, or this time lag might be even months before the data is formatted, cleansed, validated, curated, aggregated, and committed in a data warehouse before insights are derived. Inherently, data is historical, and hence the term “real-time analytics” is an oxymoron and doesn’t exist. Instead, a better term for faster analytics can be near real-time analytics.

Why is this a best practice?

Why is strategically sourcing the data the best practice? Many analytics initiatives and managers assume that the data for analytics is readily and easily available in the enterprise. Unfortunately, businesses are plagued with poor quality data. As discussed earlier, most of the effort in analytics projects is on getting good quality data for deriving the insights. But from the data availability perceptive, analytics initiatives in business enterprises have two main challenges. First, there is no data at all. Second, there is no quality data. Below is some of the research done on the state of data quality in business enterprises.

An average user spends two hours a day looking for the right data [Mckinsey, 2012]
In 2016, a study by IBM concluded that bad data cost the US economy US$ 3.1 trillion [Ross, 2019].
Up to 73% of data in an organization is never used for analytics [Forrester, 2016]

Despite poor-quality data in the enterprise, appropriate data acquisition strategies should be derived so that business can go forward in deriving good insights. Fundamentally as the business environment evolves, organizations also adapt to the evolving environment. Given the flux in the operating environment of the business, the business model is never stable. Hence, the performance and the utilization of the business assets, including data, are rarely in the ideal or stable or perfect state.

So, what is the approach? They say perfection is the opposite of getting things done. Voltaire, the French writer, said, “The best is the enemy of the good.” George Patton, the American WWII general, had said – “A good plan today is better than a perfect plan tomorrow.” Focusing on perfection or getting high-quality data becomes a productivity problem. It can lead to a feeling of paralysis, of being stuck in the task in which one can never get the desired result. The bottom line is that in the current business paradigm or model, there is no state like perfect high-quality data that can be used in business analytics. Businesses must find some alternatives to get the best available data.

Realizing the best practice

Given that getting quality data is a huge challenge in analytics initiatives, how can businesses get good data for deriving meaningful insights? What are the alternatives to get the best available data? The trick is to stay focused on the final product and not put too much emphasis on the process. So, what is the final product in analytics? It is insights and not data per se; data is the vehicle to get good insights. Businesses essentially need good reliable insights.

In chapter 1 we saw that insights could come from intuition as well, apart from data. Also, as discussed in chapter 1 in the last mile analytics (LMA) section, even though data is the foundation for analytics, data collection is practically a non-value-added task for the business. It is an inevitable process; we can only aspire to shorten this process, but we cannot eliminate it. How can we shorten this data collection activity and focus on deriving and implementing the insights? Overall, data is expensive to manage; the value of getting insights should be significantly higher than the cost of generating insights.

In addition, many business processes do not need high-quality data. For example, accurate data is good enough compared to correct data, and there is a big effort required in moving to the correct data state from accurate data state; accuracy and correctness are two of the 12 data quality dimensions. For example, if a Telecom company’s primary channel to reach its customer is the customer’s phone number, then the phone number should be correct, and the home address can be accurate. Fixing home addresses when people move places would be expensive and even unnecessary for the Telecom company if the primary contact mode is via the telephone. The Telecom company can save time and costs if it keeps the home address as accurate and phone number as correct, instead of having both the home address and phone number as correct.

So, what are the options to get good data for analytics? There are three main workaround strategies or options the business enterprises can leverage to overcome this challenge of poor data quality. These three strategies will help the business in deriving insights quickly and implementing them for business efficiencies.

Data sampling
Feature engineering (FE)
Acquiring and blending data

A key pre-requisite for implementing these three workaround strategies is a strong knowledge of the business and the associated processes. In addition, the data derived from applying these three workaround strategies should be validated against the 12 data quality dimensions discussed earlier.

Data sampling

Data sampling is selecting a representative subset of data from a larger population of data. The goal is to work with a small amount of data that is reflecting the characteristics of the actual population. Data sampling will help one derive insights quickly and cost-effectively as only a small data set is prepared for deriving insights. But there are two important considerations in acquiring a good data sample are:

the size of the required data sample and
the possibility of the sampling error in the selected data sample

To address the first issue on the sample data size or count, sample data size or data count should be based on three main factors:

Population in question. This factor is on the total size of the population in question. For example, if the data analytics is on procurement spend for the last three years, then the total count of purchase orders issued by the company is needed. The count of purchase orders can be easily retrieved from the ERP or Procurement systems in the company.
The margin of error (MoE). MoE is the percentage that expresses the probability that the data collected is accurate, and MoE is typically fixed at 5%.
Confidence level (CL). It is the probability that the MoE is accurate, and CL is generally set at 95%.

These three variables, population, MoE, and CL, can be used in a statistical formula to get the sample size or count, or there are many online tools that can help one to get the sample size easily. The image below is the sample count of 384 purchase orders coming out from an online tool called SurveyMonkey after analyzing the population of 839,997 purchase orders at 5% MoE and at a 95% confidence level.

Once the sample count is determined, we need to ensure that the sample size or count is not subject to sampling error. Sampling error, which is the second issue in data sampling, is the situation where the sample selected is not the true representation of the population. Sampling errors can be eliminated by ensuring that the sample size adequately represents the entire population, and the samples are randomly selected. The randomness in the selected data, can be validated with tests such as the runs test. Technically, runs test is used to test the hypothesis that the elements of the data set are mutually independent.

In other words, representation of the population and randomization in the data selection are two key factors to minimize sampling error. In the above example, if the 384 sample purchase orders selected are from one geography for one item category from one vendor, it does not reflect the true nature of the entire population of 839,997 purchase orders especially if the company operates in multiple countries and procures items from multiple vendors from different item categories. To summarize, sample data should be of good count, should be randomly selected, and should represent the characteristics of the actual population well. The relationship is illustrated on the facing page.

When the sample data selected has addressed risks on sample count and sampling error, you must make sure that the insights derived from the sample data are not likely by chance. Statistical significance expressed as p-value helps to quantify whether the insight gleaned from sample data is due to chance. The lower the p-value (lower than 0.05), the less likely that the results are due to chance, technically known as rejecting the null hypothesis. For example, a t-test, one of the important statistical tests, tells if the difference in means between two sample data sets is significant or if the difference in means could have happened by chance. If the p-value in the t-test output is less than 0.05, we reject the null hypothesis and conclude that there is a difference between the two sample data sets if it were extended to the entire population. Similarly, in the regression output that uses sample data, a low p-value means that the data variable can be included in the regression model and would work for the population data set during predictive analytics.

Feature engineering

The second strategy to get good data for analytics quickly is feature engineering. Fundamentally analytics algorithms take data as the input to give the insights as to the output. But getting good quality data in the right data format to be used as an input to the algorithms is a challenge in most business enterprises. This is where feature engineering addresses the problem. Feature engineering is creating a “smarter” dataset or attributes or features applying the domain experience and intuition on the existing data sets. Feature engineering serves two main purposes:

Transform data types
Create new fields or attributes.

With feature engineering, the integrity of the data is never compromised.

Let us first discuss how feature engineering can be used to transform data types using an example. In multiple linear regression (MLR) algorithms, the output or dependent variable is in numeric format, and the input or independent variable is in continuous format. But if the input or the independent variables are in the nominal format, one option is to use dummy variables in the MLR, and the second option is to transform the nominal variable into a continuous variable using feature engineering. For example, in a retail chain, if the store area field has values large, medium, and small, the nominal field can be converted to numeric by assigning a value of 5 to large, 3 to medium, and 1 to small.

The second function of feature engineering is creating new fields. Often, data in many business operations are repeated across datasets, and this data can be used to build or derive new data attributes or features. Below is a sample where the initial or original production data from the factory floor can be used to derive new attributes or features without compromising the underlying data integrity. In Figure 5.5, based on the timestamp, a new field or attribute called “shift” is derived. If the timestamp field has a value between 13 to 15 hours, then the shift is “Day,” if the timestamp has a value of 16 hours, then the shift is “Break,” and if the timestamp has a value between 17 to 19 hours, then the shift is “Night.”

Acquiring and blending data

The third option or the best practice for companies to get good quality data for analytics quickly is to acquire new datasets and blend it with an existing dataset to make it into a functioning dataset. This happens when the company lacks key data attributes in its dataset to do analytics – especially predictive and prescriptive analytics. While descriptive analytics primarily deals with historical data, predictive and prescriptive analytics will typically rely on a combination of historical datasets and new datasets. There are three key steps in this approach of acquiring new data and blending it with existing data:

Data acquisition

New business data could be acquired internally or externally. Internally new dataset could come in three main ways - business experiments, PoC (Proof-of-concept), and surveys.

A business experiment is applying tools and techniques on different hypotheses and acquiring new data.
A PoC (Proof of Concept) is a small and quick exercise to verify and validate the functionality. A PoC is usually a small exercise to test the design idea or assumption and, in the process, acquire new data.
A survey is a research method used for collecting data from a pre-defined group of respondents to gain insights on a specific topic of interest.

A business experiment should always begin with the definition of what constitutes a valid testable hypothesis.

Externally the new datasets could be acquired from sources such as Kaggle (https://www.kaggle.com/datasets), Google (https://cloud.google.com/public-datasets/). Microsoft (https://msropendata.com/), Government-published datasets from EU, US, India, and Canada, for example. These datasets are normally available for free as Open data. Also, datasets can be purchased from external companies like Bloomberg, IHS-Markit, Statista, Snowflake cloud data warehouse, and others for a nominal fee. Below is the Snowflake’s Data Exchange marketplace that utilizes secure data sharing to connect data providers with data consumers.

Combining data with joins and unions

Once the datasets are acquired – internally or externally, the data can be combined with the available or existing dataset to form an integrated dataset. Two key SQL operations to combine datasets are JOINS AND UNIONS. A (SQL) JOIN combines columns from one or more database tables using the primary keys (PK) and foreign keys (FK). The PK uniquely identifies a record in the table, while the FK is a field in the table that is PK in another table. A foreign key, by contrast, is one or more fields or columns that correspond to the primary key of another table. Foreign keys are what make it possible to join tables with each other. Closely related to JOIN is the UNION operation. If two tables with similar data structures are combined using the UNION operation, then the data from the first table is in one set of rows, and the data from the second table in another set. The rows are in the same result. In simple terms, JOIN combines data into new columns, and UNION combines data into new rows.

Data cleansing

Once the data is combined to form an integrated data set, the integrated dataset should be cleansed to form a usable or functional dataset so that data scientists can use it in analytics for deriving insights. Data cleansing on the integrated dataset at this stage involves four main activities:

Populating any missing values
Standardizing and enriching the integrated data.
Removing duplicate records if any
Validating and verifying the data with key business stakeholders

Conclusion

Data is the fuel for running analytics algorithms or models. But getting good quality data for analytics is challenging for most business enterprises. A study from the Harvard Business Review discovered that data quality is far worse than most companies realize, saying that a mere 3% of the data quality scores in the study were rated as “acceptable” [Nagle et al., 2017]. So, should the analytic initiatives wait for the data quality to improve, or is there a workaround?

There are many options or workarounds for getting good quality like data sampling, feature engineering, and acquiring and blending new data from internal or external sources. The analytics team should strategically source data given that getting perfect high-quality data is nearly impossible in most scenarios. However, if quality population data is available, it is always recommended to use the population dataset. These three strategies, however, do not fix the data quality problem in the company; they are essentially best practice workarounds that allow one to move forward with your analytics initiatives. As civil rights movement leader, Martin Luther King Jr said, “If you can’t fly then run, if you can’t run then walk, if you can’t walk then crawl, but whatever you do you have to keep moving forward.” And according to tennis player, Arthur Ashe, “Start where you are. Use what you have. Do what you can.”

References

Forrester, https://bit.ly/2USU9ij, Jan 2016.
Gabernet, Armand Ruiz, “Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity,” https://ibm.co/2yiqbwE, 2017.
Gartner, “How to Create a Business Case for Data Quality Improvement,” https://gtnr.it/345AJLt, Jun 2018.
Mckinsey, “The social economy: Unlocking value and productivity through social technologies,” July 2012.
Nagle, Tadhg; Redman, Thomas, and David Sammon, “Only 3% of Companies’ Data Meets Basic Quality Standards,” https://bit.ly/2UxaHO4, Sep 2017.
Ross, Andrew, “Experian study: why organisations think they have bad data,” https://bit.ly/2w3th79, Feb 2019.
Szmigiera, M, “Number of merger and acquisition transactions worldwide from 1985 to 2019,” https://bit.ly/2xGXpFy, Feb 2020.