Chapter 12
Conclusion

“Great things are not done by impulse, but by a series of small things brought together.”

Vincent van Gogh

Most business enterprises today are data-rich but poor in insights. For example, the oil and gas industry has historically captured data for operations and compliance, and today they are aggressively building capabilities to convert the captured data to insights [Wethe, 2018]. In transforming data into insights, this book offers practical guidance with these ten key analytics best practices on what one can do for successfully delivering analytics initiatives for the organization:

Tie stakeholders’ goals to questions and KPIs. This best practice can be realized with:
- Identify stakeholder’s business and analytics goal
- Strengthening the goal statement with pertinent questions
- Refining the strengthened goal statement with KPIs.
Build a high performing team for analytics. This can be implemented by:
- Data literacy as the foundation
- A strong analytics leader such as the CDO
- Staffing the team across the entire data lifecycle (DLC)
- Hypothesis-based methodology
- Execution mechanism for data analytics.
Understand the data from the analytics view. Realization of this best practice is with:
- Profiling the enterprise data
- Transforming the native state of data into the analytics view
- Balancing the cost and business value in deriving the analytics data types.
Source data strategically. This best practice can be implemented by:
- Data sampling
- Feature engineering (FE)
- Acquiring new data and blending it with existing data.
Make data compliance an integral part of analytics. This can be delivered with:
- Compliance to external and internal mandates
- Compliance to business purpose
- Compliance to transparency.
Focus on descriptive analytics for data literacy. This can be implemented by:
- Use data to build the data-driven culture
- Building data pipelines
- Implementing reports and dashboards.
Use continuous refinement and validation as the mainstay of advanced analytics. This best practice can be realized with:
- Cross-validation of data
- Validation of output from multiple algorithms
- What-if scenarios on the independent variables to optimize business resource utilization.
Leverage analytics for data monetization. This best practice can be delivered with:
- Data architecture
- Embedded analytics
- Data products.
Support analytics with data governance. Realize this best practice by:
- Identifying the data assets to be governed
- Identifying the data owner, data stewards, and data custodians of these data assets
- Setting up the process and KPIs to govern these data assets.
Implement insights for business results. This can be implemented by:
- Data storytelling
- Change management.

These ten best analytics practices were applied in an analytics program in an OFS (Oil Field Services) company. The first section of this chapter will share details on how these best practices were applied in this OFS company. Also, throughout the book, the discussion was focused on delivering good analytics and insights. If successful analytics depends on senior management support, stakeholder alignment, data architecture, quality data, the right algorithms, and change management, how does bad analytics look? The second section of this chapter is on bad analytics. Lastly, deriving insights rely on the statistical models. So, what are the key statistical tools required to derive business insights? In all, this chapter looks at three main topics – a case study, what bad analytics looks like, and the key statistical tools for analytics.

Before we look at these three topics, Figure 12.1 summarizes the different types of business analytics discussed so far.

Case study: Data insight product for Payload Technologies

The ten best analytics practices discussed in the previous chapters were implemented in Payload (PL) Technologies (https://www.payload.com/), a Canadian Oil Field Services (OFS) technology company based in Calgary, Canada. Payload has two flagship cloud SaaS (Software-as-a-Service) products – eTicket and eManifest. These two products, which are used for digitizing oil movement regulatory documents, can be accessed as a web application or as a mobile application.

The Oil and Gas Exploration and Production (E&P) regulatory process requires the E&P companies to use work orders, and Bills-of-lading (BoL) for tracking oil transportation, and this document is called a field ticket. Field ticket, which is a requirement of the provincial regulator, is managed in eTicket to capture details such as a product type, volume, and the pickup and the drop off locations.
eManifests are digital work order and BoL documents required for submission to the energy regulator on the transportation of dangerous substances like oil as per the requirements of Federal Transportation of Dangerous Goods (TDG) regulations of Transport Canada.

Payload’s current operating model is a digital network platform. A digital network platform is a technology-enabled business model that creates value by facilitating interactions between two or more interdependent groups. Technically, digital platforms have three distinct features.

Firstly, the network effect brings together users to create transactions.
Secondly, digital platforms enable concurrence of technologies—cloud, analytics and mobile to create more services
Thirdly, digital platforms enable easy data access using APIs to derive more insights

Payload’s digital network platform is as shown below. In Payload’s digital network platform, there are three interdependent groups – the E&P companies, the trucking companies, and the drivers. The E&P companies search for oil reserves and then drill the oil wells to extract oil. Once the oil is extracted, the oil products are transported by the trucking companies from the oil wells to the collection points for further shipments to the refineries. In Payload’s digital network platform, there are seven E&P companies that have contracted over 120 trucking companies who work with over 1900 drivers for the oil product movement.

In this backdrop, Payload (PL) has captured operational and compliance data on oil movements for over five years from the E&P companies, trucking companies, and the drivers. This data is approximately 425,000 field tickets and manifests that cover transportation of about 65 million barrels of crude oil (and it’s derivative products like emulsion and condensate) over 86 million kilometers resulting in a trade of over C$ 3.5 billion for the Canadian Oil industry.

Payload wants to monetize the data and offer insights to the E&P and the Trucking companies as a new data analytics product. The data product will be valuable to these companies as these data products will give them insights to:

Reduce Operational Risk
Identify Value Leaks and
Predict the Cash Flow, payables for the E&P and receivables, for the trucking companies

The business operating model, that is, value stream mapping (VSM), of Payload (PL), is mapped to create the enterprise data model (EDM). The EDM shown below is based on the conceptual and integrated value chain of four key elements – Projects, Orders, Field tickets, and Master Tickets and uses different types of reference data, master data, and transactional data elements.

As the first step in building the new data product, the Payload analytics team identified the stakeholder personas and their key value proposition, basically tied the stakeholders’ goals to questions & KPIs.

Based on the value proposition of the stakeholders, the analytics team worked on addressing the following key business questions:

What to move? Is the product oil, emulsion, water, or condensate?
How to move? Is the oil movement using a 7-axle truck with 30 M3 capacity or a 5-axle truck with 20 M3 capacity?
Where to move? What are the pick-up and destination points?
Under what conditions will the oil product be moved?
What is the price charged to move the oil products?

The answers or the data related to these key questions were captured in the eTicket and eManifest application and the data was stored in PostgreSQL, an open-source relational database management system. The data from PostgreSQL was transferred every hour to the Snowflake Cloud Data Platform (CDP), which served as the Data warehouse for reporting. As the data was captured in the eTicket and eManifest application in a structured format with data integrity rules, the data quality level was high. Hence, the entire population data set was considered for analytics.

After extensive discussion with Payload’s senior management and key subject matter experts (SMEs), the data product strategy was to build the following five analytics offerings. The five data products branded as PL Insights are:

Basic Data Products for the Trucking Companies
Basic Data Products for the E&P Companies
Advanced Data Products for the Trucking Companies
Advanced Data Products for the E&P Companies
Data Products for Oil/Gas Industry

The roadmap for deploying these data products, PL Insights, was mapped to the three types of data products, data enhancing, data exchanging, and data experiencing products, as shown in Figure 12.4. The first wave of data product development was to focus on the two data experiencing products - Basic Data Products for the Trucking Companies and Basic Data Products for the E&P Companies. The priority was to first improve the data literacy and adoption in the user ecosystem or in the digital network with descriptive analytics (reports and dashboard) before embarking on advanced analytics solutions, which were predictive and prescriptive analytics solutions. Hence the development of the data exchanging and data enhancing products was deferred until market success was realized with two basic data analytics products.

Basic data analytics products include dashboards and reports. While the reports can be transactional or BI reports, the analytics team blended the key features of transactional and BI reports leveraging the technical capability in the Snowflake data warehouse (DWH) platform. Snowflake data warehouse uses a columnar format to store data and is designed for quick analytic queries. These queries are saved as views and embedded into the eTicket and eManifest products using Sigma Computing, a cloud data analysis and visualization software.

The Basic Data Product for the Trucking companies appears in Figure 12.6, and for E&P companies in Figure 12.7.

To help Trucking and E&P companies succeed in consuming these data products in their business operations, PL Insights was delivered as a holistic solution – a combination of data products and services. While the basic data products for the Trucking and E&P companies included the dashboard and the descriptive analytics reports, the services included:

Operations Support (7 AM to 10 PM MST, 365 days)
Training videos and manuals on running the reports and dashboards and interpreting the insights
FAQs (Frequently Asked Questions)

The data governance was mainly done by the Operations team in Payload. The data governance activities include training the users from the E&P and the trucking companies on using the right descriptive analytics solution, that is, the dashboard and report, assigning the users to the right reports based on RBAC (Role-Based Access Control), data cleansing and validation, and so on.

About bad analytics

While there are a lot of discussions on good analytics, how does bad analytics look? What exactly is bad analytics, and what are its key characteristics? Bad analytics is more than not having good insights. Here are the eleven key characteristics of bad analytics, which will help you identify and prevent bad analytics in your company.

Fundamentally, analytics is using data to answer business questions. If the analytics does not answer the QUESTIONS for deriving the insights, the insights have NOT clearly addressed the business objectives, that is, the problem or the opportunity.
Bad analytics have insights prone to biases. Bias refers to the tendency to over or underestimate the insights derived. This results in the insights being skewed in a certain direction. Insights can be biased due to:
a) Confirmation bias. A confirmation bias involves favoring insights that confirms previously existing beliefs or findings. It is insights that are already proven or known.
b) Availability bias. Getting good quality data for analytics is challenging. Availability bias is the tendency to share insights that come readily to mind instead of thoroughly analyzing the issue.
c) Selection bias. Selection bias refers to the data sample that does not represent the size of the population. In addition, the sample data is not representative of the population, and there is no proper randomization of the selected data.
d) Anchoring bias. It is fixating on initial information and failing to adapt for subsequent information. This becomes an important issue if the analytics is not made on the most recent data.
e) Framing bias. It is the tendency to be influenced by the way a problem is formulated or defined to suit one interests.
f) Sunk Costs bias. It is the tendency to “honor” already spent resources, especially time and money. This happens when investments are on bad insights, and businesses do not want to lose the time or money already invested, instead of making the decision that would give them the best outcome going forward.
g) Authority bias. Authority bias is accepting the opinion of the highest-paid person’s opinion (HiPPO). The highest-paid person usually has the most power and the highest designation in the room. Once his or her opinion is out, dissent is shut out, thereby affecting a thorough analysis of the problem and the solution.
Bad analytics is insights derived using poor DATA QUALITY. Often it is very difficult for businesses to get quality data for analytics. The “Garbage-in; Garbage-out” concept is very much applicable for analytics. If the quality of data used in analytics is poor, the insights will also be of bad quality
If NON-NORMALIZED DATA is used for business insights, then it is bad analytics. Business processes inherently follow the normalized bell-curve, and if there are any exceptions, businesses will fix these variations as they despise inconsistency. Business processes that capture discrete and continuous events are shown to fit a normal distribution.
OUTLIERS. Outliers are observations that are not following the same pattern as the other data sets. Outliers are not necessarily a bad thing all the time in business, as fraud analytics is heavily dependent on outliers. If the outliers are not identified and explained, then it is bad analytics. And if the outliers are eliminated, one should try to understand why they appeared and whether it is likely similar values will continue to appear.
OVERFITTING AND UNDERFITTING. Underfitting means the analytics model gives a simplistic picture of reality, and overfitting is when the model is overcomplicated. Bad analytics is one where the analytics model is not validated with cross-validation (with training and test data) and multiple algorithms.
Bad analytics present a CONFOUND variable(s) in the analytics model explaining the insight. A confounding variable(s) is an “extra” variable that is not accounted for analytics. In other words, confounding variables are extra independent variables that are having a hidden effect on the dependent variables.
MIXING Correlation and Causation. Correlation describes the relationship between two variables, while causation speaks to the idea that one event is the result of the occurrence of the other event. It is common to assume causation when there is simply a correlation in the data, and this typically happens when individuals working on the data are influenced by experience and personal biases.
If the insights derived are OBVIOUS or have high validity, then we have bad analytics. In other words, analytics is solving a problem based on testable hypotheses using good quality data. Spending time in reconfirming an obvious insight is a waste of business resources. For example, if a crude oil pipeline company after analyzing thousands of crude oil nomination tickets determines that crude oil viscosity affects the crude oil transportation time, it is not new insights. In other words, it is bad analytics as it is “solving” a physics problem and not a data problem.
The insights are not MONETIZABLE. If the insights derived are not increasing revenue, reducing cost, and minimizing the risk for the business, then we have bad analytics.
The insights coming out of bad analytics cannot be implemented with the available RESOURCES. Securing new or additional resources for business is always a challenge. If the insights derived cannot be implemented with the available resources, the insights derived are not very useful.

Overall bad analytics does not support evidence-based and data-driven decision-making (3DM) for business results. Analytics should be designed for a purpose, and the best way to avoid bad analytics is to work on real problems for actual customers or stakeholders. In other words, tying stakeholder insight needs to goals, questions, and quality data.

Selecting statistical tools for analytics

The decision of which statistical test to use depends on the business question, the distribution of the data, and the data type of the variable. As discussed in Chapter 11, there are four main types of business questions from a statistical perspective – composition, comparison, relationship, and distribution. The business data is normally distributed and hence parametric tests are used. In general, if the data is normally distributed, parametric tests should be used. The data type of the analytics variable is nominal, ordinal and continuous. The image below is a list of key statistical tests and their typical use cases.

Closing thoughts

The amount of data generated by the business today is unprecedented. As this growth continues, so do the opportunities for organizations to derive insights from their data analytics initiatives and derive sustainable competitive advantage. Given the complexity in business operations, today, decision making must inevitably rely on the insights derived from data analytics. Analytics today is seen as the next frontier for innovation and productivity in business. But achieving a sustainable competitive advantage from analytics is a complex endeavor and demands a lot of commitment from the organization.

As discussed in chapter 1, the implementation of these ten best analytics practices is in a playbook fashion – a combination of strategy and tactical elements to deliver the greatest value to the business. From the strategic perspective, it means enabling organizations to develop the analytics talent, the culture, data literacy, the discipline, and the organization structure. From a tactical perspective, it means implementation of the ten best analytics practices that reflects the process workflows, standard operating procedures (SOP), and the cultural values.

In implementing the ten best practices, the analytic team will run into many challenges. If there is no data or quality data for validating the hypothesis, one option is to rework your hypothesis. If the team is challenged with acquiring data internally, one approach is to get data from external sources. If there is no good data for analytics, one strategy is to leverage sampling or feature engineering techniques. If there is no precise or accurate business data, one solution is to use ranges and confidence intervals. The bottom line is that analytics is a probabilistic process and not a deterministic process. One cannot expect a perfect situation in the analytics initiatives. It simply doesn’t exist. Overall, the analytics implementation is an evolutionary process, just like the business entity itself. The insight needs of the businesses constantly change, the organizational capabilities continuously mature, the data sets grow, improve, and sometimes even degrade, and the technological capabilities to process the data improve over time.

Reference

Wethe, David, “‘Netflix for Oil’ Setting Stage for $1 Trillion Battle Over Data,” https://bit.ly/2WYDY5O, Mar 2018.