Appendix 3
Analytics Glossary
- ACID Test: A test applied to data for atomicity, consistency, isolation, and durability.
- Aggregation: A process of searching, gathering, and presenting data.
- Algorithm: A mathematical formula or statistical process used to perform analysis of data.
- API (Application Program Interface): A set of programming standards and instructions for accessing or building web-based software applications.
- Artificial Intelligence: The ability of a machine to apply information gained from previous experience accurately to new situations in a way that a human would.
- Best Practice: A best practice is a guideline or idea that has been generally accepted as superior to any alternatives because it produces results that are prescriptive, superior, and reusable.
- Big Data: Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. Big data sets are characterized by 3Vs: volume, velocity, and variety.
- Business Intelligence: The general term used for the identification, extraction, and analysis of multi-dimensional data.
- CDO 4.0. According to Gartner, with the increased usage of data and analytics (D&A) across the enterprise, the chief data officer’s (CDO) focus needs to shift from focusing on D&A projects and programs to products.
- CDO 1.0 was focused exclusively on data management.
- CDO 2.0 started to embrace analytics.
- CDO 3.0 led and participated quite heavily in digital transformation.
- CDO 4.0 is focused on data products
- Change Management: Change management is the discipline that guides how we prepare, equip, and support individuals to successfully adopt change in order to drive organizational success and outcomes.
- Cloud Computing: A distributed computing system hosted and running on remote servers and accessible from anywhere on the internet.
- Correlation. Correlation is a statistical technique that shows how strongly two variables are related. For example, height and weight are correlated; taller people tend to be heavier than shorter people.
- Cube. A data structure in OLAP systems. It is a method of storing data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data (measures) are categorized by dimensions. OLAP cubes are often pre-summarized across dimensions to drastically improve query time over relational databases
- Dashboard: A graphical representation of KPIs and Visuals.
- Data: Data is a set of fields with quantitative or qualitative values in a specific format.
- Data Analyst: A person responsible for the tasks of modeling, preparing and cleaning data for the purpose of deriving actionable information from it.
- Data Analytics: The process of answering business questions using data. Businesses typically use the three types of analytics: Descriptive, Predictive and Prescriptive Analytics.
- Data Architecture: It is the mechanism in which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.
- Data Broker: Data broker is a data product that aggregates data from a variety of sources, processes it, enriches it, cleanse or analyzes it, and licenses it to other organizations as data products.
- Data Center: A data center is a dedicated space used to house computer systems and associated components, such as telecommunications and storage systems.
- Data Cleansing: The process of reviewing and revising data to delete duplicate entries, correct misspelling and other errors, add missing data, and provide consistency.
- Data Governance: A set of processes or rules that ensure data integrity and data management best practices are met.
- Data Integration: The process of combining data from different sources and presenting it in a single view.
- Data Hub: A data hub is a collection of data from multiple sources organized for distribution and sharing. Generally, this data distribution and sharing is in the form of a hub and spoke architecture.
- Data Integrity: The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.
- Data Lake: A large repository of enterprise-wide data in raw format – structured and unstructured data.
- Data Mart: The access layer of a data warehouse used to provide data to users.
- Data Mining: It is finding meaningful patterns and deriving insights in large sets of data using sophisticated pattern recognition techniques. To derive meaningful patterns, data miners use statistics, machine learning algorithms, and artificial intelligence techniques.
- Data Product: A data product is the application of data for improving business performance; it is usually an output of the data science activity.
- Data Science: A discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.
- Data Storytelling: Data storytelling is communicating the insights from data using a combination of four key elements: data, visuals, narrative, and benefits.
- Data Warehouse: A repository for enterprise-wide data but in a structured format after cleaning and integrating with other sources. Data warehouses are typically used for conventional data (but not exclusively).
- Database: A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system.
- Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is like summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.
- Digital Asset Management (DAM): DAM is a system that stores, shares, and organizes digital assets in a central location
- Digital Vortex. The Digital Vortex is the inevitable movement of industries toward a “digital center” in which business models, offerings, and value chains are digitized to the maximum extent possible.
- Discrete Data: Data that is not measured on a continuous scale. Also known as intermittent data. Discrete data is based on counts.
- ETL (Extract, Transform and Load): The process of extracting raw data, transforming by cleaning/enriching the data to make it fit operational needs and loading into the appropriate repository for the system’s use.
- Event: A set of outcomes of an experiment (a subset of the sample space) to which a probability is assigned.
- Exploratory Analysis: An approach to data analysis focused on identifying general patterns in data, including outliers and features of the data.
- Feature Engineering: It is creating a “smarter” dataset or attributes or features applying the domain experience, and intuition on the existent data sets. Basically, feature engineering serves two main purposes: Transform data types and Create new fields or attributes.
- Hypothesis. A hypothesis is an assumption, an idea, or a gut feeling that is proposed for the validation so that it can be tested to see if it might be true.
- IoT (Internet of Things): The network of physical objects or “things” embedded with electronics, software, sensors, and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator and/or other connected devices.
- Industry 4.0. The fourth industrial revolution, referred to as Industry 4.0, is geared towards automation and data exchange using cyber-physical systems (CPS), the internet of things (IoT), industrial internet of things (IIOT), cloud computing, cognitive computing, and artificial intelligence (AI).
- Insight. It is the understanding of a specific cause and effect within a specific context. In this book, the terms insight and information are used interchangeably.
- Join: A data join is when two or more data sets are combined using at least one common column in each data set. Join differs from a union which puts data sets on top of each other, requiring all of the columns to be the same
- KPI. A Key Performance Indicator (KPI) is a measurable value that demonstrates how effectively the entity is achieving key objectives or targets.
- Last Mile Analytics. LMA is delivering analytics solution by focusing on the last mile of analytics, insight derivation and decision making, throughout the analytics process thereby providing value to the enterprise
- Linear Regression: It is the model of the relationship between one scalar response (or dependent variable) and one or more explanatory variables (or independent variables). If there is one explanatory variable, it is called simple linear regression. If there are multiple explanatory variables, it is called multiple linear regression (MLR).
- Logistic Regression: Investigates the relationship between response (Y’s) and one or more predictors (X’s) where Y’s are categorical, not continuous, and X’s can be either continuous or categorical.
- MAD Framework: MAD, which stands for Monitor-Analyze-Detail, is a top-down analysis framework that delivers insights to users based on their needs, thus optimizing adoption and usability.
- Machine-generated Data: Data automatically created by machines via sensors or algorithms or any other non-human source. Commonly known as IoT data.
- Machine Learning: A method of designing systems that can learn, adjust and improve based on the data fed to them. Using statistical algorithms that are fed to these machines, they learn and continually zero in on “correct” behavior and insights, and they keep improving as more data flows through the system.
- Master Data. Master data describe the core entities of the enterprise, like customers, products, suppliers, assets, and so on.
- Master data management (MDM. Master data is any non-transactional data that is critical to the operation of a business — for example, customer or supplier data, product information, or employee data. MDM is the process of managing that data to ensure consistency, quality, and availability.
- Metadata. Any data used to describe other data — for example, a data file’s size or date of creation.
- Multicollinearity. It is a state of very high intercorrelations among the independent variables indicating that there are duplicate or redundant variables in the analysis. It is, therefore, a type of disturbance in the data, and if present in the dataset, the insights derived may not be reliable.
- Normalization. It is a database design technique that organizes database tables in a manner that reduces redundancy and dependency of data. It divides larger database tables to smaller tables and links them using relationships.
- Normal Distribution: Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution is the familiar bell curve.
- Online analytical processing (OLAP). The process of analyzing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives). OLAP systems are used in BI reports.
- Online transactional processing (OLTP). The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it. OLTP systems are used in Transactional reports
- Predictive Analytics: Using statistical functions on one or more data sets to predict trends or future events.
- Prescriptive Analytics: Prescriptive analytics builds on predictive analytics by including actions and make data-driven decisions by looking at the impacts of various actions.
- Population: A dataset that consists of all the members of some group.
- Reference Data. Data that reflects the business categories.
- Regression Analysis: A modeling technique used to define the association between variables. It assumes a one-way causal effect from predictor variables (independent variables) to a response of another variable (dependent variable). Regression can be used to explain the past and predict future events.
- SQL (Structured Query Language): A programming language for retrieving data from a relational database.
- SVM. Support-vector machine (SVM) is a machine learning (ML) algorithm used for data classification.
- Sample. A sample data set consists of only a portion of the data from the population.
- Stored procedure. A stored procedure is a group of SQL statements that have been created and stored in the database, so it can be reused and shared by multiple programs.
- Systems of Insight (SoI). It is the system used to perform data analysis from the data that is combined from the SoR or transactional systems
- System of Record (SoR). The authoritative data source for a data element. To ensure data integrity in the enterprise, there must be one — and only one — system of record for a data element.
- Stakeholder: Individuals and organizations who are actively involved in the initiative, or whose interests may be positively or negatively affected as a result of execution or successful completion of the initiative.
- Structured Data: Data that is organized according to a predetermined structure.
- Text Analytics: Text analytics or text mining is the application of statistical, linguistic, and machine learning techniques on text-based data-sources to derive meaning or insight. It is the process of deriving insights from text-based content.
- Transactional Data: Data that relates to business events such as purchase orders and invoices.
- Union: The SQL UNION operator is used to combine two or more data sets with the same fields and data types. The tables that are part of the UNION should be of the same structure.
- Unstructured Data: Data that has no identifiable structure, such as email, social media posts, documents, audio files, images, and videos.
- Value Stream Mapping (VSM): VSM is a visual tool that shows the flow of process and information the company uses to produce a product or service for its customers. VSM has its origins to lean manufacturing, and the business value of VSM is to identify and remove or reduce waste in the processes, thereby increasing the efficiency in the system.
- Visualization: A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively. Visuals created are often complex, but understandable, in order to convey the data.
- Visual Analytics: It is the science of analytical reasoning supported by visuals. In this book, concepts such as data storytelling and dashboards are associated with visual analytics.