Appendix B

Glossary

aggregating: Summarizing a collection of data.

agile: A framework for developing software or a product that focuses on small iteration cycles (usually two to four weeks), heavy stakeholder interaction, and delivery of functional product at each cycle.

AI: See artificial intelligence.

algorithm: A mathematical formula used to analyze data.

analytics: A field of study around data; the results of data analysis.

anonymization: The process of removing personally identifiable data about a person.

application: Computer software written to perform some function.

artificial intelligence (AI): A field of study with the goal of allowing machines to process in a similar way to humans. Allows for computer systems to “learn” and improve with more information.

behavioral analytics: Analytics that look at how humans act with the goal to predict future actions. Human pattern analysis.

big data scientist: Someone who develops the algorithms to make sense out of big data.

big data startup: A young company that has developed new big data technology.

bioinformatics: The analytical field of study pertaining to life science and medicine.

biometrics: The identification of humans by their characteristics like facial features, fingerprints, and other distinguishing marks.

business intelligence: Attempts to look at historical data to better understand business trends.

cloud computing: A distributed computing system used for data storage and computing power that is not located on the user’s premises. There are several key attributes of cloud computing that make it different from a traditional system:

Elasticity: The system grows or shrinks automatically with the demands of the user.
Geography: Users can push their applications to various geographic regions.
No capital expense: Users do not need to purchase the systems outright. They pay for what they use, like a utility.

comparative analysis: Process of data comparisons and calculations to detect patterns within very large datasets.

concurrency: Performing multiple similar tasks at the same time.

correlation analysis: A method of determining relationships among data or events.

CRM: See customer relationship management.

customer relationship management (CRM): A system for managing sales, sales forecasts, and customers.

cybersecurity: The act of protecting technology, information, and networks from attacks.

dashboard: A visual representation of data.

data aggregation tools: Tools for bringing disparate data together for analysis.

data analyst: Someone who analyzes, models, cleans, or processes data.

data center: A physical location that houses the servers for storing data.

data cleansing: The process of reviewing and revising data in order to delete duplicates, correct errors, and provide consistency.

data custodian: Someone who is responsible for the technical environment necessary for data storage.

data ethical guidelines: Guidelines that help organizations be transparent with their data, ensuring simplicity, security, and privacy.

data feed: A stream of data, such as a Twitter feed or RSS. See also RSS.

data marketplace: An online environment for buying and selling datasets.

data mart: Related to a data warehouse, but usually smaller and more focused on one subject area. See also data warehouse.

data mining: The process of finding data patterns from datasets.

data modeling: The process of architecting data objects and structures as they relate to a business or other context.

data virtualization: A data integration process used to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, and so on.

data warehouse: A data store used for decision support. Data warehouses are not often real time and are organized in a way to make data analysis easy. Unlike data marts, data warehouses contain multiple subject areas.

database: A collection of data stored digitally.

database as a service (DBaaS): A database hosted in the cloud on a pay-per-use basis (for example, Amazon Relational Database Service or Oracle Database as Service).

database management system: A system for collecting, storing, and providing access to data.

dataset: A collection of data.

DBaaS: See database as a service.

deidentification: See anonymization.

discriminant analysis: A statistical method for predicting a grouping of data.

distributed file system: A system that offers simplified, highly available access to storing, analyzing, and processing data.

document store database: A document-oriented database used to store, manage, and retrieve documents objects. Also known as semi-structured data.

ETL: See extract, transform, and load.

exabyte: 1,000 petabytes or 1 billion gigabytes.

extract, transform, and load (ETL): A process in databases and data warehousing that involves extracting the data from different sources, transforming it to fit specific needs, and loading it into a logical database.

failover: Switching automatically to one system if another fails.

fault-tolerant design: A system designed to work in the event of a disruption.

geospatial data: Data that defines geographic information.

graph database: A database that uses graph structures with edges, properties, and nodes.

grid computing: Connecting many computer system locations, often via the cloud, working together for the same purpose.

Hadoop: An open-source framework that is built to process and store huge amounts of data across a distributed file system.

Hadoop Distributed File System (HDFS): A distributed file system designed to run on commodity hardware. See also distributed file system.

HBase: An open-source, nonrelational, distributed database running in conjunction with Hadoop. See also Hadoop.

HDFS: See Hadoop Distributed File System.

high-performance computing (HPC): Using supercomputers or huge clusters to solve highly complex computing problems.

HPC: See high-performance computing.

in memory: When a database management system stores data on the main memory instead of the disk, resulting is very fast processing, storing, and loading of the data.

Internet of things: Ordinary devices that are connected to the Internet at any time, anywhere, via sensors.

key-value database: A simple database that stores data with a primary key.

latency: A measure of time delayed in a system.

legacy system: An old system or technology.

load balancing: The process of distributing workloads or network traffic across multiple servers.

log file: A file automatically created by a computer to record events that occur while the system is running.

M2M data: See machine-to-machine data.

machine data: Data automatically created by systems via sensors or algorithms.

machine learning: A subset of AI, in which machines learn from what they’re doing and become better over time. See also artificial intelligence.

machine-to-machine (M2M) data: Data used by machines or clusters of machines that are communicating with each other.

MapReduce: A software framework for processing vast amounts of data.

mash up: Bringing together different information to communicate a single idea.

massively parallel processing (MPP): Using many different processors to perform tasks at the same time.

megabyte: Approximately 1,000 bytes.

metadata: Data about data. Metadata gives information about what the data is about.

MongoDB: An open-source NoSQL database.

MPP: See massively parallel processing.

multidimensional database: A database used to process online analytical processing (OLAP) applications and often the framework for data warehousing.

MultiValue database: A type of NoSQL and multidimensional database that understands three-dimensional data directly.

natural language processing (NLP): An area of computer science involved with the computational study of human languages.

network analysis: Analyzing connections between computers in a network. Looks at connectivity, speed, and latency.

NLP: See natural language processing.

NoSQL: A database that doesn’t adhere to relational database structures. Used to organize and query unstructured data.

Not Only SQL: See NoSQL.

object database: A database that stores data in the form of objects. Object databases are different from relational structures and can be used to store unstructured or semi-structured data.

OLAP: See online analytical processing.

OLTP: See online transaction processing.

online analytical processing (OLAP): Systems that are multidimensional data stores used to perform analysis on large datasets. Usually used for business intelligence systems.

online transaction processing (OLTP): Systems that manage transactional data, like banking or ATM systems.

ontology: The study of how things relate. Used in big data to analyze seemingly unrelated data to discover insights.

operational database: A database system that runs core functions to the business in production environments. These are not test or reporting database systems, but actual systems that run the operations of the company.

optimization analysis: The process of using algorithms to improve performance of a system or analysis.

orthogonal: In statistics, data that are independent of each other. In big data, we seek to understand if orthogonal information is, in fact, related.

outlier: An object or data point that deviates significantly from the general average.

outlier detection: A set of algorithms to automatically discover outliers within a dataset.

PaaS: See platform as a service.

pattern recognition: The process of identifying patterns in data via algorithms to make predictions within a subject area.

petabyte: Approximately 1,000 terabytes or 1 million gigabytes.

Pig: A programming interface for programmers to create MapReduce jobs within Hadoop.

platform as a service (PaaS): A service providing all the necessary infrastructure for cloud-computing solutions.

predictive analysis: Analysis that is used to predict behavior or events. This can be from historical data, social data, or any other orthogonal datasets. See also orthogonal.

primary key: A uniquely identifiable record used for fast data lookup.

privacy: The ability of a person to keep personal information to himself or herself.

public data: Public information or datasets that were created for general use, with public and private funding.

query: A question; in the context of data, it’s often a set of code used to ask a question of a dataset.

real-time data: Data that is created while a process or event is happening.

real-time streaming: The process of capturing, storing, and analyzing real-time data. See also real-time data.

recommendation engine: An algorithm that suggests actions based on past behavior of the user or similar users.

regression analysis: Analysis used to discover the dependency between variables.

relational online analytical processing (ROLAP): The process of automatically creating analysis and the coinciding tables to research relational data stores.

ROLAP: See relational online analytical processing.

routing analysis: Analysis using many different variables to find the optimized routing for a certain means of transport in order to decrease fuel costs and increase efficiency.

RSS: A standard by which Internet feeds can publish news or other information to subscribers. Stands for Really Simple Syndication.

SaaS: See software as a service.

Semi-structured data: A structured data type that does not have a formal definition, like a document. It has tags or other markers to enforce a hierarchy of records within a particular object, but may be different from another object. See also structured data.

sentiment analysis: Analysis to find out how people feel based on digital communication like Twitter, Instagram, or emails.

signal analysis: Analysis that looks at information over a given amount of time. It can be sound, data stream, or even image feeds. It is often used with sensor data.

simulation: The imitation of a real-world event in order to predict behavior or analyze patterns.

smart grid: Sensors within an energy grid used to monitor what is going on in real time, helping to increase efficiency.

social media: User-curated systems where groups of people are networked together to collaborate or share information.

software as a service (SaaS): A software tool available via web clients and usually paid for through a subscription-type model.

SQL: A programming language for retrieving data from a relational database.

structured data: Data that adheres to a strict definition.

terabyte: Approximately 1,000 gigabytes.

time series analysis: The analysis of data obtained through a recurring measurement of time.

transactional data: Information stored from a time-based instance, like a bank deposit or phone call.

unstructured data: Data that doesn’t fit into a fixed and strict definition. Things like sound files, images, text, and web pages can be considered unstructured data.

value: The benefit for stakeholders derived from bringing together massive amounts of information.

variability: The ever-changing nature of data. For example, you may want to store tweets, bank records, and weather.

variety: The difference in type of data. Data today comes in many different formats: unstructured data, semi-structured data, structured data, and even complex structured data.

velocity: The speed at which data is created, stored, analyzed, and reported.

visualization: The process of bringing very complex analysis into a visual representation that users can understand.

volume: The massive amounts of data to be stored and analyzed.

weather data: Public datasets that can be used by organizations wanting to combine weather information with other data sources.

yottabyte: Approximately 1,000 zettabytes, equivalent to about 250 trillion DVDs.

zettabyte: Approximately 1,000 exabytes or 1 billion terabytes.