Chapter 2
IN THIS CHAPTER
Unraveling the big data story
Looking at important data sources
Differentiating data science from data engineering
Storing data on-premise or in a cloud
Exploring other data engineering solutions
Though data and artificial intelligence (AI) are extremely interesting topics in the eyes of the public, most laypeople aren’t aware of what data really is or how it’s used to improve people’s lives. This chapter tells the full story about big data, explains where big data comes from and how it’s used, and then outlines the roles that machine learning engineers, data engineers, and data scientists play in the modern data ecosystem. In this chapter, I introduce the fundamental concepts related to storing and processing data for data science so that this information can serve as the basis for laying out your plans for leveraging data science to improve business performance.
I am reluctant to even mention big data in this, the third, edition of Data Science For Dummies. Back about a decade ago, the industry hype was huge over what people called big data — a term that characterizes data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it lacks the structural requirements of traditional database architectures.
My reluctance stems from a tragedy I watched unfold across the second decade of the 21st century. Back then, the term big data was so overhyped across industry that countless business leaders made misguided impulse purchases. The narrative in those days went something like this: “If you’re not using big data to develop a competitive advantage for your business, the future of your company is in great peril. And, in order to use big data, you need to have big data storage and processing capabilities that are available only if you invest in a Hadoop cluster.”
Despite its significant drawbacks, Hadoop is, and was, powerful at satisfying one requirement: batch-processing and storing large volumes of data. That's great if your situation requires precisely this type of capability, but the fact is that technology is never a one-size-fits-all sort of thing. If I learned anything from the years I spent building technical and strategic engineering plans for government institutions, it’s this: Before investing in any sort of technology solution, you must always assess the current state of your organization, select an optimal use case, and thoroughly evaluate competing alternatives, all before even considering whether a purchase should be made. This process is so vital to the success of data science initiatives that I cover it extensively in Part 4.
Unfortunately, in almost all cases back then, business leaders bought into Hadoop before having evaluated whether it was an appropriate choice. Vendors sold Hadoop and made lots of money. Most of those projects failed. Most Hadoop vendors went out of business. Corporations got burned on investing in data projects, and the data industry got a bad rap. For any data professional who was working in the field between 2012 and 2015, the term big data represents a blight on the industry.
Despite the setbacks the data industry has faced due to overhype, this fact remains: If companies want to stay competitive, they must be proficient and adept at infusing data insights into their processes, products, as well as their growth and management strategies. This is especially true in light of the digital adoption explosion that occurred as a direct result of the COVID-19 pandemic. Whether your data volumes rank on the terabyte or petabyte scales, data-engineered solutions must be designed to meet requirements for the data’s intended destination and use.
Three characteristics — also called “the three Vs” — define big data: volume, velocity, and variety. Because the three Vs of big data are continually expanding, newer, more innovative data technologies must continuously be developed to manage big data problems.
The lower limit of big data volume starts as low as 1 terabyte, and it has no upper limit. If your organization owns at least 1 terabyte of data, that data technically qualifies as big data.
A lot of big data is created by using automated processes and instrumentation nowadays, and because data storage costs are relatively inexpensive, system velocity is, many times, the limiting factor. Keep in mind that big data is low-value. Consequently, you need systems that are able to ingest a lot of it, on short order, to generate timely and valuable insights.
In engineering terms, data velocity is data volume per unit time. Big data enters an average system at velocities ranging between 30 kilobytes (K) per second to as much as 30 gigabytes (GB) per second. Latency is a characteristic of all data systems, and it quantifies the system’s delay in moving data after it has been instructed to do so. Many data-engineered systems are required to have latency less than 100 milliseconds, measured from the time the data is created to the time the system responds.
Throughput is a characteristic that describes a systems capacity for work per unit time. Throughput requirements can easily be as high as 1,000 messages per second in big data systems! High-velocity, real-time moving data presents an obstacle to timely decision-making. The capabilities of data-handling and data-processing technologies often limit data velocities.
Tools that intake data into a system — otherwise known as data ingestion tools — come in a variety of flavors. Some of the more popular ones are described in the following list:
Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from user clicks on a web page — otherwise known as click-streams.
Structured data can be stored, processed, and manipulated in a traditional relational database management system (RDBMS) — an example of this would be a PostgreSQL database that uses a tabular schema of rows and columns, making it easier to identify specific values within data that’s stored within the database. This data, which can be generated by humans or machines, is derived from all sorts of sources — from click-streams and web-based forms to point-of-sale transactions and sensors. Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data can be derived from blog posts, emails, and Word documents. Semistructured data doesn’t fit into a structured database system, but is nonetheless structured, by tags that are useful for creating a form of order and hierarchy in the data. Semistructured data is commonly found in databases and file systems. It can be stored as log files, XML files, or JSON data files.
Vast volumes of data are continually generated by humans, machines, and sensors everywhere. Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the Internet of things — a web of digital connections that joins together the ever-expanding array of electronic devices that consumers use in their everyday lives. Figure 2-1 shows a variety of popular big data sources.
Data science, machine learning engineering, and data engineering cover different functions within the big data paradigm — an approach wherein huge velocities, varieties, and volumes of structured, unstructured, and semistructured data are being captured, processed, stored, and analyzed using a set of techniques and technologies that are completely novel compared to those that were used in decades past.
All these functions are useful for deriving knowledge and actionable insights from raw data. All are essential elements for any comprehensive decision-support system, and all are extremely helpful when formulating robust strategies for future business growth. Although the terms data science and data engineering are often used interchangeably, they’re distinct domains of expertise. Over the past five years, the role of machine learning engineer has risen up to bridge a gap that exists between data science and data engineering. In the following sections, I introduce concepts that are fundamental to data science and data engineering, as well as the hybrid machine learning engineering role, and then I show you the differences in how these roles function in an organization’s data team.
If science is a systematic method by which people study and explain domain-specific phenomena that occur in the natural world, you can think of data science as the scientific domain that’s dedicated to knowledge discovery via data analysis.
Data scientists use mathematical techniques and algorithmic approaches to derive solutions to complex business and scientific problems. Data science practitioners use its predictive methods to derive insights that are otherwise unattainable. In business and in science, data science methods can provide more robust decision-making capabilities:
Data science is a vast and multidisciplinary field. To call yourself a true data scientist, you need to have expertise in math and statistics, computer programming, and your own domain-specific subject matter.
Using data science skills, you can do cool things like the following:
Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.
A machine learning engineer is essentially a software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, thus bringing machine learning models into production in a live environment like a Software as a Service (SaaS) product or even just a web page. Contrary to what you may have guessed, the role of machine learning engineer is a hybrid between a data scientist and a software engineer, not a data engineer. A machine learning engineer is, at their core, a well-rounded software engineer who also has a solid foundation in machine learning and artificial intelligence. This person doesn’t need to know as much data science as a data scientist but should know much more about computer science and software development than a typical data scientist.
If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data.
Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with (and designing) real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as with RDBMSs. They generally code in Java, C++, Scala, or Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into datasets with more manageable sizes. Simply put, with respect to data science, the purpose of data engineering is to engineer large-scale data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.
Using data engineering skills, you can, for example:
Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.
The roles of data scientist, machine learning engineer, and data engineer are frequently conflated by hiring managers. If you look around at most position descriptions for companies that are hiring, they often mismatch the titles and roles or simply expect applicants to be the Swiss army knife of data skills and be able to do them all.
Because many organizations combine and confuse roles in their data projects, data scientists are sometimes stuck having to learn to do the job of a data engineer — and vice versa. To come up with the highest-quality work product in the least amount of time, hire a data engineer to store, migrate, and process your data; a data scientist to make sense of it for you; and a machine learning engineer to bring your machine learning models into production.
Lastly, keep in mind that data engineer, machine learning engineer, and data scientist are just three small roles within a larger organizational structure. Managers, middle-level employees, and business leaders also play a huge part in the success of any data-driven initiative.
A lot has changed in the world of big data storage options since the Hadoop debacle I mention earlier in this chapter. Back then, almost all business leaders clamored for on-premise data storage. Delayed by years due to the admonitions of traditional IT leaders, corporate management is finally beginning to embrace the notion that storing and processing big data with a reputable cloud service provider is the most cost-effective and secure way to generate value from enterprise data. In the following sections, you see the basics of what’s involved in both cloud and on-premise big data storage and processing.
After you have realized the upside potential of storing data in the cloud, it’s hard to look back. Storing data in a cloud environment offers serious business advantages, such as these:
A lot of different technologies have emerged in the wake of the cloud computing revolution, many of which are of interest to those trying to leverage big data. The next sections examine a few of these new technologies.
When we talk about serverless computing, the term serverless is quite misleading because the computing indeed takes place on a server. Serverless computing really refers to computing that’s executed in a cloud environment rather than on your desktop or on-premise at your company. The physical host server exists, but it's 100 percent supported by the cloud computing provider retained by you or your company.
One great tragedy of modern-day data science is the amount of time data scientists spend on non-mission-critical tasks like data collection, data cleaning and reformatting, data operations, and data integration. By most estimates, only 10 percent of a data scientist's time is spent on predictive model building — the rest of it is spent trying to prepare the data and the data infrastructure for that mission-critical task they’ve been retained to complete. Serverless computing has been a game-changer for the data science industry because it decreases the down-time that data scientists spend in preparing data and infrastructure for their predictive models.
Earlier in this chapter, I talk a bit about SaaS. Serverless computing offers something similar, but this is Function as a Service (FaaS) — a containerized cloud computing service that makes it much faster and simpler to execute code and predictive functions directly in a cloud environment, without the need to set up complicated infrastructure around that code. With serverless computing, your data science model runs directly within its container, as a sort of stand-alone function. Your cloud service provider handles all the provisioning and adjustments that need to be made to the infrastructure to support your functions.
Examples of popular serverless computing solutions are AWS Lambda, Google Cloud Functions, and Azure Functions.
Kubernetes is an open-source software suite that manages, orchestrates, and coordinates the deployment, scaling, and management of containerized applications across clusters of worker nodes. One particularly attractive feature about Kubernetes is that you can run it on data that sits in on-premise clusters, in the cloud, or in a hybrid cloud environment.
Kubernetes’ chief focus is helping software developers build and scale apps quickly. Though it does provide a fault-tolerant, extensible environment for deploying and scaling predictive applications in the cloud, it also requires quite a bit of data engineering expertise to set them up correctly.
To overcome this obstacle, Kubernetes released its KubeFlow product, a machine learning toolkit that makes it simple for data scientists to directly deploy predictive models within Kubernetes containers, without the need for outside data engineering support.
You have a number of products to choose from when it comes to cloud-warehouse solutions. The following list looks at the most popular options:
Amazon Redshift: A popular big data warehousing service that runs atop data sitting within the Amazon Cloud, it is most notable for the incredible speed at which it can handle data analytics and business intelligence workloads. Because it runs on the AWS platform, Redshift’s fully managed data warehousing service has the incredible capacity to support petabyte-scale cloud storage requirements. If your company is already using other AWS services — like Amazon EMR, Amazon Athena, or Amazon Kinesis — Redshift is the natural choice to integrate nicely with your existing technology. Redshift offers both pay-as-you-go as well as on-demand pricing structures that you’ll want to explore further on its website: https://aws.amazon.com/redshift
Parallel processing refers to a powerful framework where data is processed very quickly because the work required to process the data is distributed across multiple nodes in a system. This configuration allows for the simultaneous processing of multiple tasks across different nodes in the system.
A traditional RDBMS isn’t equipped to handle big data demands. That’s because it’s designed to handle only relational datasets constructed of data that’s stored in clean rows and columns and thus is capable of being queried via SQL. RDBMSs are incapable of handling unstructured and semistructured data. Moreover, RDBMSs simply lack the processing and handling capabilities that are needed for meeting big data volume-and-velocity requirements.
This is where NoSQL comes in — its databases are nonrelational, distributed database systems that were designed to rise to the challenges involved in storing and processing big data. They can be run on-premise or in a cloud environment. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data querying of nonrelational or schema-free, semistructured and unstructured data. In this way, NoSQL databases are able to handle the structured, semistructured, and unstructured data sources that are common in big data systems.
NoSQL offers four categories of nonrelational databases: graph databases, document databases, key-values stores, and column family stores. Because NoSQL offers native functionality for each of these separate types of data structures, it offers efficient storage and retrieval functionality for most types of nonrelational data. This adaptability and efficiency make NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.
NoSQL applications like Apache Cassandra and MongoDB are used for data storage and real-time processing. Apache Cassandra is a popular type of key-value store NoSQL database, and MongoDB is the most-popular document-oriented type of NoSQL database. It uses dynamic schemas and stores JSON-esque documents.
Although cloud storage and cloud processing of big data is widely accepted as safe, reliable, and cost-effective, companies have a multitude of reasons for using on-premise solutions instead. In many instances of the training and consulting work I’ve done for foreign governments and multinational corporations, cloud data storage was the ultimate “no-fly zone” that should never be breached. This is particularly true of businesses I’ve worked with in the Middle East, where local security concerns were voiced as a main deterrent for moving corporate or government data to the cloud.
Though the popularity of storing big data on-premise has waned in recent years, many companies have their reasons for not wanting to move to a cloud environment. If you find yourself in circumstances where cloud services aren’t an option, you’ll probably appreciate the following discussion about on-premise alternatives.
Because big data’s three Vs (volume, velocity, and variety) don’t allow for the handling of big data using traditional RDMSs, data engineers had to become innovative. To work around the limitations of relational systems, data engineers originally turned to the Hadoop data processing platform to boil down big data into smaller datasets that are more manageable for data scientists to analyze. This was all the rage until about 2015, when market demands had changed to the point that the platform was no longer able to meet them.
MapReduce is a parallel distributed processing framework that can process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples. (With respect to MapReduce, tuples refers to key-value pairs by which data is grouped, sorted, and processed.) In layperson terms, MapReduce uses parallel distributed computing to transform big data into data of a manageable size.
The HDFS uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.
The HDFS is characterized by these three key features:
The Hadoop platform was designed for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN (a resource manager) all working together.
Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close as possible to the data — the task processors are positioned as closely as possible to the outgoing data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.
Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional on-premise data warehouse, an MPP may be the perfect solution.
To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.
A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near-real-time) as the data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than the hours or days it typically takes batch processing frameworks like MapReduce. Real-time processing frameworks do one of the following:
Apache Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming streaming data in near-real-time. Its power lies in its processing speed: The ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter.