Chapter 2

Tapping into Critical Aspects of Data Engineering

IN THIS CHAPTER

check Unraveling the big data story

check Looking at important data sources

check Differentiating data science from data engineering

check Storing data on-premise or in a cloud

check Exploring other data engineering solutions

Though data and artificial intelligence (AI) are extremely interesting topics in the eyes of the public, most laypeople aren’t aware of what data really is or how it’s used to improve people’s lives. This chapter tells the full story about big data, explains where big data comes from and how it’s used, and then outlines the roles that machine learning engineers, data engineers, and data scientists play in the modern data ecosystem. In this chapter, I introduce the fundamental concepts related to storing and processing data for data science so that this information can serve as the basis for laying out your plans for leveraging data science to improve business performance.

Defining Big Data and the Three Vs

I am reluctant to even mention big data in this, the third, edition of Data Science For Dummies. Back about a decade ago, the industry hype was huge over what people called big data — a term that characterizes data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it lacks the structural requirements of traditional database architectures.

My reluctance stems from a tragedy I watched unfold across the second decade of the 21st century. Back then, the term big data was so overhyped across industry that countless business leaders made misguided impulse purchases. The narrative in those days went something like this: “If you’re not using big data to develop a competitive advantage for your business, the future of your company is in great peril. And, in order to use big data, you need to have big data storage and processing capabilities that are available only if you invest in a Hadoop cluster.”

Remember Hadoop is a data processing platform that is designed to boil down big data into smaller datasets that are more manageable for data scientists to analyze. For reasons you’re about to see, Hadoop’s popularity has been in steady decline since 2015.

Despite its significant drawbacks, Hadoop is, and was, powerful at satisfying one requirement: batch-processing and storing large volumes of data. That's great if your situation requires precisely this type of capability, but the fact is that technology is never a one-size-fits-all sort of thing. If I learned anything from the years I spent building technical and strategic engineering plans for government institutions, it’s this: Before investing in any sort of technology solution, you must always assess the current state of your organization, select an optimal use case, and thoroughly evaluate competing alternatives, all before even considering whether a purchase should be made. This process is so vital to the success of data science initiatives that I cover it extensively in Part 4.

Unfortunately, in almost all cases back then, business leaders bought into Hadoop before having evaluated whether it was an appropriate choice. Vendors sold Hadoop and made lots of money. Most of those projects failed. Most Hadoop vendors went out of business. Corporations got burned on investing in data projects, and the data industry got a bad rap. For any data professional who was working in the field between 2012 and 2015, the term big data represents a blight on the industry.

Despite the setbacks the data industry has faced due to overhype, this fact remains: If companies want to stay competitive, they must be proficient and adept at infusing data insights into their processes, products, as well as their growth and management strategies. This is especially true in light of the digital adoption explosion that occurred as a direct result of the COVID-19 pandemic. Whether your data volumes rank on the terabyte or petabyte scales, data-engineered solutions must be designed to meet requirements for the data’s intended destination and use.

Technicalstuff When you’re talking about regular data, you’re likely to hear the words kilobyte and gigabyte used as measurements. Kilobyte refers to 1024 bytes, or 210 B.) A byte is an 8-bit unit of data.

Three characteristics — also called “the three Vs” — define big data: volume, velocity, and variety. Because the three Vs of big data are continually expanding, newer, more innovative data technologies must continuously be developed to manage big data problems.

Remember In a situation where you’re required to adopt a big data solution to overcome a problem that’s caused by your data’s velocity, volume, or variety, you have moved past the realm of regular data — you have a big data problem on your hands.

Grappling with data volume

The lower limit of big data volume starts as low as 1 terabyte, and it has no upper limit. If your organization owns at least 1 terabyte of data, that data technically qualifies as big data.

Warning In its raw form, most big data is low value — in other words, the value-to-data-quantity ratio is low in raw big data. Big data is composed of huge numbers of very small transactions that come in a variety of formats. These incremental components of big data produce true value only after they’re aggregated and analyzed. Roughly speaking, data engineers have the job of aggregating it, and data scientists have the job of analyzing it.

Handling data velocity

A lot of big data is created by using automated processes and instrumentation nowadays, and because data storage costs are relatively inexpensive, system velocity is, many times, the limiting factor. Keep in mind that big data is low-value. Consequently, you need systems that are able to ingest a lot of it, on short order, to generate timely and valuable insights.

In engineering terms, data velocity is data volume per unit time. Big data enters an average system at velocities ranging between 30 kilobytes (K) per second to as much as 30 gigabytes (GB) per second. Latency is a characteristic of all data systems, and it quantifies the system’s delay in moving data after it has been instructed to do so. Many data-engineered systems are required to have latency less than 100 milliseconds, measured from the time the data is created to the time the system responds.

Throughput is a characteristic that describes a systems capacity for work per unit time. Throughput requirements can easily be as high as 1,000 messages per second in big data systems! High-velocity, real-time moving data presents an obstacle to timely decision-making. The capabilities of data-handling and data-processing technologies often limit data velocities.

Tools that intake data into a system — otherwise known as data ingestion tools — come in a variety of flavors. Some of the more popular ones are described in the following list:

  • Apache Sqoop: You can use this data transference tool to quickly transfer data back-and-forth between a relational data system and the Hadoop distributed file system (HDFS) — it uses clusters of commodity servers to store big data. HDFS makes big data handling and storage financially feasible by distributing storage tasks across clusters of inexpensive commodity servers.
  • Apache Kafka: This distributed messaging system acts as a message broker whereby messages can quickly be pushed onto, and pulled from, HDFS. You can use Kafka to consolidate and facilitate the data calls and pushes that consumers make to and from the HDFS.
  • Apache Flume: This distributed system primarily handles log and event data. You can use it to transfer massive quantities of unstructured data to and from the HDFS.

Dealing with data variety

Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from user clicks on a web page — otherwise known as click-streams.

Structured data can be stored, processed, and manipulated in a traditional relational database management system (RDBMS) — an example of this would be a PostgreSQL database that uses a tabular schema of rows and columns, making it easier to identify specific values within data that’s stored within the database. This data, which can be generated by humans or machines, is derived from all sorts of sources — from click-streams and web-based forms to point-of-sale transactions and sensors. Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data can be derived from blog posts, emails, and Word documents. Semistructured data doesn’t fit into a structured database system, but is nonetheless structured, by tags that are useful for creating a form of order and hierarchy in the data. Semistructured data is commonly found in databases and file systems. It can be stored as log files, XML files, or JSON data files.

Tip Become familiar with the term data lake — this term is used by practitioners in the big data industry to refer to a nonhierarchical data storage system that’s used to hold huge volumes of multistructured, raw data within a flat storage architecture — in other words, a collection of records that come in uniform format and that are not cross-referenced in any way. HDFS can be used as a data lake storage repository, but you can also use the Amazon Web Services (AWS) S3 platform — or a similar cloud storage solution — to meet the same requirements on the cloud. (The Amazon Web Services S3 platform is one of the more popular cloud architectures available for storing big data.)

Warning Although both data lake and data warehouse are used for storing data, the terms refer to different types of systems. Data lake was defined above and a data warehouse is a centralized data repository that you can use to store and access only structured data. A more traditional data warehouse system commonly employed in business intelligence solutions is a data mart — a storage system (for structured data) that you can use to store one particular focus area of data, belonging to only one line of business in the company.

Identifying Important Data Sources

Vast volumes of data are continually generated by humans, machines, and sensors everywhere. Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the Internet of things — a web of digital connections that joins together the ever-expanding array of electronic devices that consumers use in their everyday lives. Figure 2-1 shows a variety of popular big data sources.

Schematic illustration of the popular sources of big data.

FIGURE 2-1: Popular sources of big data.

Grasping the Differences among Data Approaches

Data science, machine learning engineering, and data engineering cover different functions within the big data paradigm — an approach wherein huge velocities, varieties, and volumes of structured, unstructured, and semistructured data are being captured, processed, stored, and analyzed using a set of techniques and technologies that are completely novel compared to those that were used in decades past.

All these functions are useful for deriving knowledge and actionable insights from raw data. All are essential elements for any comprehensive decision-support system, and all are extremely helpful when formulating robust strategies for future business growth. Although the terms data science and data engineering are often used interchangeably, they’re distinct domains of expertise. Over the past five years, the role of machine learning engineer has risen up to bridge a gap that exists between data science and data engineering. In the following sections, I introduce concepts that are fundamental to data science and data engineering, as well as the hybrid machine learning engineering role, and then I show you the differences in how these roles function in an organization’s data team.

Defining data science

If science is a systematic method by which people study and explain domain-specific phenomena that occur in the natural world, you can think of data science as the scientific domain that’s dedicated to knowledge discovery via data analysis.

Technicalstuff With respect to data science, the term domain-specific refers to the industry sector or subject matter domain that data science methods are being used to explore.

Data scientists use mathematical techniques and algorithmic approaches to derive solutions to complex business and scientific problems. Data science practitioners use its predictive methods to derive insights that are otherwise unattainable. In business and in science, data science methods can provide more robust decision-making capabilities:

  • In business, the purpose of data science is to empower businesses and organizations with the data insights they need in order to optimize organizational processes for maximum efficiency and revenue generation.
  • In science, data science methods are used to derive results and develop protocols for achieving the specific scientific goal at hand.

Data science is a vast and multidisciplinary field. To call yourself a true data scientist, you need to have expertise in math and statistics, computer programming, and your own domain-specific subject matter.

Using data science skills, you can do cool things like the following:

  • Use machine learning to optimize energy usage and lower corporate carbon footprints.
  • Optimize tactical strategies to achieve goals in business and science.
  • Predict for unknown contaminant levels from sparse environmental datasets.
  • Design automated theft- and fraud-prevention systems to detect anomalies and trigger alarms based on algorithmic results.
  • Craft site-recommendation engines for use in land acquisitions and real estate development.
  • Implement and interpret predictive analytics and forecasting techniques for net increases in business value.

Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.

Technicalstuff Machine learning is the practice of applying algorithms to learn from — and make automated predictions from — data.

Defining machine learning engineering

A machine learning engineer is essentially a software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, thus bringing machine learning models into production in a live environment like a Software as a Service (SaaS) product or even just a web page. Contrary to what you may have guessed, the role of machine learning engineer is a hybrid between a data scientist and a software engineer, not a data engineer. A machine learning engineer is, at their core, a well-rounded software engineer who also has a solid foundation in machine learning and artificial intelligence. This person doesn’t need to know as much data science as a data scientist but should know much more about computer science and software development than a typical data scientist.

Technicalstuff Software as a Service (SaaS) is a term that describes cloud-hosted software services that are made available to users via the Internet. Examples of popular SaaS companies include Salesforce, Slack, HubSpot, and so many more.

Defining data engineering

If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data.

Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with (and designing) real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as with RDBMSs. They generally code in Java, C++, Scala, or Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into datasets with more manageable sizes. Simply put, with respect to data science, the purpose of data engineering is to engineer large-scale data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.

Remember Most engineered systems are built systems — they are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.

Using data engineering skills, you can, for example:

  • Integrate data pipelines with the natural language processing (NLP) services that were built by data scientists at your company.
  • Build mission-critical data platforms capable of processing more than 10 billion transactions per day.
  • Tear down data silos by finally migrating your company’s data from a more traditional on-premise data storage environment to a cutting-edge cloud warehouse.
  • Enhance and maintain existing data infrastructure and data pipelines.

Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.

Comparing machine learning engineers, data scientists, and data engineers

The roles of data scientist, machine learning engineer, and data engineer are frequently conflated by hiring managers. If you look around at most position descriptions for companies that are hiring, they often mismatch the titles and roles or simply expect applicants to be the Swiss army knife of data skills and be able to do them all.

Tip If you’re hiring someone to help make sense of your data, be sure to define the requirements clearly before writing the position description. Because data scientists must also have subject matter expertise in the particular areas in which they work, this requirement generally precludes data scientists from also having much expertise in data engineering. And, if you hire a data engineer who has data science skills, that person generally won’t have much subject matter expertise outside of the data domain. Be prepared to call in a subject matter expert (SME) to help out.

Because many organizations combine and confuse roles in their data projects, data scientists are sometimes stuck having to learn to do the job of a data engineer — and vice versa. To come up with the highest-quality work product in the least amount of time, hire a data engineer to store, migrate, and process your data; a data scientist to make sense of it for you; and a machine learning engineer to bring your machine learning models into production.

Lastly, keep in mind that data engineer, machine learning engineer, and data scientist are just three small roles within a larger organizational structure. Managers, middle-level employees, and business leaders also play a huge part in the success of any data-driven initiative.

Storing and Processing Data for Data Science

A lot has changed in the world of big data storage options since the Hadoop debacle I mention earlier in this chapter. Back then, almost all business leaders clamored for on-premise data storage. Delayed by years due to the admonitions of traditional IT leaders, corporate management is finally beginning to embrace the notion that storing and processing big data with a reputable cloud service provider is the most cost-effective and secure way to generate value from enterprise data. In the following sections, you see the basics of what’s involved in both cloud and on-premise big data storage and processing.

Storing data and doing data science directly in the cloud

After you have realized the upside potential of storing data in the cloud, it’s hard to look back. Storing data in a cloud environment offers serious business advantages, such as these:

  • Faster time-to-market: Many big data cloud service providers take care of the bulk of the work that’s required to configure, maintain, and provision the computing resources that are required to run jobs within a defined system – also known as a compute environment. This dramatically increases ease of use, and ultimately allows for faster time-to-market for data products.
  • Enhanced flexibility: Cloud services are extremely flexible with respect to usage requirements. If you set up in a cloud environment and then your project plan changes, you can simply turn off the cloud service with no further charges incurred. This isn’t the case with on-premise storage, because once you purchase the server, you own it. Your only option from then on is to extract the best possible value from a noncancelable resource.
  • Security: If you go with reputable cloud service providers — like Amazon Web Services, Google Cloud, or Microsoft Azure — your data is likely to be a whole lot more secure in the cloud than it would be on-premise. That’s because of the sheer number of resources that these megalith players dedicate to protecting and preserving the security of the data they store. I can’t think of a multinational company that would have more invested in the security of its data infrastructure than Google, Amazon, or Microsoft.

A lot of different technologies have emerged in the wake of the cloud computing revolution, many of which are of interest to those trying to leverage big data. The next sections examine a few of these new technologies.

Using serverless computing to execute data science

When we talk about serverless computing, the term serverless is quite misleading because the computing indeed takes place on a server. Serverless computing really refers to computing that’s executed in a cloud environment rather than on your desktop or on-premise at your company. The physical host server exists, but it's 100 percent supported by the cloud computing provider retained by you or your company.

One great tragedy of modern-day data science is the amount of time data scientists spend on non-mission-critical tasks like data collection, data cleaning and reformatting, data operations, and data integration. By most estimates, only 10 percent of a data scientist's time is spent on predictive model building — the rest of it is spent trying to prepare the data and the data infrastructure for that mission-critical task they’ve been retained to complete. Serverless computing has been a game-changer for the data science industry because it decreases the down-time that data scientists spend in preparing data and infrastructure for their predictive models.

Earlier in this chapter, I talk a bit about SaaS. Serverless computing offers something similar, but this is Function as a Service (FaaS) — a containerized cloud computing service that makes it much faster and simpler to execute code and predictive functions directly in a cloud environment, without the need to set up complicated infrastructure around that code. With serverless computing, your data science model runs directly within its container, as a sort of stand-alone function. Your cloud service provider handles all the provisioning and adjustments that need to be made to the infrastructure to support your functions.

Examples of popular serverless computing solutions are AWS Lambda, Google Cloud Functions, and Azure Functions.

Containerizing predictive applications within Kubernetes

Kubernetes is an open-source software suite that manages, orchestrates, and coordinates the deployment, scaling, and management of containerized applications across clusters of worker nodes. One particularly attractive feature about Kubernetes is that you can run it on data that sits in on-premise clusters, in the cloud, or in a hybrid cloud environment.

Kubernetes’ chief focus is helping software developers build and scale apps quickly. Though it does provide a fault-tolerant, extensible environment for deploying and scaling predictive applications in the cloud, it also requires quite a bit of data engineering expertise to set them up correctly.

Remember A system is fault tolerant if it is built to continue successful operations despite the failure of one or more of its subcomponents. This requires redundancy in computing nodes. A system is described as extensible if it is flexible enough to be extended or shrunk in size without disrupting its operations.

To overcome this obstacle, Kubernetes released its KubeFlow product, a machine learning toolkit that makes it simple for data scientists to directly deploy predictive models within Kubernetes containers, without the need for outside data engineering support.

Sizing up popular cloud-warehouse solutions

You have a number of products to choose from when it comes to cloud-warehouse solutions. The following list looks at the most popular options:

  • Amazon Redshift: A popular big data warehousing service that runs atop data sitting within the Amazon Cloud, it is most notable for the incredible speed at which it can handle data analytics and business intelligence workloads. Because it runs on the AWS platform, Redshift’s fully managed data warehousing service has the incredible capacity to support petabyte-scale cloud storage requirements. If your company is already using other AWS services — like Amazon EMR, Amazon Athena, or Amazon Kinesis — Redshift is the natural choice to integrate nicely with your existing technology. Redshift offers both pay-as-you-go as well as on-demand pricing structures that you’ll want to explore further on its website: https://aws.amazon.com/redshift

    Technicalstuff Parallel processing refers to a powerful framework where data is processed very quickly because the work required to process the data is distributed across multiple nodes in a system. This configuration allows for the simultaneous processing of multiple tasks across different nodes in the system.

  • Snowflake: This SaaS solution provides powerful, parallel-processing analytics capabilities for both structured and semistructured data stored in the cloud on Snowflake’s servers. Snowflake provides the ultimate 3-in-1 with its cost-effective big data storage, analytical processing capabilities, and all the built-in cloud services you might need. Snowflake integrates well with analytics tools like Tableau and Qlik, as well as with traditional big data technologies like Apache Spark, Pentaho, and Apache Kafka, but it wouldn’t make sense if you’re already relying mostly on Amazon services. Pricing for the Snowflake service is based on the amount of data you store as well as on the execution time for compute resources you consume on the platform.
  • Google BigQuery: Touted as a serverless data warehouse solution, BigQuery is a relatively cost-effective solution for generating analytics from big data sources stored in the Google Cloud. Similar to Snowflake and Redshift, BigQuery provides fully managed cloud services that make it fast and simple for data scientists and analytics professionals to use the tool without the need for assistance from in-house data engineers. Analytics can be generated on petabyte-scale data. BigQuery integrates with Google Data Studio, Power BI, Looker, and Tableau for ease of use when it comes to post-analysis data storytelling. Pricing for Google BigQuery is based on the amount of data you store as well as on the compute resources you consume on the platform, as represented by the amount of data your queries return from the platform.

Introducing NoSQL databases

A traditional RDBMS isn’t equipped to handle big data demands. That’s because it’s designed to handle only relational datasets constructed of data that’s stored in clean rows and columns and thus is capable of being queried via SQL. RDBMSs are incapable of handling unstructured and semistructured data. Moreover, RDBMSs simply lack the processing and handling capabilities that are needed for meeting big data volume-and-velocity requirements.

This is where NoSQL comes in — its databases are nonrelational, distributed database systems that were designed to rise to the challenges involved in storing and processing big data. They can be run on-premise or in a cloud environment. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data querying of nonrelational or schema-free, semistructured and unstructured data. In this way, NoSQL databases are able to handle the structured, semistructured, and unstructured data sources that are common in big data systems.

Technicalstuff A key-value pair is a pair of data items, represented by a key and a value. The key is a data item that acts as the record identifier and the value is the data that’s identified (and retrieved) by its respective key.

NoSQL offers four categories of nonrelational databases: graph databases, document databases, key-values stores, and column family stores. Because NoSQL offers native functionality for each of these separate types of data structures, it offers efficient storage and retrieval functionality for most types of nonrelational data. This adaptability and efficiency make NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.

NoSQL applications like Apache Cassandra and MongoDB are used for data storage and real-time processing. Apache Cassandra is a popular type of key-value store NoSQL database, and MongoDB is the most-popular document-oriented type of NoSQL database. It uses dynamic schemas and stores JSON-esque documents.

Technicalstuff A document-oriented database is a NoSQL database that houses, retrieves, and manages the JSON files and XML files that you heard about back in Chapter 1, in the definition of semistructured data. A document-oriented database is otherwise known as a document store.

Technicalstuff Some people argue that the term NoSQL stands for Not Only SQL, and others argue that it represents non-SQL databases. The argument is rather complex and has no cut-and-dried answer. To keep things simple, just think of NoSQL as a class of nonrelational systems that don’t fall within the spectrum of RDBMSs that are queried using SQL.

Storing big data on-premise

Although cloud storage and cloud processing of big data is widely accepted as safe, reliable, and cost-effective, companies have a multitude of reasons for using on-premise solutions instead. In many instances of the training and consulting work I’ve done for foreign governments and multinational corporations, cloud data storage was the ultimate “no-fly zone” that should never be breached. This is particularly true of businesses I’ve worked with in the Middle East, where local security concerns were voiced as a main deterrent for moving corporate or government data to the cloud.

Though the popularity of storing big data on-premise has waned in recent years, many companies have their reasons for not wanting to move to a cloud environment. If you find yourself in circumstances where cloud services aren’t an option, you’ll probably appreciate the following discussion about on-premise alternatives.

Remember The Kubernetes and NoSQL databases described earlier in this chapter can be deployed on-premise as well as in a cloud environment.

Reminiscing about Hadoop

Because big data’s three Vs (volume, velocity, and variety) don’t allow for the handling of big data using traditional RDMSs, data engineers had to become innovative. To work around the limitations of relational systems, data engineers originally turned to the Hadoop data processing platform to boil down big data into smaller datasets that are more manageable for data scientists to analyze. This was all the rage until about 2015, when market demands had changed to the point that the platform was no longer able to meet them.

Remember When people refer to Hadoop, they’re generally referring to an on-premise Hadoop storage environment that includes the HDFS (for data storage), MapReduce (for bulk data processing), Spark (for real-time data processing), and YARN (for resource management).

Incorporating MapReduce, the HDFS, and YARN

MapReduce is a parallel distributed processing framework that can process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples. (With respect to MapReduce, tuples refers to key-value pairs by which data is grouped, sorted, and processed.) In layperson terms, MapReduce uses parallel distributed computing to transform big data into data of a manageable size.

Technicalstuff In Hadoop, parallel distributed processing refers to a powerful framework in which data is processed quickly via the distribution and parallel processing of tasks across clusters of commodity servers.

Storing data on the Hadoop distributed file system (HDFS)

The HDFS uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.

The HDFS is characterized by these three key features:

  • HDFS blocks: In data storage, a block is a storage unit that contains some maximum number of records. HDFS blocks can store 64 megabytes of data, by default.
  • Redundancy: Datasets that are stored in HDFS are broken up and stored on blocks. These blocks are then replicated (three times, by default) and stored on several different servers in the cluster, as backup, or as redundancy.
  • Fault-tolerance: As mentioned earlier, a system is described as fault-tolerant if it’s built to continue successful operations despite the failure of one or more of its subcomponents. Because the HDFS has built-in redundancy across multiple servers in a cluster, if one server fails, the system simply retrieves the data from another server.

Putting it all together on the Hadoop platform

The Hadoop platform was designed for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN (a resource manager) all working together.

Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close as possible to the data — the task processors are positioned as closely as possible to the outgoing data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.

Introducing massively parallel processing (MPP) platforms

Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional on-premise data warehouse, an MPP may be the perfect solution.

To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.

Processing big data in real-time

A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near-real-time) as the data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than the hours or days it typically takes batch processing frameworks like MapReduce. Real-time processing frameworks do one of the following:

  • Increase the overall time efficiency of the system: Solutions in this category include Apache Storm and Apache Spark for near-real-time stream processing.
  • Deploy innovative querying methods to facilitate the real-time querying of big data: Some solutions in this category are Google’s Dremel, Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.

Technicalstuff In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides results a lot faster but cannot process much data per processing interval.

Apache Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming streaming data in near-real-time. Its power lies in its processing speed: The ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter.

Tip Real-time, stream-processing frameworks are quite useful in a multitude of industries — from stock and financial market analyses to e-commerce optimizations and from real-time fraud detection to optimized order logistics. Regardless of the industry in which you work, if your business is impacted by real-time data streams that are generated by humans, machines, or sensors, a real-time processing framework would be helpful to you in optimizing and generating value for your organization.