In Section 1.13, we introduced big data. In this capstone chapter, we discuss popular hardware and software infrastructure for working with big data, and we develop complete applications on several desktop and cloud-based big-data platforms.
Databases are critical big-data infrastructure for storing and manipulating the massive amounts of data we’re creating. They’re also critical for securely and confidentially maintaining that data, especially in the context of ever-stricter privacy laws such as HIPAA (Health Insurance Portability and Accountability Act) in the United States and GDPR (General Data Protection Regulation) for the European Union.
First, we’ll present relational databases, which store structured data in tables with a fixed-size number of columns per row. You’ll manipulate relational databases via Structured Query Language (SQL).
Most data produced today is unstructured data, like the content of Facebook posts and Twitter tweets, or semi-structured data like JSON and XML documents. Twitter processes each tweet’s contents into a semi-structured JSON document with lots of metadata, as you saw in the “Data Mining Twitter” chapter. Relational databases are not geared to the unstructured and semi-structured data in big-data applications. So, as big data evolved, new kinds of databases were created to handle such data efficiently. We’ll discuss the four major types of these NoSQL databases—key–value, document, columnar and graph databases. Also, we’ll overview NewSQL databases, which blend the benefits of relational and NoSQL databases. Many NoSQL and NewSQL vendors make it easy to get started with their products through free tiers and free trials, and typically in cloud-based environments that require minimal installation and setup. This makes it practical for you to gain big-data experience before “diving in.”
Much of today’s data is so large that it cannot fit on one system. As big data grew, we needed distributed data storage and parallel processing capabilities to process the data more efficiently. This led to complex technologies like Apache Hadoop for distributed data processing with massive parallelism among clusters of computers where the intricate details are handled for you automatically and correctly. We’ll discuss Hadoop, its architecture and how it’s used in big-data applications. We’ll guide you through configuring a multi-node Hadoop cluster using the Microsoft Azure HDInsight cloud service, then use it to execute a Hadoop MapReduce job that you’ll implement in Python. Though HDInsight is not free, Microsoft gives you a generous new-account credit that should enable you to run the chapter’s code examples without incurring additional charges.
As big-data processing needs grow, the information-technology community is continually looking for ways to increase performance. Hadoop executes tasks by breaking them into pieces that do lots of disk I/O across many computers. Spark was developed as a way to perform certain big-data tasks in memory for better performance.
We’ll discuss Apache Spark, its architecture and how it’s used in high-performance, real-time big-data applications. You’ll implement a Spark application using functional-style filter/map/reduce programming capabilities. First, you’ll build this example using a Jupyter Docker stack that runs locally on your desktop computer, then you’ll implement it using a cloud-based Microsoft Azure HDInsight multi-node Spark cluster. In the exercises, you’ll also do this example with the free Databricks Community Edition.
We’ll introduce Spark streaming for processing streaming data in mini-batches. Spark streaming gathers data for a short time interval you specify, then gives you that batch of data to process. You’ll implement a Spark streaming application that processes tweets. In that example, you’ll use Spark SQL to query data stored in a Spark DataFrame
which, unlike pandas DataFrame
s, may contain data distributed over many computers in a cluster.
We’ll conclude with an introduction to the Internet of Things (IoT)—billions of devices that are continuously producing data worldwide. We’ll introduce the publish/subscribe model that IoT and other types of applications use to connect data users with data providers. First, without writing any code, you’ll build a web-based dashboard using Free-board.io and a sample live stream from the PubNub messaging service. Next, you’ll simulate an Internet-connected thermostat which publishes messages to the free Dweet.io messaging service using the Python module Dweepy, then create a dashboard visualization of the data with Freeboard.io. Finally, you’ll build a Python client that subscribes to a sample live stream from the PubNub service and dynamically visualizes the stream with Seaborn and a Matplotlib FuncAnimation
.
The rich exercise set encourages you to work with more big-data cloud and desktop platforms, additional SQL and NoSQL databases, NewSQL databases and IoT platforms. One exercise asks you to work with Wikipedia as another popular big-data source. Another asks you to implement an IoT application with the popular Raspberry Pi device simulator.
Cloud vendors focus on service-oriented architecture (SOA) technology in which they provide “as-a-Service” capabilities that applications connect to and use in the cloud. Common services provided by cloud vendors include:1
Big data as a Service (BDaaS) Hadoop as a Service (HaaS) Hardware as a Service (HaaS) Infrastructure as a Service (IaaS) |
Platform as a Service (PaaS) Software as a Service (SaaS) Storage as a Service (SaaS) Spark as a Service (SaaS) |
You’ll get hands-on experience in this chapter with several cloud-based tools. In this chapter’s examples, you’ll use the following platforms:
A free MongoDB Atlas cloud-based cluster.
A multi-node Hadoop cluster running on Microsoft’s Azure HDInsight cloud-based service—for this you’ll use the credit that comes with a new Azure account.
A free single-node Spark “cluster” running on your desktop computer, using a Jupyter Docker-stack container.
A multi-node Spark cluster, also running on Microsoft’s Azure HDInsight—for this you’ll continue using your Azure new-account credit.
In the project exercises, you can explore various other options, including cloud-based services from Amazon Web Services, Google Cloud and IBM Watson, and the free desktop versions of the Hortonworks and Cloudera platforms (there also are cloud-based paid versions of these). You’ll also explore and use a single-node Spark cluster running on the free cloud-based Databricks Community Edition. Spark’s creators founded Databricks.
Always check the latest terms and conditions of each service you use. Some require you to enable credit-card billing to use their clusters. Caution: Once you allocate Microsoft Azure HDInsight clusters (or other vendors’ clusters), they incur costs. When you complete the case studies using services such as Microsoft Azure, be sure to delete your cluster(s) and their other resources (like storage). This will help extend the life of your Azure new-account credit.
Installation and setups vary across platforms and over time. Always follow each vendor’s latest steps. If you have questions, the best sources for help are the vendor’s support capabilities and forums. Also, check sites such as stackoverflow.com
—other people may have asked questions about similar problems and received answers from the developer community.
Algorithms and data are the core of Python programming. The first few chapters of this book were mostly about algorithms. We introduced control statements and discussed algorithm development. Data was small—primarily individual integers, floats and strings. Chapters 5–9 emphasized structuring data into lists, tuples, dictionaries, sets, arrays and files. In Chapter 11, we refocused on algorithms, using Big-O notation to help us quantify how hard algorithms work to do their jobs.
But, what about the meaning of the data? Can we use the data to gain insights to better diagnose cancers? Save lives? Improve patients’ quality of life? Reduce pollution? Conserve water? Increase crop yields? Reduce damage from devastating storms and fires? Develop better treatment regimens? Create jobs? Improve company profitability?
The data-science case studies of Chapters 12–16 all focused on AI. In this chapter, we focus on the big-data infrastructure that supports AI solutions. As the data used with these technologies continues growing exponentially, we want to learn from that data and do so at blazing speed. We’ll accomplish these goals with a combination of sophisticated algorithms, hardware, software and networking designs. We’ve presented various machine-learning technologies, seeing that there are indeed great insights to be mined from data. With more data, and especially with big data, machine learning can be even more effective.
The following articles and sites provide links to hundreds of free big data sources:
“Awesome-Public-Datasets,” GitHub.com, https:/ “AWS Public Datasets,” https:/ “Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018,” by B. Marr, https://www.forbes.com/sites/bernardmarr/2018/02/26/big-data-and-ai-30-amazing-and-free-public-data-sources-for-2018/ “Datasets for Data Mining and Data Science,” http:/ “Exploring Open Data Sets,” https:/ “Free Big Data Sources,” Datamics, http:/ Hadoop Illuminated, Chapter 16. Publicly Available Big Data Sets, http:/ “List of Public Data Sources Fit for Machine Learning,” https:/ “Open Data,” Wikipedia, https:/ “Open Data 500 Companies,” http:/ “Other Interesting Resources/Big Data and Analytics Educational Resources and Research,” B. Marr, http:/ “6 Amazing Sources of Practice Data Sets,” https:/ “20 Big Data Repositories You Should Check Out,” M. Krivanek, http:/ “70+ Websites to Get Large Data Repositories for Free,” http:/ “Ten Sources of Free Big Data on Internet,” A. Brown, https://www.linkedin.com/pulse/ten-sources-free-big-data-internet-alan-brown “Top 20 Open Data Sources,” https:/ “We’re Setting Data, Code and APIs Free,” NASA, https:/ “Where Can I Find Large Datasets Open to the Public?” Quora, https:/ |
(Fill-In) databases store structured data in tables with a fixed-size number of columns per row and are manipulated via Structured Query Language (SQL).
Answer: Relational.
(Fill-In) Most data produced today is data, like the content of Facebook posts and Twitter tweets, or data like JSON and XML documents.
Answer: unstructured, semi-structured.
(Fill-In) Cloud vendors focus on technology in which they provide “as-a-Service” capabilities that applications connect to and use in the cloud.
Answer: service-oriented architecture (SOA).