© Pramod Singh 2019
Pramod SinghMachine Learning with PySpark https://doi.org/10.1007/978-1-4842-4131-8_1

1. Evolution of Data

Pramod Singh1 
(1)
Bangalore, Karnataka, India
 

Before understanding Spark, it is imperative to understand the reason behind this deluge of data that we are witnessing around us today. In the early days, data was generated or accumulated by workers, so only the employees of companies entered the data into systems and the data points were very limited, capturing only a few fields. Then came the internet, and information was made easily accessible to everyone using it. Now, users had the power to enter and generate their own data. This was a massive shift as the number of internet users grew exponentially, and the data created by these users grew at even a higher rate. For example: login/sign-up forms allow users to fill in their own details, uploading photos and videos on various social platforms. This resulted in huge data generation and the need for a fast and scalable framework to process this amount of data.

Data Generation

This data generation has now gone to the next level as machines are generating and accumulating data as shown in Figure 1-1. Every device around us is capturing data such as cars, buildings, mobiles, watches, flight engines. They are embedded with multiple monitoring sensors and recording data every second. This data is even higher in magnitude then the user-generated data.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig1_HTML.jpg
Figure 1-1

Data Evolution

Earlier, when the data was still at enterprise level, a relational database was good enough to handle the needs of the system, but as the size of data increased exponentially over the past couple of decades, a tectonic shift happened to handle the big data and it was the birth of Spark. Traditionally, we used to take the data and bring it to the processer to process it, but now it’s so much data that it overwhelms the processor. Now we are bringing multiple processors to the data. This is known as parallel processing as data is being processed at a number of places at the same time.

Let’s look at an example to understand parallel processing. Assume that on a particular freeway, there is only a single toll booth and every vehicle has to get in a single row in order to pass through the toll booth as shown in Figure 1-2. If, on average, it takes 1 minute for each vehicle to pass through the toll gate, for eight vehicles, it would take a total of 8 minutes. For 100 vehicles, it would take 100 minutes.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig2_HTML.jpg
Figure 1-2

Single Thread Processing

But imagine if instead of a single toll booth, there are eight toll booths on the same freeway and vehicles can use anyone of them to pass through. It would take only 1 minute in total for all of the eight vehicles to pass through the toll booth because there is no dependency now as shown in Figure 1-3. We have parallelized the operations.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig3_HTML.jpg
Figure 1-3

Parallel Processing

Parallel or Distributed computing works on a similar principle, as it parallelizes the tasks and accumulates the final results at the end. Spark is a framework to handle massive datasets with parallel processing at high speed and is a robust mechanism.

Spark

Apache Spark started as a research project at the UC Berkeley AMPLab in 2009 and was open sourced in early 2010 as shown in Figure 1-4. Since then, there has been no looking back. In 2016, Spark released TensorFrames for Deep Learning.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig4_HTML.jpg
Figure 1-4

Spark Evolution

Under the hood, Spark uses a different data structure known as RDD (Resilient Distributed Dataset). It is resilient in a sense that they have an ability to re-create any point of time during the execution process. So RDD creates a new RDD using the last one and always has the ability to reconstruct in case of any error. They are also immutable as original RDDs remain unaltered. As Spark is a distributed framework, it works on master and worker node settings as shown in Figure 1-5. The code to execute any of the activities is first written on Spark Driver, and that is shared across worker nodes where the data actually resides. Each worker node contains Executors that will actually execute the code. Cluster Manager keeps a check on the availability of various worker nodes for the next task allocation.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig5_HTML.jpg
Figure 1-5

Spark Functioning

The prime reason that Spark is hugely popular is due to the fact that it’s very easy to use it for data processing, Machine Learning, and streaming data; and it’s comparatively very fast since it does all in-memory computations. Since Spark is a generic data processing engine, it can easily be used with various data sources such as HBase, Cassandra, Amazon S3, HDFS, etc. Spark provides the users four language options to use on it: Java, Python, Scala, and R.

Spark Core

Spark Core is the most fundamental building block of Spark as shown in Figure 1-6. It is the backbone of Spark’s supreme functionality features. Spark Core enables the in-memory computations that drive the parallel and distributed processing of data. All the features of Spark are built on top of Spark Core. Spark Core is responsible for managing tasks, I/O operations, fault tolerance, and memory management, etc.
../images/469852_1_En_1_Chapter/469852_1_En_1_Fig6_HTML.jpg
Figure 1-6

Spark Architecture

Spark Components

Let’s look at the components.

Spark SQL

This component mainly deals with structured data processing. The key idea is to fetch more information about the structure of the data to perform additional optimization. It can be considered a distributed SQL query engine.

Spark Streaming

This component deals with processing the real-time streaming data in a scalable and fault tolerant manner. It uses micro batching to read and process incoming streams of data. It creates micro batches of streaming data, executes batch processing, and passes it to some file storage or live dashboard. Spark Streaming can ingest the data from multiple sources like Kafka and Flume.

Spark MLlib

This component is used for building Machine Learning Models on Big Data in a distributed manner. The traditional technique of building ML models using Python’s scikit learn library faces lot of challenges when data size is huge whereas MLlib is designed in a way that offers feature engineering and machine learning at scale. MLlib has most of the algorithms implemented for classification, regression, clustering, recommendation system, and natural language processing.

Spark GraphX/Graphframe

This component excels in graph analytics and graph parallel execution. Graph frames can be used to understand the underlying relationships and visualize the insights from data.

Setting Up Environment

This section of the chapter covers setting up a Spark Environment on the system. Based on the operating system, we can choose the option to install Spark on the system.

Windows

Files to Download:
  1. 1.

    Anaconda (Python 3.x)

     
  2. 2.

    Java (in case not installed)

     
  3. 3.

    Apache Spark latest version

     
  4. 4.

    Winutils.exe

     

Anaconda Installation

Download the Anaconda distribution from the link https://www.anaconda.com/download/#windows and install it on your system. One thing to be careful about while installing it is to enable the option of adding Anaconda to the path environment variable so that Windows can find relevant files while starting Python.

Once Anaconda is installed, we can use a command prompt and check if Python is working fine on the system. You may also want to check if Jupyter notebook is also opening up by trying the command below:
[In]: Jupyter notebook

Java Installation

Visit the https://www.java.com/en/download/link and download Java (latest version) and install Java.

Spark Installation

Create a folder named spark at the location of your choice. Let’s say we decide to create a folder named spark in D:/ drive. Go to https://spark.apache.org/downloads.html and select the Spark release version that you want to install on your machine. Choose the package type option of “Pre-built for Apache Hadoop 2.7 and later.” Go ahead and download the .tgz file to the spark folder that we created earlier and extract all the files. You will also observe that there is a folder named bin in the unzipped files.

The next step is to download winutils.exe and for that you need to go to the link https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe and download the .exe file and save it to the bin folder of the unzipped spark folder (D:/spark/spark_unzipped/bin).

Now that we have downloaded all the required files, the next step is adding environment variables in order to use pyspark.

Go to the start button of Windows and search for “Edit environment variables for your account.” Let’s go ahead and create a new environment variable for winutils and assign the path for the same. Click on new and create a new variable with the name HADOOP_HOME and pass the path of the folder (D:/spark/spark_unzipped) in the variable value placeholder.

We repeat the same process for the spark variable and create a new variable with name SPARK_HOME and pass the path of spark folder (D:/spark/spark_unzipped) in the variable value placeholder.

Let’s add a couple of more variables to use Jupyter notebook. Create a new variable with the name PYSPARK_DRIVER_PYTHON and pass Jupyter in the variable value placeholder. Create another variable named PYSPARK_DRIVER_PYTHON_OPTS and pass the notebook in the value field.

In the same window, look for the Path or PATH variable, click edit, and add D:/spark/spark_unzipped/bin to it. In Windows 7 you need to separate the values in Path with a semicolon between the values.

We need to add Java as well to the environment variable. So, create another variable JAVA_HOME and pass the path of the folder where Java is installed.

We can open the cmd window and run Jupyter notebook.
[In]: Import findspark
[In]: findspark.init()
[In]:import pyspark
[In]:from pyspark.sql import SparkSession
[In]: spark=SparkSession.builder.getOrCreate()

IOS

Assuming we have Anaconda and Java installed on our Mac already, we can download the latest version of Spark and save it to the home directory. We can open the terminal and go to the home directory using
[In]:  cd ~
Copy the downloaded spark zipped file to the home directory and unzip the file contents.
[In]: mv /users/username/Downloads/ spark-2.3.0-bin-hadoop2.7 /users/username
[In]: tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz
Validate if you have a .bash_profile.
[In]: ls -a
Next, we will edit the .bash_profile so that we can open a Spark notebook in any directory.
[In]: nano .bash_profile
Paste the items below in the bash profile.
export SPARK_PATH=~/spark-2.3.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias notebook='$SPARK_PATH/bin/pyspark --master local[2]'
[In]: source  .bash_profile

Now try opening Jupyter notebook in a terminal and import Pyspark to use it.

Docker

We can directly use PySpark with Docker using an image from the repository of Jupyter but that requires Docker installed on your system.

Databricks

Databricks also offers a community edition account that is free of cost and provides 6 GB clusters with PySpark.

Conclusion

In this chapter, we looked at Spark Architecture, various components, and different ways to set up the local environment in order to use Spark. In upcoming chapters, we will go deep into various aspects of Spark and build a Machine Learning model using the same.