© Pramod Singh 2019
Pramod SinghMachine Learning with PySpark https://doi.org/10.1007/978-1-4842-4131-8_4

4. Linear Regression

Pramod Singh1 
(1)
Bangalore, Karnataka, India
 

As we talked about Machine Learning in the previous chapter, it’s a very vast field and there are multiple algorithms that fall under various categories, but Linear Regression is one of the most fundamental machine learning algorithms. This chapter focuses on building a Linear Regression model with PySpark and dives deep into the workings of an LR model. It will cover various assumptions to be considered before using LR along with different evaluation metrics. But before even jumping into trying to understand Linear Regression, we must understand the types of variables.

Variables

Variables capture data information in different forms. There are mainly two categories of variables that are used widely as depicted in Figure 4-1.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig1_HTML.jpg
Figure 4-1

Types of Variables

We can even further break down these variables into sub-categories, but we will stick to these two types throughout this book.

Numerical variables are those kinds of values that are quantitative in nature such as numbers (integer/floats). For example, salary records, exam scores, age or height of a person, and stock prices all fall under the category of Numerical variables.

Categorical variables, on the other hand, are qualitative in nature and mainly represent categories of data being measured. For example, colors, outcome (Yes/No), Ratings (Good/Poor/Avg).

For building any sort of machine learning model we need to have input and output variables. Input variables are those values that are used to build and train the machine learning model to predict the output or target variable. Let’s take a simple example. Suppose we want to predict the salary of a person given the age of the person using machine learning. In this case, the salary is our output/target/dependent variable as it depends on age, which is known as the input or independent variable. Now the output variable can be categorical or numerical in nature and depending on its type, machine learning models are chosen.

Now coming back to Linear Regression, it is primarily used in the cases of when we are trying to predict numerical output variable. Linear Regression is used to predict a line that fits the input data, points the best possible way, and can help in predictions for unseen data, but the point to notice here is how can a model learn just from “age” and predict the salary amount for a given person? For sure, there needs to be some sort of relationship between these two variables (salary and age). There are two major types of variable relationships:
  • Linear

  • Nonlinear

The notion of a linear relationship between any two variables suggests that both are proportional to each other in some ways. The correlation between any two variables gives us an indication on how strong or weak is the linear relationship between them. The correlation coefficient can range from -1 to + 1. Negative correlation means that by increasing one of the variables, the other variable decreases. For example, power and mileage of a vehicle can be negatively correlated because as we increase power, the mileage of a vehicle comes down. On the other hand, salary and years of work experience are an example of positively correlated variables. Non-linear relationships are comparatively complex in nature and hence require an extra amount of details to predict the target variables. For example, a self-driving car, the relationship between input variables such as terrain, signal system, and pedestrian to the speed of the car are nonlinear .

Note

The next section includes theory behind Linear Regression and might be redundant for many readers. Please feel free to skip the section if this is the case.

Theory

Now that we understand the basics of variables and the relationships between them, let’s build on the example of age and salary to understand Linear Regression in depth.

The overall objective of Linear Regression is to predict a straight line through the data, so that the vertical distance of each of these points is minimal from that line. So, in this case, we will predict the salary of a person given an age. Let’s assume we have records of four people that capture age and their respective salaries as shown in Table 4-1.
Table 4-1

Example Dataset

Sr. No

Age

Salary (‘0000 $)

1

20

5

2

30

10

3

40

15

4

50

22

We have an input variable (age) at our disposal to make use of in order to predict the salary (which we will do at a later stage of this book), but let’s take a step back. Let’s assume that all we have with us at the start is just the salary values of these four people. The salary is plotted for each person in the Figure 4-2.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig2_HTML.jpg
Figure 4-2

Scatter plot of Salary

Now, we if we were to predict the salary of the fifth person (new person) based on the salaries of these earlier people, the best possible way to predict that is to take an average/mean of existing salary values. That would be the best prediction given this information. It is like building a Machine Learning model but without any input data (since we are using the output data as input data).

Let’s go ahead and calculate the average salary for these given salary values.

Avg. Salary = $$ \frac{\left(5+10+15+22\right)}{4} $$ = 13

So, the best prediction of the salary value for the next person is 13. Figure 4-3 showcases the salary values for each person along with the mean value (the best fit line in the case of using only one variable).
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig3_HTML.jpg
Figure 4-3

Best Fit Line plot

The line for the mean value as shown in Figure 4-3 is possibly the best fit line in this scenario for these data points because we are not using any other variable apart from salary itself. If we take a look closely, none of the earlier salary values lies on this best fit line; there seems to be some amount of separation from the mean salary value as shown in Figure 4-4 . These are also known as errors. If we go ahead and calculate the total sum of this distance and add them up, it becomes 0, which makes sense since it's the mean value of all the data points. So, instead of simply adding them, we square each error and then add them up.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig4_HTML.jpg
Figure 4-4

Residuals Plot

Sum of Squared errors = 64 + 9 + 4 + 81 = 158.

So, adding up the squared residuals, it gives us a total value of 158, which is known as the sum of squared errors (SSE) .

Note

We have not used any input variable so far to calculate the SSE.

Let us park this score for now and include the input variable (age of the person) as well to predict the salary of the person. Let’s start with visualizing the relationship between Age and Salary of the person as shown in Figure 4-5.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig5_HTML.jpg
Figure 4-5

Correlation plot between Salary and Age

As we can observe, there seems to be a clear positive correlation between years of work experience and salary value, and it is a good thing for us because it indicates that the model would be able to predict the target value (salary) with a good amount of accuracy due to a strong linear relationship between input(age) and output(salary). As mentioned earlier, the overall aim of Linear Regression is to come up with a straight line that fits the data point in such a way that the squared difference between the actual target value and predicted value is minimized. Since it is a straight line, we know in linear algebra the equation of a straight line is

y= mx + c and the same is shown in Figure 4-6.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig6_HTML.jpg
Figure 4-6

Straight Line plot

where,

m = slope of the line ($$ \frac{x_2-{x}_1}{y_2-{y}_1} $$)

x = value at x-axis

y= value at y-axis

c = intercept (value of y at x = 0)

Since Linear Regression is also finding out the straight line, the Linear Regression equation becomes
$$ y={B}_{\mathbf{0}}+{B}_{\mathbf{1}}\ast x $$

(since we are using only 1 input variable, i.e., Age)

where:

y= salary (prediction)

B0=intercept (value of Salary when Age is 0)

B1= slope or coefficient of Salary

x= Age

Now, you may ask, if there can be multiple lines that can be drawn through the data points (as shown in Figure 4-7) and how to figure out which is the best fit line.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig7_HTML.jpg
Figure 4-7

Possible Straight lines through data

The first criteria to find out the best fit line is that it should pass through the centroid of the data points as shown in Figure 4-8 . In our case, the centroid values are

mean (Age) = $$ \frac{\left(20+30+40+50\right)}{4} $$ = 35

mean (Salary) = $$ \frac{\left(5+10+15+22\right)}{4} $$ = 13

../images/469852_1_En_4_Chapter/469852_1_En_4_Fig8_HTML.jpg
Figure 4-8

Centroids of Data

The second criteria are that it should be able to minimize the sum of squared errors. We know our regression line equation is equal to
$$ y={B}_{\mathbf{0}}+{B}_{\mathbf{1}}\ast x $$

Now the objective of using Linear Regression is to come up with the most optimal values of the Intercept (B0) and Coefficient (B1) so that the residuals/errors are minimized to the maximum extent.

We can easily find out values of B0 & B1 for our dataset by using the below formulas.

B1= $$ \frac{\sum \left({x}_i-{x}_{mean}\right)\ast \left({y}_i-{y}_{mean}\right)}{\sum {\left({x}_i-{x}_{mean}\right)}^2} $$

B0=ymean − B1 ∗ (xmean)

Table 4-2 showcases the calculation of slope and intercept for Linear Regression using input data.
Table 4-2

Calculation of Slope and Intercept

Age

Salary

Age variance(diff from mean)

Salary variance (diff. from mean)

Covariance(Product)

Age Variance(squared)

20

5

-15

-8

120

225

30

10

-5

-3

15

25

40

15

5

2

10

25

50

22

15

9

135

225

Mean (Age) = 35

Mean (Salary) =13

The covariance between any two variables (age and salary) is defined as the product of the distances of each variable (age and salary) from their mean. In short, the product of the variance of age and salary is known as covariance. Now that we have the covariance product and Age variance squared values, we can go ahead and calculate the values of slope and intercept of the Linear Regression line:

B1 =$$ \frac{\sum (Covariance)}{\sum \left( Age\ Variance\ Squared\right)} $$

=$$ \frac{280}{500} $$

=0.56

B0 = 13 – (0.56 * 35)

= -6.6

Our final Linear Regression equation becomes
$$ y={B}_{\mathbf{0}}+{B}_{\mathbf{1}}\ast x $$

Salary = -6.6 + (0.56 * Age)

We can now predict any of the salary values using this equation at any given age. For example, the model would have predicted the salary of the first person to be something like this:

Salary (1st person) = -6.6 + (0.56*20)

= 4.6 ($ ‘0000)

Interpretation

Slope (B1= 0.56) here means for an increase in 1 year of Age of the person, the salary also increases by an amount of $5,600.

Intercept does not always make sense in terms of deriving meaning out of its value. Like in this example, the value of negative 6.6 suggests that if the person is not yet born (Age =0), the salary of that person would be negative $66,000.

Figure 4-9 shows the final regression line for our dataset.
../images/469852_1_En_4_Chapter/469852_1_En_4_Fig9_HTML.jpg
Figure 4-9

Regression Line

Let’s predict the salary for all four records in our data using the regression equation, and compare the difference from actual salaries as shown in Table 4-3.
Table 4-3

Difference Between Predictions and Actual Values

Age

Salary

Predicted Salary

Difference /Error

20

5

4.6

-0.4

30

10

10.2

0.2

40

15

15.8

0.8

50

22

21.4

-0.6

In a nutshell, Linear Regression comes up with the most optimal values for the Intercept (B0) and coefficients (B1, B2) so that the difference (error) between the predicted values and the target variables is minimum.

But the question remains: Is it a good fit?

Evaluation

There are multiple ways to evaluate the goodness of fit of the Regression line, but one of the ways is by using the coefficient of determination (rsquare) value. Remember we had calculated the sum of squared errors when we had only used the output variable itself and its value was 158. Now let us recalculate the SSE for this model, which we have built using an input variable. Table 4-4 shows the calculation for the new SSE after using Linear Regression.
Table 4-4

Reduction in SSE After Using Linear Regression

Age

Salary

Predicted Salary

Difference /Error

Squared Error

old SSE

20

5

4.6

-0.4

0.16

64

30

10

10.2

0.2

0.04

9

40

15

15.8

0.8

0.64

4

50

22

21.4

-0.6

0.36

81

As we can observe, the total sum of the squared difference has reduced significantly from 158 to only 1.2, which has happened because of using Linear Regression. The variance in the target variable (Salary) can be explained with help of regression (due to usage of input variable – Age). So, OLS works toward reducing the overall sum of squared errors. The total sum of squared errors is a combination of two types:

TSS (Total Sum of Squared Errors ) = SSE (Sum of squared errors) + SSR (Residual Sum of squared errors)

The total sum of squares is the sum of the squared difference between the actual and the mean values and is always fixed. This was equal to 158 in our example.

The SSE is the squared difference from the actual to predicted values of the target variable, which reduced to 1.2 after using Linear Regression.

SSR is the sum of squares explained by regression and can be calculated by (TSS – SSE).

SSR = 158 – 1.2 =156.8

rsquare (Coefficient of determination) = $$ \frac{SSR}{TSS} $$ == $$ \frac{156.8}{158} $$ = 0.99

This percentage indicates that our Linear Regression model can predict with 99 % accuracy in terms of predicting the salary amount given the age of the person. The other 1% can be attributed toward errors that cannot be explained by the model. Our Linear Regression line fits the model really well, but it can also be a case of overfitting. Overfitting occurs when your model predicts with high accuracy on training data, but its performance drops on the unseen/test data. The technique to address the issues of overfitting is known as regularization, and there are different types of regularization techniques. In terms of Linear Regression, one can use Ridge, Lasso, or Elasticnet Regularization techniques to handle overfitting.

Ridge Regression is also known as L2 regularization and focuses on restricting the coefficient values of input features close to zero whereas Lasso regression (L1) makes some of the coefficients zero in order to improve generalization of the model. Elasticnet is a combination of both techniques.

At the end of the day, Regression is a still a parametric-driven approach and assumes few underlying patterns about distributions of input data points. If the input data does not affiliate to those assumptions, the Linear Regression model does not perform well. Hence it is important to go over these assumptions very quickly in order to know them before using the Linear Regression model.

Assumptions:
  • There must be a linear relationship between the input variable and output variable.

  • The independent variables (input features) should not be correlated to each other (also known as multicollinearity).

  • There must be no correlation between the residuals/error values.

  • There must be a linear relationship between the residuals and the output variable.

  • The residuals/error values must be normally distributed.

Code

This section of the chapter focuses on building a Linear Regression Model from scratch using PySpark and Jupyter Notebook.

Although we saw a simple example of only one input variable to understand Linear Regression, this is seldom the case. The majority of the time, the dataset would contain multiple variables and hence building a multivariable Regression model makes more sense in such a situation. The Linear Regression equation then looks something like this:
$$ y={B}_{\mathbf{0}}+{B}_{\mathbf{1}}\ast {X}_{\mathbf{1}}+{B}_{\mathbf{2}}\ast {X}_{\mathbf{2}}+{B}_{\mathbf{3}}\ast {X}_{\mathbf{3}}+\dots $$

Note

The complete dataset along with the code is available for reference on the GitHub repo of this book and executes best on Spark 2.3 and higher versions.

Let’s build a Linear Regression model using Spark’s MLlib library and predict the target variable using the input features.

Data Info

The dataset that we are going to use for this example is a dummy dataset and contains a total of 1,232 rows and 6 columns. We have to use 5 input variables to predict the target variable using the Linear Regression model.

Step 1: Create the SparkSession Object

We start the Jupyter Notebook and import SparkSession and create a new SparkSession object to use Spark:
[In]: from pyspark.sql import SparkSession
[In]: spark=SparkSession.builder.appName('lin_reg').getOrCreate()

Step 2: Read the Dataset

We then load and read the dataset within Spark using Dataframe. We have to make sure we have opened the PySpark from the same directory folder where the dataset is available or else we have to mention the directory path of the data folder:
[In]: df=spark.read.csv('Linear_regression_dataset.csv',inferSchema=True,header=True)

Step 3: Exploratory Data Analysis

In this section, we drill deeper into the dataset by viewing the dataset, validating the shape of the dataset, various statistical measures, and correlations among input and output variables. We start with checking the shape of the dataset.
[In]:print((df.count(), len(df.columns)))
[Out]: (1232, 6)
The above output confirms the size of our dataset, and we can validate the datatypes of the input values to check if we need to do change/cast any columns datatypes. In this example, all columns contain Integer or double values.
[In]: df.printSchema()
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Figa_HTML.jpg
There is a total of six columns out of which five are input columns ( var_1 to var_5) and target column (output). We can now use describe function to go over statistical measures of the dataset.
[In]: df.describe().show(3,False)
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Figb_HTML.jpg
This allows us to get a sense of distribution, measure of center, and spread for our dataset columns. We then take a sneak peek into the dataset using the head function and pass the number of rows that we want to view.
[In]: df.head(3)
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Figc_HTML.jpg
We can check the correlation between input variables and output variables using the corr function:
[In]: from pyspark.sql.functions import corr
[In]: df.select(corr('var_1','output')).show()
[Out] :
../images/469852_1_En_4_Chapter/469852_1_En_4_Figd_HTML.jpg

var_1 seems to be most strongly correlated with the output column.

Step 4: Feature Engineering

This is the part where we create a single vector combining all input features by using Spark’s VectorAssembler. It creates only a single feature that captures the input values for that row. So, instead of five input columns, it essentially merges all input columns into a single feature vector column.
[In]: from pyspark.ml.linalg import Vector
[In]: from pyspark.ml.feature import VectorAssembler
One can select the number of columns that would be used as input features and can pass only those columns through the VectorAssembler. In our case, we will pass all the five input columns to create a single feature vector column.
[In]: df.columns
[Out]: ['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'output']
[In]: vec_assmebler=VectorAssembler(inputCols=['var_1', 'var_2', 'var_3', 'var_4', 'var_5'],outputCol='features')
[In]: features_df=vec_assmebler.transform(df)
[In]: features_df.printSchema()
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Fige_HTML.jpg
As, we can see, we have an additional column (‘features’) that contains the single dense vector for all of the inputs.
[In]: features_df.select('features').show(5,False)
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Figf_HTML.jpg
We take the subset of the dataframe and select only the features column and the output column to build the Linear Regression model.
[In]: model_df=features_df.select('features','output')
[In]: model_df.show(5,False)
[Out]:
../images/469852_1_En_4_Chapter/469852_1_En_4_Figg_HTML.jpg
[In]: print((model_df.count(), len(model_df.columns)))
[Out]: (1232, 2)

Step 5: Splitting the Dataset

We have to split the dataset into a training and test dataset in order to train and evaluate the performance of the Linear Regression model built. We split it into a 70/30 ratio and train our model on 70% of the dataset. We can print the shape of train and test data to validate the size.
[In]: train_df,test_df=model_df.randomSplit([0.7,0.3])
[In]: print((train_df.count(), len(train_df.columns)))
[Out]: (882, 2)
[In]: print((test_df.count(), len(test_df.columns)))
[Out]: (350, 2)

Step 6: Build and Train Linear Regression Model

In this part, we build and train the Linear Regression model using features of the input and output columns. We can fetch the coefficients (B1, B2, B3, B4, B5) and intercept (B0) values of the model as well. We can also evaluate the performance of model on training data as well using r2. This model gives a very good accuracy (86%) on training datasets.
[In]: from pyspark.ml.regression import LinearRegression
[In]: lin_Reg=LinearRegression(labelCol='output')
[In]: lr_model=lin_Reg.fit(train_df)
[In]: print(lr_model.coefficients)
[Out]: [0.000345569740987,6.07805293067e-05,0.000269273376209,-0.713663600176,0.432967466411]
[In]: print(lr_model.intercept)
[Out]: 0.20596014754214345
[In]: training_predictions=lr_model.evaluate(train_df)
[In]: print(training_predictions.r2)
[Out]: 0.8656062610679494

Step 7: Evaluate Linear Regression Model on Test Data

The final part of this entire exercise is to check the performance of the model on unseen or test data. We use the evaluate function to make predictions for the test data and can use r2 to check the accuracy of the model on test data. The performance seems to be almost similar to that of training.
[In]: test_predictions=lr_model.evaluate(test_df)
[In]: print(test_results.r2)
[Out]: 0.8716898064262081
[In]: print(test_results.meanSquaredError)
[Out]: 0.00014705472365990883

Conclusion

In this chapter, we went over the process of building a Linear Regression model using PySpark and also explained the process behind finding the most optimal coefficients and intercept values.