As we talked about Machine Learning in the previous chapter, it’s a very vast field and there are multiple algorithms that fall under various categories, but Linear Regression is one of the most fundamental machine learning algorithms. This chapter focuses on building a Linear Regression model with PySpark and dives deep into the workings of an LR model. It will cover various assumptions to be considered before using LR along with different evaluation metrics. But before even jumping into trying to understand Linear Regression, we must understand the types of variables.
Variables

Types of Variables
We can even further break down these variables into sub-categories, but we will stick to these two types throughout this book.
Numerical variables are those kinds of values that are quantitative in nature such as numbers (integer/floats). For example, salary records, exam scores, age or height of a person, and stock prices all fall under the category of Numerical variables.
Categorical variables, on the other hand, are qualitative in nature and mainly represent categories of data being measured. For example, colors, outcome (Yes/No), Ratings (Good/Poor/Avg).
For building any sort of machine learning model we need to have input and output variables. Input variables are those values that are used to build and train the machine learning model to predict the output or target variable. Let’s take a simple example. Suppose we want to predict the salary of a person given the age of the person using machine learning. In this case, the salary is our output/target/dependent variable as it depends on age, which is known as the input or independent variable. Now the output variable can be categorical or numerical in nature and depending on its type, machine learning models are chosen.
Linear
Nonlinear
The notion of a linear relationship between any two variables suggests that both are proportional to each other in some ways. The correlation between any two variables gives us an indication on how strong or weak is the linear relationship between them. The correlation coefficient can range from -1 to + 1. Negative correlation means that by increasing one of the variables, the other variable decreases. For example, power and mileage of a vehicle can be negatively correlated because as we increase power, the mileage of a vehicle comes down. On the other hand, salary and years of work experience are an example of positively correlated variables. Non-linear relationships are comparatively complex in nature and hence require an extra amount of details to predict the target variables. For example, a self-driving car, the relationship between input variables such as terrain, signal system, and pedestrian to the speed of the car are nonlinear .
Note
The next section includes theory behind Linear Regression and might be redundant for many readers. Please feel free to skip the section if this is the case.
Theory
Now that we understand the basics of variables and the relationships between them, let’s build on the example of age and salary to understand Linear Regression in depth.
Example Dataset
Sr. No | Age | Salary (‘0000 $) |
---|---|---|
1 | 20 | 5 |
2 | 30 | 10 |
3 | 40 | 15 |
4 | 50 | 22 |

Scatter plot of Salary
Now, we if we were to predict the salary of the fifth person (new person) based on the salaries of these earlier people, the best possible way to predict that is to take an average/mean of existing salary values. That would be the best prediction given this information. It is like building a Machine Learning model but without any input data (since we are using the output data as input data).
Let’s go ahead and calculate the average salary for these given salary values.
Avg. Salary = = 13

Best Fit Line plot

Residuals Plot
Sum of Squared errors = 64 + 9 + 4 + 81 = 158.
So, adding up the squared residuals, it gives us a total value of 158, which is known as the sum of squared errors (SSE) .
Note
We have not used any input variable so far to calculate the SSE.

Correlation plot between Salary and Age
As we can observe, there seems to be a clear positive correlation between years of work experience and salary value, and it is a good thing for us because it indicates that the model would be able to predict the target value (salary) with a good amount of accuracy due to a strong linear relationship between input(age) and output(salary). As mentioned earlier, the overall aim of Linear Regression is to come up with a straight line that fits the data point in such a way that the squared difference between the actual target value and predicted value is minimized. Since it is a straight line, we know in linear algebra the equation of a straight line is

Straight Line plot
where,
m = slope of the line ()
x = value at x-axis
y= value at y-axis
c = intercept (value of y at x = 0)

(since we are using only 1 input variable, i.e., Age)
where:
y= salary (prediction)
B0=intercept (value of Salary when Age is 0)
B1= slope or coefficient of Salary
x= Age

Possible Straight lines through data
The first criteria to find out the best fit line is that it should pass through the centroid of the data points as shown in Figure 4-8 . In our case, the centroid values are
mean (Age) = = 35
mean (Salary) = = 13

Centroids of Data

Now the objective of using Linear Regression is to come up with the most optimal values of the Intercept (B0) and Coefficient (B1) so that the residuals/errors are minimized to the maximum extent.
We can easily find out values of B0 & B1 for our dataset by using the below formulas.
B1=
B0=ymean − B1 ∗ (xmean)
Calculation of Slope and Intercept
Age | Salary | Age variance(diff from mean) | Salary variance (diff. from mean) | Covariance(Product) | Age Variance(squared) |
---|---|---|---|---|---|
20 | 5 | -15 | -8 | 120 | 225 |
30 | 10 | -5 | -3 | 15 | 25 |
40 | 15 | 5 | 2 | 10 | 25 |
50 | 22 | 15 | 9 | 135 | 225 |
Mean (Age) = 35
Mean (Salary) =13
The covariance between any two variables (age and salary) is defined as the product of the distances of each variable (age and salary) from their mean. In short, the product of the variance of age and salary is known as covariance. Now that we have the covariance product and Age variance squared values, we can go ahead and calculate the values of slope and intercept of the Linear Regression line:
B1 =
=
=0.56
B0 = 13 – (0.56 * 35)
= -6.6

Salary = -6.6 + (0.56 * Age)
We can now predict any of the salary values using this equation at any given age. For example, the model would have predicted the salary of the first person to be something like this:
Salary (1st person) = -6.6 + (0.56*20)
= 4.6 ($ ‘0000)
Interpretation
Slope (B1= 0.56) here means for an increase in 1 year of Age of the person, the salary also increases by an amount of $5,600.
Intercept does not always make sense in terms of deriving meaning out of its value. Like in this example, the value of negative 6.6 suggests that if the person is not yet born (Age =0), the salary of that person would be negative $66,000.

Regression Line
Difference Between Predictions and Actual Values
Age | Salary | Predicted Salary | Difference /Error |
---|---|---|---|
20 | 5 | 4.6 | -0.4 |
30 | 10 | 10.2 | 0.2 |
40 | 15 | 15.8 | 0.8 |
50 | 22 | 21.4 | -0.6 |
In a nutshell, Linear Regression comes up with the most optimal values for the Intercept (B0) and coefficients (B1, B2) so that the difference (error) between the predicted values and the target variables is minimum.
But the question remains: Is it a good fit?
Evaluation
Reduction in SSE After Using Linear Regression
Age | Salary | Predicted Salary | Difference /Error | Squared Error | old SSE |
---|---|---|---|---|---|
20 | 5 | 4.6 | -0.4 | 0.16 | 64 |
30 | 10 | 10.2 | 0.2 | 0.04 | 9 |
40 | 15 | 15.8 | 0.8 | 0.64 | 4 |
50 | 22 | 21.4 | -0.6 | 0.36 | 81 |
As we can observe, the total sum of the squared difference has reduced significantly from 158 to only 1.2, which has happened because of using Linear Regression. The variance in the target variable (Salary) can be explained with help of regression (due to usage of input variable – Age). So, OLS works toward reducing the overall sum of squared errors. The total sum of squared errors is a combination of two types:
TSS (Total Sum of Squared Errors ) = SSE (Sum of squared errors) + SSR (Residual Sum of squared errors)
The total sum of squares is the sum of the squared difference between the actual and the mean values and is always fixed. This was equal to 158 in our example.
The SSE is the squared difference from the actual to predicted values of the target variable, which reduced to 1.2 after using Linear Regression.
SSR is the sum of squares explained by regression and can be calculated by (TSS – SSE).
SSR = 158 – 1.2 =156.8
rsquare (Coefficient of determination) = ==
= 0.99
This percentage indicates that our Linear Regression model can predict with 99 % accuracy in terms of predicting the salary amount given the age of the person. The other 1% can be attributed toward errors that cannot be explained by the model. Our Linear Regression line fits the model really well, but it can also be a case of overfitting. Overfitting occurs when your model predicts with high accuracy on training data, but its performance drops on the unseen/test data. The technique to address the issues of overfitting is known as regularization, and there are different types of regularization techniques. In terms of Linear Regression, one can use Ridge, Lasso, or Elasticnet Regularization techniques to handle overfitting.
Ridge Regression is also known as L2 regularization and focuses on restricting the coefficient values of input features close to zero whereas Lasso regression (L1) makes some of the coefficients zero in order to improve generalization of the model. Elasticnet is a combination of both techniques.
At the end of the day, Regression is a still a parametric-driven approach and assumes few underlying patterns about distributions of input data points. If the input data does not affiliate to those assumptions, the Linear Regression model does not perform well. Hence it is important to go over these assumptions very quickly in order to know them before using the Linear Regression model.
There must be a linear relationship between the input variable and output variable.
The independent variables (input features) should not be correlated to each other (also known as multicollinearity).
There must be no correlation between the residuals/error values.
There must be a linear relationship between the residuals and the output variable.
The residuals/error values must be normally distributed.
Code
This section of the chapter focuses on building a Linear Regression Model from scratch using PySpark and Jupyter Notebook.

Note
The complete dataset along with the code is available for reference on the GitHub repo of this book and executes best on Spark 2.3 and higher versions.
Let’s build a Linear Regression model using Spark’s MLlib library and predict the target variable using the input features.
Data Info
The dataset that we are going to use for this example is a dummy dataset and contains a total of 1,232 rows and 6 columns. We have to use 5 input variables to predict the target variable using the Linear Regression model.
Step 1: Create the SparkSession Object
Step 2: Read the Dataset
Step 3: Exploratory Data Analysis




var_1 seems to be most strongly correlated with the output column.
Step 4: Feature Engineering



Step 5: Splitting the Dataset
Step 6: Build and Train Linear Regression Model
Step 7: Evaluate Linear Regression Model on Test Data
Conclusion
In this chapter, we went over the process of building a Linear Regression model using PySpark and also explained the process behind finding the most optimal coefficients and intercept values.