Including a regression line and residuals

In the Creating scatterplots and line plots section in Chapter 1, Base Graphics in R – One Step at a Time, we saw how to use the abline() command and the lm() command to include a regression line in your graph. Now, we will take this idea a little further. The following regression uses a datafile in which a sample of 10 people rated a film by awarding scores out of 100. These people then viewed the film a second time 1 month later and again awarded scores. We wish to use a regression model to see how well the first rating scores predicted the second rating.

In an OLS regression with one predictor, we fit a model of the following form:

Yi  =  β0  +  β1 Xi  + ei

In this form, β0 is the intercept, β1 is the slope, and e i are the errors (or residuals).

Let's perform the regression on the data and plot the results. Along the way, we will learn some useful R syntax. Go to the code file of this chapter, and copy and paste the following syntax into R. It contains the filmrating dataset:

filmrating <- structure(list(View1 = c(68L, 47L, 63L, 38L, 60L, 89L, 42L, 77L, 32L, 67L), View2 = c(85L, 44L, 69L, 38L, 83L, 93L, 35L, 79L, 91L, 32L)), .Names = c("View1", "View2"), class = "data.frame", row.names = c(NA, -10L))

Alternatively, you can copy the Filmratings CSV file to a folder, match R's working directory to that folder, and use the read.csv() function. The argument h=T in the following line of code ensures that the column headings are read correctly to the filmrating object:

          filmrating <- read.csv(Filmratings.csv, h=T)

Now, we attach an object using the attach() command. Attaching an object is a good idea because R can now identify each variable by name.

attach(filmrating)

filmrating

The output is as follows:

filmrating
   View1 View2
1     68    85
2     47    44
3     63    69
4     38    38
5     60    83
6     89    93
7     42    35
8     77    79
9     32    91
10    67    32

Before we perform the regression, we will plot the two sets of scores. We use the plot() command in conjunction with the main argument to create headings. We also use the cex argument to control the size of data points and axis labels. Enter the following syntax, which consists of the plot() command and various arguments with which you are now familiar:

plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))

Let's look at the resulting graph.

Including a regression line and residuals

We see that the relationship between the two sets of ratings is not very linear, and a nonlinear model may fit better. However, let's use the lm() command to perform the regression on our data. Since R is object-oriented, we can store the results of the regression as an object, as follows. Let's call the object model.

model <- lm(View2 ~ View1)

model

The output you get is as follows:

Call:
lm(formula = View2 ~ View1)

Coefficients:
(Intercept)        View1  
    32.3066       0.5591      

We see that the intercept is approximately 32.31 and the slope (the coefficient of the View1 variable) is approximately 0.56.

Now, we plot the regression line using the abline() command as follows:

plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))

abline(lm(View2 ~ View1))

Here is your graph:

Including a regression line and residuals

In fact, you could also have used the following syntax:

abline(32.31, 0.56)

This syntax is used to draw the regression line, but the approach involving the lm() command that we used is more concise.

As an exercise, let's draw the residuals (the lines connecting the fitted data with the observed data). We use the predict() command to set up the predicted values of the regression model as a new object called regmodel. Enter the following syntax, which consists of the predict() command and the lm() command together:

regmodel <- predict(lm(View2 ~ View1))
regmodel

The output is as follows:

Including a regression line and residuals

Remember that the predict() command gives us the fitted points from the regression.

Now, we can use the output from the predict() command to draw the residuals in order to highlight the differences between the observed data and fitted points. We can use a for loop to do this job (many online sources describe for loops in R very clearly; for example, http://paleocave.sciencesortof.com/2013/03/writing-a-for-loop-in-r/). Let's see how it is done.

for(k in 1:10){ lines(c(View1[k], View1[k]), c(View2[k], regmodel[k])) }

Here is the resulting graph with regression line and residuals:

Including a regression line and residuals

Note the syntax involved in creating the for loop. This syntax involves a running index from 1 to the total number of points and the lines() command, which connects the observed and fitted data. The observed data consists of points defined by (View1, View2), while the fitted model values consist of points defined by (View1, regmodel).

Of course, entering the previous code to draw the residuals is cumbersome, so here is a function that I wrote to do this job efficiently. It is called drawresid(). First, you must enter the entire function on the R command line:

drawresid <- function(X, Y, col){
  abline(lm(Y ~ X), col = col)
  regressionmodel <- predict(lm(Y ~ X))
  for(k in 1:length(X)){ lines(c(X[k], X[k]), c(Y[k],
    regressionmodel[k]), col = col) } }

This function is given in the code file for this chapter. Let's use it on our film scores dataset. Enter the word drawresid on the command line, and then, for the three arguments, give the independent variable, the dependent variable, and finally the color of the residual lines and the regression line. To illustrate the use of this function, we will draw the regression line and the residuals in red. Enter all of the following syntax:

plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))

drawresid(View1, View2, "red")

The graph with regression line and residuals in red is as follows:

Including a regression line and residuals

I hope that you find this function useful.