In the Creating scatterplots and line plots section in Chapter 1, Base Graphics in R – One Step at a Time, we saw how to use the abline()
command and the lm()
command to include a regression line in your graph. Now, we will take this idea a little further. The following regression uses a datafile in which a sample of 10 people rated a film by awarding scores out of 100. These people then viewed the film a second time 1 month later and again awarded scores. We wish to use a regression model to see how well the first rating scores predicted the second rating.
In an OLS regression with one predictor, we fit a model of the following form:
Yi = β0 + β1 Xi + ei
In this form, β0
is the intercept, β1
is the slope, and e
i
are the errors (or residuals).
Let's perform the regression on the data and plot the results. Along the way, we will learn some useful R syntax. Go to the code file of this chapter, and copy and paste the following syntax into R. It contains the filmrating dataset:
filmrating <- structure(list(View1 = c(68L, 47L, 63L, 38L, 60L, 89L, 42L, 77L, 32L, 67L), View2 = c(85L, 44L, 69L, 38L, 83L, 93L, 35L, 79L, 91L, 32L)), .Names = c("View1", "View2"), class = "data.frame", row.names = c(NA, -10L))
Alternatively, you can copy the Filmratings
CSV file to a folder, match R's working directory to that folder, and use the read.csv()
function. The argument h=T
in the following line of code ensures that the column headings are read correctly to the filmrating object:
filmrating <- read.csv(Filmratings.csv, h=T)
Now, we attach an object using the
attach()
command. Attaching an object is a good idea because R can now identify each variable by name.
attach(filmrating) filmrating
The output is as follows:
filmrating View1 View2 1 68 85 2 47 44 3 63 69 4 38 38 5 60 83 6 89 93 7 42 35 8 77 79 9 32 91 10 67 32
Before we perform the regression, we will plot the two sets of scores. We use the
plot()
command in conjunction with the main
argument to create headings. We also use the cex
argument to control the size of data points and axis labels. Enter the following syntax, which consists of the plot()
command and various arguments with which you are now familiar:
plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))
Let's look at the resulting graph.
We see that the relationship between the two sets of ratings is not very linear, and a nonlinear model may fit better. However, let's use the lm()
command to perform the regression on our data. Since R is object-oriented, we can store the results of the regression as an object, as follows. Let's call the object model
.
model <- lm(View2 ~ View1) model
The output you get is as follows:
Call: lm(formula = View2 ~ View1) Coefficients: (Intercept) View1 32.3066 0.5591
We see that the intercept is approximately 32.31 and the slope (the coefficient of the View1
variable) is approximately 0.56.
Now, we plot the regression line using the abline()
command as follows:
plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100)) abline(lm(View2 ~ View1))
Here is your graph:
In fact, you could also have used the following syntax:
abline(32.31, 0.56)
This syntax is used to draw the regression line, but the approach involving the lm()
command that we used is more concise.
As an exercise, let's draw the residuals (the lines connecting the fitted data with the observed data). We use the predict()
command to set up the predicted values of the regression model as a new object called regmodel
. Enter the following syntax, which consists of the predict()
command and the lm()
command together:
regmodel <- predict(lm(View2 ~ View1)) regmodel
The output is as follows:
Remember that the predict()
command gives us the fitted points from the regression.
Now, we can use the output from the predict()
command to draw the residuals in order to highlight the differences between the observed data and fitted points. We can use a for
loop to do this job (many online sources describe for
loops in R very clearly; for example, http://paleocave.sciencesortof.com/2013/03/writing-a-for-loop-in-r/). Let's see how it is done.
for(k in 1:10){ lines(c(View1[k], View1[k]), c(View2[k], regmodel[k])) }
Here is the resulting graph with regression line and residuals:
Note the syntax involved in creating the for
loop. This syntax involves a running index from 1 to the total number of points and the lines()
command, which connects the observed and fitted data. The observed data consists of points defined by (View1
, View2
), while the fitted model values consist of points defined by (View1
, regmodel
).
Of course, entering the previous code to draw the residuals is cumbersome, so here is a function that I wrote to do this job efficiently. It is called
drawresid()
. First, you must enter the entire function on the R command line:
drawresid <- function(X, Y, col){ abline(lm(Y ~ X), col = col) regressionmodel <- predict(lm(Y ~ X)) for(k in 1:length(X)){ lines(c(X[k], X[k]), c(Y[k], regressionmodel[k]), col = col) } }
This function is given in the code file for this chapter. Let's use it on our film scores dataset. Enter the word drawresid
on the command line, and then, for the three arguments, give the independent variable, the dependent variable, and finally the color of the residual lines and the regression line. To illustrate the use of this function, we will draw the regression line and the residuals in red. Enter all of the following syntax:
plot(View1, View2,pch=16,xlab="First Viewing",ylab="Second Viewing", main="FILM RATINGS", cex = 1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100)) drawresid(View1, View2, "red")
The graph with regression line and residuals in red is as follows:
I hope that you find this function useful.