Extended Example: Regression Analysis of Exam Grades

For our next example, we’ll walk through a brief statistical regression analysis. There isn’t much actual programming in this example, but it illustrates how some of the data types we just discussed are used, including R’s S3 objects. Also, it will serve as the basis for several of our programming examples in subsequent chapters.

I have a file, ExamsQuiz.txt, containing grades from a class I taught. Here are its first few lines:

2    3.3   4
3.3  2     3.7
4    4.3   4
2.3  0     3.3
...

The numbers correspond to letter grades on a four-point scale; 3.3 is a B+, for instance. Each line contains the data for one student, consisting of the midterm examination grade, final examination grade, and average quiz grade. It might be interesting to see how well the midterm and quiz grades predict the student’s grade on the final examination.

Let’s first read in the data file:

> examsquiz <- read.table("ExamsQuiz.txt",header=FALSE)

Our file does not include a header line naming the variables in each student record, so we specified header=FALSE in the function call. This is an example of a default argument, which we talked about earlier. Actually, the default value of the header argument is FALSE already (which you can check by consulting R’s online help for read.table()), so we didn’t need to specify this setting, but it’s clearer if we do.

Our data is now in examsquiz, which is an R object of class data.frame.

> class(examsquiz)
[1] "data.frame"

Just to check that the file was read in correctly, let’s take a look at the first few rows:

> head(examsquiz)
   V1  V2  V3
1 2.0 3.3 4.0
2 3.3 2.0 3.7
3 4.0 4.3 4.0
4 2.3 0.0 3.3
5 2.3 1.0 3.3
6 3.3 3.7 4.0

Lacking a header for the data, R named the columns V1, V2, and V3. Row numbers appear on the left. As you might be thinking, it would be better to have a header in our data file, with meaningful names like Exam1. In later examples, we will usually specify names.

Let’s try to predict the exam 2 score (given in the second column of examsquiz) from exam 1 (first column):

lma <- lm(examsquiz[,2] ˜ examsquiz[,1])

The lm() (for linear model) function call here instructs R to fit this prediction equation:

predicted Exam 2 = β₀ + β₁ Exam 1

Here, β₀ and β₁ are constants to be estimated from our data. In other words, we are fitting a straight line to the (exam 1, exam 2) pairs in our data. This is done through a classic least-squares method. (Don’t worry if you don’t have background in this.)

Note that the exam 1 scores, which are stored in the first column of our data frame, are collectively referred to as examsquiz[,1]. Omission of the first subscript (the row number) means that we are referring to an entire column of the frame. The exam 2 scores are similarly referenced. So, our call to lm() above predicts the second column of examsquiz from the first.

We also could have written

lma <- lm(examsquiz$V2 ˜ examsquiz$V1)

recalling that a data frame is just a list whose elements are vectors. Here, the columns are the V1, V2, and V3 components of the list.

The results returned by lm() are now in an object that we’ve stored in the variable lma. It is an instance of the class lm. We can list its components by calling attributes():

> attributes(lma)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "xlevels"       "call"          "terms"         "model"

$class
[1] "lm"

As usual, a more detailed accounting can be obtained via the call str(lma). The estimated values of β_i are stored in lma$coefficients. You can display them by typing the name at the prompt.

You can also save some typing by abbreviating component names, as long as you don’t shorten a component’s name to the point of being ambiguous. For example, if a list consists of the components xyz, xywa, and xbcde, then the second and third components can be abbreviated to xyw and xb, respectively. So here we could type the following:

> lma$coef
   (Intercept) examsquiz[, 1]
     1.1205209      0.5899803

Since lma$coefficients is a vector, printing it is simple. But consider what happens when you print the object lma itself:

> lma

Call:
lm(formula = examsquiz[, 2] ˜ examsquiz[, 1])

Coefficients:
   (Intercept)  examsquiz[, 1]
         1.121           0.590

Why did R print only these items and not the other components of lma? The answer is that here R is using the print() function, which is another example of generic functions. As a generic function, print() actually hands off the work to another function whose job is to print objects of class lm—the print.lm() function—and this is what that function displays.

We can get a more detailed printout of the contents of lma by calling summary(), the generic function discussed earlier. It triggers a call to summary.lm() behind the scenes, and we get a regression-specific summary:

> summary(lma)

Call:
lm(formula = examsquiz[, 2] ˜ examsquiz[, 1])

Residuals:
    Min      1Q  Median      3Q     Max
−3.4804 −0.1239  0.3426  0.7261  1.2225

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      1.1205     0.6375   1.758  0.08709 .
examsquiz[, 1]   0.5900     0.2030   2.907  0.00614 **
...

A number of other generic functions are defined for this class. See the online help for lm() for details. (Using R’s online documentation is discussed in Section 1.7.)

To estimate a prediction equation for exam 2 from both the exam 1 and the quiz scores, we would use the + notation:

> lmb <- lm(examsquiz[,2] ˜ examsquiz[,1] + examsquiz[,3])

Note that the + doesn’t mean that we compute the sum of the two quantities. It is merely a delimiter in our list of predictor variables.