Other linear model considerations

Before moving on, there are two additional linear model topics that we need to discuss. The first is the inclusion of a qualitative feature, and the second is an interaction term; both are explained in the following sections.

Qualitative feature

A qualitative feature, also referred to as a factor, can take on two or more levels such as Male/Female or Bad/Neutral/Good. If we have a feature with two levels, say gender, then we can create what is known as an indicator or dummy feature, arbitrarily assigning one level as 0 and the other as 1. If we create a model with just the indicator, our linear model would still follow the same formulation as before, that is, Y = B0 + B1x + e. If we code the feature as male is equal to zero and female is equal to one, then the expectation for male would just be the intercept, B0, while for female it would be B0 + B1x. In the situation where you have more than two levels of the feature, you can create n-1 indicators; so, for three levels you would have two indicators. If you created as many indicators as the levels, you would fall into the dummy variable trap, which results in perfect multicollinearity.

We can examine a simple example to learn how to interpret the output. Let's load the ISLR package and build a model with the Carseats dataset by using the following code snippet:

> library(ISLR)

> data(Carseats)

> str(Carseats)

'data.frame':   400 obs. of  11 variables:
$ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
$ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
$ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
$ Population : num  276 260 269 466 340 501 45 425 108 131 ...
$ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
$ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
$ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
$ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
$ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
$ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ..

For this example, we will predict the sales of Carseats using just Advertising, a quantitative feature and the qualitative feature ShelveLoc, which is a factor of three levels: Bad, Good, and Medium. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:

> sales.fit = lm(Sales~Advertising+ShelveLoc, data=Carseats)

> summary(sales.fit)

Call:
lm(formula = Sales ~ Advertising + ShelveLoc, data = 
Carseats)

Residuals:
    Min      1Q  Median      3Q     Max
-6.6480 -1.6198 -0.0476  1.5308  6.4098

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.89662    0.25207  19.426  < 2e-16 ***
Advertising      0.10071    0.01692   5.951 5.88e-09 ***
ShelveLocGood    4.57686    0.33479  13.671  < 2e-16 ***
ShelveLocMedium  1.75142    0.27475   6.375 5.11e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.244 on 396 degrees of freedom
Multiple R-squared:  0.3733,    Adjusted R-squared:  0.3685
F-statistic: 78.62 on 3 and 396 DF,  p-value: < 2.2e-16

If the shelving location is good, the estimate of sales is almost double than when the location is bad, given the intercept of 4.89662. To see how R codes the indicator features, you can use the contrasts() function as follows:

> contrasts(Carseats$ShelveLoc)

        Good Medium
Bad       0      0
Good      1      0
Medium    0      1

Interaction term

Interaction terms are similarly easy to code in R. Two features interact if the effect on the prediction of one feature depends on the value of the other feature. This would follow the formulation, Y = B0 + B1x + B2x + B1B2x + e. An example is available in the MASS package with the Boston dataset. The response is median home value, which is medv in the output; we will use two features, the percentage of homes with a low socioeconomic status, which is termed as lstat, and the age of the home in years, which is termed as age in the following output:

> library(MASS)

> data(Boston)

> str(Boston)

'data.frame':   506 obs. of  14 variables:
$ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
$ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm     : num  6.58 6.42 7.18 7 7.15 ...
$ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
$ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
$ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black  : num  397 397 393 395 397 ...
$ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
$ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Using feature1*feature2 with the lm() function in the code, puts both the features as well as their interaction term in the model, as follows:

> value.fit = lm(medv~lstat*age, data=Boston)

> summary(value.fit)

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max
-15.806  -4.045  -1.333   2.085  27.552

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Examining the output, we can see that while the socioeconomic status is a highly predictive feature, the age of the home is not. However, the two features have a significant interaction to positively explain the home value.