Before moving on, there are two additional linear model topics that we need to discuss. The first is the inclusion of a qualitative feature, and the second is an interaction term; both are explained in the following sections.
A qualitative feature, also referred to as a factor, can take on two or more levels such as Male/Female or Bad/Neutral/Good. If we have a feature with two levels, say gender, then we can create what is known as an indicator or dummy feature, arbitrarily assigning one level as 0
and the other as 1
. If we create a model with just the indicator, our linear model would still follow the same formulation as before, that is, Y = B0 + B1x + e. If we code the feature as male is equal to zero and female is equal to one, then the expectation for male would just be the intercept, B0, while for female it would be B0 + B1x. In the situation where you have more than two levels of the feature, you can create n-1 indicators; so, for three levels you would have two indicators. If you created as many indicators as the levels, you would fall into the dummy variable trap, which results in perfect multicollinearity.
We can examine a simple example to learn how to interpret the output. Let's load the ISLR
package and build a model with the Carseats
dataset by using the following code snippet:
> library(ISLR) > data(Carseats) > str(Carseats) 'data.frame': 400 obs. of 11 variables: $ Sales : num 9.5 11.22 10.06 7.4 4.15 ... $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ... $ Income : num 73 48 35 100 64 113 105 81 110 113 ... $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ... $ Population : num 276 260 269 466 340 501 45 425 108 131 ... $ Price : num 120 83 80 97 128 72 108 120 124 124 ... $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ... $ Age : num 42 65 59 55 38 78 71 67 76 76 ... $ Education : num 17 10 12 14 13 16 15 10 10 17 ... $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ... $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ..
For this example, we will predict the sales of Carseats
using just Advertising
, a quantitative feature and the qualitative feature ShelveLoc
, which is a factor of three levels: Bad
, Good
, and Medium
. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:
> sales.fit = lm(Sales~Advertising+ShelveLoc, data=Carseats) > summary(sales.fit) Call: lm(formula = Sales ~ Advertising + ShelveLoc, data = Carseats) Residuals: Min 1Q Median 3Q Max -6.6480 -1.6198 -0.0476 1.5308 6.4098 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.89662 0.25207 19.426 < 2e-16 *** Advertising 0.10071 0.01692 5.951 5.88e-09 *** ShelveLocGood 4.57686 0.33479 13.671 < 2e-16 *** ShelveLocMedium 1.75142 0.27475 6.375 5.11e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.244 on 396 degrees of freedom Multiple R-squared: 0.3733, Adjusted R-squared: 0.3685 F-statistic: 78.62 on 3 and 396 DF, p-value: < 2.2e-16
If the shelving location is good, the estimate of sales is almost double than when the location is bad, given the intercept of 4.89662
. To see how R codes the indicator features, you can use the contrasts()
function as follows:
> contrasts(Carseats$ShelveLoc) Good Medium Bad 0 0 Good 1 0 Medium 0 1
Interaction terms are similarly easy to code in R. Two features interact if the effect on the prediction of one feature depends on the value of the other feature. This would follow the formulation, Y = B0 + B1x + B2x + B1B2x + e. An example is available in the MASS
package with the Boston
dataset. The response is median home value, which is medv
in the output; we will use two features, the percentage of homes with a low socioeconomic status, which is termed as lstat
, and the age of the home in years, which is termed as age
in the following output:
> library(MASS) > data(Boston) > str(Boston) 'data.frame': 506 obs. of 14 variables: $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... $ chas : int 0 0 0 0 0 0 0 0 0 0 ... $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... $ rm : num 6.58 6.42 7.18 7 7.15 ... $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... $ dis : num 4.09 4.97 4.97 6.06 6.06 ... $ rad : int 1 2 2 3 3 3 5 5 5 5 ... $ tax : num 296 242 242 222 222 222 311 311 311 311 ... $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... $ black : num 397 397 393 395 397 ... $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Using feature1*feature2 with the lm()
function in the code, puts both the features as well as their interaction term in the model, as follows:
> value.fit = lm(medv~lstat*age, data=Boston) > summary(value.fit) Call: lm(formula = medv ~ lstat * age, data = Boston) Residuals: Min 1Q Median 3Q Max -15.806 -4.045 -1.333 2.085 27.552 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 *** lstat -1.3921168 0.1674555 -8.313 8.78e-16 *** age -0.0007209 0.0198792 -0.036 0.9711 lstat:age 0.0041560 0.0018518 2.244 0.0252 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.149 on 502 degrees of freedom Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531 F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Examining the output, we can see that while the socioeconomic status is a highly predictive feature, the age of the home is not. However, the two features have a significant interaction to positively explain the home value.