7

Multivariate Adaptive Regression Splines

The previous chapters discussed algorithms that are intrinsically linear. Many of these models can be adapted to nonlinear patterns in the data by manually adding nonlinear model terms (e.g., squared terms, interaction effects, and other transformations of the original features); however, to do so you the analyst must know the specific nature of the nonlinearities and interactions a priori. Alternatively, there are numerous algorithms that are inherently nonlinear. When using these models, the exact form of the nonlinearity does not need to be known explicitly or specified prior to model training. Rather, these algorithms will search for, and discover, nonlinearities and interactions in the data that help maximize predictive accuracy.

This chapter discusses multivariate adaptive regression splines (MARS) (Friedman, 1991), an algorithm that automatically creates a piecewise linear model which provides an intuitive stepping block into nonlinearity after grasping the concept of multiple linear regression. Future chapters will focus on other nonlinear algorithms.

7.1    Prerequisites

For this chapter we will use the following packages:

# Helper packages

library(dplyr) # for data wrangling

library(ggplot2) # for awesome plotting

library(rsample) # for data splitting

# Modeling packages

library(earth) # for fitting MARS models

library(caret) # for automating the tuning process

# Model interpretability packages

library(vip) # for variable importance

library(pdp) # for variable relationships

To illustrate various concepts we’ll continue with the ames_train and ames_test data sets created in Section 2.7.

7.2    The basic idea

In the previous chapters, we focused on linear models (where the analyst has to explicitly specify any nonlinear relationships and interaction effects). We illustrated some of the advantages of linear models such as their ease and speed of computation and also the intuitive nature of interpreting their coefficients. However, linear models make a strong assumption about linearity, and this assumption is often a poor one, which can affect predictive accuracy.

We can extend linear models to capture any non-linear relationship. Typically, this is done by explicitly including polynomial terms (e.g., X12) or step functions. Polynomial regression is a form of regression in which the relationship between X and Y is modeled as a dth degree polynomial in X. For example, Equation (7.1) represents a polynomial regression function where Y is modeled as a d-th degree polynomial in X. Generally speaking, it is unusual to use d greater than 3 or 4 as the larger d becomes, the easier the function fit becomes overly flexible and oddly shaped…especially near the boundaries of the range of X values. Increasing d also tends to increase the presence of multicollinearity.

yi=β0+β1xi+β2xi2+β3xi3+βdxid+i,

(7.1)

An alternative to polynomials is to use step functions. Whereas polynomial functions impose a global non-linear relationship, step functions break the range of X into bins, and fit a simple constant (e.g., the mean response) in each. This amounts to converting a continuous feature into an ordered categorical variable such that our linear regression function is converted to Equation (7.2)

yi=β0+β1C1(xi)+β2C2(xi)+β3C3(xi)+βdCd(xi)+i,

(7.2)

where C1(x) represents X values ranging from c1X < c2, C2 (X) represents X values ranging from cd−1X < cd. Figure 7.1 contrasts linear, polynomial, and step function fits for non-linear, non-monotonic simulated data.

Image
FIGURE 7.1: Blue line represents predicted (‘y‘) values as a function of ‘x‘ for alternative approaches to modeling explicit nonlinear regression patterns. (A) Traditional linear regression approach does not capture any nonlinearity unless the predictor or response is transformed (i.e. log transformation). (B) Degree-2 polynomial, (C) Degree-3 polynomial, (D) Step function fitting cutting ‘x‘ into six categorical levels.

Although useful, the typical implementation of polynomial regression and step functions require the user to explicitly identify and incorporate which variables should have what specific degree of interaction or at what points of a variable X should cut points be made for the step functions. Considering many data sets today can easily contain 50, 100, or more features, this would require an enormous and unnecessary time commitment from an analyst to determine these explicit non-linear settings.

7.2.1    Multivariate adaptive regression splines

Multivariate adaptive regression splines (MARS) provide a convenient approach to capture the nonlinearity relationships in the data by assessing cutpoints (knots) similar to step functions. The procedure assesses each data point for each predictor as a knot and creates a linear regression model with the candidate feature(s). For example, consider our non-linear, non-monotonic data above where Y = f (X). The MARS procedure will first look for the single point across the range of X values where two different linear relationships between Y and X achieve the smallest error (e.g., smallest SSE). What results is known as a hinge function h (xa), where a is the cutpoint value. For a single knot (Figure 7.2 (A)), our hinge function is h (x − 1.183606) such that our two linear models for Y are

y={β0+β1(1.183606x)x<1.183606,β0+β1(x1.183606)x>1.183606

(7.3)

Once the first knot has been found, the search continues for a second knot which is found at x = 4.898114 (Figure 7.2 (B)). This results in three linear models for y:

y={β0+β1(1.183606x)x<1.183606,β0+β1(x1.183606)x>1.183606&x<4.898114,β0+β1(4.898114x)x>4.898114

(7.4)

Image
FIGURE 7.2: Examples of fitted regression splines of one (A), two (B), three (C), and four (D) knots.

This procedure continues until many knots are found, producing a (potentially) highly non-linear prediction equation. Although including many knots may allow us to fit a really good relationship with our training data, it may not generalize very well to new, unseen data. Consequently, once the full set of knots has been identified, we can sequentially remove knots that do not contribute significantly to predictive accuracy. This process is known as “pruning” and we can use cross-validation, as we have with the previous models, to find the optimal number of knots.

7.3    Fitting a basic MARS model

We can fit a direct engine MARS model with the earth package (from mda:mars by Trevor Hastie and utilities with Thomas Lumley’s leaps wrapper., 2019). By default, earth::earth() will assess all potential knots across all supplied features and then will prune to the optimal number of knots based on an expected change in R2 (for the training data) of less than 0.001. This calculation is performed by the Generalized cross-validation (GCV) procedure, which is a computational shortcut for linear models that produces an approximate leave-one-out cross-validation error metric (Golub et al., 1979).

Image

The term “MARS” is trademarked and licensed exclusively to Salford Systems: https://www.salford-systems.com. We can use MARS as an abbreviation; however, it cannot be used for competing software solutions. This is why the R package uses the name earth. Although, according to the package documentation, a backronym for “earth” is “Enhanced Adaptive Regression Through Hinges”.

The following applies a basic MARS model to our ames example. The results show us the final models GCV statistic, generalized R2 (GRSq), and more.

# Fit a basic MARS model

mars1 <- earth(

 Sale_Price ~ .,

 data = ames_train

)

# Print model summary

print(mars1)

## Selected 36 of 41 terms, and 24 of 307 predictors

## Termination condition: RSq changed by less than 0.001 at 41 terms

## Importance: First_Flr_SF, Second_Flr_SF, …

## Number of terms at each degree of interaction: 1 35 (additive model)

## GCV 5.11e+08 RSS 9.79e+11 GRSq 0.921 RSq 0.926

It also shows us that 36 of 41 terms were used from 24 of the 307 original predictors. But what does this mean? If we were to look at all the coefficients, we would see that there are 36 terms in our model (including the intercept). These terms include hinge functions produced from the original 307 predictors (307 predictors because the model automatically dummy encodes categorical features). Looking at the first 10 terms in our model, we see that Gr_Liv_Area is included with a knot at 2790 (the coefficient for h (2790 − Gr_Liv_Area) is −5.26), Year_Built is included with a knot at 2002, etc.

Image

You can check out all the coefficients with summary(mars1) or coef(mars1).

summary(mars1) %>% .$coefficients %>% head(10)

##                               Sale_Price

## (Intercept)                    289316.2

## h(Gr_Liv_Area-2790)               -55.3

## h(Year_Built-2002)               3040.6

## h(2002-Year_Built)               -410.9

## h(2220-Total_Bsmt_SF)             -30.7

## h(Bsmt_Unf_SF-543)                -25.4

## h(543-Bsmt_Unf_SF)                 13.8

## h(Total_Bsmt_SF-1550)              39.2

## h(Garage_Cars-2)                12480.7

## h(2-Garage_Cars)                -4834.4

The plot method for MARS model objects provides useful performance and residual plots. Figure 7.3 illustrates the model selection plot that graphs the GCV R2 (left-hand y-axis and solid black line) based on the number of terms retained in the model (x-axis) which are constructed from a certain number of original predictors (right-hand y-axis). The vertical dashed lined at 36 tells us the optimal number of non-intercept terms retained where marginal increases in GCV R2 are less than 0.001.

plot(mars1, which = 1)

Image
FIGURE 7.3: Model summary capturing GCV R2 (left-hand y-axis and solid black line) based on the number of terms retained (x-axis) which is based on the number of predictors used to make those terms (right-hand side y-axis). For this model, 35 non-intercept terms were retained which are based on 26 predictors. Any additional terms retained in the model, over and above these 35, result in less than 0.001 improvement in the GCV R2.

In addition to pruning the number of knots, earth::earth() allows us to also assess potential interactions between different hinge functions. The following illustrates this by including a degree = 2 argument. You can see that now our model includes interaction terms between a maximum of two hinge functions (e.g., h(Year_Built-2002)*h(2362-Gr_Liv_Area) represents an interaction effect for those houses built prior to 2002 and had less than 2,362 square feet of living space above ground).

# Fit a basic MARS model

mars2 <- earth(

Sale_Price ~ .,

data = ames_train,

degree = 2

)

# check out the first 10 coefficient terms

summary(mars2) %>% .$coefficients %>% head(10)

##                                            Sale_Price

## (Intercept)                                 230780.52

## h(Gr_Liv_Area-2790)                             94.79

## h(2790-Gr_Liv_Area)                            -50.98

## h(Year_Built-2002)                            9111.80

## h(2002-Year_Built)                            -689.25

## h(Year_Built-2002)*h(2362-Gr_Liv_Area)     -7.72

## h(Total_Bsmt_SF-1136)                           63.63

## h(1136-Total_Bsmt_SF)                          -32.70

## h(Bsmt_Unf_SF-504)                             -25.09

## h(504-Bsmt_Unf_SF)                              12.96

7.4    Tuning

There are two important tuning parameters associated with our MARS model: the maximum degree of interactions and the number of terms retained in the final model. We need to perform a grid search to identify the optimal combination of these hyperparameters that minimize prediction error (the above pruning process was based only on an approximation of CV model performance on the training data rather than an exact k-fold CV process). As in previous chapters, we’ll perform a CV grid search to identify the optimal hyperparameter mix. Below, we set up a grid that assesses 30 different combinations of interaction complexity (degree) and the number of terms to retain in the final model (nprune).

Image

Rarely is there any benefit in assessing greater than 3-rd degree interactions and we suggest starting out with 10 evenly spaced values for nprune and then you can always zoom in to a region once you find an approximate optimal solution.

# create a tuning grid

hyper_grid <- expand.grid(

degree = 1:3,

nprune = seq(2, 100, length.out = 10) %>% floor()

)

head(hyper_grid)

##  degree nprune

## 1     1    2

## 2     2    2

## 3     3    2

## 4     1   12

## 5     2   12

## 6     3   12

As in the previous chapters, we can use caret to perform a grid search using 10-fold CV. The model that provides the optimal combination includes third degree interaction effects and retains 45 terms. The cross-validated RMSE for these models is displayed in Figure 7.4; the optimal model’s cross-validated RMSE was $22,888.

Image

This grid search took roughly five minutes to complete.

# Cross-validated model

set.seed(123) # for reproducibility

cv_mars <- train(

 x = subset(ames_train,select = -Sale_Price),

 y = ames_train$Sale_Price,

 method = ”earth”,

 metric = ”RMSE”,

 trControl = trainControl(method = ”cv”, number = 10),

 tuneGrid = hyper_grid

)

# View results

cv_mars$bestTune

##  nprune degree

## 25       45    3

ggplot(cv_mars)

Image
FIGURE 7.4: Cross-validated RMSE for the 30 different hyperparameter combinations in our grid search. The optimal model retains 45 terms and includes up to 3rd degree interactions.

The above grid search helps to focus where we can further refine our model tuning. As a next step, we could perform a grid search that focuses in on a refined grid space for nprune (e.g., comparing 35–45 terms retained). However, for brevity we’ll leave this as an exercise for the reader.

So how does this compare to our previously built models for the Ames housing data? The following table compares the cross-validated RMSE for our tuned MARS model to an ordinary multiple regression model along with tuned principal component regression (PCR), partial least squares (PLS), and regularized regression (elastic net) models. By incorporating non-linear relationships and interaction effects, the MARS model provides a substantial improvement over the previous linear models that we have explored.

Image

Notice that our elastic net model is higher than in the last chapter. This table compares these 5 modeling approaches without performing any logarithmic transformation on the target variable. However, even considering the best tuned regularized regression results from last chapter (RMSE = 23503), our optimal MARS model performs better.

TABLE 7.1: Cross-validated RMSE results for tuned MARS and regression models.

Image

7.5    Feature interpretation

MARS models via earth::earth() include a backwards elimination feature selection routine that looks at reductions in the GCV estimate of error as each predictor is added to the model. This total reduction is used as the variable importance measure (value = ”gcv”). Since MARS will automatically include and exclude terms during the pruning process, it essentially performs automated feature selection. If a predictor was never used in any of the MARS basis functions in the final model (after pruning), it has an importance value of zero. This is illustrated in Figure 16.1 where 27 features have > 0 importance values while the rest of the features have an importance value of zero since they were not included in the final model. Alternatively, you can also monitor the change in the residual sums of squares (RSS) as terms are added (value = ”rss”); however, you will see very little difference between these methods.

# variable importance plots

p1 <- vip(cv_mars, num_features = 40, bar = FALSE, value = ”gcv”) + ggtitle (”GCV”)

p2 <- vip(cv_mars, num_features = 40, bar =FALSE, value = ”rss”) + ggtitle(”RSS”)

gridExtra::grid.arrange(p1, p2, ncol = 2)

Image
FIGURE 7.5: Variable importance based on impact to GCV (left) and RSS (right) values as predictors are added to the model. Both variable importance measures will usually give you very similar results.

Its important to realize that variable importance will only measure the impact of the prediction error as features are included; however, it does not measure the impact for particular hinge functions created for a given feature. For example, in Figure 16.1 we see that Gr_Liv_Area and Year_Built are the two most influential variables; however, variable importance does not tell us how our model is treating the non-linear patterns for each feature. Also, if we look at the interaction terms our model retained, we see interactions between different hinge functions for Gr_Liv_Area and Year_Built.

# extract coefficients, convert to tidy data frame, and

# filter for interaction terms

cv_mars$finalModel %>%

coef() %>%

broom::tidy() %>%

filter(stringr::str_detect(names, ”\\*”))

## # A tibble: 20 x 2

## names          x

## <chr>         <dbl>

## 1 h(Year_Built-2002) * h(2362-Gr_Liv_Area)          -6.98e+0

## 2 h(Year_Remod_Add-2007) * h(Total_Bsmt_SF~          8.93e+0

## 3 h(2007-Year_Remod_Add) * h(Total_Bsmt_SF~         -1.22e+0

## 4 NeighborhoodEdwards * h(Year_Built-2002)~         -6.71e+1

## 5 h(Lot_Area-3874) * h(3-Garage_Cars)               -1.17e+0

## 6 Bsmt_ExposureGd * h(Total_Bsmt_SF-1136)            3.12e+1

## 7 NeighborhoodCrawford * h(2002-Year_Built)          4.25e+2

## 8 h(2002-Year_Built) * h(Year_Remod_Add-19~          7.90e+0

## 9 h(2002-Year_Built) * h(1974-Year_Remod_A~          5.42e+0

## 10 h(Kitchen_AbvGr-1) * FunctionalTyp               -1.55e+4

## 11 Overall_QualVery_Excellent * FunctionalT~         9.68e+4

## 12 Overall_QualGood * FunctionalTyp                  1.32e+4

## 13 h(Lot_Area-3874) * h(Latitude-42.0014)            7.65e+0

## 14 h(Lot_Area-3874) * h(42.0014-Latitude)           -1.23e+2

## 15 h(Total_Bsmt_SF-1136) * h(115-Screen_Por~        -3.04e-1

## 16 h(Lot_Area-3874) * h(Gr_Liv_Area-2411) *~    -2.99e-3

## 17 h(Lot_Area-3874) * h(2411-Gr_Liv_Area) *~      5.29e-4

## 18 Overall_CondGood * h(2002-Year_Built)             3.33e+2

## 19 Overall_CondVery_Good * h(2002-Year_Buil~         3.68e+2

## 20 Overall_CondAbove_Average * h(2790-Gr_Li~         6.20e+0

To better understand the relationship between these features and Sale_Price, we can create partial dependence plots (PDPs) for each feature individually and also together. The individual PDPs illustrate the knots for each feature that our model found provides the best fit. For Gr_Liv_Area, as homes exceed 2,790 square feet, each additional square foot demands a higher marginal increase in sale price than homes with less than 2,790 square feet. Similarly, for homes built after 2002, there is a greater marginal effect on sales price based on the age of the home than for homes built prior to 2002. The interaction plot (far right figure) illustrates the stronger effect these two features have when combined.

# Construct partial dependence plots

p1 <- partial(cv_mars,pred.var = ”Gr_Liv_Area”) %>% autoplot()

p2 <- partial(cv_mars,pred.var = ”Year_Built”) %>% autoplot()

p3 <- partial(cv_mars,pred.var = c(”Gr_Liv_Area”, ”Year_Built”), =

    chull = TRUE) %>%

plotPartial(palette = ”inferno”, contour = TRUE) %>%

 ggplotify::as.grob()   # convert to grob to plot with cowplot

# Display plots in a grid

top_row <- cowplot::plot_grid(p1, p2)

cowplot::plot_grid (top_row, p3, nrow = 2, rel_heights = c(1,2))

Image
FIGURE 7.6: Partial dependence plots to understand the relationship between sale price and the living space and year built features. The PDPs tell us that as living space increases and for newer homes, predicted sale price increases dramatically.

7.6    Attrition data

The MARS method and algorithm can be extended to handle classification problems and GLMs in general.1 We saw significant improvement to our predictive accuracy on the Ames data with a MARS model, but how about the employee attrition example? In Chapter 5 we saw a slight improvement in our cross-validated accuracy rate using regularized regression. Here, we tune a MARS model using the same search grid as we did above. We see our best models include no interaction effects and the optimal model retained 12 terms.

# get attrition data

df <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE)

# Create training (70%) and test (30%) sets for the

# rsample::attrition data.

set.seed(123)

churn_split <- initial_split(df, prop = 0.7, strata = ”Attrition”)

churn_train <- training(churn_split)

churn_test <- testing(churn_split)

# for reproducibiity

set.seed(123)

# cross validated model

tuned_mars <- train(

 x = subset(churn_train, select = -Attrition),

 y = churn_train$Attrition,

 method = ”earth”,

 trControl = trainControl(method = ”cv”, number = 10),

 tuneGrid = hyper_grid

)

# best model

tuned_mars$bestTune

## nprune degree

## 2 12 1

# plot results

ggplot(tuned_mars)

TABLE 7.2: Cross-validated accuracy results for tuned MARS and regression models.

Image

However, comparing our MARS model to the previous linear models (logistic regression and regularized regression), we do not see any improvement in our overall accuracy rate.

Image
FIGURE 7.7: Cross-validated accuracy rate for the 30 different hyperparameter combinations in our grid search. The optimal model retains 12 terms and includes no interaction effects.

7.7    Final thoughts

There are several advantages to MARS. First, MARS naturally handles mixed types of predictors (quantitative and qualitative). MARS considers all possible binary partitions of the categories for a qualitative predictor into two groups.2 Each group then generates a pair of piecewise indicator functions for the two categories. MARS also requires minimal feature engineering (e.g., feature scaling) and performs automated feature selection. For example, since MARS scans each predictor to identify a split that improves predictive accuracy, non-informative features will not be chosen. Furthermore, highly correlated predictors do not impede predictive accuracy as much as they do with OLS models.

However, one disadvantage to MARS models is that they’re typically slower to train. Since the algorithm scans each value of each predictor for potential cutpoints, computational performance can suffer as both n and p increase. Also, although correlated predictors do not necessarily impede model performance, they can make model interpretation difficult. When two features are nearly perfectly correlated, the algorithm will essentially select the first one it happens to come across when scanning the features. Then, since it randomly selected one, the correlated feature will likely not be included as it adds no additional explanatory power.

1See Friedman et al. (2001) and Stone et al. (1997) for technical details regarding various alternative encodings for binary and mulinomial classification approaches.

2This is very similar to CART-like decision trees which you’ll be exposed to in Chapter 9.