3

Feature & Target Engineering

Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman et al., 2001, p. 201). Although this book primarily focuses on applying machine learning algorithms, feature engineering can make or break an algorithm’s predictive ability and deserves your continued focus and education.

We will not cover all the potential ways of implementing feature engineering; however, we’ll cover several fundamental preprocessing tasks that can potentially significantly improve modeling performance. Moreover, different models have different sensitivities to the type of target and feature values in the model and we will try to highlight some of these concerns. For more in depth coverage of feature engineering, please refer to Kuhn and Johnson (2019) and Zheng and Casari (2018).

3.1    Prerequisites

This chapter leverages the following packages:

# Helper packages

library(dplyr) # for data manipulation

library(ggplot2) # for awesome graphics

library(visdat) # for additional visualizations

# Feature engineering packages

library(caret) # for various ML tasks

library(recipes) # for feature engineering tasks

We’ll also continue working with the ames_train data set created in Section 2.7:

3.2    Target engineering

Although not always a requirement, transforming the response variable can lead to predictive improvement, especially with parametric models (which require that certain assumptions about the model be met). For instance, ordinary linear regression models assume that the prediction errors (and hence the response) are normally distributed. This is usually fine, except when the prediction target has heavy tails (i.e., outliers) or is skewed in one direction or the other. In these cases, the normality assumption likely does not hold. For example, as we saw in the data splitting section (2.2), the response variable for the Ames housing data (Sale_Price) is right (or positively) skewed as illustrated in Figure 3.1 (ranging from $12,789 to $755,000). A simple linear model, say Sale_Price = β0 + β1 Year_Built + ϵ, often assumes the error term ϵ (and hence Sale_Price) is normally distributed; fortunately, a simple log (or similar) transformation of the response can often help alleviate this concern as Figure 3.1 illustrates.

Image
FIGURE 3.1: Transforming the response variable to minimize skewness can resolve concerns with non-normally distributed errors.

Furthermore, using a log (or other) transformation to minimize the response skewness can be used for shaping the business problem as well. For example, in the House Prices: Advanced Regression Techniques Kaggle competition1, which used the Ames housing data, the competition focused on using a log transformed Sale Price response because “…taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.”

This would be an alternative to using the root mean squared logarithmic error (RMSLE) loss function as discussed in Section 2.6.

There are two main approaches to help correct for positively skewed target variables:

Option 1: normalize with a log transformation. This will transform most right skewed distributions to be approximately normal. One way to do this is to simply log transform the training and test set in a manual, single step manner similar to:

transformed_response <- log(ames_train$Sale_Price)

However, we should think of the preprocessing as creating a blueprint to be re-applied strategically. For this, you can use the recipe package or something similar (e.g., caret::preProcess()). This will not return the actual log transformed values but, rather, a blueprint to be applied later.

# log transformation

ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%

step_log(all_outcomes())

ames_recipe

## Data Recipe

##

## Inputs:

##

## role #variables

## outcome  1

## predictor 80

##

## Operations:

##

## Log transformation on all_outcomes()

If your response has negative values or zeros then a log transformation will produce NaNs and -Infs, respectively (you cannot take the logarithm of a negative number). If the nonpositive response values are small (say between -0.99 and 0) then you can apply a small offset such as in log1p() which adds 1 to the value prior to applying a log transformation (you can do the same within step_log() by using the offset argument). If your data consists of values ≤ −1, use the Yeo-Johnson transformation mentioned next.

log(-0.5)

## [1] NaN

log1p(-0.5)

## [1] -0.693

Option 2: use a Box Cox transformation. A Box Cox transformation is more flexible than (but also includes as a special case) the log transformation and will find an appropriate transformation from a family of power transforms that will transform the variable as close as possible to a normal distribution (Box and Cox, 1964; Carroll and Ruppert, 1981). At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from −5 to 5. All values of λ are considered and the optimal value for the given data is estimated from the training data; The “optimal value” is the one which results in the best transformation to an approximate normal distribution. The transformation of the response Y has the form:

y(λ)={Yλ1λ,if λ0log(Y),if λ=0.

(3.1)

Image

Be sure to compute the lambda on the training set and apply that same lambda to both the training and test set to minimize data leakage. The recipes package automates this process for you.

If your response has negative values, the Yeo-Johnson transformation is very similar to the Box-Cox but does not require the input variables to be strictly positive. To apply, use step_YeoJohnson().

Figure 3.2 illustrates that the log transformation and Box Cox transformation both do about equally well in transforming Sale_Price to look more normally distributed.

Note that when you model with a transformed response variable, your predictions will also be on the transformed scale. You will likely want to undo (or re-transform) your predicted values back to their normal scale so that decision-makers can more easily interpret the results. This is illustrated in the following code chunk:

# Log transform a value

y <- log(10)

# Undo log-transformation

exp(y)

## [1] 10

# Box Cox transform a value

y <- forecast::BoxCox(10, lambda)

# Inverse Box Cox function

inv_box_cox <- function(x, lambda) {

# for Box-Cox, lambda = 0 --> log transform

if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda)

}

# Undo Box Cox-transformation

inv_box_cox(y, lambda)

## [1] 10

## attr(,”lambda”)

## [1] 0.0526

Image
FIGURE 3.2: Response variable transformations.

3.3    Dealing with missingness

Data quality is an important issue for any project involving analyzing data. Data quality issues deserve an entire book in their own right, and a good reference is The Quartz guide to bad data.2 One of the most common data quality concerns you will run into is missing values.

Data can be missing for many different reasons; however, these reasons are usually lumped into two categories: informative missingness (Kuhn and Johnson, 2013) and missingness at random (Little and Rubin, 2014). Informative missingness implies a structural cause for the missing value that can provide insight in its own right; whether this be deficiencies in how the data was collected or abnormalities in the observational environment. Missingness at random implies that missing values occur independent of the data collection process3.

The category that drives missing values will determine how you handle them. For example, we may give values that are driven by informative missingness their own category (e.g., ”None”) as their unique value may affect predictive performance. Whereas values that are missing at random may deserve deletion4 or imputation.

Furthermore, different machine learning models handle missingness differently. Most algorithms cannot handle missingness (e.g., generalized linear models and their cousins, neural networks, and support vector machines) and, therefore, require them to be dealt with beforehand. A few models (mainly tree-based), have built-in procedures to deal with missing values. However, since the modeling process involves comparing and contrasting multiple models to identify the optimal one, you will want to handle missing values prior to applying any models so that your algorithms are based on the same data quality assumptions.

3.3.1    Visualizing missing values

It is important to understand the distribution of missing values (i.e., NA) in any data set. So far, we have been using a pre-processed version of the Ames housing data set (via the AmesHousing::make_ames() function). However, if we use the raw Ames housing data (via AmesHousing::ames_raw), there are actually 13,997 missing values—there is at least one missing values in each row of the original data!

sum(is.na(AmesHousing::ames_raw))

## [1] 13997

It is important to understand the distribution of missing values in a data set in order to determine the best approach for preprocessing. Heat maps are an efficient way to visualize the distribution of missing values for small- to medium-sized data sets. The code is.na(<data-frame-name>) will return a matrix of the same dimension as the given data frame, but each cell will contain either TRUE (if the corresponding value is missing) or FALSE (if the corresponding value is not missing). To construct such a plot, we can use R’s built-in heatmap() or image() functions, or ggplot2’s geom_raster() function, among others; Figure 3.3 illustrates geom_raster(). This allows us to easily see where the majority of missing values occur (i.e., in the variables Alley, Fireplace Qual, Pool QC, Fence, and Misc Feature). Due to their high frequency of missingness, these variables would likely need to be removed prior to statistical analysis, or imputed. We can also spot obvious patterns of missingness. For example, missing values appear to occur within the same observations across all garage variables.

AmesHousing::ames_raw %>%

is.na() %>%

reshape2::melt() %>%

ggplot(aes(Var2, Var1, fill=value)) +

  geom_raster() +

  coord_flip() +

  scale_y_continuous(NULL, expand = c(0, 0)) +

  scale_fill_grey(name = ””,

          labels = c(”Present”,

            ”Missing”)) +

xlab(”Observation”) +

theme(axis.text.y = element_text(size = 4))

Digging a little deeper into these variables, we might notice that Garage_Cars and Garage_Area contain the value 0 whenever the other Garage_xx variables have missing values (i.e. a value of NA). This might be because they did not have a way to identify houses with no garages when the data were originally collected, and therefore, all houses with no garage were identified by including nothing. Since this missingness is informative, it would be appropriate to impute NA with a new category level (e.g., ”None”) for these garage variables. Circumstances like this tend to only become apparent upon careful descriptive and visual examination of the data!

AmesHousing::ames_raw %>%

filter(is.na(‘Garage Type‘)) %>%

select(‘Garage Type‘, ‘Garage Cars‘, ‘Garage Area‘)

## # A tibble: 157 x 3

##  ‘Garage Type‘ ‘Garage Cars‘ ‘Garage Area‘

##  <chr> <int> <int>

## 1 <NA>   0 0

## 2 <NA>   0 0

## 3 <NA>   0 0

## 4 <NA>   0 0

## 5 <NA>   0 0

## 6 <NA>   0 0

## 7 <NA>   0 0

## 8 <NA>   0 0

## 9 <NA>   0 0

## 10 <NA>  0 0

## # … with 147 more rows

Image
FIGURE 3.3: Heat map of missing values in the raw Ames housing data.

The vis_miss() function in R package visdat (Tierney, 2019) also allows for easy visualization of missing data patterns (with sorting and clustering options). We illustrate this functionality below using the raw Ames housing data (Figure 3.4). The columns of the heat map represent the 82 variables of the raw data and the rows represent the observations. Missing values (i.e., NA) are indicated via a black cell. The variables and NA patterns have been clustered by rows (i.e., cluster = TRUE).

vis_miss(AmesHousing::ames_raw, cluster = TRUE)

Image
FIGURE 3.4: Visualizing missing data patterns in the raw Ames housing data.

Data can be missing for different reasons. Perhaps the values were never recoded (or lost in translation), or it was recorded in error (a common feature of data entered by hand). Regardless, it is important to identify and attempt to understand how missing values are distributed across a data set as it can provide insight into how to deal with these observations.

3.3.2    Imputation

Imputation is the process of replacing a missing value with a substituted, “best guess” value. Imputation should be one of the first feature engineering steps you take as it will affect any downstream preprocessing5.

3.3.2.1 Estimated statistic

An elementary approach to imputing missing values for a feature is to compute descriptive statistics such as the mean, median, or mode (for categorical) and use that value to replace NAs. Although computationally efficient, this approach does not consider any other attributes for a given observation when imputing (e.g., a female patient that is 63 inches tall may have her weight imputed as 175 lbs since that is the average weight across all observations which contains 65% males that average a height of 70 inches).

An alternative is to use grouped statistics to capture expected values for observations that fall into similar groups. However, this becomes infeasible for larger data sets. Modeling imputation can automate this process for you and the two most common methods include K-nearest neighbor and tree-based imputation, which are discussed next.

However, it is important to remember that imputation should be performed within the resampling process and as your data set gets larger, repeated model-based imputation can compound the computational demands. Thus, you must weigh the pros and cons of the two approaches. The following would build onto our ames_recipe and impute all missing values for the Gr_Liv_Area variable with the median value:

ames_recipe %>%

step_medianimpute(Gr_Liv_Area)

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Operations:

##

## Box-Cox transformation on all_outcomes()

## Median Imputation for Gr_Liv_Area

Image

Use step_modeimpute() to impute categorical features with the most common value.

3.3.2.2 K-nearest neighbor

K-nearest neighbor (KNN) imputes values by identifying observations with missing values, then identifying other observations that are most similar based on the other available features, and using the values from these nearest neighbor observations to impute missing values.

We discuss KNN for predictive modeling in Chapter 8; the imputation application works in a similar manner. In KNN imputation, the missing value for a given observation is treated as the targeted response and is predicted based on the average (for quantitative values) or the mode (for qualitative values) of the k nearest neighbors.

As discussed in Chapter 8, if all features are quantitative then standard Euclidean distance is commonly used as the distance metric to identify the k neighbors and when there is a mixture of quantitative and qualitative features then Gower’s distance (Gower, 1971) can be used. KNN imputation is best used on small to moderate sized data sets as it becomes computationally burdensome with larger data sets (Kuhn and Johnson, 2019).

Image

As we saw in Section 2.7, k is a tunable hyperparameter. Suggested values for imputation are 5–10 (Kuhn and Johnson, 2019). By default, step_knnimpute() will use 5 but can be adjusted with the neighbors argument.

ames_recipe %>%

step_knnimpute(all_predictors(), neighbors = 6)

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Operations:

##

## Box-Cox transformation on all_outcomes()

## 6-nearest neighbor imputation for all_predictors()

3.3.2.3 Tree-based

As previously discussed, several implementations of decision trees (Chapter 9) and their derivatives can be constructed in the presence of missing values. Thus, they provide a good alternative for imputation. As discussed in Chapters 9,10,11, single trees have high variance but aggregating across many trees creates a robust, low variance predictor. Random forest imputation procedures have been studied (Shah et al., 2014; Stekhoven, 2015); however, they require significant computational demands in a resampling environment (Kuhn and Johnson, 2019). Bagged trees (Chapter 10) offer a compromise between predictive accuracy and computational burden.

Similar to KNN imputation, observations with missing values are identified and the feature containing the missing value is treated as the target and predicted using bagged decision trees.

ames_recipe %>%

step_bagimpute(all_predictors())

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Operations:

##

## Box-Cox transformation on all_outcomes()

## Bagged tree imputation for all_predictors()

Figure 3.5 illustrates the differences between mean, KNN, and tree-based imputation on the raw Ames housing data. It is apparent how descriptive statistic methods (e.g., using the mean and median) are inferior to the KNN and tree-based imputation methods.

3.4    Feature filtering

In many data analyses and modeling projects we end up with hundreds or even thousands of collected features. From a practical perspective, a model with more features often becomes harder to interpret and is costly to compute. Some models are more resistant to non-informative predictors (e.g., the Lasso and tree-based methods) than others as illustrated in Figure 3.6.6

Although the performance of some of our models are not significantly affected by non-informative predictors, the time to train these models can be negatively impacted as more features are added. Figure 3.7 shows the increase in time to perform 10-fold CV on the exemplar data, which consists of 10,000 observations. We see that many algorithms (e.g., elastic nets, random forests, and gradient boosting machines) become extremely time intensive the more predictors we add. Consequently, filtering or reducing features prior to modeling may significantly speed up training time.

Image
FIGURE 3.5: Comparison of three different imputation methods. The red points represent actual values which were removed and made missing and the blue points represent the imputed values. Estimated statistic imputation methods (i.e. mean, median) merely predict the same value for each observation and can reduce the signal between a feature and the response; whereas KNN and tree-based procedures tend to maintain the feature distribution and relationship.
Image
FIGURE 3.6: Test set RMSE profiles when non-informative predictors are added.
Image
FIGURE 3.7: Impact in model training time as non-informative predictors are added.

Zero and near-zero variance variables are low-hanging fruit to eliminate. Zero variance variables, meaning the feature only contains a single unique value, provides no useful information to a model. Some algorithms are unaffected by zero variance features. However, features that have near-zero variance also offer very little, if any, information to a model. Furthermore, they can cause problems during resampling as there is a high probability that a given sample will only contain a single unique value (the dominant value) for that feature. A rule of thumb for detecting near-zero variance features is:

•  The fraction of unique values over the sample size is low (say ≤ 10%).

•  The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say ≥ 20%).

If both of these criteria are true then it is often advantageous to remove the variable from the model. For the Ames data, we do not have any zero variance predictors but there are 20 features that meet the near-zero threshold.

caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%

rownames_to_column() %>%

filter(nzv)

##  rowname freqRatio percentUnique zeroVar

## 1  Street  255.8  0.0974  FALSE

## 2  Alley   24.7   0.1461   FALSE

## 3  Land_Contour  21.9  0.1947  FALSE

## 4  Utilities  2052.0  0.1461  FALSE

## 5  Land_Slope  21.7  0.1461  FALSE

## 6  Condition_2 184.7  0.3408  FALSE

## 7  Roof_Matl  112.5  0.2921  FALSE

## 8  Bsmt_Cond 24.0  0.2921  FALSE

## 9  BsmtFin_Type_2  23.1  0.3408  FALSE

## 10  Heating  91.8  0.2921  FALSE

## 11  Low_Qual_Fin_SF  674.3  1.4119  FALSE

## 12  Kitchen_AbvGr  22.8  0.1947  FALSE

## 13  Functional  35.3  0.3895  FALSE

## 14  Enclosed_Porch  109.1  7.2055  FALSE

## 15  Three_season_porch  674.7 1.2658  FALSE

## 16  Screen_Porch  186.9  4.8199  FALSE

## 17  Pool_Area  2046.0  0.4382  FALSE

## 18  Pool_QC  682.0  0.2434  FALSE

## 19  Misc_Feature  29.6  0.2921  FALSE

## 20  Misc_Val  124.0  1.2658  FALSE

##   nzv

## 1 TRUE

## 2 TRUE

## 3 TRUE

## 4 TRUE

## 5 TRUE

## 6 TRUE

## 7 TRUE

## 8 TRUE

## 9 TRUE

## 10 TRUE

## 11 TRUE

## 12 TRUE

## 13 TRUE

## 14 TRUE

## 15 TRUE

## 16 TRUE

## 17 TRUE

## 18 TRUE

## 19 TRUE

## 20 TRUE

Image

We can add step_zv() and step_nzv() to our ames_recipe to remove zero or near-zero variance features.

Other feature filtering methods exist; see Saeys et al. (2007) for a thorough review. Furthermore, several wrapper methods exist that evaluate multiple models using procedures that add or remove predictors to find the optimal combination of features that maximizes model performance (see, for example, Kursa et al. (2010), Granitto et al. (2006), Maldonado and Weber (2009)). However, this topic is beyond the scope of this book.

3.5    Numeric feature engineering

Numeric features can create a host of problems for certain models when their distributions are skewed, contain outliers, or have a wide range in magnitudes. Tree-based models are quite immune to these types of problems in the feature space, but many other models (e.g., GLMs, regularized regression, KNN, support vector machines, neural networks) can be greatly hampered by these issues. Normalizing and standardizing heavily skewed features can help minimize these concerns.

3.5.1    Skewness

Similar to the process discussed to normalize target variables, parametric models that have distributional assumptions (e.g., GLMs, and regularized models) can benefit from minimizing the skewness of numeric features. When normalizing many variables, it’s best to use the Box-Cox (when feature values are strictly positive) or Yeo-Johnson (when feature values are not strictly positive) procedures as these methods will identify if a transformation is required and what the optimal transformation will be.

Image

Non-parametric models are rarely affected by skewed features; however, normalizing features will not have a negative effect on these models’ performance. For example, normalizing features will only shift the optimal split points in tree-based algorithms. Consequently, when in doubt, normalize.

# Normalize all numeric columns

recipe(Sale_Price ~ ., data = ames_train) %>%

step_YeoJohnson(all_numeric())

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Operations:

##

## Yeo-Johnson transformation on all_numeric()

3.5.2    Standardization

We must also consider the scale on which the individual features are measured. What are the largest and smallest values across all features and do they span several orders of magnitude? Models that incorporate smooth functions of input features are sensitive to the scale of the inputs. For example, 5X + 2 is a simple linear function of the input X, and the scale of its output depends directly on the scale of the input. Many algorithms use linear functions within their algorithms, some more obvious (e.g., GLMs and regularized regression) than others (e.g., neural networks, support vector machines, and principal components analysis). Other examples include algorithms that use distance measures such as the Euclidean distance (e.g., k nearest neighbor, k-means clustering, and hierarchical clustering).

For these models and modeling components, it is often a good idea to standardize the features. Standardizing features includes centering and scaling so that numeric variables have zero mean and unit variance, which provides a common comparable unit of measure across all the variables.

Some packages (e.g., glmnet, and caret) have built-in options to standardize and some do not (e.g., keras for neural networks). However, you should standardize your variables within the recipe blueprint so that both training and test data standardization are based on the same mean and variance. This helps to minimize data leakage.

ames_recipe %>%

step_center(all_numeric(), -all_outcomes()) %>%

step_scale(all_numeric(), -all_outcomes())

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Operations:

##

## Box-Cox transformation on all_outcomes()

## Centering for 2 items

## Scaling for 2 items

Image
FIGURE 3.8: Standardizing features allows all features to be compared on a common value scale regardless of their real value differences.

3.6    Categorical feature engineering

Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from preprocessing categorical features. The following sections will discuss a few of the more common approaches to engineer categorical features.

3.6.1    Lumping

Sometimes features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data but several of them only have a few observations.

count(ames_train, Neighborhood) %>% arrange(n)

## # A tibble: 28 x 2

##  Neighborhood  n

##  <fct>    <int>

## 1 Green_Hills  1

## 2 Landmark  1

## 3 Blueste  5

## 4 Greens  7

## 5 Veenker  16

## 6 Northpark_Villa  17

## 7 Briardale  22

## 8 Bloomington_Heights  23

## 9 Meadow_Village  27

## 10 Clear_Creek  30

## # … with 18 more rows

Even numeric features can have similar distributions. For example, Screen_Porch has 92% values recorded as zero (zero square footage meaning no screen porch) and the remaining 8% have unique dispersed values.

count(ames_train, Screen_Porch) %>% arrange(n)

## # A tibble: 99 x 2

##   Screen_Porch   n

##   <int> <int>

## 1   40   1

## 2   53   1

## 3   60   1

## 4   63   1

## 5   80   1

## 6   84   1

## 7   88   1

## 8   92   1

## 9   94   1

## 10   95   1

## # … with 89 more rows

Sometimes we can benefit from collapsing, or “lumping” these into a lesser number of categories. In the above examples, we may want to collapse all levels that are observed in less than 10% of the training sample into an “other” category. We can use step_other() to do so. However, lumping should be used sparingly as there is often a loss in model performance (Kuhn and Johnson, 2013).

Image

Tree-based models often perform exceptionally well with high cardinality features and are not as impacted by levels with small representation.

# Lump levels for two features

lumping <- recipe(Sale_Price ~ ., data = ames_train) %>%

step_other(Neighborhood, threshold = 0.01,

     other = ”other”) %>%

step_other(Screen_Porch, threshold = 0.1,

     other = ”>0”)

# Apply this blue print --> you will learn about this at

# the end of the chapter

apply_2_training <- prep(lumping, training = ames_train) %>%

bake(ames_train)

# New distribution of Neighborhood

count(apply_2_training, Neighborhood) %>% arrange(n)

## # A tibble: 23 x 2

##  Neighborhood        n

##  <fct>         <int>

## 1 Briardale          22

## 2 Bloomington_Heights     23

## 3 Meadow_Village       27

## 4 Clear_Creek         30

## 5 South_and_West_of_Iowa_State_University  33

## 6 Stone_Brook         36

## 7 Timberland         47

## 8 other       47

## 9 Northridge         55

## 10 Iowa_DOT_and_Rail_Road   59

## # … with 13 more rows

# New distribution of Screen_Porch

count(apply_2_training, Screen_Porch) %>% arrange(n)

## # A tibble: 2 x 2

##  Screen_Porch   n

##  <fct>   <int>

## 1 >0    185

## 2 0    1869

3.6.2    One-hot & dummy encoding

Many models require that all predictor variables be numeric. Consequently, we need to intelligently transform any categorical variables into numeric representations so that these algorithms can compute. Some packages automate this process (e.g., h2o and caret) while others do not (e.g., glmnet and keras). There are many ways to recode categorical variables as numeric (e.g., one-hot, ordinal, binary, sum, and Helmert).

The most common is referred to as one-hot encoding, where we transpose our categorical variables so that each level of the feature is represented as a boolean value. For example, one-hot encoding the left data frame in Figure 3.9 results in X being converted into three columns, one for each level. This is called less than full rank encoding . However, this creates perfect collinearity which causes problems with some predictive modeling algorithms (e.g., ordinary linear regression and neural networks). Alternatively, we can create a full-rank encoding by dropping one of the levels (level c has been dropped). This is referred to as dummy encoding.

Image
FIGURE 3.9: Eight observations containing a categorical feature X and the difference in how one-hot and dummy encoding transforms this feature.

We can one-hot or dummy encode with the same function (step_dummy()). By default, step_dummy() will create a full rank encoding but you can change this by setting one_hot = TRUE.

# Lump levels for two features

recipe(Sale_Price ~ ., data = ames_train) %>%

step_dummy(all_nominal(), one_hot = TRUE)

## Data Recipe

##

## Inputs:

##

##  role #variables

## outcome   1

## predictor   80

##

## Operations:

##

## Dummy variables from all_nominal()

Image

Since one-hot encoding adds new features it can significantly increase the dimensionality of our data. If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases you may want to explore label/ordinal encoding or some other alternative.

3.6.3    Label encoding

Label encoding is a pure numeric conversion of the levels of a categorical variable. If a categorical variable is a factor and it has pre-specified levels then the numeric conversion will be in level order. If no levels are specified, the encoding will be based on alphabetical order. For example, the MS_SubClass variable has 16 levels, which we can recode numerically with step_integer().

# Original categories

count(ames_train, MS_SubClass)

## # A tibble: 16 x 2

##  MS_SubClass  n

##  <fct>  <int>

## 1 One_Story_1946_and_Newer_All_Styles  749

## 2 One_Story_1945_and_Older  97

## 3 One_Story_with_Finished_Attic_All_Ages   4

## 4 One_and_Half_Story_Unfinished_All_Ages   14

## 5 One_and_Half_Story_Finished_All_Ages   192

## 6 Two_Story_1946_and_Newer  401

## 7 Two_Story_1945_and_Older  94

## 8 Two_and_Half_Story_All_Ages  16

## 9 Split_or_Multilevel    87

## 10 Split_Foyer   31

## 11 Duplex_All_Styles_and_Ages  73

## 12 One_Story_PUD_1946_and_Newer  147

## 13 One_and_Half_Story_PUD_All_Ages  1

## 14 Two_Story_PUD_1946_and_Newer  94

## 15 PUD_Multilevel_Split_Level_Foyer  12

## 16 Two_Family_conversion_All_Styles_and_Ages  42

# Label encoded

recipe(Sale_Price ~ ., data = ames_train) %>%

step_integer(MS_SubClass) %>%

prep(ames_train) %>%

bake(ames_train) %>%

count(MS_SubClass)

## # A tibble: 16 x 2

## MS_SubClass   n

##   <dbl> <int>

## 1   1  749

## 2   2  97

## 3   3   4

## 4   4 14

## 5   5  192

## 6   6  401

## 7   7  94

## 8   8  16

## 9   9  87

## 10  10  31

## 11  11  73

## 12  12  147

## 13  13  1

## 14  14  94

## 15  15  12

## 16  16  42

We should be careful with label encoding unordered categorical features because most models will treat them as ordered numeric features. If a categorical feature is naturally ordered then label encoding is a natural choice (most commonly referred to as ordinal encoding). For example, the various quality features in the Ames housing data are ordinal in nature (ranging from Very_Poor to Very_Excellent).

ames_train %>% select(contains(”Qual”))

## # A tibble: 2,054 x 6

##  Overall_Qual Exter_Qual Bsmt_Qual Low_Qual_Fin_SF

##  <fct>   <fct>   <fct>  <int>

## 1 Above_Avera~ Typical  Typical  0

## 2 Average  Typical  Typical  0

## 3 Above_Avera~ Typical  Typical  0

## 4 Good  Good  Typical  0

## 5 Above_Avera~ Typical  Typical  0

## 6 Very_Good  Good  Good  0

## 7 Very_Good  Good  Good  0

## 8 Good  Typical  Typical  0

## 9 Above_Avera~ Typical  Good  0

## 10 Above_Avera~ Typical  Good  0

## # … with 2,044 more rows, and 2 more variables:

## #  Kitchen_Qual <fct>, Garage_Qual <fct>

Ordinal encoding these features provides a natural and intuitive interpretation and can logically be applied to all models.

Image

The various xxx_Qual features in the Ames housing are not ordered factors. For ordered factors you could also use step_ordinalscore().

# Original categories

count(ames_train, Overall_Qual)

## # A tibble: 10 x 2

##  Overall_Qual   n

##   <fct>   <int>

## 1 Very_Poor  4

## 2 Poor   8

## 3 Fair  23

## 4 Below_Average  169

## 5 Average  582

## 6 Above_Average  497

## 7 Good  425

## 8 Very_Good   249

## 9 Excellent  75

## 10 Very_Excellent  22

# Label encoded

recipe(Sale_Price ~ ., data = ames_train) %>%

 step_integer(Overall_Qual) %>%

 prep(ames_train) %>%

 bake(ames_train) %>%

 count(Overall_Qual)

## # A tibble: 10 x 2

##  Overall_Qual   n

##   <dbl> <int>

## 1  1   4

## 2  2   8

## 3  3  23

## 4  4  169

## 5  5  582

## 6  6  497

## 7  7  425

## 8  8  249

## 9  9  75

## 10  10  22

3.6.4    Alternatives

There are several alternative categorical encodings that are implemented in various R machine learning engines and are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. For example, target encoding the Neighborhood feature would change North_Ames to 144617.

Target encoding runs the risk of data leakage since you are using the response variable to encode a feature. An alternative to this is to change the feature value to represent the proportion a particular level represents for a given feature. In this case, North_Ames would be changed to 0.153.

Image

In Chapter 9, we discuss how tree-based models use this approach to order categorical features when choosing a split point.

Several alternative approaches include effect or likelihood encoding (Micci-Barreca, 2001; Zumel and Mount, 2016), empirical Bayes methods (West et al., 2014), word and entity embeddings (Guo and Berkhahn, 2016; Chollet and Allaire, 2018), and more. For more in depth coverage of categorical encodings we highly recommend Kuhn and Johnson (2019).

TABLE 3.1: Example of target encoding the Neighborhood feature of the Ames housing data set.

Neighborhood

Avg Sale_Price

North_Ames

147040

College_Creek

202438

Old_Town

121815

Edwards

124297

Somerset

232394

Northridge_Heights

320174

Gilbert

191095

Sawyer

137405

Northwest_Ames

186082

Sawyer_West

183062

TABLE 3.2: Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.

Neighborhood

Proportion

North_Ames

0.154

College_Creek

0.093

Old_Town

0.081

Edwards

0.064

Somerset

0.062

Northridge_Heights

0.058

Gilbert

0.056

Sawyer

0.052

Northwest_Ames

0.040

Sawyer_West

0.045

3.7    Dimension reduction

Dimension reduction is an alternative approach to filter out non-informative features without manually removing them. We discuss dimension reduction topics in depth later in the book (Chapters 17,18,19) so please refer to those chapters for details.

However, we wanted to highlight that it is very common to include these types of dimension reduction approaches during the feature engineering process. For example, we may wish to reduce the dimension of our features with principal components analysis (Chapter 17) and retain the number of components required to explain, say, 95% of the variance and use these components as features in downstream modeling.

recipe(Sale_Price ~ ., data = ames_train) %>%

 step_center(all_numeric()) %>%

 step_scale(all_numeric()) %>%

 step_pca(all_numeric(), threshold = .95)

## Data Recipe

##

## Inputs:

##

##  role #variables

##  outcome   1

## predictor   80

##

## Operations:

##

## Centering for all_numeric()

## Scaling for all_numeric()

## PCA extraction with all_numeric()

3.8    Proper implementation

We stated at the beginning of this chapter that we should think of feature engineering as creating a blueprint rather than manually performing each task individually. This helps us in two ways: (1) thinking sequentially and (2) to apply appropriately within the resampling process.

3.8.1    Sequential steps

Thinking of feature engineering as a blueprint forces us to think of the ordering of our preprocessing steps. Although each particular problem requires you to think of the effects of sequential preprocessing, there are some general suggestions that you should consider:

•  If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. Alternatively, use the Yeo-Johnson transformation so you don’t have to worry about this.

•  One-hot or dummy encoding typically results in sparse data which many algorithms can operate efficiently on. If you standardize sparse data you will create dense data and you loose the computational efficiency. Consequently, it’s often preferred to standardize your numeric features and then one-hot/dummy encode.

•  If you are lumping infrequently occurring categories together, do so before one-hot/dummy encoding.

•  Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes.

While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:

1.  Filter out zero or near-zero variance features.

2.  Perform imputation if required.

3.  Normalize to resolve numeric feature skewness.

4.  Standardize (center and scale) numeric features.

5.  Perform dimension reduction (e.g., PCA) on numeric features.

6.  One-hot or dummy encode categorical features.

3.8.2    Data leakage

Data leakage is when information from outside the training data set is used to create the model. Data leakage often occurs during the data preprocessing period. To minimize this, feature engineering should be done in isolation of each resampling iteration. Recall that resampling allows us to estimate the generalizable prediction error. Therefore, we should apply our feature engineering blueprint to each resample independently as illustrated in Figure 3.10. That way we are not leaking information from one data set to another (each resample is designed to act as isolated training and test data).

For example, when standardizing numeric features, each resampled training data should use its own mean and variance estimates and these specific values should be applied to the same resampled test set. This imitates how real-life prediction occurs where we only know our current data’s mean and variance estimates; therefore, on new data that comes in where we need to predict we assume the feature values follow the same distribution of what we’ve seen in the past.

Image
FIGURE 3.10: Performing feature engineering preprocessing within each resample helps to minimize data leakage.

3.8.3    Putting the process together

To illustrate how this process works together via R code, let’s do a simple re-assessment on the ames data set that we did at the end of the last chapter (Section 2.7) and see if some simple feature engineering improves our prediction error. But first, we’ll formally introduce the recipes package, which we’ve been implicitly illustrating throughout.

The recipes package allows us to develop our feature engineering blueprint in a sequential nature. The idea behind recipes is similar to caret::preProcess() where we want to create the preprocessing blueprint but apply it later and within each resample.7

There are three main steps in creating and applying feature engineering with recipes:

1.  recipe: where you define your feature engineering steps to create your blueprint.

2.  prepare: estimate feature engineering parameters based on training data.

3.  bake: apply the blueprint to new data.

The first step is where you define your blueprint (aka recipe). With this process, you supply the formula of interest (the target variable, features, and the data these are based on) with recipe() and then you sequentially add feature engineering steps with step_xxx(). For example, the following defines Sale_Price as the target variable and then uses all the remaining columns as features based on ames_train. We then:

1.  Remove near-zero variance features that are categorical (aka nominal).

2.  Ordinal encode our quality-based features (which are inherently ordinal).

3.  Center and scale (i.e., standardize) all numeric features.

4.  Perform dimension reduction by applying PCA to all numeric features.

blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%

step_nzv(all_nominal()) %>%

step_integer(matches(”Qual|Cond|QC|Qu”)) %>%

step_center(all_numeric(), -all_outcomes()) %>%

step_scale(all_numeric(), -all_outcomes()) %>%

step_pca(all_numeric(), -all_outcomes())

blueprint

## Data Recipe

##

## Inputs:

##

##  role #variables

##  outcome    1

## predictor   80

##

## Operations:

##

## Sparse, unbalanced variable filter on all_nominal()

## Integer encoding for matches(”Qual|Cond|QC|Qu”)

## Centering for 2 items

## Scaling for 2 items

## PCA extraction with 2 items

Next, we need to train this blueprint on some training data. Remember, there are many feature engineering steps that we do not want to train on the test data (e.g., standardize and PCA) as this would create data leakage. So in this step we estimate these parameters based on the training data of interest.

prepare <- prep(blueprint, training = ames_train)

prepare

## Data Recipe

##

## Inputs:

##

##   role #variables

##  outcome  1

## predictor  80

##

## Training data contained 2054 data points and no missing data.

##

## Operations:

##

## Sparse, unbalanced variable filter removed Street, … [trained]

## Integer encoding for Condition_1, …  [trained]

## Centering for Lot_Frontage, … [trained]

## Scaling for Lot_Frontage, … [trained]

## PCA extraction with Lot_Frontage, …  [trained]

Lastly, we can apply our blueprint to new data (e.g., the training data or future test data) with bake().

baked_train <- bake(prepare, new_data = ames_train)

baked_test <- bake(prepare, new_data = ames_test)

baked_train

## # A tibble: 2,054 x 27

##  MS_SubClass MS_Zoning Lot_Shape Lot_Config

##  <fct> <fct> <fct> <fct>

## 1 One_Story_~ Resident~ Slightly~ Corner

## 2 One_Story_~ Resident~ Regular  Inside

## 3 One_Story_~ Resident~ Slightly~ Corner

## 4 One_Story_~ Resident~ Regular  Corner

## 5 Two_Story_~ Resident~ Slightly~ Inside

## 6 One_Story_~ Resident~ Slightly~ Inside

## 7 One_Story_~ Resident~ Slightly~ Inside

## 8 Two_Story_~ Resident~ Regular  Inside

## 9 Two_Story_~ Resident~ Slightly~ Corner

## 10 One_Story_~ Resident~ Slightly~ Inside

## # … with 2,044 more rows, and 23 more variables:

## #  Neighborhood <fct>, Bldg_Type <fct>,

## #  House_Style <fct>, Roof_Style <fct>,

## #  Exterior_1st <fct>, Exterior_2nd <fct>,

## #  Mas_Vnr_Type <fct>, Foundation <fct>,

## #  Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,

## #  Central_Air <fct>, Electrical <fct>,

## #  Garage_Type <fct>, Garage_Finish <fct>,

## #  Paved_Drive <fct>, Fence <fct>, Sale_Type <fct>,

## #  Sale_Price <int>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>,

## #  PC4 <dbl>, PC5 <dbl>

Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample. We illustrate with the ames housing example.

First, we create our feature engineering blueprint to perform the following tasks:

1.  Filter out near-zero variance features for categorical features.

2.  Ordinally encode all quality features, which are on a 1–10 Likert scale.

3.  Standardize (center and scale) all numeric features.

4.  One-hot encode our remaining categorical features.

blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%

step_nzv(all_nominal()) %>%

step_integer(matches(”Qual|Cond|QC|Qu”)) %>%

step_center(all_numeric(), -all_outcomes()) %>%

step_scale(all_numeric(), -all_outcomes()) %>%

step_dummy(all_nominal(), -all_outcomes(),

   one_hot TRUE) =

Next, we apply the same resampling method and hyperparameter search grid as we did in Section 2.7. The only difference is when we train our resample models with train(), we supply our blueprint as the first argument and then caret takes care of the rest.

# Specify resampling plan

cv <- trainControl(

method = ”repeatedcv”,

 number = 10,

 repeats = 5

)

# Construct grid of hyperparameter values

hyper_grid <- expand.grid(k = seq(2, 25, by = 1))

# Tune a knn model using grid search

knn_fit2 <- train(

 blueprint,

 data = ames_train,

 method = ”knn”,

 trControl = cv,

 tuneGrid = hyper_grid,

 metric = ”RMSE”

)

Looking at our results we see that the best model was associated with k = 11, which resulted in a cross-validated RMSE of 32,938. Figure 3.11 illustrates the cross-validated error rate across the spectrum of hyperparameter values that we specified.

# print model results

knn_fit2

## k-Nearest Neighbors

##

## 2054 samples

##  80 predictor

##

## Recipe steps: nzv, integer, center, scale, dummy

## Resampling: Cross-Validated (10 fold, repeated 5 times)

## Summary of sample sizes: 1849, 1847, 1849, 1847, 1850, 1849,

## Resampling results across tuning parameters:

##

##  k  RMSE  Rsquared MAE

## 2 36656 0.796   22496

## 3 35576 0.809   21758

## 4 34562 0.820   21281

## 5 34074 0.827   20966

## 6 33660 0.832   20793

## 7 33332 0.838   20659

## 8 33209 0.841   20575

## 9 33049 0.845 20543

## 10 32959 0.847 20530

## 11 32938 0.848 20556

## 12 32961 0.849 20557

## 13 32991 0.849 20631

## 14 33069 0.849 20683

## 15 33091 0.850 20719

## 16 33095 0.850 20734

## 17 33096 0.851 20740

## 18 33154 0.851 20784

## 19 33241 0.851 20850

## 20 33306 0.851 20899

## 21 33423 0.850 20989

## 22 33475 0.850 21053

## 23 33546 0.850 21106

## 24 33580 0.850 21164

## 25 33617 0.851 21226

##

## RMSE was used to select the optimal model using

## the smallest value.

## The final value used for the model was k = 11.

# plot cross validation results

ggplot(knn_fit2)

Image
FIGURE 3.11: Results from the same grid search performed in Section 2.7 but with feature engineering performed within each resample.

By applying a handful of the preprocessing techniques discussed throughout this chapter, we were able to reduce our prediction error by over $10,000. The chapters that follow will look to see if we can continue reducing our error by applying different algorithms and feature engineering blueprints.

1https://www.kaggle.com/c/house-prices-advanced-regression-techniques

2https://github.com/Quartz/bad-data-guide

3Little and Rubin (2014) discuss two different kinds of missingness at random; however, we combine them for simplicity as their nuanced differences are distinguished between the two in practice.

4If your data set is large, deleting missing observations that have missing values at random rarely impacts predictive performance. However, as your data sets get smaller, preserving observations is critical and alternative solutions should be explored.

5For example, standardizing numeric features will include the imputed numeric values in the calculation and one-hot encoding will include the imputed categorical value.

6See Kuhn and Johnson (2013) Section 19.1 for data set generation.

7In fact, most of the feature engineering capabilities found in resample can also be found in caret::preProcess().