Implementing k-fold cross-validation

Let's take examples of a diamond dataset to understand the implementation of k-fold cross-validation.

For performing k-fold cross-validation in scikit-learn, we first have to import the libraries that we will use. The following code snippet shows the code used for importing the libraries:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The second step is to prepare the dataset for the diamond dataset that we will use in this example. The following shows the code used to prepare data for this dataset:

# importing data
data_path= '../data/diamonds.csv'
diamonds = pd.read_csv(data_path)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)],axis=1)
diamonds.drop(['cut','color','clarity'], axis=1, inplace=True)

After preparing the data, we will create the objects used for modeling. The following shows the code used to create the objects for modeling:

from sklearn.preprocessing import RobustScaler
target_name = 'price'
robust_scaler = RobustScaler()
X = diamonds.drop('price', axis=1)
X = robust_scaler.fit_transform(X)
y = diamonds[target_name]
# Notice that we are not doing train-test split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

This is the same cell that we used in Chapter 1, Ensemble Methods for Regression and Classification. The difference here is that we are not using the train_test_split function. Here, we are producing the X matrix, which contains all the features and also has our target feature. So we have our X matrix and our y vector.

For training the model, we will instantiate our RandomForestRegressor function, which we found was the best model in Chapter 1 , Ensemble Methods for Regression and Classification, for this dataset. The following shows the code used to instantiate the RandomForestRegressor function:

from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=123, n_jobs=-1)

To perform k-fold cross-validation, we import the cross_validate function from the model_selection module in scikit-learn. The following shows the code used to import the cross_validate function:

# this will work from sklearn version 0.19, if you get an error 
# make sure you upgrade: $conda upgrade scikit-learn
from sklearn.model_selection import cross_validate

This cross_validate function works as follows:

We provide the estimator, which will be the RandomForestRegressor function. The following shows the code used to apply the RandomForestRegressor function:

 scores = cross_validate(estimator=RF,X=X,y=y,
 scoring=['mean_squared_error','r2'],
 cv=10, n_jobs=-1)

Here, we pass the X object and the y object.

We provide a set of metrics that we want to evaluate for this model and for this dataset. In this case, evaluation is done using the mean_squared_error function and the r2 metrics, as shown in the preceding code. Here, we pass the value of k in cv. So, in this case, we will do tenfold cross-validation.

The output that we get from this cross_validate function will be a dictionary with the corresponding matrix. For better understanding, the output is converted into a dataframe. The following screenshot shows the code use to visualize the scores in the dataframe and the dataframe output:

Here, we apply test_mean_squared_error and test_r2, which were the two metrics that we wanted to evaluate. After evaluating, we get the train_mean_squared_error value and the test_r2 set. So, we are interested in the testing metrics.

To get a better assessment of the performance of a model, we will take an average (mean) of all of the individual measurements.

The following shows the code for getting the Mean test MSE and Mean test R-squared values and output showing their values:

print("Mean test MSE:", round(scores['test_mean_squared_error'].mean()))
print("Mean test R-squared:", scores['test_r2'].mean())

So here, on averaging, we see that the mean of the test MSE is the value that we have here and an average of the other metric, which was the R-squared evaluation metrics.