Comparing models with k-fold cross-validation

As k-fold cross-validation method proved to be a better method, it is more suitable for comparing models. The reason behind this is that k-fold cross-validation gives much estimation of the evaluation metrics, and on averaging these estimations, we get a better assessment of model performance.

The following shows the code used to import libraries for comparing models:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

After importing libraries, we'll import the diamond dataset. The following shows the code used to prepare this diamond dataset:

# importing data
data_path= '../data/diamonds.csv'
diamonds = pd.read_csv(data_path)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)],axis=1)
diamonds.drop(['cut','color','clarity'], axis=1, inplace=True)

Now, we have to prepare objects for modeling after preparing the dataset for doing model comparison. The following shows the code used for preparing the objects for modeling. Here we have the X matrix, showing the features, and the y vector, which is the target for this dataset:

from sklearn.preprocessing import RobustScaler
target_name = 'price'
robust_scaler = RobustScaler()
X = diamonds.drop('price', axis=1)
X = robust_scaler.fit_transform(X)
y = diamonds[target_name]

Here, we will compare the KNN model, the random forest model, and the bagging model. In these models, using tenfold cross-validation, we will use the KNNRegressor, RandomForestRegressor, and AdaBoostRegressor functions.

Then we will import the cross_validate function. The following shows the code used to import the cross_validate function for these three models:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_validate

The next step is to compare the models using the cross_validate function. The following shows the code block used for comparing these three models:

## KNN
knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1)
knn_test_mse = cross_validate(estimator=knn,X=X,y=y,
 scoring='mean_squared_error', 
 cv=10, n_jobs=-1)['test_score']

## Random Forests
RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=55, n_jobs=-1)
RF_test_mse = cross_validate(estimator=RF,X=X,y=y,
 scoring='mean_squared_error', 
 cv=10, n_jobs=-1)['test_score']

## Boosting
boosting = AdaBoostRegressor(n_estimators=50, learning_rate=0.05, random_state=55) 
boosting_test_mse = cross_validate(estimator=boosting,X=X,y=y,
 scoring='mean_squared_error', 
 cv=10, n_jobs=-1)['test_score']

Here, we see the result of the testing evaluation metrics that we will use for every model. We use the mean_squared_error function. For every model, we use tenfold cross-validation and after getting the result, we get the test_score variable. This test_score variable is the mean-squared error in this case, and is shown in the following code:

 mse_models = -1*pd.DataFrame({'KNN':knn_test_mse,
 'RandomForest': RF_test_mse,
 'Boosting':boosting_test_mse})

The following screenshot shows the code for the result that we got after running the tenfold cross-validation for every model and also, the table showing the 10 estimations of the evaluation metrics for the three models:

This table shows a lot of variation in estimations of every model. To know the actual performance of any model, we take the average of the result. So, in the preceding figure, we take the mean of all the values and then plot it.

The following screenshot shows the code used to take the mean and the graph showing the mean MSE values for every model:

On averaging, the random forest model performs best out of the three models. So, after taking an average, the second place goes to the KNN model, and the boosting model takes last place.

The following figure shows the code used to get the box plot for these evaluation metrics and the box plot for the three models:

So, looking at the box plot for these evaluation metrics, we see that, the random forest performs the best.

To check the degree of variability of these models, we can analyze the standard deviation of the test MSE for regression models. The following screenshot shows the code used to check the degree of variability and the plot showing the standard deviation of these models:

The preceding screenshot shows that the most model in this case was the KNN model, followed by the boosting model, and random forest model had the least variation. So, the random forest model is the best among these three models. Even for this dataset, the random forest model performs the best.