Training Classification Models

Starting on the left side of the plot, we can see that both sets of data are agreeing on the score, which is good. However, the score is also quite low compared to other gamma values, so therefore we say the model is underfitting the data. Increasing the gamma, we can see a point where the error bars of these two lines no longer overlap. From this point on, we see the classifier overfitting the data as the models behave increasingly well on the training set compared to the validation set. The optimal value for the gamma parameter can be found by looking for a high validation score with overlapping error bars on the two lines.

Keep in mind that a learning curve for some parameter is only valid while the other parameters remain constant. For example, if training the SVM in this plot, we could decide to pick gamma on the order of 10-4. However, we may want to optimize the C parameter as well. With a different value for C, the preceding plot would be different and our selection for gamma may no longer be optimal.

Using k-fold cross validation and validation curves in Python with scikit-learn

If you've not already done so, start the NotebookApp and open the lesson-2-workbook.ipynb file. Scroll down to Subtopic B: K-fold cross-validation and validation curves.
The training data should already be in the notebook's memory, but let's reload it as a reminder of what exactly we're working with.
Load the data and select the satisfaction_level and last_evaluation features for the training/validation set. We will not use the train-test split this time because we are going to use k-fold validation instead. Run the cell containing the following code:
```
df = pd.read_csv('../data/hr-analytics/hr_data_processed.csv')
features = ['satisfaction_level', 'last_evaluation']
X = df[features].values
y = df.left.values
```
Instantiate a Random Forest model by running the cell containing the following code:
```
clf = RandomForestClassifier(n_estimators=100, max_depth=5)
```
To train the model with stratified k-fold cross validation, we'll use the model_selection.cross_val_score function.
Train 10 variations of our model clf using stratified k-fold validation. Note that scikit-learn's cross_val_score does this type of validation by default. Run the cell containing the following code:
```
from sklearn.model_selection import cross_val_score
np.random.seed(1)
scores = cross_val_score(
		estimator=clf,
		X=X,
		y=y,
		cv=10)
print('accuracy = {:.3f} +/- {:.3f}'.format(scores.mean(), scores.
std()))
>> accuracy = 0.923 +/- 0.005
```
Note how we use np.random.seed to set the seed for the random number generator, therefore ensuring reproducibility with respect to the randomly selected samples for each fold and decision tree in the Random Forest.
With this method, we calculate the accuracy as the average of each fold. We can also see the individual accuracies for each fold by printing scores. To see these, run print(scores):
```
>> array([ 0.93404397,  0.91533333,  0.92266667,  0.91866667,  0.92133333,
           0.92866667,  0.91933333,  0.92      ,  0.92795197,  0.92128085])
```
Using cross_val_score is very convenient, but it doesn't tell us about the accuracies within each class. We can do this manually with the model_selection.StratifiedKFold class. This class takes the number of folds as an initialization parameter, then the split method is used to build randomly sampled "masks" for the data. A mask is simply an array containing indexes of items in another array, where the items can then be returned by doing this: data[mask].
Define a
custom class for calculating k-fold cross validation class accuracies. Run the cell containing the following code:
```
from sklearn.model_selection import StratifiedKFold…
…
        print('fold: {:d} accuracy: {:s}'.format(k+1, str(class_acc)))
    return class_accuracy
```
Note
For the complete code, refer to the Lesson 2.txt file in the Lesson 2 folder.
We can then calculate the class accuracies with code that's very similar to step 4. Do this by running the cell containing the following code:
```
from sklearn.model_selection import cross_val_scorenp.random.seed(1)…
…
>> fold: 10 accuracy: [ 0.98861646  0.70588235]
>> accuracy = [ 0.98722476  0.71715647] +/- [ 0.00330026  0.02326823]
```
Note
For the complete code, refer to the Lesson 2.txt file in the Lesson 2 folder.
Now we can see the class accuracies for each fold! Pretty neat, right?
Let's move on to show how a validation curve can be calculated using model_selection.validation_curve. This function uses stratified k-fold cross validation to train models for various values of a given parameter.
Do the calculations required to plot a validation curve by training Random Forests over a range of max_depth values. Run the cell containing the following code:
```
clf = RandomForestClassifier(n_estimators=10)
max_depths = np.arange(3, 16, 3)
train_scores, test_scores = validation_curve(
			estimator=clf,
			X=X,
			y=y,
			param_name='max_depth',
			param_range=max_depths,
			cv=10);
```
This will return arrays with the cross validation scores for each model, where the models have different max depths. In order to visualize the results, we'll leverage a function provided in the scikit-learn documentation.
Run the cell in which plot_validation_curve is defined. Then, run the cell containing the following code to draw the plot:
```
plot_validation_curve(train_scores, test_scores,
                      max_depths, xlabel='max_depth')
```
Recall how setting the max depth for decision trees limits the amount of overfitting? This is reflected in the validation curve, where we see overfitting taking place for large max depth values to the right. A good value for max_depth appears to be 6, where we see the training and validation accuracies in agreement. When max_depth is equal to 3, we see the model underfitting the data as training and validation accuracies are lower.

To summarize, we have learned and implemented two important techniques for building reliable predictive models. The first such technique was k-fold cross-validation, which is used to split the data into various train/test batches and generate a set accuracy. From this set, we then calculated the average accuracy and the standard deviation as a measure of the error. This is important so that we have a gauge of the variability of our model and we can produce trustworthy accuracy.

We also learned about another such technique to ensure we have trustworthy results: validation curves. These allow us to visualize when our model is overfitting based on comparing training and validation accuracies. By plotting the curve over a range of our selected hyperparameter, we are able to identify its optimal value.

In the final section of this lesson, we take everything we have learned so far and put it together in order to build our final predictive model for the employee retention problem. We seek to improve the accuracy, compared to the models trained thus far, by including all of the features from the dataset in our model. We'll see now-familiar topics such as k-fold cross-validation and validation curves, but we'll also introduce something new: dimensionality reduction techniques.

Subtopic C: Dimensionality Reduction Techniques

Dimensionality reduction can simply involve removing unimportant features from the training data, but more exotic methods exist, such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). These techniques allow for data compression, where the most important information from a large group of features can be encoded in just a few features.

In this subtopic, we'll focus on PCA. This technique transforms the data by projecting it into a new subspace of orthogonal "principal components," where the components with the highest eigenvalues encode the most information for training the model. Then, we can simply select a few of these principal components in place of the original high-dimensional dataset. For example, PCA could be used to encode the information from every pixel in an image. In this case, the original feature space would have dimensions equal to the number of pixels in the image. This high-dimensional space could then be reduced with PCA, where the majority of useful information for training predictive models might be reduced to just a few dimensions. Not only does this save time when training and using models, it allows them to perform better by removing noise in the dataset.

Like the models we've seen, it's not necessary to have a detailed understanding of PCA in order to leverage the benefits. However, we'll dig into the technical details of PCA just a bit further so that we can conceptualize it better. The key insight of PCA is to identify patterns between features based on correlations, so the PCA algorithm calculates the covariance matrix and then decomposes this into eigenvectors and eigenvalues. The vectors are then used to transform the data into a new subspace, from which a fixed number of principal components can be selected.

In the following section, we'll see an example of how PCA can be used to improve our Random Forest model for the employee retention problem we have been working on. This will be done after training a classification model on the full feature space, to see how our accuracy is affected by dimensionality reduction.

Training a predictive model for the employee retention problem

We have already spent considerable effort planning a machine learning strategy, preprocessing the data, and building predictive models for the employee retention problem. Recall that our business objective was to help the client prevent employees from leaving. The strategy we decided upon was to build a classification model that would predict the probability of employees leaving. This way, the company can assess the likelihood of current employees leaving and take action to prevent it.

Given our strategy, we can summarize the type of predictive modeling we are doing as follows:

Supervised learning on labeled training data
Classification problems with two class labels (binary)

In particular, we are training models to determine whether an employee has left the company, given a set of continuous and categorical features. After preparing the data for machine learning in Activity A, Preparing to Train a Predictive Model for the Employee-Retention Problem, we went on to implement SVM, k-Nearest Neighbors, and Random Forest algorithms using just two features. These models were able to make predictions with over 90% overall accuracy. When looking at the specific class accuracies, however, we found that employees who had left (class-label 1) could only be predicted with 70-80% accuracy. Let's see how much this can be improved by utilizing the full feature space.

In the lesson-2-workbook.ipynb notebook, scroll down to the code for this section. We should already have the preprocessed data loaded from the previous sections, but this can be done again, if desired, by executing df = pd.read_csv('../data/hr-analytics/hr_data_processed.csv'). Then, print the DataFrame columns with print(df.columns).
Define a list of all the features by copy and pasting the output from df.columns into a new list (making sure to remove the target variable left). Then, define X and Y as we have done before. This goes as follows:
```
features = ['satisfaction_level', 'last_evaluation', 'number_project','average_montly_hours', 'time_spend_company', 'work_accident',…
…
X = df[features].values
y = df.left.values
```
Note
For the complete code, refer to the Lesson 2.txt file in the Lesson 2 folder.
Looking at the feature names, recall what the values look like for each one. Scroll up to the set of histograms we made in the first activity to help jog your memory. The first two features are continuous; these are what we used for training models in the previous two exercises. After that, we have a few discrete features, such as number_project and time_spend_company, followed by some binary fields such as work_accident and promotion_last_5years. We also have a bunch of binary features, such as department_IT and department_accounting, which were created by one-hot encoding.
Given a mix of features like this, Random Forests are a very attractive type of model. For one thing, they're compatible with feature sets composed of both continuous and categorical data, but this is not particularly special; for instance, an SVM can be trained on mixed feature types as well (given proper preprocessing).
Note
If you're interested in training an SVM or k-Nearest Neighbors classifier on mixed-type input features, you can use the data-scaling prescription from this StackExchange answer: https://stats.stackexchange.com/questions/82923/mixing-continuous-and-binary-data-with-linear-svm/83086#83086.
A simple approach would be to preprocess data as follows:
- Standardize continuous variables
- One-hot-encode categorical features
- Shift binary values to -1 and 1 instead of 0 and 1
- Then, the mixed-feature data could be used to train a variety of classification models
We need to figure out the best parameters for our Random Forest model. Let's start by tuning the max_depth hyperparameter using a validation curve. Calculate the training and validation accuracies by running the following code:
```
%%time
np.random.seed(1)
clf = RandomForestClassifier(n_estimators=20)
max_depths = [3, 4, 5, 6, 7,
				9, 12, 15, 18, 21]
train_scores, test_scores = validation_curve(
		estimator=clf,
		X=X,
		y=y,
		param_name='max_depth',
		param_range=max_depths,
		cv=5);
```
We are testing 10 models with k-fold cross validation. By setting k = 5, we produce five estimates of the accuracy for each model, from which we extract the mean and standard deviation to plot in the validation curve. In total, we train 50 models, and since n_estimators is set to 20, we are training a total of 1,000 decision trees! All in roughly 10 seconds!
Plot the validation curve using our custom plot_validation_curve function from the last exercise. Run the following code:
```
plot_validation_curve(train_scores, test_scores,
                      max_depths, xlabel='max_depth');
```
For small max depths, we see the model underfitting the data. Total accuracies dramatically increase by allowing the decision trees to be deeper and encode more complicated patterns in the data. As the max depth is increased further and the accuracy approaches 100%, we find the model overfits the data, causing the training and validation accuracies to grow apart. Based on this figure, let's select a max_depth of 6 for our model.
We should really do the same for n_estimators, but in the spirit of saving time, we'll skip it. You are welcome to plot it on your own; you should find agreement between training and validation sets for a large range of values. Usually, it's better to use more decision tree estimators in the Random Forest, but this comes at the cost of increased training times. We'll use 200 estimators to train our model.
Use cross_val_class_score, the k-fold cross validation by class function we created earlier, to test the selected model, a Random Forest with max_depth = 6 and n_estimators = 200:
```
np.random.seed(1)clf = RandomForestClassifier(n_estimators=200, max_depth=6)scores = cross_val_class_score(clf, X, y)print('accuracy = {} +/- {}'\.format(scores.mean(axis=0), scores.std(axis=0)))
>> accuracy = [ 0.99553722  0.85577359] +/- [ 0.00172575  0.02614334]
```
The accuracies are way higher now that we're using the full feature set, compared to before when we only had the two continuous features!
Visualize the accuracies with a boxplot by running the following code:
```
fig = plt.figure(figsize=(5, 7))sns.boxplot(data=pd.DataFrame(scores, columns=[0, 1]),palette=sns.color_palette('Set1'))plt.xlabel('Left')
plt.ylabel('Accuracy')
```
Random Forests can provide an estimate of the feature performances.
Note
The feature importance in scikit-learn is calculated based on how the node impurity changes with respect to each feature. For a more detailed explanation, take a look at the following StackOverflow thread about how feature importance is determined in Random Forest Classifier: https://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined.
Plot the feature importance, as stored in the attribute feature_importances_, by running the following code:
```
pd.Series(clf.feature_importances_, name='Feature importance',index=df[features].columns)\.sort_values()\.plot.barh()
plt.xlabel('Feature importance')
```
It doesn't look like we're getting much in the way of useful contribution from the one-hot encoded variables: department and salary. Also, the promotion_last_5years and work_accident features don't appear to be very useful.
Let's use Principal Component Analysis (PCA) to condense all of these weak features into just a few principal components.
Import the PCA class from scikit-learn and
transform the features. Run the following code:
```
from sklearn.decomposition import PCApca_features = \…
…
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_reduce)
```
Note
For the complete code, refer to the Lesson 2.txt file in the Lesson 2 folder.

Look at the string representation of X_pca by typing it alone and executing the cell:

>> array([[-0.67733089,  0.75837169, -0.10493685],
>>       [ 0.73616575,  0.77155888, -0.11046422],
>>       [ 0.73616575,  0.77155888, -0.11046422],
>>       ..., 
>>       [-0.67157059, -0.3337546 ,  0.70975452],
>>       [-0.67157059, -0.3337546 ,  0.70975452],
>>       [-0.67157059, -0.3337546 ,  0.70975452]])

Since we asked for the top three components, we get three vectors returned.

Add the new features to our DataFrame with the following code:

df['first_principle_component'] = X_pca.T[0]df['second_principle_component'] = X_pca.T[1]
df['third_principle_component'] = X_pca.T[2]

Select our reduced-dimension feature set to train a new Random Forest with. Run the following code:

features = ['satisfaction_level', 'number_project', 'time_spend_
company',
			'average_montly_hours', 'last_evaluation',
			'first_principle_component',
			'second_principle_component',
			'third_principle_component']
X = df[features].values
y = df.left.values

Assess the new model's accuracy with k-fold cross validation. This can be done by running the same code as before, where X now points to different features. The code is as follows:

np.random.seed(1)
clf = RandomForestClassifier(n_estimators=200, max_depth=6)
scores = cross_val_class_score(clf, X, y)
print('accuracy = {} +/- {}'\
		.format(scores.mean(axis=0), scores.std(axis=0)))
>> accuracy = [ 0.99562463 0.90618594] +/- [ 0.00166047 0.01363927]

Visualize the result in the same way as before, using a box plot. The code is as follows:
```
fig = plt.figure(figsize=(5, 7))sns.boxplot(data=pd.DataFrame(scores, columns=[0, 1]),
            palette=sns.color_palette('Set1'))plt.xlabel('Left')
plt.ylabel('Accuracy')
```
Comparing this to the previous result, we find an improvement in the class 1 accuracy! Now, the majority of the validation sets return an accuracy greater than 90%. The average accuracy of 90.6% can be compared to the accuracy of 85.6% prior to dimensionality reduction!
Let's select this as our final model. We'll need to re-train it on the full sample space before using it in production.

Train the final predictive model by running the following code:

np.random.seed(1)clf = RandomForestClassifier(n_estimators=200, max_depth=6)
clf.fit(X, y)

Save the trained model to a binary file using externals.joblib.dump. Run the following code:
```
from sklearn.externals import joblib
joblib.dump(clf, 'random-forest-trained.pkl')
```
Check that it's saved into the working directory, for example, by running: !ls *.pkl. Then, test that we can load the model from the file by running the following code:
```
clf = joblib.load('random-forest-trained.pkl')
```
Congratulations! We've trained the final predictive model! Now, let's see an example of how it can be used to provide business insights for the client.
Say we have a particular employee, who we'll call Sandra. Management has noticed she is working very hard and reported low job satisfaction in a recent survey. They would therefore like to know how likely it is that she will quit.
For the sake of simplicity, let's take her feature values as a sample from the training set (but pretend that this is unseen data instead).

List the feature values for Sandra by running the following code:

sandra = df.iloc[573]X = sandra[features]X
>> satisfaction_level              0.360000
>> number_project                  2.000000
>> time_spend_company              3.000000
>> average_montly_hours          148.000000
>> last_evaluation                 0.470000
>> first_principle_component       0.742801
>> second_principle_component     -0.514568
>> third_principle_component      -0.677421

The next step is to ask the model which group it thinks she should be in.

Predict the class label
for Sandra by running the following code:
```
clf.predict([X])
>> array([1])
```
The model classifies her as having already left the company; not a good sign! We can take this a step further and calculate the probabilities of each class label.
Use clf.predict_proba to predict the probability of our model predicting that Sandra has quit. Run the following code:
```
clf.predict_proba([X])
>> array([[ 0.06576239,  0.93423761]])
```
We see the model predicting that she has quit with 93% accuracy.
Since this is clearly a red flag for management, they decide on a plan to reduce her number of monthly hours to 100 and the time spent at the company to 1.
Calculate the new probabilities with Sandra's newly planned metrics. Run the following code:
```
X.average_montly_hours = 100X.time_spend_company = 1clf.predict_proba([X])
>> array([[ 0.61070329,  0.38929671]])
```
Excellent! We can now see that the model returns a mere 38% likelihood that she has quit! Instead, it now predicts she will not have left the company.

Our model has allowed management to make a data-driven decision. By reducing her amount of time with the company by this particular amount, the model tells us that she will most likely remain an employee at the company!