We discussed the intuition behind and basics of logistic regression models in Chapter 3, Machine Learning Foundations. To build a model on our training set, we use the following code:
from sklearn.linear_model import LogisticRegression
clfs = [LogisticRegression()]
for clf in clfs:
clf.fit(X_train, y_train.ravel())
print(type(clf))
print('Training accuracy: ' + str(clf.score(X_train, y_train)))
print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
coefs = {
'column': [X_train_cols[i] for i in range(len(X_train_cols))],
'coef': [clf.coef_[0,i] for i in range(len(X_train_cols))]
}
df_coefs = pd.DataFrame(coefs)
print(df_coefs.sort_values('coef', axis=0, ascending=False))
Prior to the for loop, we import the LogisticRegression class and set clf equal to a LogisticRegression instance. The training and testing occur in the for loop. First, we use the fit() method to fit the model (for example, to determine the optimal coefficients) using the training data. Next, we use the score() method to assess the model performance on both the training data and the testing data.
In the second half of the for loop, we print out the coefficient values of each feature. In general, features with coefficients that are farther from zero are the most positively/negatively correlated with the outcome. However, we did not scale the data prior to training, so it is possible that more important predictors that are not scaled appropriately will have lower coefficients.
The output of the code should look like the following:
<class 'sklearn.linear_model.logistic.LogisticRegression'> Training accuracy: 0.888978581423 Validation accuracy: 0.884261501211 coef column 346 2.825056 rfv_Symptoms_of_onset_of_labor 696 1.618454 rfv_Adverse_effect_of_drug_abuse 95 1.467790 rfv_Delusions_or_hallucinations 108 1.435026 rfv_Other_symptoms_or_problems_relating_to_psy... 688 1.287535 rfv_Suicide_attempt 895 1.265043 IMMEDR_01 520 1.264023 rfv_General_psychiatric_or_psychological_exami... 278 1.213235 rfv_Jaundice 712 1.139245 rfv_For_other_and_unspecified_test_results 469 1.084806 rfv_Other_heart_disease
...
First, let's discuss the performance of the training and testing sets. They are close together, which indicates the model did not overfit to the training set. The accuracy is approximately 88%, which is on par with performance in research studies (Cameron et al., 2015) for predicting emergency department status.
Looking at the coefficients, we can confirm that they make intuitive sense. The feature with the highest coefficient is related to the onset of labor in pregnancy; we all know that labor results in a hospital admission. Many of the features pertaining to severe psychiatric disease, which almost always result in an admission due to the risk the patient poses to themselves or to others. The IMMEDR_1 feature also has a high coefficient; remember that this feature corresponds to a value of 1 on the triage scale, which is the most critical value:
...
898 -0.839861 IMMEDR_04 823 -0.848631 BEDDATA_03 625 -0.873828 rfv_Hand_and_fingers 371 -0.960739 rfv_Skin_rash 188 -0.963524 rfv_Earache_pain_ 217 -0.968058 rfv_Soreness 899 -1.019763 IMMEDR_05 604 -1.075670 rfv_Suture__insertion_removal 235 -1.140021 rfv_Toothache 30 -1.692650 LEFTATRI
In contrast, scrolling to the bottom reveals some of the features that are negatively correlated with an admission. Having a toothache and needing sutures to be removed show up here, and they are not likely to result in admissions since they are not urgent complaints.
We've trained our first model. Let's see whether some of the more complex models will improve.