In some cases, the response variable that we are trying to predict may already be a separate well-defined column. In those cases, simply converting the response from a string to a numeric type before splitting the data into train and test sets will suffice.
In our specific modeling task, we are trying to predict which patients presenting to the ED will eventually be hospitalized. In our case, hospitalization encompasses:
- Those admitted to an inpatient ward for further evaluation and treatment
- Those transferred to a different hospital (either psychiatric or non-psychiatric) for further treatment
- Those admitted to the observation unit for further evaluation (whether they are eventually admitted or discharged after their observation unit stay)
Accordingly, we must do some data wrangling to assemble all of these various outcomes into a single response variable:
response_cols = ['ADMITHOS','TRANOTH','TRANPSYC','OBSHOS','OBSDIS'] df_ed.loc[:, response_cols] = df_ed.loc[:, response_cols].apply(pd.to_numeric) df_ed['ADMITTEMP'] = df_ed[response_cols].sum(axis=1) df_ed['ADMITFINAL'] = 0 df_ed.loc[df_ed['ADMITTEMP'] >= 1, 'ADMITFINAL'] = 1 df_ed.drop(response_cols, axis=1, inplace=True) df_ed.drop('ADMITTEMP', axis=1, inplace=True)
Let's discuss the previous code example in detail:
- The first line identifies the columns we would like to include in our final target variable by name. The target should equal 1 if the values for any of those columns is 1.
- In Line 2, we convert the columns from the string to the numeric type.
- In Lines 3-5, we create a column called ADMITTEMP that contains the row-wise sum of the five target columns. We then create our final target column, ADMITFINAL, and set it equal to 1 when ADMITTEMP is >= 1.
- In Lines 6-7, we drop the five original response columns as well as the ADMITTEMP column since we now have our final response column.