In our example database, the VISIT table contains the diagnostic codes for the visit. Although they didn't get their own table in our example, the diagnostic codes are among the most important pieces of information for many analytics problems. For one thing, they allow us to select the observations that are relevant to our model. For example, if we were building a model to predict malignant cancers, we would need the diagnosis codes to tell us which patients have cancer and to filter out the other patients. Second, they often serve as good predictor variables (Futoma et al., 2015). For example, as we will see in Chapter 7, Making Predictive Models in Healthcare, many chronic diseases increase the likelihood of poor healthcare outcomes by a large amount. Clearly, we must leverage the information given to us in the diagnostic codes to optimize our predictive models.
We will introduce two transformations for coded variables here. The first transformation, binning, converts the categorical variable to a series of binary variables for specific diagnoses. The second transformation, aggregating, groups many of the binary binned variables into a single binary or numerical variable. These transformations apply not only to diagnostic codes, but to the procedure, medication, and laboratory codes as well. The following are examples of both of these transformations.