Part of the reason why machine learning is so uniquely difficult in healthcare is its propensity for missing data. Inpatient hospital-data collection is often dependent on the nurses and other clinical staff to be completed thoroughly, and given how busy nurses and other clinical staff are, it's no wonder that many inpatient datasets have certain features, such as urinary intake and output or timestamps of medication administrations, inconsistently reported. Another example is diagnosis codes: a patient may be eligible for a dozen medical diagnoses but in the interest of time, only five are entered into the chart by the outpatient physician. When details such as these are left out of our data, our models will be that much less accurate when applied to real patients.
Even more problematic than the lack of detail is the effect of the missing data on our algorithms. Even one missing value in a dataframe that consists of thousands of patients and hundreds of features can prevent a model from running successfully. A quick fix might be simply to type in or impute a zero where the missing value is. But if the variable is a hemoglobin lab value, surely a hemoglobin of 0.0 is impossible. Should we impute the missing data with the mean hemoglobin lab value instead? Do we use the overall mean or the gender-specific mean? Questions such as these are the reasons why dealing with missing data is practically a data science field in itself. The importance of having the basic awareness of missing data in your dataset cannot be overemphasized. In particular, it is important to know the difference between zero-valued data and missing data. Also, gaining some familiarity with concepts such as zero, NaN ("not a number"), NULL ("missing"), or NA ("not applicable") and how they are expressed in your languages of choice, whether SQL, Python, R, or some other language, is important.
The final goal of the data-cleansing stage is usually a single data frame, which is a single data structure that organizes the data into a matrix-like object of rows and columns, where rows comprise the individual events or observations and columns reflect different features of the observations using various data types. In an ideal world, all of the variables will have been inspected and converted to the appropriate type, and there would be no missing data. It should be noted that there may be some back-and-forth iterations between data cleansing, exploring, visualizing, and feature selection before reaching this final milestone. Data exploration/visualization and feature selection are the two pipeline steps that we'll discuss next.