Let's begin with data at the nominal level. The main method we have is to transform our categorical data into dummy variables. We have two options to do this:
- Utilize pandas to automatically find the categorical variables and dummy code them
- Create our own custom transformer using dummy variables to work in a pipeline
Before we delve into these options, let's go over exactly what dummy variables are.
Dummy variables take the value zero or one to indicate the absence or presence of a category. They are proxy variables, or numerical stand-ins, for qualitative data.
Consider a simple regression analysis for wage determination. Say we are given gender, which is qualitative, and years of education, which is quantitative. In order to see if gender has an effect on wages, we would dummy code when the person is a female to female = 1, and female = 0 when the person is male.
When working with dummy variables, it is important to be aware of and avoid the dummy variable trap. The dummy variable trap is when you have independent variables that are multicollinear, or highly correlated. Simply put, these variables can be predicted from each other. So, in our gender example, the dummy variable trap would be if we include both female as (0|1) and male as (0|1), essentially creating a duplicate category. It can be inferred that a 0 female value indicates a male.
To avoid the dummy variable trap, simply leave out the constant term or one of the dummy categories. The left out dummy can become the base category to which the rest are compared to.
Let's come back to our dataset and employ some methods to encode our categorical data into dummy variables. pandas has a handy get_dummies method that actually finds all of the categorical variables and dummy codes them for us:
pd.get_dummies(X, columns = ['city', 'boolean'], # which columns to dummify prefix_sep='__') # the separator between the prefix (column name) and cell value
We have to be sure to specify which columns we want to apply this to because it will also dummy code the ordinal columns, and this wouldn't make much sense. We will take a more in-depth look into why dummy coding ordinal data doesn't makes sense shortly.
Our data, with our dummy coded columns, now looks like this:
ordinal_column |
quantitative_column |
city__london |
city_san francisco |
city_seattle |
city_tokyo |
boolean_no |
boolean_yes |
|
0 |
somewhat like |
1.0 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
like |
11.0 |
0 |
0 |
0 |
0 |
1 |
0 |
2 |
somewhat like |
-0.5 |
1 |
0 |
0 |
0 |
0 |
0 |
3 |
like |
10.0 |
0 |
0 |
1 |
0 |
1 |
0 |
4 |
somewhat like |
NaN |
0 |
1 |
0 |
0 |
1 |
0 |
5 |
dislike |
20.0 |
0 |
0 |
0 |
1 |
0 |
1 |
Our other option for dummy coding our data is to create our own custom dummifier. Creating this allows us to set up a pipeline to transform our whole dataset in one go.
Once again, we will use the same structure as our previous two custom imputers. Here, our transform method will use the handy pandas get_dummies method to create dummy variables for specified columns. The only parameter we have in this custom dummifier is cols:
# create our custom dummifier
class CustomDummifier(TransformerMixin): def __init__(self, cols=None): self.cols = cols def transform(self, X): return pd.get_dummies(X, columns=self.cols) def fit(self, *_): return self
Our custom dummifier mimics scikit-learn's OneHotEncoding, but with the added advantage of working on our entire DataFrame.