In this section, we will be focusing on data preprocessing which includes data cleaning, transformation, and normalizations if required. Basically, we perform operations to get the data ready before we start performing any analysis on it.

We had mentioned earlier that by default all the attributes of the dataset had been declared as int, which is a numeric type by R, but it is not so in this case and we have to change that based on the variable semantics and values. If you have taken a basic course on statistics, you might know that usually we deal with two types of variables:

Since all the variables in our dataset have been converted to numeric by default, we will only need to convert the categorical variables from numeric data types to factors, which is a nice way to represent categorical variables in R.

The numeric variables in our dataset include credit.duration.months, credit.amount, and age and we will not need to perform any conversions. However, the remaining 18 variables are all categorical and we will be using the following utility function to convert their data types:

This function will be used on our existing data frame credit.df as follows for transforming the variable data types:

Now we can see the attribute details in the data frame with the transformed data types for the selected categorical variables in the following output. You will notice that out of the 21 variables/attributes of the dataset, 18 of them have been successfully transformed into categorical variables.

Datatype conversions

This brings an end to the data preprocessing step and we will now dive into analyzing our dataset.