Data preprocessing

In this section, we will be focusing on data preprocessing which includes data cleaning, transformation, and normalizations if required. Basically, we perform operations to get the data ready before we start performing any analysis on it.

Dealing with missing values

There will be situations when the data you are dealing with will have missing values, which are often represented as NA in R. There are several ways to detect them and we will show you a couple of ways next. Note that there are several ways in which you can do this.

> # check if data frame contains NA values
> sum(is.na(credit.df))
[1] 0
> 
> # check if total records reduced after removing rows with NA 
> # values
> sum(complete.cases(credit.df))
[1] 1000

The is.na function is really useful as it helps in finding out if any element has an NA value in the dataset. There is another way of doing the same by using the complete.cases function, which essentially returns a logical vector saying whether the rows are complete and if they have any NA values. You can check if the total records count has decreased compared to the original dataset as then you will know that you have some missing values in the dataset. Fortunately, in our case, we do not have any missing values. However, in the future if you are dealing with missing values, there are various ways to deal with that. Some of them include removing the rows with missing values by using functions such as complete.cases, or filling them up with a value which could be the most frequent value or the mean, and so on. This is also known as missing value imputation and it depends on the variable attribute and the domain you are dealing with. Hence, we won't be focusing too much in this area here.

Datatype conversions

We had mentioned earlier that by default all the attributes of the dataset had been declared as int, which is a numeric type by R, but it is not so in this case and we have to change that based on the variable semantics and values. If you have taken a basic course on statistics, you might know that usually we deal with two types of variables:

Numeric variables: The values of these variables carry some mathematical meaning. This means that you can carry out mathematical operations on them, such as addition, subtraction, and so on. Some examples can be a person's age, weight, and so on.
Categorical variables: The values of these variables do not have any mathematical significance and you cannot perform any mathematical operations on them. Each value in this variable belongs to a specific class or category. Some examples can be a person's gender, job, and so on.

Since all the variables in our dataset have been converted to numeric by default, we will only need to convert the categorical variables from numeric data types to factors, which is a nice way to represent categorical variables in R.

The numeric variables in our dataset include credit.duration.months, credit.amount, and age and we will not need to perform any conversions. However, the remaining 18 variables are all categorical and we will be using the following utility function to convert their data types:

# data transformation
to.factors <- function(df, variables){
  for (variable in variables){
    df[[variable]] <- as.factor(df[[variable]])
  }
  return(df)
}

This function will be used on our existing data frame credit.df as follows for transforming the variable data types:

> # select variables for data transformation
> categorical.vars <- c('credit.rating', 'account.balance', 
+                       'previous.credit.payment.status',
+                       'credit.purpose', 'savings', 
+                       'employment.duration', 'installment.rate',
+                       'marital.status', 'guarantor', 
+                       'residence.duration', 'current.assets',
+                       'other.credits', 'apartment.type', 
+                       'bank.credits', 'occupation', 
+                       'dependents', 'telephone', 
+                       'foreign.worker')
> 
> # transform data types
> credit.df <- to.factors(df = credit.df, 
+                         variables=categorical.vars)
> 
> # verify transformation in data frame details
> str(credit.df)

Now we can see the attribute details in the data frame with the transformed data types for the selected categorical variables in the following output. You will notice that out of the 21 variables/attributes of the dataset, 18 of them have been successfully transformed into categorical variables.

This brings an end to the data preprocessing step and we will now dive into analyzing our dataset.

Note

Do note that several of our dataset attributes/features have a lot of classes or categories and we will need to do some more data transformations and feature engineering in the analysis phase to prevent overfitting of our predictive models, which we shall discuss later.