In this section, we will be focusing on data preprocessing which includes data cleaning, transformation, and normalizations if required. Basically, we perform operations to get the data ready before we start performing any analysis on it.
There will be situations when the data you are dealing with will have missing values, which are often represented as NA
in R. There are several ways to detect them and we will show you a couple of ways next. Note that there are several ways in which you can do this.
> # check if data frame contains NA values > sum(is.na(credit.df)) [1] 0 > > # check if total records reduced after removing rows with NA > # values > sum(complete.cases(credit.df)) [1] 1000
The is.na
function is really useful as it helps in finding out if any element has an NA
value in the dataset. There is another way of doing the same by using the complete.cases
function, which essentially returns a logical vector saying whether the rows are complete and if they have any NA
values. You can check if the total records count has decreased compared to the original dataset as then you will know that you have some missing values in the dataset. Fortunately, in our case, we do not have any missing values. However, in the future if you are dealing with missing values, there are various ways to deal with that. Some of them include removing the rows with missing values by using functions such as complete.cases
, or filling them up with a value which could be the most frequent value or the mean, and so on. This is also known as missing value imputation and it depends on the variable attribute and the domain you are dealing with. Hence, we won't be focusing too much in this area here.
We had mentioned earlier that by default all the attributes of the dataset had been declared as int
, which is a numeric type by R, but it is not so in this case and we have to change that based on the variable semantics and values. If you have taken a basic course on statistics, you might know that usually we deal with two types of variables:
Since all the variables in our dataset have been converted to numeric by default, we will only need to convert the categorical variables from numeric data types to factors, which is a nice way to represent categorical variables in R.
The numeric variables in our dataset include credit.duration.months
, credit.amount
, and age and we will not need to perform any conversions. However, the remaining 18 variables are all categorical and we will be using the following utility function to convert their data types:
# data transformation to.factors <- function(df, variables){ for (variable in variables){ df[[variable]] <- as.factor(df[[variable]]) } return(df) }
This function will be used on our existing data frame credit.df
as follows for transforming the variable data types:
> # select variables for data transformation > categorical.vars <- c('credit.rating', 'account.balance', + 'previous.credit.payment.status', + 'credit.purpose', 'savings', + 'employment.duration', 'installment.rate', + 'marital.status', 'guarantor', + 'residence.duration', 'current.assets', + 'other.credits', 'apartment.type', + 'bank.credits', 'occupation', + 'dependents', 'telephone', + 'foreign.worker') > > # transform data types > credit.df <- to.factors(df = credit.df, + variables=categorical.vars) > > # verify transformation in data frame details > str(credit.df)
Now we can see the attribute details in the data frame with the transformed data types for the selected categorical variables in the following output. You will notice that out of the 21 variables/attributes of the dataset, 18 of them have been successfully transformed into categorical variables.
This brings an end to the data preprocessing step and we will now dive into analyzing our dataset.