Data understanding and preparation

To start, we will load these three packages. The data is in the MASS package, but this is a required dependent package loaded with neuralnet:

> library(caret)

> library(neuralnet)
Loading required package: grid
Loading required package: MASS

> library(vcd)

The neuralnet package will be used for the building of the model and caret for the data preparation. The vcd package will assist us in data visualization. Let's load the data and examine its structure:

> data(shuttle)

> str(shuttle)
'data.frame':256 obs. of  7 variables:
 $ stability: Factor w/ 2 levepicels "stab","xstab": 2 2 2 2 2 2 2 2 2 2 ...
 $ error    : Factor w/ 4 levels "LX","MM","SS",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sign     : Factor w/ 2 levels "nn","pp": 2 2 2 2 2 2 1 1 1 1 ...
 $ wind     : Factor w/ 2 levels "head","tail": 1 1 1 2 2 2 1 1 1 2 ...
 $ magn     : Factor w/ 4 levels "Light","Medium",..: 1 2 4 1 2 4 1 2 4 1 ...
 $ vis      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ use      : Factor w/ 2 levels "auto","noauto": 1 1 1 1 1 1 1 1 1 1 ...

The data consists of 256 observations and 7 variables. Notice that all of the variables are categorical and the response is use with two levels, auto and noauto. The covariates are as follows:

stability: This is stable positioning or not (stab/xstab)
error: This is the size of the error (MM / SS / LX)
sign: This is the sign of the error, positive or negative (pp/nn)
wind: This is the wind sign (head / tail)
magn: This is the wind strength (Light / Medium / Strong / Out of Range)
vis: This is the visibility (yes / no)

We will build a number of tables to explore the data, starting with the response/outcome:

> table(shuttle$use)

  auto noauto 
   145    111

Almost 57 percent of the time, the decision is to use the autolander. There are a number of possibilities to build tables for categorical data. The table() function is perfectly adequate to compare one versus another, but if you add a third, it can turn into a mess to look at. The vcd package offers a number of table and plotting functions. One is structable(). This function will take a formula (column1 + column2 ~ column3), where column3 becomes the rows in the table:

> table1=structable(wind+magn~use, shuttle)

> table1
       wind  head                    tail                  
       magn Light Medium Out Strong Light Medium Out Strong
use                                                        
auto           19     19  16     18    19     19  16     19
noauto         13     13  16     14    13     13  16     13

Here, we can see that in the cases of a headwind that was Light in magnitude, auto occurred 19 times and noauto, 13 times. The vcd package offers the mosaic() function to plot the table created by structable() and provide the p-value for a chi-squared test:

> mosaic(table1, gp=shading_hcl)

The output of the preceding command is as follows:

The plot tiles correspond to the proportional size of their respective cells in the table, created by recursive splits. You can also see that the p-value is not significant, so the variables are independent, which means that knowing the levels of wind and/or magn does not help us predict the use of the autolander. You do not need to include a structable() object in order to create the plot as it will accept a formula just as well. We will also drop the shading syntax, which drops the colored shading and residuals bar on the right. This can make for a cleaner plot:

> mosaic(use~error+vis, shuttle)

The output of the preceding command is as follows:

Note that the shading of the table has changed, reflecting the rejection of the null hypothesis and dependence in the variables. The plot first takes and splits the visibility. The result is that if the visibility is no, then the autolander is used. The next split is horizontal by error. If error is SS or MM when vis is no, then the autolander is recommended, otherwise it is not. A p-value is not necessary as the gray shading indicates significance.

One can also examine proportional tables with the prop.table() function as a wrapper around table():

> table(shuttle$use, shuttle$stability)
        
         stab xstab
  auto     81    64
  noauto   47    64

> prop.table(table(shuttle$use, shuttle$stability))
        
              stab     xstab
  auto   0.3164062 0.2500000
  noauto 0.1835938 0.2500000

In case we forget, the chi-squared tests are quite simple:

> chisq.test(shuttle$use, shuttle$stability)

Pearson's Chi-squared test with Yates' continuity
correction

data:  shuttle$use and shuttle$stability
X-squared = 4.0718, df = 1, p-value = 0.0436

Preparing the data for a neural network is very important as all the covariates and responses need to be numeric. In our case, we have all of them categorical. The caret package allows us to quickly create dummy variables as our input features:

> dummies = dummyVars(use~.,shuttle, fullRank=TRUE)

> dummies
Dummy Variable Object

Formula: use ~ .
7 variables, 7 factors
Variables and levels will be separated by '.'
A full rank encoding is used

To put this into a data frame, we need to predict the dummies object to an existing data, either the same or different, in as.data.frame(). Of course, the same data is needed here:

> shuttle.2 = as.data.frame(predict(dummies, newdata=shuttle))

> names(shuttle.2)
 [1] "stability.xstab" "error.MM"        "error.SS"       
 [4] "error.XL"        "sign.pp"         "wind.tail"      
 [7] "magn.Medium"     "magn.Out"        "magn.Strong"    
[10] "vis.yes"        

> head(shuttle.2)
  stability.xstab error.MM error.SS error.XL sign.pp wind.tail
1               1        0        0        0       1         0
2               1        0        0        0       1         0
3               1        0        0        0       1         0
4               1        0        0        0       1         1
5               1        0        0        0       1         1
6               1        0        0        0       1         1
  magn.Medium magn.Out magn.Strong vis.yes
1           0        0           0       0
2           1        0           0       0
3           0        0           1       0
4           0        0           0       0
5           1        0           0       0
6           0        0           1       0

We now have an input feature space of ten variables. Stability is now either 0 for stab or 1 for xstab. Base error is LX and three variables represent the other categories.

The response can be created using the ifelse() function:

> shuttle.2$use = ifelse(shuttle$use=="auto",1,0)

> table(shuttle.2$use)

  0   1 
111 145

The caret package also provides us with a functionality to create the train and test sets. The idea is to index each observation as train or test and then split the data accordingly. Let's do this with a 70/30 train to test split, as follows:

> set.seed(123)

> trainIndex = createDataPartition(shuttle.2$use, p = .7, list = FALSE, times = 1)

> head(trainIndex)
     Resample1
[1,]         1
[2,]         2
[3,]         6
[4,]         9
[5,]        10
[6,]        11

The values in trainIndex provide us with a row number; in our case, 70 percent of the total row numbers in shuttle.2. It is now a simple case of creating the train/test datasets and to make sure that the response variable is balanced between them.

> shuttleTrain = shuttle.2[ trainIndex,]

> shuttleTest  = shuttle.2[-trainIndex,]

> table(shuttleTrain$use)

 0  1 
81 99 
> table(shuttleTest$use)

 0  1 
30 46

Nicely done! We are now ready to begin building the neural networks.