Understanding the relationship between a continuous predictor and a categorical outcome variable

In this section, we are going to look at the relationship between a continuous predictor and a categorical outcome variable. In order to do that, we can either run a t-test or ANOVA and which is also called the Means node in Modeler:

Connect the Select node to the Means node from the output palette:

Edit the Means node. In the Grouping field box, input the prediction, which is the $N-Status variable:

In the Test field(s) box, select all of the continuous variables. Also, deselect the actual confidence:

Click on OK and then click on Run. You will get the following output:

We can see that most of the variables ended up differentiating between the prediction for churn and the prediction for being a current customer. There are some predictors that weren't important. For example, estimated revenue was not statistically significant and neither was the number of employees, but the rest of the predictors were very important.

Now, here we are seeing the Mean. But, let's try to find some more information. We'll change the View from Simple to Advanced and we will get the following results:

As you can see, we have the same number of cases for each of these variables. The most important predictor ended up being the TVs variable, as we can see in the following screenshot, and that's the one that's able to differentiate these groups the most.

For example look at the TVs field, we can see that the current customers are the people who we predicted to churn and we thought we're going to lose these customer, on average they bought 2.4 TVs. This prediction is significantly more than the people who were predicted to be going to stay as current customers which bought about one TV on average As we can see, we had a statistically significant result. The opposite is true when it comes to Stereos. You can see that we have a statistically significant result, however you can see that the people that we'd predicted to lose as customers on average, bought 13.2 stereos, whereas the people that we'd predicted were going to stay as customers on average bought 16.2 stereos

The last thing we're going to do is look at the relationship between two continuous variables. Earlier, we used the scatter plot to show the relationship between two continuous variables. Here, we're going to quantify that by calculating a correlation:

Using the bank dataset, go to the Output palette in Modeler and connect the Statistics node to source node:

Edit the Statistics node, which allows us to get summary statistics in the examine box. Click on the fields option and select current salary:

Click on OK and correlate current salary with the other continuous variables. Click on the arrow next to the correlate box and select the ruler icon, which will select all our continuous variables:

Hold down the Ctrl key and deselect id:

Click on OK and then click on Run. You will see the following result:

As we can see that the current salary correlates very highly with the starting salary:

Remember that correlation values can range anywhere from -1 to +1. The farther away the value is from zero, the stronger the relationship. If the correlation coefficients are positive, this means that as the value of one variable increases, the value of the other variable increases. If the correlation coefficients are negative, this means that as the value of one variable increases, the value of the other variable decreases.

Notice that the correlation between beginning salary and current salary is very high, almost 0.9. You can also see that the correlation between education law and current salary is also very high, at 0.66.

On the other hand, you can see that the correlation between the amount of time that someone has worked at this particular job and also the correlation of the number of months that someone has been employed before they have this current job, those variables don't really correlate with current salary. Notice that the correlation values are actually pretty close to zero.

This gives us an indication of the types of relationships that you have and it allows us to quantify those relationships. In the next section, we are going to see how to use decision trees to interpret the results of a machine learning model.