In this section, we will be using statistics to interpret the results of a machine learning model. We talked about graphs and how they allow us to see and interpret the predictions of a machine learning model, but we can also use those graph for statistical tests. Have a look at the following table:
Categorical | Continuous | |
Categorical | Chi-square | ANOVA |
Continuous | ANOVA | Correlation |
Let's say we had a categorical outcome variable and we had a categorical predictor, we could use the chi-square test. The chi-square test of Independence will allow us to look at the relationship between two categorical variables.
Also, if we have a categorical outcome variable and a continuous predictor, we can use ANOVA to analyze the variance. If we have two categories that we're comparing, we can have a t-test instead, which will allow us to interpret what's going on. On the other hand, if our outcome variable is a continuous variable and we have a categorical predictor, we can use ANOVA. We could also use logistic regression or discriminant analysis if we wanted. And finally, when both of our variables are continuous variables, we can use correlation or linear regression. Let's take a look at a few examples. We will outline some of these techniques so that we can understand what's going on, in terms of the inner workings of a machine learning model.
Let's take a look at how some tables can help clarify relationships between our variables and to see how these different predictors are used by a model.
Go to the Output palette and connect the Select node to a matrix node:
The Matrix node will allow us to do the chi-square test, which will help us to look at the relationship between two categorical variables. In this case, we're going to look at the relationship between a categorical predictor and a categorical outcome variable. Let's edit the Matrix node. In the Rows field, we will input our prediction, which is the $N-Status variable and in the Columns field, we will select the important predictor in the model, which will be the Premier variable:
Go to the Appearance tab and select Percentage of columns, which means we will get the column percentages and click on Run:
We will get the following result:
We actually have a Chi-square test that is statistically significant, so Premier is an important predictor in the model.
We also get some numerical information that is backing up what we have seen in our graphs previously . We can see that about 80% of those people, the non-premier customers that are associated with being with Churned and about 87% of the Premier customers that are associated with being a Current customer.
This is what the model has been doing. Now, we understand how at least one variable is used in the model.