So far, the algorithms that we have seen were not very complex; they were rather trivial, and most of our focus was on implementing the API for a particular data structure. In this example, there is no data structure to implement; the algorithm itself generates the decision tree that we would be using for our application.
First, let's take a look at how a table of historical data can be converted into a decision tree. The tree that is formed primarily is made up of decision nodes (which indicate a decision) and leaf nodes (which indicates the final response, such as yes or no). The decision nodes can have two or more subnodes, depending on the dataset. However, the tree has to start from somewhere, right? What would be the root node and how do we get it?
To determine the root node, we will first need to know some basics of Information Theory here:
-
Entropy: Entropy is the measure of uncertainty of a series of inputs—the more uncertain the number of input messages, the more inputs are needed to determine what the message is; for example, if our input series always sends the same message, there is no uncertainty, and thus the entropy is zero. Such an input series is also known as pure. However, if the same series sends n different types of inputs with the same probability, then the entropy becomes high and the receiver needs to ask log2n Boolean questions to determine the message. The average number of bits needed to identify a message is a measure of the sender's entropy.
- Information gain: To determine the root node, first we will need to split the dataset based on the attributes provided, and then determine the entropy at each attribute, the difference of entropy at each attribute and that of the target determines the information gain or loss of each attribute.
The attribute with the highest information gain becomes the root attribute. Then, we repeat the process for each subtree until we get no entropy. Let's take a look at this with an example and then start off with the code.
For the following example, we will take a simple input and popular dataset to decide whether a game is going to be played or not based on the weather conditions:
Outlook | Temperature | Humidity | Wind | Play Soccer |
Sunny | Hot | High | Weak | No |
Sunny | Hot | High | Strong | No |
Overcast | Hot | High | Weak | Yes |
Rain | Mild | High | Weak | Yes |
Rain | Cool | Normal | Weak | Yes |
Sunny | Cool | Normal | Weak | Yes |
Rain | Mild | Normal | Weak | Yes |
Sunny | Mild | Normal | Strong | Yes |
Overcast | Mild | High | Strong | Yes |
Overcast | Hot | Normal | Weak | Yes |
Rain | Cool | Normal | Strong | No |
Overcast | Cool | Normal | Strong | Yes |
Sunny | Mild | High | Weak | No |
In the preceding example, the target is the Play Soccer attribute. Let's assume that our input source has a capability of sending n messages and the probability of sending each message is Pn , then the entropy of the source is the summation over n of the probabilities E = -pi * log2(pi).