In the preceding diagram, you may have noticed that the example tree most likely uses subjectively determined cutpoints in deciding which route to follow. For example, Diamond #5 uses a BMI cutoff of 25, and Diamond #7 uses a BMI cutoff of 30. Nice, round numbers! In the decision analysis field, trees are usually constructed based on human inference and discussion. What if we could objectively determine the best variables to cut (and the corresponding cutpoints at which to cut) in order to minimize the error of the algorithm?
This is just what we do when we train a formal decision tree using a machine learning algorithm. Decision trees evolved in the 1990s and used principles of information theory to optimize the branching variables/points of the tree to maximize the classification accuracy. The most common and simple algorithm for training a decision tree proceeds using what is known as a greedy approach. Starting at the first node, we take the training set of our data and split it based on each variable, using a variety of cutpoints for each variable. After each split, we calculate the entropy or information gain from the resulting split. Don't worry about the formulas for calculating these quantities, just know that they measure how much information is gained from the split, which correlates with how even the split is. For example, using the PUL algorithm shown previously, a split that results in eight normal intrauterine pregnancies and seven ectopic pregnancies would be favored over a split that results in 15 normal intrauterine pregnancies and zero ectopic pregnancies. Once we have the variable and cutpoint for the best split, we proceed and then repeat the method, using the remaining variables. To prevent overfitting the model to the training data, we stop splitting the tree when certain criteria are reached, or alternatively, we could train a big tree with many nodes and then remove (prune) some of the nodes.
Decision trees have some limitations. For one thing, decision trees must split the decision space linearly at each step based on a single variable. Another problem is that decision trees are prone to overfitting. Because of these issues, decision trees typically aren't competitive with most state-of-the-art machine learning algorithms in terms of minimizing errors. However, the random forest, which is basically an ensemble of de-correlated decision trees, is currently among the most popular and accurate machine learning methods in medicine. We will make decision trees and random forests in Chapter 7, Making Predictive Models in Healthcare of this book.