“Data is a precious thing and will last longer than the systems themselves.” – Tim Berners-Lee
For banks to figure out if a they should offer a person a loan or not, they will often work through a list of questions to see if the person would be safe to give the loan to. These types of questions could start are simply like, “What kind of income do you have?” If the answer is between $30 and $70,000, they will continue onto the following question. “How long have you worked at your current job?” If they say one to five years, it will continue onto their next question. “Do you make regular credit card payments?” If they yes, then they will offer them a loan, and if they don’t they won’t get the loan. This is the most basic decision tree.
A decision tree is pretty much just a non-parametric machine learning modeling technique that is used for classification and regression problems. In order to find the solutions, a decision tree will create a hierarchical and sequential decision that variables of the outcome based on data.
And this means what?
Hierarchical refers to the model that is defined by a question series that will lead to a label or value once it has been applied to an observation. After it is set up, this model will work like a protocol using a bunch of “if this happen then that will happen” conditions that will give a certain result from the data that was added.
A method that is non-parametric means that there won’t be an underlying assumption concerning the distribution of the data or errors. This basically means that you model will be created by using observed data.
Decision trees that use a discrete value set for the target variable are classification trees. With these types of trees, the nodes, or leafs, are representations of class labels, and the branches show the feature conjunctions that lead to the class. Decision trees that have target variables that are taking continuous value, which is typically numbers, are referred to as Regression Trees. Together, these two types of decision trees are called CART.
Each one of these models is a case of a Directed Acyclic Graph. All of the graphs have nodes that show a decision point about the top variable given the edges and predictor that are between each node. If we continue with the loan scenario, $30 to $70,000 would represent an edge, and “Years at present job” would be a node.
The main goal of the decision tree is to make the best choice once you reach the end of a node so it will need an algorithm that does that. This is known as Hunt’s algorithm, which works recursive and greedily. Greedily means it makes the best choice and recursive means that it split big questions into smaller ones. The choice of splitting a node is decided according a purity metric. Nodes are considered 100% impure if a node is split 50/50, and it is considered 100% if all of the data is a part of one class.
To make sure that the model is optimized you have to reach a max purity and stay away from impurity. You do this by using the Gini impurity, which will measure how often a random element ends up being labeled wrong if the distribution randomly labeled it. This is figured out by adding the odds, pi, of a node with the label, I, being picked then multiplied by the odds of a mistake in categorization. The goal is to make sure you reach 0 where it should be pure.
The other metric that it will use is information gain. This is for deciding the feature that you split at each tree step. This can be figured out using this equation.
“Information Gain = Entropy(parent) – Weight Sum of Entropy(Children)”
This is a pretty good model, but it presents a problem because it results in a model that will only stop once all of the information is distributed into a single attribute or class. At the cost of bias, the model’s variance is huge and will end up leading to over fitting. This can be fought by setting a max depth of your tree, or by setting an alternative to specify the minimum amount of points that will be needed to decide to split.
With decision, trees comes advantages and disadvantages. On the down side, they are greedy algorithms that are locally optimized where a return to the global tree isn’t guaranteed. On the positive side, they are super simply to understand because they have a visual representation that doesn’t require all that much data.