Chapter Seven: Machine Learning Algorithms

This is one of the simplest algorithms used in machine learning. This technique, like every other machine learning technique, is based on human reasoning. For instance, if something were to happen in your life, that had a significant impact on you, you would memorize the experience and use that as a guideline for the future.

Let us consider the following example where a person has just dropped a glass. When the glass is falling, that person would have predicted that the glass would break the minute it touches the ground. How can the person make that assumption? He or she has never seen that particular glass break before. However, he or she has seen similar glasses drop and break. The situation will not have been the same, but it is a known fact that a glass that falls from a height will usually break. This gives the person confidence that his or her assumed outcome is accurate. What would happen if that person were to drop a glass from a small height onto a carpet? We now have understood that the ground surface and height play a role in the outcome. This is how the K-NN algorithm words. When any situation arises, the algorithm will scan through the training data and try to identify the closest outcome from that data. These experiences are called k nearest neighbors.

If you are given a classification task, for instance, you have to predict if the glass will break, you will first look at all the nearest neighbors. If k=6, and in 4 of the experiences, the glass has broken, the algorithm will state that the glass will break. Let us assume that you now want to obtain the number of pieces into which the glass will break. Here you will need to predict a number that is called regression. The algorithm will take an average of all the nearest neighbors. For example, if k=6, and the number of pieces that glass has broken into are 1 (glass did not break), 4, 6, 3, 6, and 10, the algorithm will give you a prediction of 5.

This algorithm is often called the lazy algorithm since it does not learn from the training data. The data is stored to create a database that is later used to identify the nearest neighbors and predict outcomes.

Naïve Bayes Estimation and Bayesian Networks

In the ﬁeld of statistics, probability is approached in two ways – the classical approach or the Bayesian approach. Probability is often taught using the classical approach or the frequentist approach. This is a method that is followed in all beginners’ classes in statistics. In the repeated method to probability, the population factors are ﬁxed constants with values that not known. These prospects are deﬁned as “the various categories are assigned relative frequencies, when the experiment is repeated sometime.” For example, if you were to toss a coin twenty times, it may not be unusual to observe at least 80 percent heads, but if this coin were tossed ten trillion times, we can be certain that the proportion of heads would reduce to 50 percent. This behavior defines the prospect of the frequentist approach

However, certain situations do arise in which the classical definition of probability makes it difficult to understand the situation. For example, what is the probability that terrorists will strike New York City with a dirty bomb? Given that such an occurrence has never occurred, it is difﬁcult to conceive what the long-term behavior of this gruesome experiment might be. Another approach to probability, the frequentist approach, uses parameters that are fixed so that the randomness lies only in the data. This randomness is viewed as a random sample from a given distribution with unknown but ﬁxed parameters.

These assumptions are turned around in the Bayesian approach to probability. In this approach to probability, the parameters are all considered to be random variables with data that is known. It is always regarded that the parameters always come from distributions with possible values, and to provide information on the likely parameters that need to be used for the data Bayesian would look at the observed data.

Criticism of the Bayesian framework has focused primarily on two potential drawbacks. First, the elicitation of a prior distribution may be subjective. That is, two different subject matter experts may provide two different prior distributions, which will presumably percolate through to result in two different posterior distributions. The solution to this problem is (1) to select non-informative priors if the choice of priors is controversial and (2) to apply a large amount of data so that the relative importance of the prior is diminished. Failing this, model selection can be performed on the two different posterior distributions using model adequacy and efﬁcacy criteria, resulting in the choice of the better model. Is reporting more than one model a bad thing?

The second criticism has been that Bayesian computation has been intractable in data mining terms for most interesting problems where the approach suffered from scalability issues. The curse of dimensionality hits Bayesian analysis rather hard, given that the normalizing factor requires integrating (or summing) over all possible values of the parameter vector, which may be computationally infeasible when applied directly. However, the introduction of Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling and the Metropolis algorithm, has greatly expanded the range of problems and dimensions that Bayesian analysis can handle.

Regression Modeling

Regression modeling is an elegant and powerful method for estimating the value of target variables that are continuous. There are multiple regression models that could be used – one of the simplest models is the simple linear regression model. In this model, a straight line is used to estimate the relationship between a single continuous response variable and a single continuous predictor variable. There are also the multiple regression models where numerous predictor variables can be used to estimate one response.

Apart from the methods mentioned above, there is also the least shared regression method, which is an extremely powerful tool that can be used. However, there is a certain level of disparity when it comes to the assumptions of the model. It is important that these assumptions be validated before the model is even built. If the user were to build and use a model that was based on assumptions that have not been verified, this would lead to failures that could cause too much damage to the company or the individual.

Once the user has obtained the results from the model, he or she will need to be certain that there exists absolutely no linear relationship between the variables used in the model. There could be a relationship that exists that would be very granular and difficult to identify. There is, however, a systematic approach to determining whether there is any linear relationship between the variables, and that is using inference. There are four inferential methods that could be used to determine the relationship:

• The t-test for the relationship between the response variable and the predictor variable

• The conﬁdence interval for the slope, β1

• The conﬁdence interval for the mean of the response variable given a particular value of the predictor

• The prediction interval for a random value of the response variable given a particular value of the predictor

The inferential methods described above all depend on the adherence of data to the assumptions that are made at the beginning of the process. The level at which the data adheres to the assumptions can be identified using two graphical methods – a plot of the normal probability and a plot of the standardized residuals against the predicted or fitted values.

A normal probability plot is a quantile-quantile plot of the quantiles of a particular distribution against the quantiles of the standard normal distribution for the purposes of determining whether the speciﬁed distribution deviates from normality. In a normality plot, the values observed for the distribution of interest are compared against the same number of values that would be expected from the normal distribution. If the distribution is normal, the bulk of the points in the plot should fall in a straight line; systematic deviations from linearity in this plot indicate non-normality. We evaluate the validity of the regression assumptions by observing whether certain patterns exist in the plot of the residuals versus ﬁts, in which case one of the assumptions has been violated, or whether no such discernible patterns exists, in which case the assumptions remain intact.

If these graphs indicate violations of the assumptions, we may apply a transformation to the response variable y, such as the ln (natural log, log to the base e) transformation. Transformations may also be called if the relationship between the predictor and the response variables is not linear. We may use either “Mosteller and Tukey’s ladder of re-expressions” or a “Box-Cox transformation.”

Support Vector Machine

Support Vector machines (SVM) are the most popular machine learning algorithms. They came into existence in the 1990s and are considered to be high-performing algorithms that require minimal tuning.

Maximal-Margin Classifier

This classifier is one that explains how an SVM works. The numeric variables, labeled x, in your data – indicated by columns – form an n-dimensional space. For example, if you had three input variables, you would obtain a three-dimensional space. A hyperplane is selected to separate the points in the n-dimensional space by classes 0 or 1. In a two dimensional space, this can be visualized as a line, and let us make an assumption that all the input variable points are separated by the line. For instance, A0 + (A1 * X1) + (A2 * X2) = 0. The coefficients, A1 and A2, help you calculate the slope of the line and the intercept (the point that cuts the Y-axis) is determined by A0. The learning algorithm finds these values and the input variables are X1 and X2.

Classifications can be made using the line above. You can calculate whether a new point would be above or below the line by inputting the values into the equation.

● If the point is above the line, the equation would return a value greater than 0 implying that the point represents the first class (Class 0)

● If a point is below the line, the equation would return a value less than 0 implying that the point would represent the second class (Class 1)

● A value that is close to the line would return a value that is close to 0 making it hard to identify which class the line belongs to. If the value is large, the model will be able to make a 99% accurate prediction.

The perpendicular distance between the data points closest to the line and the line is called the margin. The line that can separate two classes has the largest margin and this is called the Maximal-Margin hyperplane. The points closest to the hyperplane are called support vectors since they define or support the hyperplane. The machine learns the hyperplane using an optimization procedure on the training data to maximize the margin.

Soft Margin Classifier

The data that is fed to the machines can often not be separated using a hyperplane. You will need to relax the constraint or command stating that the machine would need to maximize the margin of the line. This is called the soft margin classifier where some points in the data are allowed to violate the rule.

To give the margin space to be violated, additional coefficients are introduced. These are called the slack variables. These coefficients increase the complexity of the model and since there are more parameters that the model will need to fit the data to. A tuning parameter (C) is introduced to the model that defines the magnitude to which violation is allowed across the dimensions. If C=0, there is no violation and you will obtain results similar to the Maximal-Margin classifier. As the value of C increases, the permitted violations of the hyperplane also increase.

When the hyperplane is learning from the data, the training variables or instances within the margin will affect how the hyperplane is placed in the dimensions. The parameter C defines the number of support vectors that can be used by the model.

● If the value of C is small, the algorithm is sensitive to the training data

● If the value of C is large, the algorithm is less sensitive to the training data

Decision Trees

A decision tree model is often complicated and may sometimes consist of unnecessary structure that can be difficult to understand and interpret. This is where tree pruning comes in. This process helps to remove any unnecessary information and details from the decision tree to make the tree more readable and understandable for human beings. This increased accuracy due to tree pruning helps to reduce over fitting.

How can an algorithm be represented as a tree?

Let us consider an example where titanic data is used to predict whether a given passenger will survive or die. The model can use three attributes or columns – age, sex, and number of spouses or children travelling with the passenger.

A decision tree is drawn with its roots at the top of the diagram. The attributes are considered the internal nodes of the model. The tree is then split into branches or edges based on the answers to those attributes. The end of the edge that does not split any further is called the leaf or the decision and the decision in this case would be whether or not the passenger survived.

A real data set would have more attributes and the tree would be large. This does not mean that the algorithm is not simple. The importance of the attributes is clear and the relation between these attributes can be viewed easily. This process is commonly called the learning decision tree from data. These trees can be drawn to view discrete data (Classification Trees) and continuous data (Regression Trees).

You as the modeler will need to identify the attributes that can be chosen and what conditions can be used to split the tree. You will also need to know when to stop extending your tree. Let us take a look at the simplest way to split a tree.

Recursive Binary Splitting

All the attributes of the data set are considered and different points to split the data are tried and tested on the basis of a cost function. The split with the lowest cost is selected. Consider the above example, where the tree learned from the titanic data set. In the first root, every attribute is considered and the data set is then divided into groups based on that split. Since there are three attributes, there can only be three splits. The next step would be to calculate the cost of the loss in accuracy when each split occurs. The split that costs the least is chosen.

This algorithm is recursive since the groups can further be divided using the same strategy. This makes the algorithm greedy since the modeler will have the excessive need to reduce the cost making the root node the best predictor.

Genetic Algorithms

Genetic algorithms (GAs) attempt to mimic computationally the processes by which natural selection operates and apply them to solve business and research problems. Developed by John Holland in the 1960s and 1970s, genetic algorithms provide a framework for studying the effects of such biologically inspired factors as mate selection, reproduction, mutation, and crossover of genetic information. In the natural world, the constraints and stresses of a particular environment force different species (and different individuals within species) to compete to produce the ﬁttest offspring. In the world of genetic algorithms, the ﬁtness of various potential solutions are compared, and the ﬁttest potential solutions evolve to produce ever more optimal solutions.

Unsurprisingly, the ﬁeld of genetic algorithms heavily borrows from genomic terminology. Every cell in the body has a similar bunch of chromosomes – strings of DNA, which function as a blueprint for making one of us. Then, each chromosome can be partitioned into genes, which are blocks of DNA designed to encode a particular trait, such as eye color. A particular instance of the gene (e.g., brown eyes) is an allele. Each gene is to be found at a particular locus on the chromosome. Recombination, or crossover, occurs during reproduction, where combining the characteristics of both parents’ chromosomes forms a new chromosome. Mutation, the alteration of a single gene in a chromosome of the offspring, may occur randomly and relatively rarely. The offspring’s ﬁtness is then evaluated, either regarding viability (living long enough to reproduce) or in the offspring’s fertility.

In the ﬁeld of genetic algorithms, a chromosome refers to one of the candidate solutions to the problem, a gene is a single bit or digit of the candidate solution, and an allele is a particular instance of the bit or digit (e.g., 0 for binary-encoded solutions or the number 7 for real-valued solutions). Recall that binary numbers have base 2, such that the ﬁrst “decimal” place represents “ones,” the second represents “twos,” the third represents “fours,” the fourth represents “eights,” and so on.

Selection: This operator refers to the method that is used to select the chromosomes that will be used for reproduction. The fitness function will evaluate every chromosome, also called candidate solutions, and the fitter the chromosome is, the more likely it will be selected by the algorithm to produce.

Crossover: The crossover operator uses recombination to create two new offspring by randomly selecting a focal point or a locus and then exchanging subsequences to the right and left sides of the locus that is chosen between two chromosomes during the selection. For instance, when a binary representation is used, two strings 000000 and 11111 can be crossed over at the sixth locus to generate new offspring – 111000 and 000111.

Mutation: The mutation operator is used to randomly change the digits or bits at a particular focal point or locus in a given chromosome. However, this has a very small probability. For instance, once the crossover is made, the 111000-child string can then be changed or mutated at the second focal point to become 101100. This operator introduces information that is new to the genetic pool and then protects that pool from converging too quickly.

Genetic algorithms function by iteratively updating collections of potential solutions called populations. Each member of the population is the evaluated for fitness in every cycle. The old population is then replaced by the new population using the operators mentioned above, with the fittest chromosomal members being chosen for cloning or reproduction. This fitness function f (x) is a function that is real-valued that operates on the chromosomes (the potential solution), not the gene, such that the x in f (x) refers to the numeric value taken by the chromosome at the time of ﬁtness evaluation.