Chapter Four: Supervised Machine Learning

An important part of the machine learning process is called training where a machine is provided with data about historic events that will help the machine anticipate any future events. This type of learning is called supervised machine learning when the training data fed is supervised. The data fed essentially consists of training examples. These examples consist of inputs and the desired outputs. These desired outputs are also known as supervisory signals. The machine then uses a supervised machine algorithm that creates an inferred function that is used to forecast any events. If the outputs are discrete, the function is called a classifier, and if the outputs are continuous, the function is known as a regression function. The function is used to predict the outputs for future inputs too. This algorithm is used to generate a generalized method to reach the output from the data that was fed in as input. An analogy that can be made in the spheres of human and animal learning is concept learning.

Overview

This type of learning always uses a fixed algorithm. The steps that are involved in the process are listed below:

• The first step in this process of learning is to determine the type of examples that would need to be used to train the machine. This step is extremely crucial since the engineer will need to know what kind of data he would want to use as examples for his machine. For instance, for a speech recognition system, the engineer could either use single words, small sentences or entire paragraphs for training the machine.

• When the engineer has decided the type of data that would be used, he would need to collate that data and create a training set. This set would be used to represent the possibilities of any function that can be used. The second step would need to engineer to collect the desired inputs and outputs that would need to be used for the training process.

• Now, the next step is to determine how to represent the input data to the machine. This is very important since the accuracy of the machine depends on the input representation of the function. Normally, the representation is done in the form of a vector. This vector contains information about various characteristic features of the input. However, the vector should not include information on too many features since this would increase the time taken for training. A larger number of features might also lead to mistakes made by the machine in prediction. The vector needs to contain exactly enough data to predict outputs.

• Once the engineer has decided what data to use as input data, a decision will need to be made on the structure of the function that would need to be created. The learning algorithm must be decided too. The algorithms that are used are often support vectors and decision trees.

• The engineer will now need to complete the design. The algorithm that is chosen will need to be run on the data set that has been used as the training input data. There are some algorithms that would need the engineer to identify some control parameters to verify that the algorithm is running well. These parameters can be estimated by testing on a smaller subset or by using the method of cross-validation.

• Once the algorithm has run, and the function has been generated, the accuracy and effectiveness of the function will need to be calculated. Engineers use a testing set for this. This data set is different from the training data set, and the corresponding outputs to the input data are already known. The test set inputs are sent to the machine, and the outputs obtained are checked with those in the test set.

Some supervised learning algorithms are often used, and each of these has their strengths and weaknesses. Since there is no definitive algorithm that can be used for all instances, the selection of the learning algorithm is a major step in the procedure.

Issues to consider in Supervised Learning

With the usage of supervised learning algorithms, there arise a few issues associated with it. Given below are four major issues:

Bias-variance tradeoff

The issue that would need to be kept in mind is that while working with machine learning is the bias-variance tradeoff. Consider yourself in a situation where you have some good training sets. When a machine is trained using some training data sets, it gives predictions that are systematically incorrect for certain output. It can be said that the algorithm is biased towards the input data set. A learning algorithm can also be considered to have a high variance for input. This occurs when the algorithm causes the machine to predict different outputs for that input in each training set. The sum of the variance and the bias of the learning algorithm are known as the prediction error for the classifier function. There is a tradeoff that exists between the variance and the bias. A requirement for learning algorithms with low bias is that they need to be flexible enough to accommodate all the data sets. However, if they are too flexible, the learning algorithms might end up giving varying outputs for each training set and hence, increase the variance. This tradeoff would need to be adjusted using supervised machine learning algorithms. This is done automatically or by using an adjustable parameter.

Function complexity and amount of training data

The second issue is concerned with deciding on the amount of training data based on the complexity of the classifier or regression function to be generated. Suppose the function to be generated is simple, a learning algorithm that is relatively inflexible with low variance and high bias will be able to learn from a small amount of training data. However, on many occasions, the function will be complex. This can be the case due to a large number of input features being involved or due to the machine being expected to behave differently for different parts of the input vector. In such cases, the function can only be learned from a large amount of training data. These cases also require the algorithms used to be flexible with low bias and high variance. Therefore, efficient learning algorithms automatically arrive at a tradeoff between the bias and variance depending on the complexity of the function and the amount of training data required.

Dimensionality of the input space

Another issue that needs to be dealt with is the dimensionality of the input vector space. If the input vector includes a large number of features, the learning problem will become difficult even if the function only considers a few of these features as valuable inputs. This is simply because the extra and unnecessary dimensions could lead to confusion and could cause the learning algorithm to have high variance. So, when the input dimensions are large, the classifier is adjusted to offset the effects by having low variance and high bias. In practice, the engineer could manually remove the irrelevant features to improve the accuracy and efficiency of the learning algorithm. However, this might not always be a practical solution. In recent times, many algorithms have been developed which are capable of removing unnecessary features and retaining only the relevant ones. This concept is known as dimensionality reduction, which helps in mapping input data into lower dimensions to improve the performance of the learning algorithm.

Noise in the output values

Last but not the least, the interference of noise (white noise) in the output values provided by the machine. The output values can be wrong due to the white noise that is added to the output values. These values could also be wrong due to human error. In such cases, the learning algorithm should not look to match the training inputs with their exact outputs. Algorithms with a low variance and high bias are the most desirable.

Other factors to consider

• One important thing to be kept in mind is the heterogeneity of data. The level of heterogeneity of the data should also play a role in dictating the learning algorithm that is to be chosen. Some algorithms work better on data sets whose inputs are limited within small ranges. A few of these are support vector machines, logistic regression, neural networks, linear regression and nearest neighbor methods. Nearest neighbor methods and support vector machines with Gaussian kernels work especially better with inputs limited to small ranges. On the other hand, there exist algorithms like decision trees, which work very well with heterogeneous data sets.

• Another feature of the data sets that need to be considered is the amount of redundancy in the set. A few algorithms perform poorly in the presence of excessive redundancy. This happens due to numerical instabilities. Examples of these types of algorithms are logistic regression, linear regression, and distance-based methods. For such cases, regularization needs to be included so that the algorithms can perform better.

• While choosing algorithms, engineers need to consider the amount of non - linearities in the inputs and the interactions within different features of the input vector. If there is little to no interaction and each feature contributes independently to the output, algorithms based on distance functions and linear functions perform very efficiently. However, when there are some interactions within the input features, algorithms based on decision trees and neural networks are desirable. The reason for this is that these algorithms are designed to detect these interactions in the input vectors. If the engineer decides to use linear algorithms, he must specify the interactions that exist.

When an engineer is tasked with selecting an algorithm for a specific application, he may choose to compare various algorithms experimentally to decide which one is best suited for the application. However, a large amount of time needs to be invested by the engineer in collecting training data and tuning the algorithm. If provided with a large number of resources, it is advisable to spend more time collecting data than spending time on tuning the algorithm because the latter is extremely tedious. The most commonly used learning algorithms are neural networks, nearest neighbor algorithms, linear and logistic regressions, support vector machines and decision trees.