Classification

Classification refers to the process of predicting the class of a particular data point. Classes are referred to as labels, targets, or categories. Classification predictive modeling is the procedure of estimating a mapping function (f) from input variables (X) to discrete output variables (y).

Let’s take the example of spam detection in email service providers which can be selected as a classification challenge. This is an example of a binary classification

because there are just 2 classes: a spam and not a spam. A classifier takes advantage of training data to understand the way a specific input of variables is associated with a particular class. In the following example, known spam and non-spam emails should be used as the training data. When the classifier is accurately trained, you can use it to detect unknown email.

Classification is a field of supervised learning where targets come with the input data. There are many areas in real life where classification is applied. Some of these areas include medical diagnosis, credit approval, target marketing, and many more.

Lazy learners hold training data and wait till the time when a testing data arrives. Once the data arrives, classification is performed depending on the common data found in the training data. When you compare it to eager learners, lazy learners have a minimum time of training. However, more time is required in prediction. Example includes k-nearest neighbor and case-based reasoning which we shall look later in the chapter.

With eager learners, the classification model is created with respect to the type of training data before getting data for classification. It should be able to dedicate a single hypothesis that handles the whole instance space. Because of the construction of the model, eager learners will consume more training time and minimum time during prediction. Example of eager learners includes Artificial Neural Networks, Naive Bayes and Decision Tree.

There are many different kinds of classification algorithms developed, however, it is hard to pick on one which is better than the other. This is because of a few factors such as the application and nature of the existing data set. For instance, if you have linearly separable classes, the linear classifiers such as Logistic regression, Fisher’s linear discriminant can execute complex models.

A decision tree creates regression models and classification models just like a tree structure. This tree works with the same concept as the if-then rule set that is mutually exclusive and exhaustive for classification. Rules are learned sequentially by applying the training data one at a time. Every time a rule is learned, the tuples which the rules handles are deleted. This process is repeated on the training set until a meeting termination condition is attained.

The tree is built through a technique called top-down recursive divide-and-conquer manner. All the features must be categorical. Nonetheless, they need to be discretized in advance. With a decision tree, it is very easy for overfitting to take place. Overfitting will produce many branches which may indicate problems of noise and outliers. In an overfitted model, the performance is very poor on the unseen data although it provides the correct performance on training data.

However, this is can be avoided by applying pre-pruning. Pre-pruning shall stop the tree construction early or post-pruning which eliminates branches from a complete tree.

This is one of the most important advantages of decision tree model. Unlike other models of the decision tree, the decision tree reveals all possible alternatives and traces each alternative to the end in a single view. This makes it easy to compare the different alternatives. The application of different nodes to represent user-defined decisions, increases transparency in decision making.

Another major advantage of the decision tree in analysis is the ability to allocate a given value to a problem and outcomes of every decision. This is important because it helps minimize vagueness in the decision making. Every possible case from a decision tree discovers a representation using a clear fork and node. This allows one to see all solutions in a clear view. The inclusion of monetary values to decision tree reveals the costs and benefits of taking a different course of action.

The decision tree has a graphical representation of the problem and different alternatives in an easy and simple way to help any person understand without asking for an explanation.

The decision tree is one of the best predictive models because it has a comprehensive analysis of the results of every possible decision. That can include what the decision leads to, if it finishes in uncertainty or whether it results to new issues which the process may require repetition.

✓ They implicitly perform feature selection.

✓ Decision trees can deal with categorical and numerical data.

✓ Users have little to do with data preparation.

✓ Nonlinear relationships between parameters cannot affect the performance.

1. There are times when decision trees can be unstable because of the little variations in the data that may lead to a totally different tree generated.

2. The greedy algorithm cannot prove that it will return a universally optimal decision tree. This can be solved by training multiple trees where the samples and features have been randomly sampled with replacement.

3. Learners of the decision tree can build advanced trees that don’t generalize the data.

For that reason, it is advised to balance the data set before fitting with the decision tree.

The k-nearest Neighbor belongs to the lazy learning algorithm which holds all instances that match to training data points in n-dimensional space. In case there is an unknown discrete data, it has to make an analysis of the nearest k number of instances saved and display the most popular class as the prediction. For the real-valued data, it has to return the mean of k-nearest neighbors.

In the case of the distance-weighted nearest neighbor algorithm, it measures the weight of every k-nearest neighbor based on their distance by applying the query

Typically, KNN is very strong to noisy data because it averages the k-nearest neighbors.

K-means is used with data that is numeric, continuous and has a small dimension. Imagine an instance where you would like to group similar items from a randomly spread collection of things such as k-means. This list has a few interesting areas where you can apply K-means

Clustering of documents in numerous categories depends on topics, tags, and the content of the document. This is a normal classification problem and k-means is a great algorithm for this function. The original document processing is important when you want to replace every document as a sector and applies the frequency term to use terms which classify the document. The vectors of the document have to be clustered so that they can select similarity in document groups.

b) Delivery store Optimization

If you want to improve the process of delivery, you’ll need to enhance it by applying drones and integrating k-means algorithm to determine the optimal number of launch locations and a genetic algorithm to compute the route of the truck.

c) Fantasy League Stat Analysis

To analyze the stats of a player is one of the most critical features of the sporting world. With the rapid rise of competition, machine learning has an important function to offer here. As a great exercise, if you want to build a fantasy draft team and select similar players, k-means is a great option.

Information about Uber is available to the public. This dataset has an extensive size of valuable data about transit time, traffic, peak pickup localities, and many more. If you analyze this particular data, you will get insight into the urban traffic patterns and help plan for the cities in the future.

This is the process of gathering data from people and groups to select important links. The concept behind cyber-profiling is extracted from criminal histories that provide information about investigation division to help categorize criminals present at the crime.

Extensive enterprise in IT infrastructure technology like network generates huge volumes of alert messages. Since alert messages refer to operational issues, it has to be manually screened for categorization. Data clustering can help provide insight into alert categories and the mean time to repair and support predictions.

Since data associated to crime is present in specific city localities, the type of crime, the area of the crime, and the relation between the two can provide quality insight into the most crime-prone areas in the city or a locality.

Artificial Neural Network describes a set of connected input/output where every connection is linked to a particular weight. In the learning phase, the network adjusts the weights so that it can predict the right class label of input tuples.

There are a lot of network architectures present now. Some of them include the Feed-forward, Recurrent, Convolutional, etc. The correct architecture depends on the model application. In most cases, the feed-forward models provide a reasonably accurate result and mostly for image processing applications.

There can be many hidden layers in a model based on the complexity of the function that is to be wrapped by the model. If you have a lot of hidden layers, it will facilitate the modeling of complex relationships like deep neural networks.

However, the presence of many hidden layers increases the time it takes to train and adjust weights. Another drawback is the poor interpretability when compared to other models such as Decision Trees.

Despite this, ANN has performed well in the majority of the real world applications. It has an intensive persistence to noisy data and can categorize untrained patterns. Generally, ANN works better with continuous-valued inputs and outputs.

• It stores information in the whole network. For example, traditional programming information is kept in the whole network, and not in a database. This means that loss of certain information in a given place does not stop the network functions.

• It has fault tolerance. The destruction of one or more cells of ANN doesn’t affect it from producing input. Therefore, this specific feature causes the network to be fault tolerant.

• It can work with incomplete knowledge. Once the ANN training is over, the data can produce output using incomplete information. The loss of performance, in this case, will depend on the missing information.

• It has a parallel processing capability. The ANN neural networks feature a numerical strength that does more than one job at the same time.

• It depends on the hardware. ANN need processors which contain parallel processing power based on their structure. For this case, the realization of the device is dependent.

• The determination of the correct network structure. Often, there is no fixed rule to use to determine the structure of artificial neural networks. The right network structure is attained through trial and error.

• The duration of the network is not known. The network is limited to a particular value of the error on the sample means which the training is completed. This value does not generate an optimum result.

• There are unexplained characteristics of the network. It is one of the major problems of ANN. If an ANN generates a probing solution, it doesn’t show any hint. This always reduces trust in the network.

The Naïve Bayes algorithm is a probabilistic classifier which was driven by the Bayes theorem. This is based on a simple assumption where attributes are conditionally independent.

The classification works by extracting the maximum posterior that is the maximal P(Ci|X) with the above stated assumption working. This assumption always reduces the computational cost by measuring the computational cost. Although the assumption fails many times because the properties are dependent. Despite this, the Naïve Bayes has continued to work so well.

This is a simple algorithm to implement and improve outcomes that have been generated in most instances. It is possible for it to be scaled into massive datasets because it assumes a linear time.

› It is simple and easy to implement.

› It requires a minimal training data.

› It handles continuous and discrete data.

› It can develop probabilistic predictions.

› It is highly scalable.

This refers to ratio of the number of correct predictions to the general number of input samples.

It works better when the number of samples which belong to each class is equal. Classification accuracy is the best but provides a false notion of attaining high accuracy.

The major problem emerges when the cost of misclassification of minor class samples is high. If you are to handle a rare but dangerous disease, the cost of not diagnosing the disease of a sick individual is very high compared to the cost of testing a healthy person.

This operates well for multi-class classification. When you work with Log Loss, the classifier has to allocate probability for every class. For example, if you have N samples of M classes, you can compute the Log Loss as follows:

Confusion matrix as the name suggests creates a matrix as the output and explains the complete performance of a model.

Suppose you have a binary classification problem. Then there are some samples which belong to two classes: YES or NO. Additionally, you have your own classifier that can predict a class for a particular input sample. If the following model is tested on 165 samples, the following result is obtained.

✓ True Positives. This is where our prediction was YES and final outcome YES.

✓ True Negatives. This is where our prediction was NO and final outcome NO.

✓ False Positives. In this case, the prediction was YES but the final outcome was NO.

✓ False Negatives. In this scenario, the prediction was NO but the final outcome was YES.

This is one of the most widely applied metrics for evaluation. It is applied in binary application problems.