Supervised learning is performed using a collection of samples with the corresponding output values (desired output) for each sample. These machine learning methods are called supervised because we know the correct answer for each training example and the supervised learning algorithm analyzes the training data in order to make predictions on the training data. Besides, these predictions can be corrected based on the difference between the prediction and the corresponding desired output. Based on these corrections, the algorithm can learn from the mistakes to adjust its internal parameters. This way, in supervised learning, the algorithm iteratively adjusts a function, which best approximates the relationship between the collection of samples and the corresponding desired output.
Supervised learning problems can be further grouped into the following categories:
- Classification: When the output variable is a category, such as color (red, green, or blue), size (large, medium, or small), or gender (male or female), the problem can be considered as a classification problem. In a classification problem, the algorithm maps the input to output labels.
- Regression: When the output variable is a real value, such as age or weight, the supervised learning problem can be classified as a regression problem. In a regression problem, the algorithm maps the input to a continuous output.
In supervised learning, there are some major issues to take into account and for the sake of completeness are commented next:
- Bias-variance trade-off: Bias-variance trade-off is a common term in machine learning, which refers to the model—a model that underfits the data has a high bias, whereas a model that overfits the data has a high variance:
- The bias can be seen as the error that occurs from erroneous assumptions in the learning algorithm and can be defined as the difference between the prediction of our model and the correct value we are trying to predict. This leads the algorithm to learn the wrong thing by not taking into account all of the information in the data (underfitting). Therefore, a model with a high bias failed to find all of the patterns in the data, so it does not fit the training set well and it will not fit the test set well either.
- The variance can be defined as the algorithm's tendency to learn the wrong things, irrespective of the real signal, by fitting models that follow the error/noise in the data too closely (overfitting). Therefore, a model with a high variance fits the training set very well, but it fails to generalize to the test set because it has also learned the error/noise in the data. Check out the following diagram for a better understanding:
- Function complexity and the amount of training data: Model complexity refers to the complexity of the function the machine learning algorithm is attempting to learn in a similar fashion as the degree of a polynomial. The proper level of model complexity is generally determined by the nature of the training data. For example, if you have a small amount of data to train the model, a low-complexity model is preferable. This is because a high-complexity model will overfit the small training set.
- Dimensionality of the input space: When dealing with high-/very high-dimensional feature spaces, the learning problem can be very difficult because the many extra features can confuse the learning process, which results in a high variance. Therefore, when dealing with high-/very high-dimensional feature spaces, a common approach is to modify the learning algorithm to have a high bias and low variance. This problem is related to the curse of dimensionality, which refers to various aspects that arise when analyzing and organizing data in high-dimensional spaces that aren't found in low-dimensional spaces.
- Noise in the output values: If the desired output values are incorrect (due to human or sensor errors), overfitting can occur if the learning algorithm attempts to fit the data too closely. There are several common strategies that can be used to alleviate the effect of error/noise in the output values. For example, detecting and removing the noisy training examples before training the algorithm is a common approach. Another strategy is early stopping, which can be used to prevent overfitting.