Bayesian learning

In the maximum-likelihood approach to learning, we try to find the most optimal parameters for our model that maximizes our likelihood function. But data in real life is usually really noisy, and in most cases, it doesn't represent the true underlying distribution. In such cases, the maximum-likelihood approach fails. For example, consider tossing a fair coin a few times. It is possible that all of our tosses result in either heads or tails. If we use a maximum-likelihood approach on this data, it will assign a probability of 1 to either heads or tails, which would suggest that we would never get the other side of the coin. Or, let's take a less extreme case: let's say we toss a coin 10 times and get three heads and seven tails. In this case, a maximum-likelihood approach will assign a probability of 0.3 to heads and 0.7 to tails, which is not the true distribution of a fair coin. This problem is also commonly known as overfitting.

Bayesian learning takes a slightly different approach to learn these parameters. We start by assigning a prior distribution over the parameters of our model. The prior makes our assumptions about the model explicit. In the case of tossing the coin, we can start by using a prior that assigns equal probabilities to both heads and tails. Then we apply the Bayes theorem to compute the posterior distribution over our parameters based on the data. This allows us to shift our belief (prior) toward where the data points to, and this makes us do a less extreme estimate of the parameters. And in this way, Bayesian learning can solve one of the major drawbacks of maximum likelihood.

In more general terms, in the case of Bayesian learning, we try to learn a distribution over the parameters of our model instead of learning a single parameter that maximizes the likelihood. For learning this distribution over the parameters, we use the Bayes theorem, given by the following:

Here, P(θ) is our prior over the parameters of the model, P(D|θ) is the likelihood of the data given the parameters, and P(D) is the probability of the observed data. P(D) can also be written in terms of prior and likelihood as follows:

Now let's talk about each of these terms separately and see how can we compute them. The prior, P(θ), is a probability distribution over the parameters representing our belief about the values of the parameters. For example, in the case of coin tossing, we can have our initial belief as θ is in between 0 and 1 and is uniformly distributed. The likelihood term, P(D|θ), is the same term that we tried to maximize in Chapter 4, Parameter Inference using Maximum Likelihood. It represents how likely our observed data is, given the parameters of the model. The next term, P(D), is the probability of observing our data and it acts as the normalizing term. It is computationally difficult to compute because it requires us to sum over all the possible values of θ and, for any sufficiently large number of parameters, it quickly becomes intractable. In the next sections of this chapter, we will see the different algorithms that we can use to approximate these values. The term that we are trying to compute, P(D|θ), is known as the posterior. It represents our final probability distribution over the parameters of the model given our observed data. Basically, our prior is updated using the likelihood term to give the final distribution. 

Another problem that Bayesian learning solves is the model selection. Since Bayesian learning gives a distribution over the different possible models rather than a single model, we have a couple of options of how we want to do predictions from these models. The first method is to just select a specific model that has the maximum probability, which is also commonly known as the Maximum Aposteriori (MAP) estimate. The other possible way is to compute the expectation of the prediction from all the models based on the posterior distribution. This allows us to regularize our predictions since we are computing expectation over all possible models.