MLE in a coin toss

Let's assume that we want to learn a model of a given coin using observations obtained from tossing it. Since a coin can only have two outcomes, heads or tails, it can be modeled using a single parameter. Let's say we define the parameter as θ, which is the probability of getting heads when the coin is tossed. The probability of getting tails will automatically be 1-θ because getting either heads or tails are mutually exclusive events.

We have our model ready, so let's move on to computing the likelihood function of this model. Let's assume that we are given some observations of coin tosses as D={H,H,T,H,T,T}. For the given data we can write our likelihood function as follows:

Now, we would like to find the value of θ that would maximize P(D|θ). For that, we take the derivative of our likelihood function, equate it to 0, and then solve it for θ:

Therefore, our MLE estimator learned that the probability of getting heads on tossing the coin is 0.5. Looking at our observations, we would expect the same probability as we have an equal number of heads and tails in our observed data.

Let's now try to write code to learn the parameter θ for our model. But as we know that finding the optimal value can run into numerical issues on a computer, is there a possible way to avoid that and directly be able to compute θ_MLE? If we look closely at our likelihood equation, we realize that we can write a generic formula for the likelihood for this model. If we assume that our data has n heads and m tails, we can write the likelihood as follows:

Now, we can actually find θ_MLE in a closed-form using this likelihood function and avoid relying on any numerical method to compute the optimum value:

We can see that we have been able to find a closed form solution for the MLE solution to θ. Now, coding this up would be to simply compute the preceding formula as follows:

import numpy as np


def coin_mle(data):
    """
    Returns the learned probability of getting a heads using MLE.
    
    Parameters
    ----------
    data: list, array-like
        The list of observations. 1 for heads and 0 for tails.
    
    Returns
    -------
    theta: The learned probability of getting a heads.
    """
    data = np.array(data)
    n_heads = np.sum(data)

    return n_heads / data.size

Now, let's try out our function for different datapoints:

>>> coin_mle([1, 1, 1, 0, 0])
0.59999999999999998

>>> coin_mle([1, 1, 1, 0, 0, 0])
0.5

>>> coin_mle([1, 1, 1, 0, 0, 0, 0])
0.42857142857142855

The outputs are as we expect, but one of the drawbacks of the MLE approach is that it is very sensitive to randomness in our data which, in some cases, might lead it to learn the wrong parameters. This is especially true in a case when the dataset is small in size. For example, let's say that we toss a fair coin three times and we get heads in each toss. The MLE approach, in this case, would learn the value of θ to be 1, which is not correct since we had a fair coin. The output is as follows:

>>> coin_mle([1, 1, 1])
1.0

In Chapter 5, Parameter Inference using Bayesian Approach, we will try to solve this problem of MLE by starting with a prior distribution over the parameters, and it modifies its prior as it sees more and more data.