6 Probability-based Learning

When my information changes, I alter my conclusions. What do you do, sir?

—John Maynard Keynes

In this chapter we introduce probability-based approaches to machine learning. Probability-based prediction approaches are heavily based on Bayes’ Theorem, and the fundamentals section of this chapter introduces this important cornerstone of computer science after covering some other fundamentals of probability theory. We then present the naive Bayes model, the standard approach to using probability-based approaches to machine learning. The extensions and variations to this standard approach that we describe are the use of smoothing to combat overfitting, the modifications required to the standard naive Bayes model to allow it to handle continuous features, and Bayesian network models that give us more control than a naive Bayes model over the assumptions that are encoded in a model.

6.1 Big Idea

Imagine that you are at a county fair and a stall owner is offering all comers a game of find the lady. Find the lady is a card game that hucksters have been using to earn money from unsuspecting marks for centuries.¹ In a game, the dealer holds three cards—one queen and two aces (as shown in Figure 6.1(a)^[248])—and, typically with a little bit of flair, quickly drops these cards face down onto a table. Faced with the backs of three cards (as shown in Figure 6.1(b)^[248]), the player then has to guess where the queen has landed. Usually the player bets money on their ability to do this, and the dealer uses a little manual trickery to misdirect the player toward the wrong card.

When you first see the game played, because the dealer lays out the three cards so quickly, you think that there is no way to tell where the queen lands. In this case you can only assume that the queen is equally likely to be in any of the three possible positions: left, center, or right. This is shown in the bar plot in Figure 6.1(c)^[248], which shows an equal likelihood for each position. Not feeling quite brave enough to play a game, you decide to instead study the dealer playing games with other people.

art

Figure 6.1

A game of find the lady: (a) the cards used; (b) the cards dealt face down on a table; (c) the initial likelihoods of the queen ending up in each position; (d) a revised set of likelihoods for the position of the queen based on evidence collected.

After watching the dealer play 30 games with other players, you notice that he has a tendency to drop the queen in the position on the right (19 times) more than the left (3 times) or center (8 times). Based on this, you update you beliefs about where the queen is likely to land based on the evidence that you have collected. This is shown in Figure 6.1(d)^[248], where the bars have been redistributed to illustrate the revised likelihoods.

Confident that your study will help you, you lay a dollar down to play a game, ready to guess that the queen is in the position on the right. This time, however, as the dealer drops the cards onto the table, a sudden gust of wind turns over the card on the right to reveal that it is the ace of spades (shown in Figure 6.2(a)^[249]). The extra piece of evidence means that you need to revise your belief about the likelihoods of where the queen will be once again. These revised likelihoods are shown in Figure 6.2(b)^[249]. As you know that the card is not in the position on the right the likelihood of that you had associated with this position is redistributed amongst the other two possibilities. Based on the new likelihoods, you guess that the queen is in the center position, and happily, this turns out to be correct (see Figure 6.2(c)^[249]). The dealer encourages you to play again, but you know that you’ve got to know when to walk away, so you head on with an extra dollar in your pocket.

art

Figure 6.2

(a) The set of cards after the wind blows over the one on the right; (b) the revised likelihoods for the position of the queen based on this new evidence; (c) the final positions of the cards in the game.

This illustrates the big idea underlying probability-based machine learning. We can use estimates of likelihoods to determine the most likely predictions that should be made. Most importantly, though, we revise these predictions based on data we collect and whenever extra evidence becomes available.

6.2 Fundamentals

In this section we describe Bayes’ Theorem and the important fundamentals of probability theory that are required to use it. This section assumes a basic understanding of probability theory, including the basics of calculating probabilities based on relative frequencies, calculating conditional probabilities, the probability product rule, the probability chain rule, and the Theorem of Total Probability. Appendix B^[541] provides a comprehensive introduction to these aspects of probability theory, so we recommend that readers unfamiliar with them review this appendix before continuing with this chapter.

Table 6.1

A simple dataset for MENINGITIS diagnosis with descriptive features that describe the presence or absence of three common symptoms of the disease: HEADACHE, FEVER, and VOMITING.

ID	HEADACHE	FEVER	VOMITING	MENINGITIS
1	true	true	false	false
2	false	true	false	false
3	true	false	true	false
4	true	false	true	false
5	false	true	false	true
6	true	false	true	false
7	true	false	true	false
8	true	false	true	true
9	false	true	false	false
10	true	false	true	true

We will use the dataset² in Table 6.1^[250] to illustrate how the terminology of probability is mapped into the language of machine learning for predictive data analytics. The target being predicted in this dataset is whether a patient is suffering from MENINGITIS, and the descriptive features are common symptoms associated with this disease (HEADACHE, FEVER, and VOMITING).

From a probability point of view, each feature in a dataset is a random variable, and the sample space for the domain associated with a prediction problem is the set of all possible combinations of assignments of values to features. Each row in a dataset represents an experiment, which associates a target feature value with a set of descriptive feature values, and the assignment of a set of descriptive features with values is an event. So, for example, each row in Table 6.1^[250] represents an experiment and the assignment of the descriptive features to the values shown in each row can be referred to as a distinct event.

A probability function, P(), returns the probability of an event. For example, P(FEVER = true) returns the probability of the FEVER feature taking the value true. This probability, which is 0.4, can be counted directly from the dataset. Probability functions for categorical features are referred to as probability mass functions, while probability functions for continuous features are known as probability density functions. In the early part of this chapter, we focus on categorical features, but we return to continuous features in Section 6.4^[272]. A joint probability refers to the probability of an assignment of specific values to multiple different features, for example, P(MENINGITIS = true, HEADACHE = true) = 0.2. Last, a conditional probability refers to the probability of one feature taking a specific value given that we already know the value of a different feature, for example, P(MENINGITIS = true | HEADACHE = true) = 0.2857.

It is often useful to talk about the probabilities for all the possible assignments to a feature. To do this we use the concept of a probability distribution. A probability distribution is a data structure that describes the probability of each possible value a feature can take. For example, the probability distribution for the binary feature MENINGITIS from Table 6.1^[250] is P(MENINGITIS) = 〈0.3, 0.7〉 (by convention we give the true probability first). We use bold notation to distinguish between a probability distribution, P(), and a probability function, P(). The sum of a probability distribution must equal 1.0.

A joint probability distribution is a probability distribution over more than one feature assignment and is written as a multi-dimensional matrix in which each cell lists the probability of a particular combination of feature values being assigned. The dimensions of the matrix are dependent on the number of features and the number of values in the domains of the features. The sum of all the cells in a joint probability distribution must be 1.0. For example, the joint probability distribution for the four binary features from Table 6.1^[250] (HEADACHE, FEVER, VOMITING, and MENINGITIS) is written as³

art

Given a joint probability distribution, we can compute the probability of any event in the domain that it covers by summing over the cells in the distribution where that event is true. For example, to compute the probability of P(h) in the domain specified by the joint probability distribution P(H, F, V, M), we simply sum the values in the cells containing h (the cells in the first column). Calculating probabilities in this way is known as summing out. We can also use summing out to compute conditional probabilities from a joint probability distribution. To calculate the probability P(h | f ) from P(H, F, V, M), we sum the values in all the cells where h and f are the case (the top four cells in the first column).

We are now ready to take on Bayes’ Theorem!

6.2.1 Bayes’ Theorem

Bayes’ Theorem⁴ is so elegant and intuitive that it can be stated in one sentence of plain English:

the probability that an event has happened given a set of evidence for it is equal to the probability of the evidence being caused by the event multiplied by the probability of the event itself

or slightly more succinctly:

P(an event given evidence) = P(the evidence given the event)

× P(the event)

Reading from left to right, the theorem shows us how to calculate the probability of an event given the evidence we have of that event in terms of the likelihood of the event causing this evidence. This is useful because reasoning from the evidence to events (inverse reasoning) is often much more difficult than reasoning from an event to the evidence it causes (forward reasoning). Bayes’ Theorem allows us to easily swap back and forth between these two types of reasoning.

The formal definition of Bayes’ Theorem is

art

Bayes’ Theorem defines the conditional probability of an event, X, given some evidence, Y, in terms of the product of the inverse conditional probability, P(Y | X), and the prior probability of the event P(X).

For an illustrative example of Bayes’ Theorem in action, imagine that after a yearly checkup, a doctor informs a patient that there is both bad news and good news. The bad news is that the patient has tested positive for a serious disease and that the test the doctor used is 99% accurate (i.e., the probability of testing positive when a patient has the disease is 0.99, as is the probability of testing negative when a patient does not have the disease). The good news, however, is that the disease is extremely rare, striking only 1 in 10,000 people. So what is the actual probability that the patient has the disease? And why is the rarity of the disease good news given that the patient has tested positive for it?

We can use Bayes’ Theorem to answer both of these questions. To calculate the probability that the patient actually has the disease based on the evidence of the test result, P(d | t), we apply Bayes’ Theorem:

The information about the scenario gives us the probability of having the disease as P(d) = 0.0001 and the probability of not having the disease as P( d) = 0.9999. The accuracy of the test is captured as P(t | d) = 0.99 and P(t | d) = 0.01. The overall probability of the test returning a positive value, P(t), is not given in the description above, but it can be easily calculated using the Theorem of Total Probability5 as

art

We can insert these probabilities into the application of Bayes’ Theorem to give

art

So, the probability of actually having the disease, in spite of the positive test result, is less than 1%. This is why the doctor said the rarity of the disease was such good news. One of the important characteristics of Bayes’ Theorem is its explicit inclusion of the prior probability of an event when calculating the likelihood of that event based on evidence.6

Let’s look at Bayes’ Theorem in a little more detail. Bayes’ Theorem is easily derived from the product rule.⁷ We know from the product rule and the logical symmetry of the and operation⁸ that

P(Y | X)P(X) = P(X | Y)P(Y)

If we divide both sides of this equation by the prior probability on the left hand side, P(Y), we get

art

The P(Y) terms on the left hand side of this equation, however, cancel each other out to give us Bayes’ Theorem

art

There are two important observations regarding the division in Bayes’ Theorem by the denominator P(Y). The first is that this division functions as a normalization mechanism ensuring that

0 ≤ P(X | Y) ≤ 1

and

art

where Σ_i P(X_i) should be interpreted as summing over the set of events that are a complete assignment to the features in X. The reason that the division functions as a normalization mechanism is that the prior probability of the evidence, P(Y), is not conditional on X_i, and as a result, it is constant for all X_i.

The second interesting observation about the division of the right hand side of Bayes’ Theorem by P(Y) is that we can calculate P(Y) in two different ways. First, we can calculate P(Y) directly from a dataset as

Alternatively, we can use the Theorem of Total Probability to calculate P(Y):

Notice that ignoring the subscripts, the expression we are summing in Equation (6.4)^[255] is identical to the numerator in Bayes’ Theorem. This gives us a way to calculate the posterior probability distribution over the possible assignment of values to the features in event X conditioned on the event Y, that is, P(X | Y), that avoids explicitly calculating P(Y). If we let

then

where the term η explicitly represents a normalization constant. Because Bayes’ Theorem can be calculated in this way, it is sometimes written as

where η is as defined in Equation (6.5)^[255].

So we have two different definitions of Bayes’ Theorem (Equation (6.2)^[252] and Equation (6.7)^[255]), but which one should we use? The choice is really a matter of convenience. If we are calculating the probability of a single event given some evidence, then calculating P(Y) directly from the data using Equation (6.2)^[252] is the easier option. If, however, we need to calculate the posterior probability distribution over X given Y, that is P(X | Y), then we will be actually calculating each of the P(Y | X_i)P(X_i) values in Equation (6.5)^[255] as part of this calculation, and it is more efficient to use Equation (6.7)^[255].

We are now ready to use Bayes’ Theorem to generate predictions based on a dataset. The next section will examine how this is done.

6.2.2 Bayesian Prediction

To make Bayesian predictions, we generate the probability of the event that a target feature, t, takes a specific level, l, given the assignment of values to a set of descriptive features, q, from a query instance. We can restate Bayes’ Theorem using this terminology and generalize the definition of Bayes’ Theorem so that it can take into account more than one piece of evidence (each descriptive feature value is a separate piece of evidence). The Generalized Bayes’ Theorem is defined as

art

To calculate a probability using the Generalized Bayes’ Theorem, we need to calculate three probabilities:

P(t = l), the prior probability of the target feature t taking the level l
P(q[1], …, q[m]), the joint probability of the descriptive features of a query instance taking a specific set of values
P(q[1], …, q[m] | t = l), the conditional probability of the descriptive features of a query instance taking a specific set of values given that the target feature takes the level l

The first two of these probabilities are easy to calculate. P(t = l) is simply the relative frequency with which the target feature takes the level l in a dataset. P(q[1], …, q[m]) can be calculated as the relative frequency in a dataset of the joint event that the descriptive features of an instance take on the values q[1], …, q[m]. As discussed in the previous section, it can also be calculated using the Theorem of Total Probability (in this instance, summing over all the target levels Σk∈levels(t) P(q[1], …, q[m] | t = k)P(t = k)), or replaced entirely with a normalization constant, η.

The final probability that we need to calculate, P(q[1], …, q[m] | t = l), can be calculated either directly from a dataset (by calculating the relative frequency of the joint event q[1], …, q[m] within the set of instances where t = l), or alternatively, it can be calculated using the probability chain rule.⁹ The chain rule states that the probability of a joint event can be rewritten as a product of conditional probabilities. So, we can rewrite P(q[1], …, q[m]) as

We can use the chain rule for conditional probabilities by just adding the conditioning term to each term in the expression, so

This transformation from a joint probability conditioned on a single event into a product of conditional probabilities with just one event being conditioned in each term may not appear to achieve much. We will see shortly, however, that this transformation is incredibly useful.

Let’s look at an example of how we can now use Bayes’ Theorem to make predictions based on the meningitis diagnosis dataset in Table 6.1^[250] for a query instance with HEADACHE = true, FEVER = false, and VOMITING = true. Returning to the shortened notation that we used previously, a predicted diagnosis for this query instance can be given using Bayes’ Theorem as

art

There are two values in the domain of the MENINGITIS feature, true and false, so we have to do this calculation once for each. Considering first the calculation for m, we need the following probabilities, which can be computed directly from Table 6.1^[250]

art

We also need to calculate the likelihood of the descriptive feature values of the query given that the target is true. We could calculate this directly from the dataset, but in this example, we will illustrate the chain rule approach just described. Using the chain rule approach, we compute the overall likelihood of the descriptive feature values given a target value of true as the product of a set of conditional probabilities that are themselves calculated from the dataset

We can now combine the three probabilities just calculated to calculate the overall probability of the target feature taking the level true given the query instance

The corresponding calculation for P(¬m | h,¬f, v) is:

These calculations tell us that it is twice as probable that the patient does not have meningitis as it is that the patient does. This might seem a little surprising given that the patient is suffering from a headache and is vomiting, two key symptoms of meningitis. Indeed, we have a situation where the posterior for a given prediction given the evidence is quite low (here P(m | h,¬f, v) = 0.3333), even though the likelihood of the evidence if we assume the prediction to be correct is quite high, P(h,¬f, v | m) = 0.6666.

What is happening here is that, as Bayes’ Theorem states, when calculating a posterior prediction, we weight the likelihood of the evidence given the prediction by the prior of the prediction. In this case, although the likelihood of suffering from a headache and vomiting is quite high when someone has meningitis, the prior probability of having meningitis is quite low. So, even when we take the evidence into account, the posterior probability of having meningitis remains low. This can seem counter-intuitive at first. The mistake is to confuse the probability of a prediction given the evidence with the probability of the evidence given the prediction and is another example of the paradox of the false positive.10

Calculating exact probabilities for each of the possible target levels is often very useful to a human decision maker, for example, a doctor. However, if we are trying to build a predictive model that automatically assigns a target level to a query instance, then we need to decide how the model will make a prediction based on the computed probabilities. The obvious way to do this is to have the model return the target level that has the highest posterior probability given the state of the descriptive features in the query. A prediction model that works in this way is making a maximum a posteriori (MAP) prediction.¹¹ We can formally define a Bayesian MAP prediction model as

art

where art (q) is the prediction returned by the model art using a MAP prediction mechanism for a query, q, composed of q[1], …, q[m] descriptive features; levels(t) is the set of levels the target feature can take; and arg max_l∈levelst specifies that we return the level, l, that has the maximum value computed using the function on the right of the arg max term.

Notice that the denominator in Equation (6.10)^[259] is not dependent on the target feature, so it is functioning as a normalization constant. Furthermore, if we want to make a MAP prediction, we don’t necessarily have to calculate the actual probabilities for each level in the target domain; we simply need to know which of the levels in the target domain has the largest probability. Consequently, we don’t necessarily have to normalize the scores for each target level—something we would have to do if we wanted the actual probabilities. Instead we can simply return the target level that has the highest score from the numerator term. Using this simplification, the Bayesian MAP prediction model can be restated as

Although it might seem that we now have a good solution for building probability-based prediction models, we are not quite done yet. There is one fundamental flaw with the approach that we have developed. To illustrate this, we will consider a second query instance for the meningitis diagnosis problem, this time with descriptive feature values HEADACHE = true, FEVER = true, and VOMITING = false. The probability of MENINGITIS = true given this query is

and for MENINGITIS = false

The calculated posterior probabilities indicate that it is a certainty that the patient does not have meningitis! This is because as we progress along the sequence of conditional probabilities specified by the chain rule, the size of the set of conditioning events for each term increases. As a result, the set of events that fulfill the conditions for each conditional probability in the sequence, and hence that are considered when we compute the probability, get smaller and smaller as more and more conditions are added. The technical term for this splitting of the data into smaller and smaller sets based on larger and larger sets of conditions is data fragmentation. Data fragmentation is essentially an instance of the curse of dimensionality. As the number of descriptive features grows, the number of potential conditioning events grows. Consequently, an exponential increase is required in the size of the dataset as each new descriptive feature is added to ensure that for any conditional probability, there are enough instances in the training dataset matching the conditions so that the resulting probability is reasonable.

Returning to our example query, in order to calculate P(h, f,¬v | m), the chain rule requires us to define three conditional probabilities, P(h | m), P(f | h, m), and P(¬v | f, h, m). For the first of these terms, P(h | m), only three instances in the dataset fulfill the condition of m (d₅, d₈ and d₁₀). In two out of these three rows (d₈ and d₁₀), h is the case, so the conditional probability P(h | m) = 0.6666. These are also the only two rows that fulfill the conditions of the second term in the chain sequence, P(f | h, m). In neither of these rows is f the case, so the conditional probability for P(f | h, m) is 0. Because the chain rule specifies the product of a sequence of probabilities, if any of the probabilities in the sequence is zero, then the overall probability will be zero. Even worse, because there are no rows in the dataset where f, h, and m are true, there are no rows in the dataset where the conditions for the third term P(¬v | f, h, m) hold, so this probability is actually undefined as calculating it involves a division by zero. Trying to compute the probability of P(h, f,¬v | m) directly from the data rather than using the chain rule also suffers from the same problem.

In summary, whether we compute the likelihood term for this example using the chain rule or directly from the dataset, we will end up with a probability of zero, or worse, an undefined probability. This is because there are no instances in the dataset where a patient that had meningitis was suffering from a headache and had a fever but wasn’t vomiting. Consequently, the probability for the MENINGITIS feature being true given the evidence in the query using this dataset was zero.

Clearly, the probability of a patient who has a headache and a fever having meningitis should be greater than zero. The problem here is that our dataset is not large enough to be truly representative of the meningitis diagnosis scenario, and our model is overfitting to the training data. The problem is even more serious than this, however, as in practice, it is almost never possible to collect a dataset that is big enough to sufficiently cover all the possible combinations of descriptive feature values that can occur in a dataset so as to avoid this. All is not lost, however, as the concepts of conditional independence and factorization can help us overcome this flaw of our current approach.

6.2.3 Conditional Independence and Factorization

So far our treatment of probability has assumed that the evidence we have collected affects the probability of the event we are trying to predict. This is not always the case. For example, it would seem reasonable to argue that the behavior of an octopus in a swimming tank should not affect the outcome of a soccer match.¹² If knowledge of one event has no effect on the probability of another event, and vice versa, then the two events are said to be independent of each other. If two events X and Y are independent, then

art

Full independence between events is quite rare. A more common phenomenon is that two, or more, events may be independent if we know that a third event has happened. This is known as conditional independence. The typical situation where conditional independence holds between events is when the events share the same cause. For example, consider the symptoms of meningitis. If we don’t know whether the patient has meningitis, then knowing that the patient has a headache may increase the probability we assign to the patient of suffering from a fever. This is because having a headache increases the probability of the patient having meningitis, which in turn increases the probability of the patient having a fever. However, if we already know that the patient has meningitis, then also knowing that the patient has a headache will not affect the probability of the patient having a fever. This is because the information we get from knowing that the patient has a headache is already contained within the information that the patient has meningitis. In this situation, knowing that someone has meningitis makes the events of them having a headache and having a fever independent of each other. For two events, X and Y, that are conditionally independent given knowledge of a third event, here Z, we can say that

This allows us an important reformulation of the chain rule for situations in which conditional independence applies. Recall that the chain rule for calculating the probability that a set of descriptive features, q[1], …, q[m], takes a specific set of values when a target feature, t, takes a specific level, l, is

If the event of the target feature t taking the level l causes the assignment of values to the descriptive features, q[1], …, q[m], then the events of each descriptive feature taking a value are conditionally independent of each other given the value of the target feature. This means that the chain rule definition can be simplified as follows:

The reason that this simplification is so important is that it allows us to simplify the calculations in Bayes’ Theorem, under the assumption of conditional independence between the descriptive features, given the level l of the target feature, from

Where appropriate, conditional independence not only simplifies the calculations but also enables us to compactly represent the full joint probability distribution for a domain. Rather than calculating and storing the probabilities of all the joint events in a domain, we can break up the distribution into data structures called factors, which define distributions over subsets of features. We can then compute any of the probabilities in the joint probability distribution using the product of these factors.

For example, Equation (6.1)^[251] listed the joint probability distribution for the four binary features in the meningitis diagnosis dataset in Table 6.1^[250]. This distribution contained 16 entries. If, however, it is in fact the case that HEADACHE, FEVER, and VOMITING are conditionally independent of each other given MENINGITIS, then we would need to store only four factors: P(M), P(H | M), P(F | M), and P(V | M). We can recalculate all the elements of the joint probability distribution using the product of these four factors:

P(H, F, V, M) = P(M) × P(H | M) × P(F | M) × P(V | M)

Because all the features in this example are binary, we need to store only the probabilities for the events where the features are true under the different combinations of values for the conditioning cases, as the probabilities for the complementary events can be computed by subtracting the stored probabilities from 1.0. Consequently, under this factorization, we need to calculate only seven probabilities directly from the data: P(m), P(h | m), P(h | ¬m), P(f | m), P(f | ¬m), P(v | m), and P(v | ¬m). The four factors required to represent the full joint distribution over the features HEADACHE, FEVER, VOMITING, and MENINGITIS (when the first three are assumed to be conditionally independent given MENINGITIS) can be stated as

art

and the product required to calculate the probability of any joint event in the domain using these four factors is

P(H, F, V, M) = P(M) × P(H | M) × P(F | M) × P(V | M)

So, the assumed conditional independence between the features permits us to factorize the distribution and in doing so reduces the number of probabilities we need to calculate and store from the data. The reduction from 16 to 7 probabilities to represent this domain may not seem to achieve much, but there are two things to bear in mind. First, individually, the 7 probabilities have fewer constraints on them than the 16 in the full joint probability distribution. As a result, it is typically easier to collect the data required to calculate these probabilities. Second, as the number of features in the domain grows, the difference between the number of probabilities required for a factorized representation and the number of probabilities in the full joint probability distribution gets larger. For example, in a domain with one target feature and nine descriptive features, all of which are binary, the full joint probability distribution will contain 2¹⁰ = 1,024 probabilities. However, if all the descriptive features are conditionally independent given the target feature, we can factorize the joint distribution and represent it using just 19 probabilities (one for the prior of the target and two conditional probabilities for each descriptive feature).

Apart from making a model more compact, conditional independence and factorization also increase the coverage of a probability-based prediction model by allowing the model to calculate reasonable probabilities for queries with combinations of evidence that do not occur in the training dataset. To illustrate this, let’s return to the example query instance for the meningitis diagnosis problem, where HEADACHE = true, FEVER = true, and VOMITING = false. When we originally tried to calculate probabilities for this query, a problem arose from the requirement that we have instances in the training dataset where all the evidence events hold. If we treat the evidence events as conditionally independent given the target feature, however, then we can factorize the evidence into its component events and calculate probabilities for each of these events separately. By doing this, we relax the requirement that, to avoid probabilities of zero, all the evidence events must hold in at least one instance for each value in the domain of the target. Instead, to avoid zero probabilities, we require only that for each value in the domain of the target feature, there be at least one instance in the dataset where each event in the evidence holds. For example, this allows us to use the probability of a patient having a fever given that the patient has meningitis, rather than the more constrained conditional probability of the patient having a fever given that the patient has meningitis and is suffering from a headache.

We reiterate the factors required to represent the full joint distribution for the meningitis diagnosis scenario when we assume that the descriptive features are conditionally independent given the target, this time including the actual probabilities calculated from the dataset:

Using the factors in Equation (6.15)^[266], we calculate the posterior distribution for meningitis given the query instance using Equation (6.14)^[263] as

As with our previous calculations, the posterior probabilities for meningitis, calculated under the assumption of conditional independence of the evidence, indicates that the patient probably does not have meningitis, and consequently, a MAP Bayesian model would return MENINGITIS = false as the prediction for this query instance. However, the posterior probabilities are not as extreme as those calculated when we did not assume conditional independence. What has happened is that asserting conditional independence has allowed the evidence of the individual symptoms to be taken into account, rather than requiring an exact match across all the symptoms taken together. By doing this, the Bayesian prediction model is able to calculate reasonable probabilities for queries with combinations of evidence that do not occur in the dataset. This results in the model having a higher coverage with respect to the possible queries it can handle. Furthermore, the conditional independence assumption enables us to factorize the distribution of the domain, and consequently we need fewer probabilities with fewer constraints to represent the domain. As we will see, a fundamental component of creating probabilistic prediction models is deciding on the conditional independence assumptions we wish to make and the resulting factorization of the domain.

In the next section we introduce the naive Bayes model, a probability-based machine learning algorithm that asserts a global conditional independence between the descriptive features given the target. As a result of this conditional independence assumption, naive Bayes models are very compact and relatively robust to overfitting the data, making them one of the most popular predictive modeling approaches.

6.3 Standard Approach: The Naive Bayes Model

A naive Bayes model returns a MAP prediction where the posterior probabilities for the levels of the target feature are computed under the assumption of conditional independence between the descriptive features in an instance given a target feature level. More formally, the naive Bayes model is defined as

art

where t is a target feature with a set of levels, levels(t), and q is a query instance with set of descriptive features, q[1], …, q[m].

In Section 6.2^[249] we described how a full joint probability distribution could be used to compute the probability for any event in a domain. The problem with this, however, is that generating full joint probability distributions suffers from the curse of dimensionality, and as a result, this approach is not tractable for domains involving more than a few features. In Section 6.2.3^[262], however, we showed how conditional independence between features allows us to factorize the joint distribution, and this helps with the curse of dimensionality problem by reducing the number of probabilities we are required to calculate from the data as well as the number of conditioning constraints on these probabilities. The naive Bayes model leverages conditional independence to the extreme by assuming conditional independence between the assignment of all the descriptive feature values given the target level. This assumption allows a naive Bayes model to radically reduce the number of probabilities it requires, resulting in a very compact, highly factored representation of a domain.

We say that the naive Bayes model is naive because the assumption of conditional independence between the features in the evidence given the target level is a simplifying assumption that is made whether or not it is incorrect. Despite this simplifying assumption, however, the naive Bayes approach has been found to be surprisingly accurate across a large range of domains. This is partly because errors in the calculation of the posterior probabilities for the different target levels do not necessarily result in prediction errors. As we noted when we dropped the denominator of Bayes’ Theorem from the MAP prediction model (Equation (6.11)^[260]), for a categorical prediction task, we are primarily interested in the relative size of the posterior probabilities for the different target levels rather than the exact probabilities. Consequently, the relative ranking of the likelihood of the target levels are, to a certain extent, robust to errors in the calculation of the exact probabilities.13

The assumption of conditional independence between the features in the evidence given the level of the target feature also makes the naive Bayes model relatively robust to data fragmentation and the curse of dimensionality. This is particularly important in scenarios with small datasets or with sparse data.¹⁴ One application domain where sparse data is the norm rather than the exception is in text analytics (for example, spam filtering), and naive Bayes models are often successful in this domain.

The naive Bayes model can also be easily adapted to handle missing feature values: we simply drop the conditional probabilities for the evidence events that specify features taking values that are not in the data from the product of the evidence events. Obviously, doing this may have a negative effect on the accuracy of posterior probabilities computed by the model, but again this may not translate directly into prediction errors.

A final advantage of the naive Bayes model is how simple it is to train. For a given prediction task, all that is required to train a naive Bayes model is to calculate the priors for each target level and the conditional probability for each feature given each target level. As a result, a naive Bayes model can be trained relatively quickly compared to many other prediction models. A further advantage that results from this simplicity is the compactness of the naive Bayes model with which a very large dataset can be represented.

Overall, although naive Bayes models may not be as powerful as some other prediction models, they often provide reasonable accuracy results, for prediction tasks with categorical targets, while being robust to the curse of dimensionality and also being easy to train. As a result, a naive Bayes model is often a good prediction model to use to define a baseline accuracy score or when working with limited data.

6.3.1 A Worked Example

We will use the dataset presented in Table 6.2^[270] to illustrate how to create and use a naive Bayes model for a prediction problem. This dataset relates to a fraud detection scenario in which we would like to build a model that predicts whether loan applications are fraudulent or genuine. There are three categorical descriptive features in this dataset. CREDIT HISTORY captures the credit history of the applicant, and its levels are none (the applicant has no previous loans), paid (the applicant had loans previously and has paid them off), current (the applicant has existing loans and are current in repayments), and arrears (the applicant has existing loans and are in arrears in repayments). The GUARANTOR/COAPPLICANT feature records whether the loan applicant has a guarantor or coapplicant associated with the application. The levels are none, guarantor, and coapplicant. The ACCOMMODATION feature refers to the applicant’s current accommodation, and the levels are own (the applicant owns their accommodation), rent (the applicant rents their accommodation), and free (the applicant has free accommodation). The binary target feature, FRAUD, tells us whether the loan application turned out to be fraudulent (true or false).

To train a naive Bayes model using this data, we need to compute the prior probabilities of the target feature taking each level in its domain, and the conditional probability of each feature taking each level in its domain conditioned for each level that the target can take. There are two levels in the target feature domain, four levels in the CREDIT HISTORY domain, three in the GUARANTOR/COAPPLICANT domain, and three in the ACCOMMODATION domain. This means that we need to calculate 2 + (2 × 4) + (2 × 3) + (2 × 3) = 22 probabilities. Although this sounds like a lot of probabilities considering the size of the example dataset, it is worth noting that these 22 probabilities would suffice no matter how many new instances are added to the dataset, be it hundreds of thousands, or even millions. This is an example of the compactness of a naive Bayes representation. Be aware, however, that if new descriptive features were added to the dataset, then the number of probabilities required would grow by |domain of target| × |domain of new feature|, and, furthermore, if an extra value were added to the domain of the target, then the number of probabilities would grow exponentially. Once the required probabilities are calculated, our naive Bayes model is ready to make predictions for queries. It is that simple! Table 6.3^[271] lists the probabilities we need for our naive Bayes fraud detection model.

Table 6.2

A dataset from a loan application fraud detection domain.

ID	CREDIT HISTORY	GUARANTOR /COAPPLICANT	ACCOMMODATION	FRAUD
1	current	none	own	true
2	paid	none	own	false
3	paid	none	own	false
4	paid	guarantor	rent	true
5	arrears	none	own	false
6	arrears	none	own	true
7	current	none	own	false
8	arrears	none	own	false
9	current	none	rent	false
10	none	none	own	true
11	current	coapplicant	own	false
12	current	none	own	true
13	current	none	rent	true
14	paid	none	own	false
15	arrears	none	own	false
16	current	none	own	false
17	arrears	coapplicant	rent	false
18	arrears	none	free	false
19	arrears	none	own	false
20	paid	none	own	false

The following is a query instance for the fraud detection domain:

CREDIT HISTORY = paid, GUARANTOR/COAPPLICANT = none, ACCOMMODATION = rent

Table 6.4^[272] shows the relevant probabilities needed to make a prediction for this query and the calculation of the scores for each possible prediction. Each calculation applies Equation (6.16)^[267] and can be understood as a product of the four factors that the naive Bayes model represents: P(FR), P(CH | FR), P(GC | FR), and P(ACC | FR). The scores are 0.0139 for a prediction of true and 0.0245 for a prediction of false. It is worth emphasizing that the scores calculated are not the actual posterior probabilities for each target level given the query evidence (to get the actual probabilities we would need to normalize these scores), but they do give us enough information to rank the different target levels based on the relative posterior probabilities. A naive Bayes prediction model returns the MAP prediction, so our naive Bayes model would make a prediction of false and so classify this loan application query as not fraudulent.

Table 6.3

The probabilities needed by a naive Bayes prediction model, calculated from the data in Table 6.2^[270].

P(fr)	=	0.3	P (¬fr)	=	0.7
P(CH = none \| fr)	=	0.1666	P(CH = none \| ¬fr)	=	0
P(CH = paid \| fr)	=	0.1666	P(CH = paid \| ¬fr)	=	0.2857
P(CH = current \| fr)	=	0.5	P(CH = current \| ¬fr)	=	0.2857
P(CH = arrears \| fr)	=	0.1666	P(CH = arrears \| ¬fr)	=	0.4286
P(GC = none \| fr)	=	0.8334	P(GC = none \| ¬fr)	=	0.8571
P(GC = guarantor \| fr)	=	0.1666	P(GC = guarantor \| ¬fr)	=	0
P(GC = coapplicant \| fr)	=	0	P(GC = coapplicant \| ¬fr)	=	0.1429
P(ACC = own \| fr)	=	0.6666	P(ACC = own \| ¬fr)	=	0.7857
P(ACC = rent \| fr)	=	0.3333	P(ACC = rent \| ¬fr)	=	0.1429
P(ACC = free \| fr)	=	0	P(ACC = free \| ¬fr)	=	0.0714

Notation key: FR = FRAUD, CH = CREDIT HISTORY, GC = GUARANTOR/COAPPLICANT, ACC =ACCOMMODATION.

There is one, non-obvious, aspect of this example that is particularly interesting. If we look for an instance in the dataset in Table 6.2^[270] that matches all the descriptive feature values in the query, we won’t find one. The fact that despite the lack of any instances that perfectly match the evidence, we were still able to calculate a score for each target level and make a prediction for the query highlights how the conditional independence assumption between the evidence given the target level increases the coverage of the model and allows the model to generalize beyond the data used to induce it.

Table 6.4

The relevant probabilities, from Table 6.3^[271], needed by the naive Bayes prediction model to make a prediction for a query with CH = paid, GC = none, and ACC = rent, and the calculation of the scores for each target level.

art

6.4 Extensions and Variations

In this section we discuss extensions and variations of the naive Bayes model that increase its ability to generalize and avoid overfitting (smoothing) and that allow it to handle continuous descriptive features. We also describe Bayesian networks, which are a probability-based modeling approach that allows us to include more subtle assumptions in a model than the global assumption of conditional independence between all descriptive features that the naive Bayes model makes.

6.4.1 Smoothing

Although the assumption of conditional independence extends the coverage of a naive Bayes model and allows it to generalize beyond the contents of the training data, naive Bayes models still do not have complete coverage of the set of all possible queries. We can see the reason for this in in Table 6.3^[271], where there are still some probabilities equal to zero, for example, P(CH = none | ¬fr). These arise when there are no instances in the training data that match a specific combination of target feature and descriptive feature levels. Consequently, a model is likely to overfit the data for any query where one or more of the evidence events match the conditioned event of one of these zero probabilities. For example, consider the following query:

Table 6.5

The relevant probabilities, from Table 6.3^[271], needed by the naive Bayes prediction model to make a prediction for the query with CH = paid, GC = guarantor, and ACC = free, and the calculation of the scores for each possible target level.

art

CREDIT HISTORY = paid, GUARANTOR/COAPPLICANT = guarantor, ACCOMMODATION = free

Table 6.5^[273] lists the relevant probabilities needed to make a prediction for this query, and the calculation of the scores for each of the possible target levels. In this instance, both possible predictions have a score of zero! Both scores are set to zero because one of the conditional probabilities used to calculate them is zero. For fr the probability P(ACC = free | fr) causes the problem, and for ¬fr the probability P(GC = guarantor | ¬fr) is the offender. As a result, the model is unable to return a prediction for this query.

The way to solve this problem is by smoothing the probabilities used by the model. We know from the definition of probability that the sum of the probabilities of a feature taking each of its possible levels should equal 1.0:

art

where f is a feature and levels(f ) is the set of levels in the domain of the feature. This means that we have a total probability mass of 1.0 that is shared out between the different assignments of a level to a feature based on their relative frequency. Smoothing involves taking some of the probability mass from the assignments with probability greater than average and spreading it across the probabilities that are below average, or even equal to zero.

Table 6.6

The posterior probability distribution for the GUARANTOR/COAPPLICANT feature under the condition that FRAUD = false.

art

For example, if we sum across the posterior probability distribution for the GUARANTOR/COAPPLICANT feature under the condition that FRAUD = false, we will get a value of 1.0 (see Table 6.6^[274]). Notice that within this set, P(GC = none | ¬fr) is quite large, and at the other extreme, P(GC = guarantor | ¬fr) = is equal to zero. Smoothing takes some of the probability mass from the events with high probability and shares this with the events with low probabilities. If this is done correctly, then the total probability mass for the set will remain equal to 1.0, but the spread of probabilities across the set will be smoother (hence the name smoothing).

There are several different ways to smooth probabilities. We will use Laplace smoothing. Note, that in general, it does not make sense to smooth the unconditional (prior) probabilities for the different target feature levels,¹⁵ so here we will focus on smoothing the conditional probabilities for the features. Laplace smoothing for conditional probabilities is defined as

art

where count(f = l | t) is how often the event f = l occurs in the subset of rows in the dataset where the target level is t, count(f | t) is how often the feature, f, took any level in the subset of rows in the dataset where the target level is t, |Domain(f )| is the number of levels in the domain of the feature, and k is a predetermined parameter. Larger values of k mean that more smoothing occurs—that is more probability mass is taken from the larger probabilities and given to the small probabilities. Typically k takes small values such as 1, 2, or 3.

Table 6.7

Smoothing the posterior probabilities for the GUARANTOR/COAPPLICANT feature conditioned on FRAUD = false.

art

Table 6.7^[275] illustrates the steps in smoothing the posterior probabilities for the GUARANTOR/COAPPLICANT feature when conditioned on FRAUD = false. We can see that after smoothing, the probability mass is more evenly distributed across the events in the set. Crucially, the posterior probability for P(GC = guarantor | fr) is no longer zero, and as a result, the coverage of the model has been extended to include queries with GUARANTOR/COAPPLICANT values of guarantor.

Table 6.8^[276] lists the prior and smoothed conditional probabilities for the fraud domain that are relevant to a naive Bayes model. Notice that there are no zero probabilities, so the model will be able to return a prediction for any query in this domain. We can illustrate the extended coverage of the model by returning to the query from the beginning of this section:

Table 6.8

The Laplace smoothed (with k = 3) probabilities needed by a naive Bayes prediction model, calculated from the dataset in Table 6.2^[270].

P(fr)	=	0.3	P(¬fr)	=	0.7
P(CH = none \| fr)	=	0.2222	P(CH = none \| ¬fr)	=	0.1154
P(CH = paid \| fr)	=	0.2222	P(CH = paid \| ¬fr)	=	0.2692
P(CH = current \| fr)	=	0.3333	P(CH = current \| ¬fr)	=	0.2692
P(CH = arrears \| fr)	=	0.2222	P(CH = arrears \| ¬fr)	=	0.3462
P(GC = none \| fr)	=	0.5333	P(GC = none \| ¬fr)	=	0.6522
P(GC = guarantor \| fr)	=	0.2667	P(GC = guarantor \| ¬fr)	=	0.1304
P(GC = coapplicant \| fr)	=	0.2	P(GC = coapplicant \| ¬fr)	=	0.2174
P(ACC = own \| fr)	=	0.4667	P(ACC = own \| ¬fr)	=	0.6087
P(ACC = rent \| fr)	=	0.3333	P(ACC = rent \| ¬fr)	=	0.2174
P(ACC = free \| fr)	=	0.2	P(ACC = free \| ¬fr)	=	0.1739

Notation key: FR =FRAUD, CH = CREDIT HISTORY, GC = GUARANTOR/COAPPLICANT, ACC =ACCOMMODATION.

CREDIT HISTORY = paid, GUARANTOR/COAPPLICANT = guarantor, ACCOMMODATION = free

Table 6.9^[277] illustrates how a naive Bayes model would calculate the scores for each candidate target level for this query using the smoothed probabilities from Table 6.8^[276]. Using our smoothed probabilities we are able to calculate a score for both target levels: 0.0036 for true and 0.0043 for false. The target level false has the highest score (if only marginally) and is the MAP prediction for this query. Therefore, our naive Bayes model will predict that this loan application is not fraudulent.

6.4.2 Continuous Features: Probability Density Functions

To calculate the probability of an event, we have simply counted how often the event occurred and divided this number by how often the event could have occurred. A continuous feature can have an infinite number of values in its domain, so any particular value will occur a negligible amount of the time. In fact, the relative frequency of any particular value for a continuous feature will be indistinguishable from zero given a large dataset.

Table 6.9

The relevant smoothed probabilities, from Table 6.8^[276], needed by the naive Bayes prediction model to make a prediction for the query with CH = paid, GC = guarantor, and ACC = free, and the calculation of the scores for each target levels.

art

The way to solve the problem of zero probabilities is to think in terms of how the probability of a continuous feature taking a value is distributed across the range of values that a continuous feature can take. A probability density function (PDF) represents the probability distribution of a continuous feature using a mathematical function, and there are a large number of standard, well-defined probability distributions—such as the normal distribution—that we can use to model the probability of a continuous feature taking different values in its range.

Table 6.10^[278] shows the definition of some of the standard probability distributions—the normal, exponential, and mixture of Gaussians distributions—that are commonly used in probabilistic prediction models, and Figure 6.3^[279] illustrates the shapes of the density curves of these distributions. All standard PDFs have parameters that alter the shape of the density curve defining that distribution. The parameters required for the normal, exponential, and mixture of Gaussians PDFs are shown in Table 6.10^[278]. In order to use a PDF to represent the probability of a continuous feature taking different values, we need to choose these parameters to fit the characteristics of the data. We have already described the normal distribution, in some detail, in Section 3.2.1^[64], so we won’t repeat that introduction here, but we will describe the other distributions in a little detail.

Table 6.10

Definitions of some standard probability distributions.

art

The student-t distribution is symmetric around a single peak. In fact, it looks very similar to a normal distribution, as shown in Figure 6.3(a)^[279]. The definition of the student-t probability density function uses the gamma function, Γ(), which is a standard statistical function.¹⁶ The student-t distribution is a member of the location-scale family of distributions.¹⁷ These distributions take two parameters: a location parameter ϕ, which specifies the position of the peak density of the distribution, and a non-negative scale parameter ρ, which specifies how spread out the distribution is; the higher the scale the more spread out the distribution. The normal distribution is a member of this location-scale family, with the mean μ specifying the location, and the standard deviation σ acting as the scale parameter. We use different notation for location and scale parameters, ϕ and ρ, than we do for mean and standard deviation parameters of the normal, μ and σ, because the values of these parameters are estimated using different techniques: generally, the location and scale parameters for distributions are fitted to the data using a guided search process.18 The student-t distribution, however, takes an extra parameter κ. This parameter is the degrees of freedom of the distribution. In statistics the degrees of freedom of a distribution is the number of variables in the calculation of the statistic that are free to vary. For the student-t distribution, the degrees of freedom is always set to the sample size (number of rows in the dataset) minus one.

art

Figure 6.3

Plots of some well-known probability distributions.

From a distribution perspective, the main distinction between a normal distribution and a student-t is that a normal distribution has light tails whereas the student-t distribution has fat tails. Figure 6.4^[280] illustrates the distinction between fat and light tail distributions using histograms of two datasets. The dataset in Figure 6.4(a)^[280] follows a light tail distribution—the bars at the extreme left and right of the distribution have zero height. The dataset in Figure 6.4(b)^[280] has a fat tail distribution—the bars on the extreme left and right of the distribution are still above zero, if only just. This distinction between fat and light tailed distributions is important because it highlights that when we use a normal distribution, we are implicitly assuming that the likelihood of values that differ from the mean of the distribution drops quite dramatically as we move away from the mean. A common mistake made by many data analysts is to automatically default to modeling unimodally distributed data with a normal distribution.19 There are statistical tests (such as the Kolmogorov-Smirnov test) that can be use to check whether or not a feature is normally distributed, and in cases where the feature is not normally distributed, another unimodal distribution, such as the student-t distribution, may be a better fit.

art

Figure 6.4

Histograms of two unimodal datasets: (a) the distribution has light tails; (b) the distribution has fat tails.

Another consequence of the normal distribution having light tails is that it is sensitive to outliers in the data. Figure 6.5^[281] illustrates how outliers affect normal and student-t distributions. Figure 6.5(a)^[281] shows a histogram of a dataset that has been overlaid with the curves of a normal and a student-t distribution that have been fitted to the data. The normal and the student-t distributions are both very similar, and both do a good job of matching the shape of the density histogram. Figure 6.5(b)^[281] shows a histogram of the same dataset after some outliers have been added to the extreme right of the distribution. Again, we have overlaid the histogram with plots of the curves for a normal and a student-t distribution that have been fitted to the updated dataset. Comparing Figure 6.5(a)^[281] and Figure 6.5(b)^[281], we can see clearly that the introduction of outliers has a much larger effect on the normal distribution than it does on the student-t distribution. The robustness of the student-t to outliers is another reason to consider using this distribution, as opposed to a normal distribution, to model unimodal data in situations with relatively small or possibly noisy datasets.

art

Figure 6.5

Illustration of the robustness of the student-t distribution to outliers: (a) a density histogram of a unimodal dataset overlaid with the density curves of a normal and a student-t distribution that have been fitted to the data; (b) a density histogram of the same dataset with outliers added, overlaid with the density curves of a normal and a student-t distribution that have been fitted to the data. (This figure is inspired by Figure 2.16 in Bishop (2006).)

The plot of the density curve for the exponential distribution (Figure 6.3(b)^[279]) shows that it assigns a high probability to values near the left of the distribution and that the probability of a value occurring drops dramatically as we move to the right. The standard range for the exponential distribution is from zero upward (i.e., the density assigned to values less than zero is zero). However, we can adjust this by offsetting the values input into the distribution. The exponential distribution takes one parameter, λ, known as the rate. Varying the value λ changes the rate at which the density drops off. As λ gets larger, the peak of the distribution (on the left) gets larger and the drop-off in density gets steeper. To fit an exponential distribution to a continuous feature, we set λ equal to 1 divided by the mean of the feature. The exponential distribution is often used to model waiting times (for example, how long it will take for a call to be answered at a help desk, how long you will have to wait for a bus, or how long before a piece of hardware fails), where the parameter λ is equal to 1 divided by the average time it takes for the event.

As the name suggests, the mixture of Gaussians distribution is the distribution that results when a number of normal (or Gaussian) distributions are merged. Mixture of Gaussians distributions are used to represent data that is composed of multiple subpopulations. Figure 6.6(a)^[283] illustrates the profile typical of data with multiple subpopulations. The multiple peaks in the density curve arise from the different subpopulations (a distribution with multiple peaks is called multimodal). Using a mixture of Gaussians distribution assumes that all the subpopulations in the data are distributed following a normal distribution, but that each of these subpopulation normal distributions has a different mean and may also have a different standard deviation.

The definition of the mixture of Gaussians distribution in Table 6.10^[278] shows how the individual normal distributions in a mixture of Gaussians distribution are combined using a weighted sum. Each normal that is merged is known as a component of the mixture. The weight of a component in the sum determines the contribution of the component to the overall density of the resulting mixture. A mixture of Gaussians distribution is defined by three parameters for each component: a mean, μ, a standard deviation, σ, and a weight, ω. The set of weight parameters for the mixture must sum to 1.

There is no closed form solution to calculate the parameters to fit a mixture of Gaussians distribution to a set of feature values, as there is for the exponential and normal distributions. Instead, given the set of values for a continuous feature, we fit a mixture of Gaussians distribution to this data by searching for the number of components and set of parameters for each component that best matches the data. Guided search techniques, such as the gradient descent algorithm, are used for this task. Analysts will often input a suggested starting point for this search based on their own analysis of the data in order to guide the process.

In Figure 6.6(b)^[283] we can see the three normal distributions used to model the multimodal distribution in Figure 6.6(a)^[283]. Each normal distribution has a different mean but the same standard deviation. The size of the individual normal density curves is proportional to the weight for that normal used in the mixture. Figure 6.6(c)^[283] overlays the multimodal density curve on top of the three weighted normals. It is clear from this figure that the weighted sum of the three normals does an excellent job of modeling the multimodal density distribution.

The fact that we have a range of parameterized distributions to choose from means that in order to define probability density function (PDF), we must

art

Figure 6.6

Illustration of how a mixture of Gaussians model is composed of a number of normal distributions. The curve plotted using a solid line is the mixture of Gaussians density curve, created using an appropriately weighted summation of the three normal curves, plotted using dashed and dotted lines.

Select which probability distribution we believe will best model the distribution of the values of the feature. The simplest and most direct way to choose a distribution for a feature is to create a density histogram of the feature’s values and compare the shape of this histogram to the shapes of the standard distributions. We should choose whichever standard distribution best matches the shape of the histogram to model the feature.
Fit the parameters of the selected distribution to the feature values in the dataset. It is relatively straightforward to fit the parameters, μ and σ, of the normal distribution to a dataset by using the sample mean and standard deviation of the feature values in a dataset as estimates of μ and σ respectively. Similar to the normal distribution, the λ parameter for the exponential distribution can be easily calculated by using the value of 1 divided by the mean of the data. However, for many of the other statistical distributions, for example, the mixture of Gaussians distribution, we cannot define an equation over the data that estimates the parameters appropriately. For these distributions, the parameters are set using guided search techniques such as gradient descent. Fortunately, most data analytics packages and programming APIs provide functions that implement methods to fit a specified distribution to a given dataset.²⁰

A PDF is an abstraction over a density histogram and, as such, defines a density curve. The shape of the curve is determined by (a) the statistical distribution that is used to define the PDF, and (b) the values of the statistical distribution parameters. To use a PDF to calculate a probability, we need to think in terms of the area under an interval of the PDF curve. Consequently, to calculate a probability using a PDF, we need to first decide on the interval we wish to calculate the probability for, and then calculate the area under the density curve for that interval to give the probability of a value from that interval occurring. There is no hard and fast rule for deciding on interval size. Instead, this decision is made on a case by case basis and is dependent on the precision required in answering a question. In some cases, the size of the interval is defined as part of the problem we are trying to solve, or there may be a natural interval to use because of the domain. For example, when we are dealing with a financial feature, we might use intervals that represent cents, while if we were dealing with temperature, we might define the interval to be 1 degree. Once we have selected the interval size, we need to calculate the area under the density curve for that interval.21

When we use a PDF to represent the probability distribution of a descriptive feature in a naive Bayes model, however, we don’t actually need to calculate exact probabilities. We only need to calculate the relative likelihood of a continuous feature taking a value given different levels of a target feature. The height of the density curve defined by a PDF at a particular feature value gives us this, so we can avoid the effort of calculating the actual probability. We can use a value from a PDF as a relative measure of likelihood because when the interval is very small, the actual area under a PDF curve for that interval can be approximated (with a small error proportional to the width of the interval) by the height of the PDF curve at the center of the interval multiplied by the width of the interval. Figure 6.7^[285] illustrates this approximation.

If we were to include the interval width when calculating conditional probabilities for a continuous descriptive feature in a naive Bayes prediction model, using Equation (6.16)^[267], we would multiply the value returned by the PDF by the same interval width each time we calculated the likelihood score for a level of the target feature. Consequently, we can drop this multiplication and just use the value returned by the PDF as a relative measure of the likelihood that the feature takes a specific value.

art

Figure 6.7

(a) The area under a density curve between the limits art and art ; (b) the approximation of this area computed by PDF(x) × ∈; and (c) the error in the approximation is equal to the difference between area A, the area under the curve omitted from the approximation, and area B, the area above the curve erroneously included in the approximation. Both of these areas will get smaller as the width of the interval gets smaller, resulting in a smaller error in the approximation.

To ground our discussion of PDFs, and to illustrate how they can be used in making naive Bayes prediction models, we will extend our loan application fraud detection scenario to have two extra continuous features: ACCOUNT BALANCE, which specifies the amount of money in the account of the loan applicant at the time of the application, and LOAN AMOUNT, which specifies the amount of the loan being applied for. Table 6.11^[286] lists this extended dataset. We first use only the extra ACCOUNT BALANCE feature in the dataset (ignoring LOAN AMOUNT, which we return to later in this chapter) to demonstrate how PDFs allow us included continuous features in a naive Bayes model.

To enable the naive Bayes model to handle the ACCOUNT BALANCE feature, we have to extend the set of probabilities used by the model to represent the domain to include the probabilities for this feature. Recall that the naive Bayes domain representation defines a conditional probability for each possible value in the domain of a descriptive feature for each level in the domain of the target. In our example, the target feature, FRAUD, is binary, so we need to define two conditional probabilities for each value in the domain of the new descriptive feature: P(AB = x | fr) and P(AB = x | ¬fr). Because the descriptive feature ACCOUNT BALANCE is continuous, there is an infinite number of values in the feature’s domain. However, we know that using an appropriately defined PDF, we can approximate the probability of the feature taking any value in its domain. As a result, we simply need to define two PDFs for the new feature with each PDF conditioned on a different level of the target feature: P(AB = x | fr) = PDF₁(AB = x | fr) and P(AB = x | ¬fr) = PDF₂(AB = x | ¬fr). These two PDFs do not have to be defined using the same distribution. Once we have selected the distributions we wish to use, to define a PDF for a descriptive feature that is conditioned on a particular target, we fit the parameters of the selected distribution to the subset of the data where the target has that value.

Table 6.11

The dataset from the loan application fraud detection domain (from Table 6.2^[270]) with two continuous descriptive features added: ACCOUNT BALANCE and LOAN AMOUNT.

art

The first step in defining the two PDFs is to decide which distribution we will use to define the PDFs for each target feature level. To make this decision, we partition the training data based on the target feature and generate histograms of the values of the descriptive feature for each of the splits. We then select the statistical distribution that is most similar in shape to each of the resulting histograms. Figure 6.8^[287] shows the histograms of the values of the ACCOUNT BALANCE feature partitioned on the two levels of the FRAUD target feature. It is clear from these histograms that the distribution of values taken by the ACCOUNT BALANCE feature in the set of instances where FRAUD = true follows an exponential distribution; whereas, the distribution of the values taken by the ACCOUNT BALANCE feature in the set of instances where the FRAUD = false is similar to a normal distribution.

art

Figure 6.8

Histograms, using a bin size of 250 units, and density curves for the ACCOUNT BALANCE feature: (a) the fraudulent instances overlaid with a fitted exponential distribution; (b) the non-fraudulent instances overlaid with a fitted normal distribution.

Once we have selected the distributions, the next step is to fit the distributions to the data. To fit the exponential distribution we compute the sample mean of the ACCOUNT BALANCE feature in the set of instances where FRAUD = true and set the λ parameter equal to 1 divided by this value. To fit the normal distribution to the set of instances where FRAUD = false, we compute the sample mean and sample standard deviation for the ACCOUNT BALANCE feature for this set of instances and set the parameters of the normal distribution to these values. Table 6.12^[288] shows how these values are calculated, and the dashed lines in Figure 6.8^[287] plot the density curves that result from this process. Once distributions have been fitted to the data, we can extend the naive Bayes domain representation to include the PDFs. Table 6.13^[289] shows the extended domain representation.

To use the extended domain representation of the model to make a prediction for a query, we calculate the product of the relevant descriptive feature probabilities and the priors for the different target levels as before, but using PDFs to calculate the probabilities for the continuous feature. Table 6.14^[290] shows how a prediction is made for the following query:

Table 6.12

Partitioning the dataset based on the value of the target feature and fitting the parameters of a statistical distribution to model the ACCOUNT BALANCE feature in each partition.

(a) Instances where FRAUD = true and the fitted parameters for the exponential distribution

ID	…	ACCOUNT BALANCE	FRAUD
1		56.75	true
4		749.50	true
6		928.30	true
10	…	405.72	true
12		223.89	true
13		103.23	true
AB		411.22
λ = ¹/_AB		0.0024

(b) Instances where FRAUD = false and the fitted parameters for the normal distribution

ID	…	ACCOUNT BALANCE	FRAUD
2		1,800.11	false
3		1,341.03	false
5		1,150.00	false
7		250.90	false
8		806.15	false
9		1,209.02	false
11	…	550.00	false
14		758.22	false
15		430.79	false
16		675.11	false
17		1,657.20	false
18		1,405.18	false
19		760.51	false
20		985.41	false
AB		984.26
sd(AB)		460.94

Note: ACCOUNT BALANCE has been shortened to AB in these tables.

CREDIT HISTORY = paid, GUARANTOR/COAPPLICANT = guarantor, ACCOMMODATION = free, ACCOUNT BALANCE = 759.07

The calculations for the probabilities for the ACCOUNT BALANCE feature are made using the equations for the normal and exponential distributions in Table 6.10^[278]. The result is that FRAUD = false still has the highest score and will be returned as the prediction for this query.

Table 6.13

The Laplace smoothed (with k = 3) probabilities needed by a naive Bayes prediction model, calculated from the dataset in Table 6.11^[286], extended to include the conditional probabilities for the new ACCOUNT BALANCE feature, which are defined in terms of PDFs.

art

Notation key: FR = FRAUD, CH = CREDIT HISTORY, GC = GUARANTOR/COAPPLICANT, ACC =ACCOMMODATION, AB =ACCOUNT BALANCE.

6.4.3 Continuous Features: Binning

A commonly used alternative to representing a continuous feature using a probability density function is to convert the feature into a categorical feature using binning. In Section 3.6.2^[94] we explained two of the best known binning techniques, equal-width binning and equal-frequency binning, and discussed some of the general advantages and disadvantages of each technique. One feature of equal-width binning is that it can result in a very uneven distribution of instances across the bins, with some bins containing a large number of instances and other bins being nearly empty. This uneven distribution of instances across bins can have dramatic and unwanted consequences for probability-based models. Bins that contain only a few instances may have extremely small or extremely large conditional probabilities (depending on how the instances are divided when conditioned on the target feature), and these extreme conditional probabilities may bias a model based on the parameters of the binning technique (for example, the number of bins we choose to have) rather than on real distributions in the data. For this reason, we recommend the use of equal-frequency binning to convert continuous features to categorical ones for probability-based models.

Table 6.14

The probabilities, from Table 6.13^[289], needed by the naive Bayes prediction model to make a prediction for the query with CH = paid, GC = guarantor, ACC = free, and AB=false, and the calculation of the scores for each candidate prediction.

art

Returning to our loan application fraud detection example, we will show how binning can be used to include the LOAN AMOUNT feature (see Table 6.11^[286]) in a naive Bayes prediction model for this scenario. Table 6.15^[291] shows the discretization of the LOAN AMOUNT feature into 4 equal-frequency bins. In this table, the instances in the dataset have been reordered in ascending order based on their LOAN AMOUNT values. Even when using equal-frequency binning, there is still chance that the partitioning of the data will give rise to extreme conditional probabilities. For example, all the bin3 values have a target feature value of false. Consequently, the posterior probability of LOAN AMOUNT = bin3 conditioned on FRAUD = true will be 0.0 and LOAN AMOUNT = bin3 conditioned FRAUD = false will be 1.0. Smoothing should be used in conjunction with binning to help with these extreme probabilities.

Table 6.15

The LOAN AMOUNT continuous feature discretized into 4 equal-frequency bins.

art

Once we have discretized the data using binning, we need to record the raw continuous feature thresholds between the bins. The reason for this is that we need to be able to bin the features of any query instances appropriately before we make predictions for them. To calculate these thresholds, we take the mid-point in the feature range between the instance with the highest feature value in one bin and the feature with the lowest feature value in the next bin. For example, the instances in Table 6.15^[291] are ordered in ascending order based on the magnitude of their original LOAN AMOUNT value. So, the threshold between bin1 and bin2 will be the mid-point between the LOAN AMOUNT values for d₁₂ (9,850) and d₄ (10,000) which is 9,925. The threshold boundaries for the 4 bins used to discretize the LOAN AMOUNT feature are

art

Once we have discretized the continuous features and calculated the thresholds for binning query features, we are ready to create our predictive model. As before, for a naive Bayes model, we calculate the prior probability distribution for the target feature and the posterior distribution for each descriptive feature conditioned on the target feature. Again, we should smooth the resulting probabilities. Table 6.16^[293] shows the Laplace smoothed (with k = 3) probabilities required by a naive Bayes prediction model calculated from the dataset in Table 6.11^[286]. Notice that in this domain representation, we blend different approaches to continuous features: we are retaining the PDFs developed in Section 6.4.2^[276] for the ACCOUNT BALANCE feature and extend the representation with the binned version of the LOAN AMOUNT feature, BINNED LOAN AMOUNT.

We are now ready to process a query that has the continuous LOAN AMOUNT feature as part of the evidence:

CREDIT HISTORY = paid, GUARANTOR/COAPPLICANT = guarantor, ACCOMMODATION = free, ACCOUNT BALANCE = 759.07, LOAN AMOUNT = 8,000

The LOAN AMOUNT value for this query (8,000) is below the threshold for bin1. Consequently, the query LOAN AMOUNT feature will be treated as being equal to bin1 during prediction. Table 6.17^[294] lists the calculations of the naive Bayes scores for the candidate predictions for this query: 0.000000462 for true and 0.000000633 for false. The target level false has the highest score and will be the prediction made by the model.

6.4.4 Bayesian Networks

In this chapter we have introduced two ways to represent the probabilities of events in a domain, a full joint probability distribution and a naive Bayes model. A full joint probability distribution encodes the probabilities for all joint events in the domain. Using a full joint probability distribution, we can do probabilistic inference by summing out the features we are not interested in. Full joint probability distributions, however, grow at an exponential rate as new features or feature levels are added to the domain. This exponential growth rate is partially due to the fact that a full joint probability distribution ignores the structural relationships between features, such as direct influence and conditional independence relationships. As a result, full joint distributions are not tractable for any domain of reasonable complexity. By contrast, a naive Bayes model uses a very compact representation of a domain. The reason for this is that the model assumes that all the descriptive features are conditionally independent of each other given the value of the target feature. The compactness of the representation is at the cost of making a naive assumption that may adversely affect the predictive accuracy of the model.

Table 6.16

The Laplace smoothed (with k = 3) probabilities needed by a naive Bayes prediction model, calculated from the data in Tables 6.11^[286] and 6.15^[291].

art

Notation key: FR =FRAUD, CH = CREDIT HISTORY, GC = GUARANTOR/COAPPLICANT, ACC =ACCOMMODATION, AB =ACCOUNT BALANCE, BLA =BINNED LOAN AMOUNT.

Bayesian networks use a graph-based representation to encode the structural relationships—such as direct influence and conditional independence—between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy.

Table 6.17

The relevant smoothed probabilities, from Table 6.16^[293], needed by the naive Bayes model to make a prediction for the query with CH = paid, GC = guarantor, ACC = free, AB=759.07, and LA=8,000, and the calculation of the scores for each candidate prediction.

art

A Bayesian network is a directed acyclical graph (there are no cycles in the graph) that is composed of three basic elements:

nodes: each feature in a domain is represented by a single node in the graph.
edges: nodes are connected by directed links; the connectivity of the links in a graph encodes the influence and conditional independence relationships between nodes.
conditional probability tables: each node has a conditional probability table (CPT) associated with it. A CPT lists the probability distribution of the feature represented by the node conditioned on the features represented by the other nodes to which a node is connected by edges.

Figure 6.9(a)^[296] illustrates a simple Bayesian network. This network describes a domain consisting of two features A and B. The directed link from A to B indicates that the value of A directly influences the value of B. In probability terms, the directed edge from A to B in Figure 6.9(a)^[296] states that

art

For example, the probability of the event a and ¬b is

P(a, ¬b) = P(¬b | a) × P(a) = 0.7 × 0.4 = 0.28

where the probabilities used in the calculation are read directly from the CPTs in Figure 6.9(a)^[296]. In the terminology of Bayesian networks, node A is a parent node of B, and node B is a child node of A, because there is a direct edge from A into B. The CPT associated with each node defines the probabilities of each feature taking a value given the value(s) of its parent node(s). Node A has no parents, so the CPT just lists the unconditional probability distribution for A. Notice that each row in the CPT tables sum to 1. Consequently, for a categorical feature with N levels, we need only N − 1 probabilities in each row, with the final probability being understood as equal to 1 minus the sum of the other N − 1 probabilities. For example, when dealing with binary features, we need simply state the probability of each feature being true, and the false value is understood as 1 minus this probability. The network in Figure 6.9(a)^[296] could be simplified in this way, and we will use this simplification for all networks drawn from now on. The standard approach for handling continuous features in a Bayesian network is to use binning. As a result, the CPT representation is sufficient to handle both categorical and (binned) continuous features.

Equation (6.17)^[295] can be generalized to the statement that for any network with N nodes, the probability of an event x₁, …, x_n, can be computed using the following formula:

art

where Parents(x_i) describes the set of nodes in the graph that directly link into node x_i. Using this equation, we can compute any joint event in the domain represented by the Bayesian network. For example, using the slightly more complex Bayesian network in Figure 6.9(b)^[296], we can calculate the probability of the joint event P(a, ¬b, ¬c, d) as

art

Figure 6.9

(a) A Bayesian network for a domain consisting of two binary features. The structure of the network states that the value of feature A directly influences the value of feature B. (b) A Bayesian network consisting of 4 binary features with a path containing 3 generations of nodes: D, C, and B.

art

When we are computing a conditional probability, we need to be aware of the state of both the parents of a node and the children of a node and their parents. This is because knowledge of the state of a child node can tell us something about the state of the parent node. For example, returning to our simple Bayesian network in Figure 6.9(a)^[296], we can compute P(a | ¬b) using Bayes’ Theorem as follows:

art

Figure 6.10

A depiction of the Markov blanket of a node. The gray nodes define the Markov blanket of the black node. The black node is conditionally independent of the white nodes given the state of the gray nodes.

Essentially, here we are using Bayes’ Theorem to invert the dependencies between the nodes. So, for a conditional independence, we need to take into account not only the parents of a node but also the state of its children and their parents. If we have knowledge of these parent and children nodes, however, then the node is conditionally independent of the rest of the nodes in the graph. The set of nodes in a graph that make a node independent of the rest of the graph are known as the Markov blanket of a node. Figure 6.10^[297] illustrates the Markov blanket of a node.

So, the conditional probability of a node x_i in a graph with n nodes can be defined as

art

where Parents(x_i) describes the set of nodes in the graph that directly link into node x_i, and Children(x_i) describes the set of nodes in the graph that x_i directly links into. Applying this definition to the network in Figure 6.9(b)^[296], we can calculate the probability of P(c | ¬a, b, d) as

art

We already used Equation (6.19)^[297] when we were making predictions for a naive Bayes classifier. A naive Bayes classifier is a Bayesian network with a specific topological structure. Figure 6.11(a)^[299] illustrates the network structure of a naive Bayes classifier and how it encodes the conditional independence between the descriptive features given assumed knowledge of the target. Figure 6.11(b)^[299] illustrates the network structure of the naive Bayes model for predicting a fraudulent loan application that was built in Section 6.3.1^[269]. We can see in this structure that the target feature, FRAUD, has no parents and is the single parent for all the descriptive feature nodes. This structure directly reflects the assumption, made by naive Bayes models, of the conditional independence between descriptive features given knowledge of the target feature and is why the conditional probabilities of the descriptive features in a naive Bayes model are conditioned only on the target feature.

When we computed a conditional probability for the target feature using a naive Bayes model, we used the following calculation

art

This equation is equivalent to Equation (6.19)^[297]. The fact that the probability P(t) is an unconditional probability simply reflects that structure of the naive Bayes’ network where the target feature has no parent nodes (see Figure 6.11(a)^[299]).

Computing a conditional probability for a node becomes more complex if the value of one or more of the parent nodes is unknown. In this situation the node becomes dependent on the ancestors of it unknown parent. This is because if a parent node is unknown, then to compute the distribution for the node, we must sum out this parent. However, to do this summing out, we must know the distribution for the unknown parent, which in turn requires us to sum out the parents of the parent, and so on if necessary. As a result of this recursive summing out, the distribution over a node is dependent on knowledge of the ancestors of any of its parent nodes.²² For example, in Figure 6.9(b)^[296], if the status of node C is not known, then node B becomes dependent on node D. For example, to compute P(b | a, d) we would do the following calculations

art

Figure 6.11

(a) A Bayesian network representation of the conditional independence asserted by a naive Bayes model between the descriptive features given knowledge of the target feature; (b) a Bayesian network representation of the conditional independence assumption for the naive Bayes model in the fraud example.

Compute the distribution for C given D: P(c | d) = 0.2, P(¬c | d) = 0.8
Compute P(b | a, C) by summing out C: P(b | a, C) = Σ_iP(b | a, C_i)

This example illustrates the power of Bayesian networks. When complete knowledge of the state of all the nodes in the network is not available, we clamp the values of nodes that we do have knowledge of and sum out the unknown nodes. Furthermore, during these calculations, we only need to condition a node on its Markov blanket, which dramatically reduces the number of probabilities required by the network.

6.4.4.1 Building Bayesian Networks

Bayesian networks can be constructed by hand or learned from data. Learning both the topology of a Bayesian network and the parameters in the CPTs in the network is a difficult computational task. One of the things that makes learning the structure of a Bayesian network so difficult is that it is possible to define several different Bayesian networks as representations for the same full joint probability distribution. Consider, for example, a probability distribution for three binary features A, B, and C. The probability for a joint event in this domain P(A, B, C) can be decomposed using the chain rule in the following way:

art

Figure 6.12

Two different Bayesian networks, each defining the same full joint probability distribution.

The chain rule, however, doesn’t specify any constraints on which features in the domain we choose to condition on. We could just as easily have decomposed the probability of the joint event as follows:

art

Both of these decompositions are valid, and both define different Bayesian networks for the domain. Figure 6.12(a)^[301] illustrates the Bayesian network representing the decomposition defined in Equation (6.20)^[300], and Figure 6.12(b)^[301] illustrates the Bayesian network representing the decompositions defined in Equation (6.21)^[301].

We can show that both of the networks in Figure 6.12^[301] represent the same joint probability by using each of them to calculate the probability of an arbitrarily chosen joint event from the domain. We should get the same probability for the joint event from both of the networks. For this example, we will calculate the probability of the event ¬a, b, c. Using the Bayesian network in Figure 6.12(a)^[301], we would carry out the calculation as follows:

art

Using the network in Figure 6.12(b)^[301], the calculation would be

art

Both networks return the same probability for the joint event. In fact, these networks will return identical probabilities for all events in this domain.

The basic approach to learning the structure of a Bayesian network is to use a local search algorithm that moves through the space of possible networks and parameters, and searches for the network topology and CPT parameters, that best fit with the data. To start the search, the algorithm is given a seed network and then iteratively adapts this network by adding, removing, or reversing links (and/or adding and removing hidden nodes), accompanied by iterations of parameter learning after each network structure adaptation. One of the difficulties with learning a network structure is that we can always improve the likelihood of the data given a network by simply adding new links into the network. Each time we add a link to a network we increase the number of CPT entries in the network. The CPT entries are essentially parameters on the network, and the more parameters a network has, the greater its ability to fit (or overfit) the data. So, care must be taken to ensure that the objective function used by the search process avoids overfitting the data by simply creating a very highly connected graph. Consequently, the objective functions used by these algorithms are often based on the minimum description length principle, which asserts that the solution with the fewest parameters (shortest description) is the best one. We have already met the minimum description length principle in the more general form of Occam’s razor. A popular metric used by these algorithms is the Bayesian information criterion (BIC):

art

where denotes the network graph, D is the training data, art is the set of entries in the CPTs of , d is the number of parameters of (i.e., how many entries in the CPTs of ), and n is the number of instances in D. This metric contains a term describing how well the model predicts the data P(D| art , ) as well as a term that punishes complex models art . As such, it balances the search goals of model accuracy and simplicity. The term P(D| art , ) can be computed using metrics such as the Bayesian score or the K2 score23. The search space for these algorithms is exponential in the number of features. Consequently, developing algorithms to learn the structure of Bayesian networks is an ongoing research challenge.²⁴

It is much simpler to construct a Bayesian network using a hybrid approach, where the topology of the network is given to the learning algorithm, and the learning task involves inducing the CPT entries from the data. This type of learning illustrates one of the real strengths of the Bayesian network framework, namely, that it provides an approach to learning that naturally accommodates human expert information. In this instance, the human expert specifies that topology of the network, and the learning algorithm induces the CPT entries for nodes in the topology in the same way that we computed the conditional probabilities for the naive Bayes model.²⁵

Given that there are multiple Bayesian networks for any domain, an obvious question to ask is what is the best topological structure to give the algorithm as input? Ideally, we would like to use the network whose structure most accurately reflects the causal relationships in the domain. Specifically, if the value of one feature directly influences, or causes, the value taken by another feature, then this should be reflected in the structure of the graph by having a link from the cause feature to the effect feature. Bayesian networks whose topological structure correctly reflects the causal relationships between the features in a dataset are called causal graphs. There are two advantages to using a causal graph: (1) people find it relatively easy to think in terms of causal relationships, and as a result, networks that encode these relationships are relatively easy to understand; (2) often networks that reflect the causal structure of a domain are more compact in terms of the number of links between nodes and hence are more compact with respect to the number of CPT entries.

We will use an example from social science to illustrate how to construct a causal graph using this hybrid approach. In this example, we will build a Bayesian network that enables us to predict the level of corruption in a country based on a number of macro-economic and social descriptive features. Table 6.18^[305] lists some countries described using the following features²⁶

GINI COEF measures the equality in a society, where a larger Gini coefficient indicates a more unequal society.
LIFE EXP measures life expectancy at birth.
SCHOOL YEARS refers to the mean number of years spent in school for adult females.
CPI is the Corruption Perception Index (CPI), and it is the target feature. The CPI measures the perceived level of corruption in the public sector of a country and ranges from 0 (highly corrupt) to 100 (very clean).

The original feature values shown in Table 6.18^[305] are continuous, so we use the standard approach of converting them to categorical features using equal-frequency binning, with two bins for each feature: low and high. The columns labelled binned feature values in Table 6.18^[305] show the data after it has been binned.

Once the data has been prepared, there are two stages to building the Bayesian network. First, we define the topology of the network. Second, we create the network CPTs. The topology of the network will be a causal graph that models this domain. In order to build this, we must have a theory of the causal relationships between the features in the domain. A potential causal theory between the features in this dataset is that

the more equal a society, the higher the investment that society will make in health and education, and this in turn results in a lower level of corruption

Figure 6.13^[306] illustrates a Bayesian network with a topology that encodes this causal theory. Equality directly affects both health and education, so there are directed arcs from GINI COEF to both LIFE EXP and SCHOOL YEARS. Health and education directly affect corruption, so there is a directed arc from LIFE EXP and and from SCHOOL YEARS to CPI. To complete the network, we need to add the CPTs. To do this, we compute the required conditional probabilities from the binned data in Table 6.18^[305]. The CPTs are shown in Figure 6.13^[306].

Table 6.18

Some socio-economic data for a set of countries, and a version of the data after equal-frequency binning has been applied.

art

6.4.4.2 Using a Bayesian Network to Make Predictions

Once a network has been created, it is relatively straightforward to use to make a prediction. We simply compute the probability distribution for the target feature conditioned on the state of the descriptive features in the query and return the target feature level with the maximum a posteriori probability:

art

where art (q) is the prediction made by the model for the query q, levels(t) is the set of levels in the domain of the target feature t, and BayesianNetwork(t = l, q) returns the probability computed by the network for the event t = l given the evidence specified in the query q.

art

Figure 6.13

A Bayesian network that encodes the causal relationships between the features in the corruption domain. The CPT entries have been calculated using the binned data from Table 6.18^[305].

For example, imagine we wanted to use the Bayesian network in Figure 6.13^[306] to predict the CPI for a country with the following profile:

GINI COEF = low, SCHOOL YEARS = high, LIFE EXP = high

Because both the parent nodes for CPI are known (SCHOOL YEARS and LIFE EXP), the probability distribution for CPI is independent of the GINI COEF feature. Therefore, we can read the relevant probability distribution for CPI directly from the CPT for the CPI node. From this CPT we can see that when SCHOOL YEARS = high, and LIFE EXP = high, then the most likely level is CPI = high. As a result, CPI = high is the MAP CPI value for this query, and this is the prediction the model will return. In other words, countries that are relatively equal and that have good education and high life expectancy are likely to have a low level of corruption.

6.4.4.3 Making Predictions with Missing Descriptive Feature Values

One real advantage of Bayesian networks over the other predictive model types that we discuss in this book is they a provide an elegant solution to making predictions for a target feature when one or more of the descriptive feature values in a query instance are missing.27 For example, we may wish to predict the CPI for a country with the following profile:

GINI COEF = high, SCHOOL YEARS = high

where the value of the LIFE EXP feature is unknown for the country. This means that in the network, one of the parents of the target feature node, CPI, is unknown. Consequently, we need to sum out this feature for each level of the target. We can calculate the probability for CPI = high as follows:²⁸

art

We calculate the numerator in this term as follows:

and denominator as:

We can now calculate the probability for CPI = high as

We know from this result that the probability for CPI = low must be 0.8. So, the network will predict CPI = low as the MAP target value for the query. This tells us that an unequal society that has a good education system but for which we have no evidence about the health system is still likely to suffer from corruption.

These calculations make it apparent that even in this small example domain, the calculation of a probability becomes computationally complex very quickly, particularly when we need to sum out one or more features. The complexity of the calculations can be reduced by being careful with the positioning of features with respect to summations and by using dynamic programming techniques to avoid repeated computations. A well-known algorithm that focuses on this approach to reducing the complexity is the variable elimination algorithm (Zhang and Poole, 1994). However, even using the variable elimination algorithm, calculating exact probabilities from a Bayesian network when descriptive feature values are missing is prohibitively complex.

Given the complexity of exact probabilistic inference for Bayesian networks, a popular alternative is to approximate the probability distribution required for a prediction using Monte Carlo methods.29 Monte Carlo methods generate a large number of sample events and then use the relative frequency of an event in the set of generated samples as the approximation for the probability of that event in the real distribution. Monte Carlo methods work well in conjunction with Bayesian networks because a Bayesian network models the probability distribution over the features. More specifically, a Bayesian network can be viewed as defining a Markov chain. A Markov chain is a system that has a set of finite states and a set of transition probabilities that define the likelihood of the system moving from one state to another. When we view a Bayesian network as a Markov chain, a state is a complete assignment of values to all the nodes in the network (for example, GINI COEF = high, SCHOOL YEARS = low, LIFE EXP = high, CPI = high would be a state in the Markov chain defined by the network in Figure 6.13^[306]), and the CPTs of the network provide a distributed representation of the transition probabilities of the Markov chain. If the distribution used to generate the samples for a Monte Carlo method is a Markov chain, then the specific algorithms we use to implement this approach come from a family known as Markov chain Monte Carlo (MCMC) algorithms. Gibbs sampling is one of the best known MCMC algorithms and is particularly suitable when we wish to generate probabilities that are conditioned on some evidence, so this is the algorithm we discuss in this section.

The Gibbs sampling algorithm initializes a Bayesian network by clamping the values of the evidence nodes and randomly assigning values to the non-evidence nodes. The algorithm then iteratively generates samples by changing the value of one of the non-evidence nodes. The selection of which non-evidence node to change can be random or follow a predefined list through which the algorithm iterates. The new value for the selected node is drawn from the distribution for the node (the CPT), conditioned on the current state of all the other nodes in the network. Each time a node is updated, a new sample state has been generated. More formally, for a network with three nodes x₁, x₂, x₃, using a predefined node selection order of x₁, x₂, x₃, x₁, … and assuming that at iteration τ each node has the values , the next four states generated will be

There are three technical requirements that must hold for distribution of states generated from Gibbs sampling to converge with the distribution that we are sampling from—in this case, the distribution defined by the Bayesian network. The first is that the distribution we are sampling from must be a stationary distribution (also known as an invariant distribution). A stationary distribution is a distribution that doesn’t change. The distribution defined by a Bayesian network doesn’t change during Gibbs sampling, so this requirement always holds in this context. The second requirement is that the Markov chain used to generate the samples must be ergodic. A Markov chain is ergodic if every state is reachable from every other state and there are no cycles in the chain. The Markov chain defined by a Bayesian network is ergodic if there are no zero entries in any of the CPTs.30 The third requirement is that the generated states should be independent of each other. As each generated state is a modified version of the preceding state, it is clear that successive states will be correlated with each other. So to obtain independent sample states, we often subsample from the sequence (subsampling in this way is also known as thinning). Once these three conditions hold (stationary distribution, ergodicity, and independent states), the samples generated will eventually converge with the distribution, and it is appropriate to use Gibbs sampling.

Because we start sampling from a random state, however, we do not know whether the initial state is an appropriate state from which to start generating samples. It may, for example, be a state that has a very low probability in the distribution. As a result, it is a good idea to run the network for a number of iterations before the generated states are recorded as samples. This burn-in time is to allow the Markov chain to settle into a state that is independent of the initial random state and that is a probable state for the distribution we are sampling from. The time it takes for the Markov chain to forget the initial random state is called the mixing time. Unfortunately, estimating how long the burn-in should be is difficult. For some Markov chains, mixing may require only a few iterations, but for others, it may require hundreds or thousands of iterations. The topology of the network can provide some insight into this problem. Larger graphs will tend to have longer mixing times. Also, an evenly connected network typically has a relatively short mixing time (for the size of the graph). If, however, a graph is composed of a number of clusters connected via bottleneck nodes, this would typically indicate a longer mixing time. Another approach used to determine the appropriate burn-in time is to start several Markov chains with different initial states and wait until all the chains are generating states with similar distribution characteristics (mean state, mode state, etc.). When this happens, it indicates that all the chains are sampling from the same distribution and, hence, that it is likely that they have all forgotten their starting states. Once this happens, the target probability can be computed by calculating the relative frequency of the event within the selected subset of generated states.

Table 6.19^[312] lists a some of the samples generated using Gibbs sampling for the Bayesian network in Figure 6.13^[306] for the query

GINI COEF = high, SCHOOL YEARS = high

A burn-in of 30 iterations was used, and the samples were thinned by subsampling every 7^th iteration. When the algorithm was used to generate 500 samples, the relative frequency of CPI = high was 0.196. When 2,000 samples were generated, the relative frequency rose to 0.1975. This rise in relative frequency illustrates that, as the number of samples generated increases, the resulting distribution approaches the actual distribution. Recall that when we did an exact calculation for this query the probability of CPI = high was 0.2.

Table 6.19

Examples of the samples generated using Gibbs sampling.

art

We can make predictions using Gibbs sampling in the same way that we made predictions using exact probabilistic inference by predicting the target level with the maximum a posteriori probability:

art

where art (q) is the prediction made by the model for the query q, levels(t) is the set of levels in the domain of the target feature t, and Gibbs(t = l, q) returns the probability for the event t = l given the evidence specified in the query q using Gibbs sampling.

6.5 Summary

There are two ways to reason with probabilities forward and inverse. Forward probability reasons from causes to effects: if we know that a particular causal event has happened, then we increase the probability associated with the known effects that it causes. Inverse probability reasons from effects to causes: if we know that a particular event has occurred, then we can increase the probability that one or more of the events that could cause the observed event have also happened. Bayes’ Theorem relates these two views of probability by using the notion of a prior probability. Put in subjective terms, Bayes’ Theorem tells us that by modifying our initial beliefs about what has happened (our prior beliefs about the world) proportionally with how our observations relate to their potential causes (inverse probability), we can update our beliefs regarding what has happened to cause our observations (forward probability). Put more formally:

The use of prior probabilities in Bayes’ Theorem is what distinguishes between Bayesian and maximum likelihood approaches to probability.

Bayesian prediction is a very intuitive approach to predicting categorical targets. In order to make a prediction, we have to learn two things:

the probability of an instance having a particular set of descriptive feature values given that it has a particular target level P(d | t)
the prior probability of that target level P(t)

Given these two pieces of information, we can compute the relative likelihood of a particular instance having a particular target level as

Once the relative likelihoods for each target level have been calculated, we simply return the maximum a posteriori (MAP) prediction.

The biggest challenge in creating a Bayesian prediction model is overcoming the exponential growth in the number of probabilities (model parameters) that are required as the dimensionality of the feature space increases. The standard approach to addressing this problem is to use the independence and conditional independence relationships between the features in a domain to factorize the full joint distribution of the domain. Factorizing the domain representation reduces the number of interactions between the features and reduces the number of model parameters.

A naive Bayes model addresses this problem by naively assuming that each of the descriptive features in a domain is conditionally independent of all the other descriptive features, given the state of the target feature. This assumption, although often wrong, enables the naive Bayes model to maximally factorize the representation that it uses of the domain—in other words, to use the smallest possible number of probabilities to represent the domain.

Surprisingly, given the naivety and strength of the assumption it depends upon, naive Bayes models often perform well. This is partly because naive Bayes models are able to make correct predictions even if the probabilities that they calculate are incorrect, so long as the error in the calculated probabilities does not affect the relative rankings of the different target levels. One consequence of this observation is that naive Bayes models are not really suitable for predicting continuous targets. When predicting a continuous target, every error in the calculation of a probability is reflected in reduced model performance.

The conditional independence assumption means that naive Bayes models use very few parameters to represent a domain. One consequence of this is that naive Bayes models can be trained using a relatively small dataset: with so few parameters and so few conditions on each parameter—only the state of the target feature—it is possible to make reasonable estimates for the parameters using a small dataset. Another benefit of the reduced representation of the model is that the behavior of the model is relatively easy to interpret. It is possible to look at the probabilities for each descriptive feature and analyze how that value contributed to the final prediction. This information can be useful in informing the development of more powerful models later in a project. Consequently, a naive Bayes model is often a good model to begin with: it is easy to train and has the potential to provide both a baseline accuracy score and some insight into the problem structure. The major drawback of naive Bayes models is the inability of the model to handle the interactions between features.

Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction. Although the task of inducing the optimal network structure from data is strictly intractable, algorithms that encode various assumptions exist that allow good models to be learned. Also, in domains where the causal relationships between features are known, Bayesian networks have the advantage of providing a natural framework for integrating expert human knowledge with data-driven induction. Bayesian networks have been successfully applied across a range of fields, including medical diagnosis, object recognition, and natural language understanding.

Several parallels can be drawn between probability-based learning and the other approaches to machine learning that we present in this book. Intuitively, the prior probability of a nearest neighbor model predicting a particular target level is simply the relative frequency of that target level in the dataset. For this reason, in general it is wrong to artificially balance the dataset used by a nearest neighbor model,31 and doing so biases the target level priors used by the model.

The relationship between probability-based and information-based learning is simply that the amount of information provided by an observation—such as a descriptive feature taking a particular value—is reflected in the difference between the prior and posterior probabilities caused by the observation. If the prior and posterior probabilities are similar, then the information content in the observation was low. If the prior and posterior probabilities are very different, then the information content in the observation was high.

Finally, it can be shown that, under some assumptions, any learning algorithm that minimizes the squared error of the model over the data will output a maximum likelihood prediction.³² The relevance of this finding is that it provides a probabilistic justification for the approach to learning we present in Chapter 7^[323].

6.6 Further Reading

McGrayne (2011) is an accessible book on the development and history of Bayes’ Theorem. All data analysts should have at least one good textbook on statistics and probability. We would recommend either Montgomery and Runger (2010) or Tijms (2012) (or both). Jaynes (2003) deals with the use of probability theory in science and is a suitable text for postgraduate students.

Chapter 6 of Mitchell (1997) provides an excellent overview of Bayesian learning. Barber (2012) is a more recent machine learning textbook that adopts a Bayesian approach to learning and inference.

Judea Pearl is recognized as one of the key pioneers in developing the use of Bayesian networks in the field of artificial intelligence, and his books (Pearl, 1988, 2000) are accessible and provide good introductions to the theory and methods of Bayesian networks, as well as the more general field of graphical models. Neapolitan (2004) is a good textbook on Bayesian networks. Kollar and Friedman (2009) is a comprehensive text on the theory and methods of graphical models and is a good reference text for postgraduate students who are doing research using graphical models.

6.7 Exercises

1. a. Three people flip a fair coin. What is the probability that exactly two of them will get heads?

b. Twenty people flip a fair coin. What is the probability that exactly eight of them will get heads?

c. Twenty people flip a fair coin. What is the probability that at least 4 of them will get heads?

2. The table below gives details of symptoms that patients presented and whether they were suffering from meningitis.

ID	HEADACHE	FEVER	VOMITING	MENINGITIS
1	true	true	false	false
2	false	true	false	false
3	true	false	true	false
4	true	false	true	false
5	false	true	false	true
6	true	false	true	false
7	true	false	true	false
8	true	false	true	true
9	false	true	false	false
10	true	false	true	true

Using this dataset calculate the following probabilities:

a. P(VOMITING = true)

b. P(HEADACHE = false)

c. P(HEADACHE = true, VOMITING = false)

d. P(VOMITING = false | HEADACHE = true)

e. P(MENINGITIS | FEVER = true, VOMITING = false)

3. Predictive data analytics models are often used as tools for process quality control and fault detection. The task in this question is to create a naive Bayes model to monitor a waste water treatment plant.³³ The table below lists a dataset containing details of activities at a waste water treatment plant for 14 days. Each day is described in terms of six descriptive features that are generated from different sensors at the plant. SS-IN measures the solids coming into the plant per day; SED-IN measures the sediment coming into the plant per day; COND-IN measures the electrical conductivity of the water coming into the plant.34 The features SS-OUT, SED-OUT, and COND-OUT are the corresponding measurements for the water flowing out of the plant. The target feature, STATUS, reports the current situation at the plant: ok, everything is working correctly; settler, there is a problem with the plant settler equipment; or solids, there is a problem with the amount of solids going through the plant.

art

a. Create a naive Bayes model that uses probability density functions to model the descriptive features in this dataset (assume that all the descriptive features are normally distributed).

b. What prediction will the naive Bayes model return for the following query?

SS-IN = 222, SED-IN = 4.5, COND-IN = 1,518, SS-OUT = 74 SED-OUT = 0.25, COND-OUT = 1,642

4. The following is a description of the causal relationship between storms, the behavior of burglars and cats, and house alarms:

Stormy nights are rare. Burglary is also rare, and if it is a stormy night, burglars are likely to stay at home (burglars don’t like going out in storms). Cats don’t like storms either, and if there is a storm, they like to go inside. The alarm on your house is designed to be triggered if a burglar breaks into your house, but sometimes it can be set off by your cat coming into the house, and sometimes it might not be triggered even if a burglar breaks in (it could be faulty or the burglar might be very good).

a. Define the topology of a Bayesian network that encodes these causal relationships.

b. The table below lists a set of instances from the house alarm domain. Using the data in this table, create the conditional probability tables (CPTs) for the network you created in part (a) of this question.

ID	STORM	BURGLAR	CAT	ALARM
1	false	false	false	false
2	false	false	false	false
3	false	false	false	false
4	false	false	false	false
5	false	false	false	true
6	false	false	true	false
7	false	true	false	false
8	false	true	false	true
9	false	true	true	true
10	true	false	true	true
11	true	false	true	false
12	true	false	true	false
13	true	true	false	true

c. What value will the Bayesian network predict for ALARM given that there is both a burglar and a cat in the house but there is no storm.

d. What value will the Bayesian network predict for ALARM given that there is a storm but we don’t know if a burglar has broken in or where the cat is?

5. The table below lists a dataset containing details of policy holders at an insurance company. The descriptive features included in the table describe each policy holders’ ID, occupation, gender, age, the type of insurance policy they hold, and their preferred contact channel. The preferred contact channel is the target feature in this domain.

ID	OCCUPATION	GENDER	AGE	POLICY TYPE	PREF CHANNEL
1	lab tech	female	43	planC	email
2	farmhand	female	57	planA	phone
3	biophysicist	male	21	planA	email
4	sheriff	female	47	planB	phone
5	painter	male	55	planC	phone
6	manager	male	19	planA	email
7	geologist	male	49	planC	phone
8	messenger	male	51	planB	email
9	nurse	female	18	planC	phone

a. Using equal-frequency binning transform the AGE feature into a categorical feature with three levels: young, middle-aged, mature.

b. Examine the descriptive features in the dataset and list the features that you would exclude before you would use the dataset to build a predictive model. For each feature you decide to exclude explain why you have made this decision.

c. Calculate the probabilities required by a naive Bayes model to represent this domain.

d. What target level will a naive Bayes model predict for the following query:

GENDER = female, AGE = 30, POLICY = planA

6. Imagine that you have been given a dataset of 1,000 documents that have been classified as being about entertainment or education. There are 700 entertainment documents in the dataset and 300 education documents in the dataset. The tables below give the number of documents from each topic that a selection of words occurred in.

Word-document counts for the entertainment dataset
fun415	is695	machine35	christmas0	family400	learning70
Word-document counts for the education dataset
fun200	is295	machine120	christmas0	family10	learning105

a. What target level will a naive Bayes model predict for the following query document: “machine learning is fun”?

b. What target level will a naive Bayes model predict for the following query document: “christmas family fun”?

c. What target level will a naive Bayes model predict for the query document in part (b) of this question, if Laplace smoothing with k = 10 and a vocabulary size of 6 is used?

_______________

1 It is appropriate to use a game involving gambling to introduce probability-based machine learning. The origins of probability theory come from attempts to understand gambling and games of chance, in particular, the work of Gerolamo Cardano and the later work of Pierre de Fermat and Blaise Pascal.

2 This data has been artificially generated for this example.

3 To save space, throughout this chapter, named features are denoted by the uppercase initial letters of their names (e.g., the MENINGITIS feature is denoted M). Where a named feature is binary, we use the lowercase initial letter of the feature name to denote the feature being true and the lowercase initial letter preceded by the symbol to denote it being false (e.g., m denotes MENINGITIS = true, and m denotes MENINGITIS = false).

4 Bayes’ Theorem is named after the Reverend Thomas Bayes, who wrote an essay that described how to update beliefs as new information arises. After Thomas Bayes died, this essay was edited and published by the Reverend Richard Price (Bayes and Price, 1763). The modern mathematical form of Bayes’ Theorem, however, was developed by Simon Pierre Laplace.

5 The Theorem of Total Probability is explained in detail in Section B.3^[548] of Appendix B^[541].

6 Famously, an experiment in which doctors were asked this question about the probability of the patient having the disease showed that most of them got this question wrong (Casscells et al., 1978).

7 The product rule is explained in detail in Section B.3^[548] of Appendix B^[541].

8 That is, a and b = b and a.

9 The probability chain rule is explained in detail in Section B.3^[548] of Appendix B^[541].

10 The paradox of the false positive states that in order to make predictions about a rare event, the model has to be as accurate as the prior of the event is rare or there is a significant chance of false positive predictions (i.e., predicting the event when it is not the case). Doctorow (2010) provides an interesting discussion of this phenomenon.

11 The MAP prediction is the prediction mechanism that we assume throughout this book. An alternative mechanism is the Bayesian optimal classifier, but we won’t discuss it in this text. See Mitchell (1997) for more details.

12 During the European Soccer Championships in 2008 and the 2010 Soccer World Cup, an octopus in Germany, called Paul, was attributed with achieving an 85% success rate at predicting the results of the matches involving Germany. Paul’s impressive accuracy should not be taken to suggest that octopus behavior affects soccer matches but rather that independent events may be correlated, at least for an interval of time, without the events actually being dependent. As the oft quoted maxim states: correlation does not imply causation! (See Section 3.5.2^[86] for further discussion.)

13 One consequence of this, however, is that a naive Bayes model is not a good approach for predicting a continuous target, because errors in calculating posterior probabilities do directly affect the accuracy of the model. This is the only modeling approach covered in this book for which we will not present a way to predict both continuous and categorical target features.

14 Recall that sparse data, discussed in Section 5.4.5^[212], refers to datasets where the majority of descriptive features have a value of zero.

15 The primary reason that we apply smoothing is to remove zero probabilities from a model’s representation of a domain, and in the vast majority of cases, all the unconditional target level probabilities will be non-zero (because there will be at least one instance with each target level in the training data). Even in cases where one of the target levels is very rare, it may not be appropriate to smooth the target level priors. See Bishop (2006, pp. 45) for a discussion on how to train a probability-based prediction model in situations where one of the target levels is rare.

16 See Tijms (2012), or any good probability textbook, for an introduction to the gamma function.

17 The student-t distribution can be defined in a number of ways. For example, it can be defined so that it takes only one parameter, degrees of freedom. In this text we use the extended location-scale definition.

18 This guided search process is similar to the gradient descent search we use to fit our regression models in Chapter 7^[323]. Many data analytics packages and programming APIs provide functions that implement methods to fit a distribution to a dataset.

19 Taleb (2008) discusses the problems that arise when analysts use normal distributions to model social and economic features, where the assumptions regarding light tails don’t hold.

20 For example, the R language provides the fitdistr() method, as part of the MASS package, that implements a maximum-likelihood fitting of a number of univariate distributions to a given dataset.

21 We can do this either by consulting a probability table or by using integration to calculate the area under the curve within the bounds of the interval. There are many excellent statistical textbooks that explain how to do both of these, for example, Montgomery and Runger (2010).

22 The conditional independence relationship between any two nodes in a Bayesian network can be specified using the framework of d-separation (the “d” stands for directed) (Pearl, 1988). We don’t discuss d-separation in this book as it is not required for our discussion.

23 The K2 score is named after the K2 algorithm, one of the earliest and best known algorithms for learning Bayesian networks (Cooper and Herskovits, 1992).

24 See Kollar and Friedman (2009) for a discussion of algorithms that seek to address this research challenge.

25 In some cases we may not have data for all the features, and in these instances, the standard approach to learning the CPT entries is to use a gradient descent approach (similar to the one we introduce in Chapter 7^[323]), where the objective function of the local search algorithm is simply how well the product of the induced conditional probabilities match the relative frequency of each joint event in the data. In other words, we choose the set of conditional probabilities that maximize the likelihood of the training data.

26 The data listed in this table is real. The Gini coefficient data is for 2013 (or the most recent year prior to 2013 for which the data was available for a country) and was retrieved from the World Bank (data.worldbank.org/indicator/SI.POV.GINI); the life expectancy and mean years in school data was retrieved from Gapminder (www.gapminder.org) and is for 2010/11 (or the most recent year prior to 2010/11 for which the data was available for a country); and the mean years in school were originally sourced from the Institute for Health Metrics and Evaluation (www.healthdata.org). The Corruption Perception Index is for 2011 and was retrieved from Transparency International (www.transparency.org).

27 The most common way to achieve this for the other model types covered in this book is to impute the missing values in the query instance using one of the techniques described in Section 3.4.1^[73].

28 In the following calculations we have abbreviated feature names as follows: GC = GINI COEF, LE = LIFE EXP, and SY = SCHOOL YEARS.

29 Monte Carlo methods are named after the Mediterranean principality that is famous for its casino.

30 If there are one or more zero entries in the CPTs, then the Markov chain may still be ergodic, but it is non-trivial to prove ergodicity in these cases.

31 See Davies (2005, pp. 693–696).

32 See Mitchell (1997, pp. 164–167).

33 The dataset in this question is inspired by the Waste Water Treatment Dataset that is available from the UCI Machine Learning repository (Bache and Lichman, 2013) at archive.ics.uci.edu/ml/machine-learning-databases/water-treatment. The creators of this dataset reported their work in Bejar et al. (1991).

34 The conductivity of water is affected by inorganic dissolved solids and organic compounds, such as oil. Consequently, water conductivity is a useful measure of water purity.