Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications

6.3 Conditional Distributions and Bayes Theorem

If f(x₁, x₂) is the probability function for continuous random variables (X₁, X₂), then

P(a1<X1≤b1,a2<X2≤b2)=∫a1b1∫a2b2f(x1,x2)dx2dx1.

(19)

From (19) we see that

P(a1<X1≤b1)=∫a1b1∫−∞∞f(x1,x2)dx2dx1,

si247_e

which means that

f1(x1)=∫−∞∞f(x1,x2)dx2,

is the pdf for X₁. The corresponding formulas for discrete (X₁, X₂) are

P(a1<X1≤b1)=∑a1<x1≤b1∑x2f(x1,x2)

so that

f1(x1)=∑x2f(x1,x2)

is the pmf for X₁. These probability densities are distinguished by calling f(x₁, x₂) the joint pdf or pmf and calling f₁(x₁) the marginal pdf or pmf. The term joint comes from the fact that f(x₁, x₂) describes how X₁ and X₂ vary jointly. When X₁ and X₂ are discrete, and the sample space finite, the joint probability density can be written in a table and the sums f₁(x₁) are naturally written on the margins of the table.

Marginal probability densities for (X₁, …, X_n) are obtained from the joint density f(x₁, …, x_n) by integration for continuous random variables and summation for discrete random variables. The marginal probability density for continuous random variable X₁ is

f1(x1)=∫−∞∞⋯∫−∞∞f(x1,…,xn)dx2…dxn.

The marginal probability densities f₂(x₂), …, f_n(x_n) are also (n − 1) fold integrals.

Marginal probability densities can describe two or more variables. For example, let (X₁, X₂, X₃, X₄) have joint probability density f(x₁, x₂, x₃, x₄), then the following are two of its marginal pdfs

f2(x2)=∫−∞∞∫−∞∞∫−∞∞f(x1,x2,x3,x4)dx1dx3dx4,f13(x1,x3)=∫−∞∞∫−∞∞f(x1,x2,x3,x4)dx2dx4.

For discrete random variables X₁ and X₂ the conditional probability that X₂ = x₂ given X₁ = x₁ can be defined using

P(A2|A1)=P(A1∩A2)P(A1)=P(X1=x1,X2=x2)P(X1=x1)=f(x1,x2)f1(x1)

(20)

where A_i = {(x′₁, x′₂) : x′_i = x_i} and the last equality follows from the definitions of joint and marginal pmfs. It is easily checked that for any x₁ such that f₁(x₁) > 0 the right-hand side of (20) as a function in x₂ is nonnegative and sums to 1

∑x2f(x1,x2)f1(x1)=1f1(x1)∑x2f(x1,x2)=f1(x1)f1(x1)=1.

si254_e

This function is written

f(x2|x1)=f(x1,x2)f1(x1),f1(x1)>0,

(21)

and called the conditional pmf for random variables X₁ and X₂. For continuous random variables X₁ and X₂ the joint and marginal pdfs do not provide probabilities. Nevertheless, f(x₁|x₂) defined in Eq. (21) will be used to defined the conditional pdf for continuous random variables X₁ and X₂.

Extensions to n random variables can result in joint conditional pdfs or pmfs. For example,

f(x2,…,xn|x1)=f(x1,…,xn)f1(x1),f1(x1)>0,

si256_e

is the joint conditional pdf or pmf for X₂, …, X_n given X₁ = x₁. The conditional pdf or pmf given X_i = x_i is obtained by dividing the joint pdf or pmf by f_i(x_i) provided f_i(x_i) > 0. The conditional pdf or pmf given X_i = x_i and X_j = x_j is obtained by dividing the joint pdf or pmf by f_ij(x_i, x_j) provided f_ij(x_i, x_j) > 0.

Because conditional pdfs (pmfs) are pdfs (pmfs) the expectation operator can be used for these functions as well. The resulting expectation is called the conditional expectation. If X₁, …, X_n are continuous random variables, the conditional expectation of u(X₂, …, X_n) given X₁ = x₁ is

E(u(X2,…,Xn)|x1)=∫−∞∞⋯∫−∞∞u(x2,…,xn)f(x2,…,xn|x1)dx2…dxn

provided f₁(x₁) > 0 and the integral exists.

Remark 9

The marginal pdf for X_i is written f_i(x_i) so that when this function is evaluated at a∈R si258_e there is no ambiguity when writing f_i(a). However, the conditional distribution of X_i given X_j = x_j is written f(x_i|x_j) and relies on the arguments to distinguish this function for different values of i and j. If X_j were observed to take the value b, f(x_i|b) would be written f(x_i|X_j = b). Bayes theorem describes how the conditional pdfs f(x_i|x_j) and f(x_j|x_i) are related and for that discussion it will be helpful to use the more explicit notation for two random variables X and Y

fY|x(y|x)=f(x,y)fX(x),fX|y(x|y)=f(x,y)fY(y),

provided f_X(x) > 0 and f_Y(y) > 0. This remark and the discussion below hold for pmfs as well as pdfs.

Bayes theorem describes how one can find f_X|y(x|y₀), the conditional pdf for X given Y = y₀, provided we have f_X(x), the marginal pdf for X, and f_Y|x(y|x), the conditional pdf for Y given X = x, for every value x. The proof of Bayes theorem follows directly from the definition of the conditional pdf

fX|y(x|y0)=f(x,y0)fY(y0)=f(x,y0)∫f(x,y0)dx=fX(x)fY|x(y0|x)∫fX(x)fY|x(y0|x)dx

(22)

where the first equality follows from the definition of f_X|y, the second from the definition of f_Y, and the third from the definition of f_Y|x. The same steps hold for discrete random variables X and Y where sums replace integration

fX|y(x|y0)=f(x,y0)fY(y0)=f(x,y0)∑xf(x,y0)=fX(x)fY|x(y0|x)∑xfX(x)fY|x(y0|x).

(23)

Example 26

Let X and Y be continuous random variables with X={(x,y):0<x<1,0<y<1} si262_e and marginal and conditional pdfs given by (24) and (25).

fX(x)=23(x+1),0<x<1=0,otherwise.

si263_e (24)

For each x between 0 and 1,

fY|x(y|x)=x+2yx+1,0<y<1=0,otherwise.

(25)

Using (22) we find

fX|y(x|y0)=23(x+1)x+2y0x+1∫0123(x+1)x+2y0x+1dx=23(x+2y0)∫0123(x+2y0)dx=23(x+2y0)13(4y0+1)=2x+4y01+4y0,0<x<1.

si265_e

The Bayes formulas expressed in Eqs. (22) and (20) are not controversial when the random variables are functions defined on a common probability space as in Example 26. For Bayesian inference, explored in greater detail in the next chapter, a joint pdf is created from a family of pdfs by putting a distribution on the parameter space. That is, the distribution for data Y is modeled by a pdf f_Y(y;θ) that depends on the parameter θ which is considered an observation from a second random variable X so that

f(y,θ)=fX(θ)fY|θ(y|θ),

si266_e

where we have used the more suggestive notation f_Y|θ(y|θ) for f_Y(y;θ) but does represent a difference of substance. The main difference from non-Bayesian inference that uses a family of probability density models {f(y;θ) : θ ∈Θ} is the marginal pdf f_X(θ) on the index set Θ. This marginal pdf is called the prior and the result of Bayes formula, fX|y0(θ,y0) si267_e , is called the posterior.

Example 27

For fixed integer n, let Y be the discrete random variables with pmf

fY|θ(y|θ)=nyθy(1−θ)n−y,y=0,1,…,n=0otherwise

(26)

for some 0 < θ < 1. The collection of pmfs given in Eq. (26) is called the Binomial family for n trials having success parameter θ. The value θ is a realization of a continuous random variable X with marginal pdf

fX(θ)=1πθ−12(1−θ)−12,0<θ<1=0otherwise,

si269_e (27)

so the joint distribution is

f(θ,y)=fX(θ)fY|θ(y|θ)=1πnyθy−12(1−θ)n−y−12.

(28)

Using Bayes formula it can be shown that the posterior for each y = 0, 1, …, n is

fX|y(θ|y)=Γ(n+1)Γ(x+12)Γ(n−x+12)θy−12(1−θ)n−y−12,0<θ<1,=0,otherwise,

(29)

where Γ is the gamma function.

Remark 10

The choice of prior is not necessarily obvious. One approach is to choose a prior that is, in some sense, noninformative. Jeffreys prior is one choice of noninformative prior and the distribution in Eq. (27) is Jeffreys prior for the Binomial family. Jeffreys prior and the posterior in Eq. (29) both belong to the Beta family of pdfs

f(θ;α,β)=Γ(α+β)Γ(α)Γ(β)θα−1(1−θ)β−1,0<θ<1=0, otherwise,

si272_e

for α > 0 and β > 0. When the prior and posterior belong to the same family of pdfs, the prior and posterior are called conjugate.

7 Independent Random Variables

The R si31_e and Rn si33_e -valued random variables considered in the previous sections could be used to model a single observation taken on an individual. For example, (X₁, X₂, X₃) could be the values obtained from measuring height, weight, and diastolic blood pressure. Statistical inference typically deals with a sample of observations. If, for example, X is bmi (body mass index) measured on an individual, then the data collected from a sample of n would be written as X₁, …, X_n where X_i is the bmi for the i-th individual. If the sample of n individuals is chosen by a simple random sample then X₁, …, X_n are independent random variables. We have described what it means for two events to be independent (Definition 3) and we will extend this to random variables. The observations on height, weight, and blood pressure (X₁, X₂, X₃) would not be expected to be independent, but a sample taken on n individuals might be. To simplify notation we consider R si31_e -valued random variables but this could be extended to Rk si276_e -valued random variables for k > 1 . For k = 3 and a sample of n individuals the sample would be X₁, …, X_n where the observation for the i-th individual is X_i = (X_1i, X_2i, X_3i).

If random variables X₁ and X₂ were independent we would expect that f(x₁|x₂) not to be a function of x₂. If f(x₁|x₂) is not a function of x₂ then

f1(x1)=∫−∞∞f(x1,x2)dx2=∫−∞∞f(x1|x2)f2(x2)dx2=f(x1|x2)∫−∞∞f2(x2)dx2=f(x1|x2).

si277_e (30)

From (30) we have

f(x1,x2)=f(x1|x2)f2(x2)=f1(x1)f2(x2)

si278_e

and that f(x₂|x₁) is not a function of x₁. When this is true for all values of x₁ and x₂ the random variables X₁ and X₂ are independent. Similar calculations hold for discrete random variables X₁ and X₂.

Definition 11

Random variables X₁ and X₂ are independent if, and only if,

f(x1,x2)=f1(x1)f2(x2)

(31)

for all (x1,x2)∈X.

Random variables X₁ and X₂ are dependent if they are not independent.

Remark 11

The equality in Definition 11 does not need to hold for all (x1,x2)∈X si280_e when X₁ and X₂ are continuous random variables. The reason is that pdfs are not uniquely defined as these functions can be changed at individual points without changing the value of the integration and it is the integration that provides probabilities. What is required is that

P({(x1,x2):f(x1,x2)≠f1(x1)f2(x2)})=0

si282_e (32)

so that (31) holds, in some sense, almost everywhere. Measure theoretic probability makes this idea of almost everywhere precise and in advanced treatments (31) would be written

f(x1,x2)=f1(x1)f2(x2)a.e.

si283_e

where a.e. means “almost everywhere.” This remark also applies to the equality given in Theorem 7.

To show that two random variables are independent it is often easier to use the following theorem rather than the definition which involves the marginal pdfs or pmfs.

Theorem 7

Let X₁ and X₂ be random variables with joint pdf or pmf f(x₁, x₂) with X=X1×X2 si284_e . Then X₁ and X₂ are independent if, and only if,

f(x1,x2)=g(x1)h(x2)

where g(x₁) > 0 for x1∈X1, zero otherwise, and h(x₂) > 0 for x2∈X2, zero otherwise. Note that g is only a function of x₁ and h is only a function of x₂.

Theorem 7 is often called the factorization theorem for independence since we need only show that the joint pdf or pmf can be factored into two functions, one that does not depend on x₁ and the other does not depend on x₂. Note that there is also the requirement that the sample space X si38_e is a crossproduct space. This theorem would not hold if, say,

X={(x1,x2):0<x1<x2<1}.

si289_e

The factorization theorem shows that the random variables X₁ and X₂ in Examples 22 and 23 are independent. The random variables X and Y in Examples 26 and 27 are not independent because the conditional distributions depend on the other variable.

The following theorem shows that the definition of independence for random variables provides the expected relationship for events defined by these variables. That is, P(A∩B)=P(A)P(B) si24_e when A = {(x₁, x₂) : a₁ < x₁ < b₁} and B = {(x₁, x₂) : a₂ < x₂ < b₂}.

Theorem 8

If X₁ and X₂ are independent random variables then

P(a1<X1<b1,a2<X2<b2)=P(a1<X1<b1)P(a2<X2<b2)

for all constants a₁ < b₁, a₂ < b₂.

The factorization of the joint pdf or pmf for independent random variables allows expectations to be factored for functions of the form u(X₁)v(X₂).

Theorem 9

If X₁ and X₂ are independent then

E(u(X1)v(X2))=Eu(X1)⋅Ev(X2).

By Theorem 9 if random variables X and Y are independent then the are uncorrelated because their covariance is zero

E(X−μX)(Y−μY)=E(X−μX)⋅E(Y−μY)=0.

si293_e

In general, random variables being uncorrelated does not mean they are independent; independence is a stronger condition. When random variables are independent there is a restriction on the joint pdf or pmf for (almost) all values in the sample space. Uncorrelated random variables only require the expectation of one function to be zero. The definition for n independent random variables is a natural extension of Definition 11.

Definition 12

Random variables X₁, …, X_n are independent if, and only if,

f(x1,…,xn)=f1(x1)…fn(xn)

(33)

for all (x1,…,xn)∈X.

The n-fold product in Eq. (33) will be abbreviated using the Π (product) notation

f1(x1)…fn(xn)=Πi=1nfi(xi).

si296_e

For independent random variables X₁, …, X_n Theorem 8 becomes

P(a1<X1<b1,…,an<Xn<bn)=Πi=1nP(ai<Xi<bi)

si297_e

and Theorem 9 becomes

E(Πi=1nui(Xi))=Πi=1nEui(Xi).

si298_e

8 Central Limit Theorem

The central limit theorem can be described informally as a justification for treating the distribution of sums and averages of random variables as coming from a normal distribution. It should be noted that the central limit theorem is a theoretical result for what holds when the number of random variables n goes to infinity. While this theorem is not about any finite number n of random variables, in practice, the normal approximation to the sample mean (and related quantities, such as slopes in linear regression) is often very good even for modest values of n. For data that are not highly skewed and for which there are no extreme outliers, samples of size 30 or larger are considered large enough to use the normal approximation. There are many versions of the central limit theorem and the one we consider here is for the sample mean X¯=1n∑i=1nXi, where X₁, …, X_n are not only independent but also identically distributed which means the marginal pdfs or pmfs are all the same

fi(xi)=f(xi),i=1,2,…,n.

When the random variables X₁, …, X_n are independent and have identical probability distributions they are called a random sample. We assume the first two moments of f(x) exists and write

μ=E(Xi),σ2=Var(Xi)=E(Xi−μ)2,

where i and be any value from 1, 2, …, n. From the properties of the expectation operator for independent random variables it is easily shown

E(X¯)=μ,Var(X¯)=σ2n.

To discuss the central limit theorem we need to defined what is meant by a limiting distribution whose distribution depends on n.

Definition 13

Consider a sequence of random variables Y_n whose probability distribution function F_n(y) depends on integer n > 0. If F(y) is a probability distribution function and if

limn→∞Fn(y)=F(y)

for every continuity point of F(y), then Y_n is said to have a limiting distribution with probability distribution function F(y).

Theorem 10

Let X₁, …, X_n be a random sample from a distribution having mean μ and standard deviation σ > 0.

Then the standardized version of X¯ si304_e , namely,

Yn=n(X¯−μ)σ

has a limiting distribution that is normal with mean 0 and standard deviation 1.

Note that for each n, E(Y_n) = 0 and V ar(Y_n) = 1. What changes with n is the shape, and in the limit the shape of the probability distribution function for Y_n is that of the normal distribution.

The assumption that each X_i have the same distribution can be dropped but other assumptions must be made. An important application where the distributions are not identical is for regression. For normal linear regression, Y_1,…, Y_n are independent but f_i(y_i) = f(y_i, μ_i) where f(y, μ_i) is the normal distribution with mean μ_i and standard deviation σ that is the same for all i. When μ_i is a linear function in the covariates X_i = (X_i1, …, X_ik) this is normal linear regression and has important properties for statistical inference. In particular, normal linear regression has sufficient statistics that are not a function of the number of observations n. Sufficiency will be discussed in the next chapter.

Generalized linear models are an extension of normal linear regression to probability distributions from an exponential family. In a generalized linear model the pdf or pmf f(y, μ_i) is a member of an exponential family having mean parameter μ_i and a transformation of μ_i is linear in X_i. This transformation is called the link function and for each exponential family there is an important function called the canonical link. When the canonical link is used, generalized linear models will have sufficient statistics that do not depend on n.