Modern Mathematics for the Engineer

DAVID BLACKWELL

PROFESSOR OF STATISTICS
UNIVERSITY OF CALIFORNIA, BERKELEY

7.1 Introduction

Information theory was founded by C. E. Shannon in 1948. In his fundamental paper,³ Shannon introduced the concepts of entropy and channel capacity and related them by the coding theorem. For an excellent treatment of Shannon’s work and its later development, the reader is referred to the book by Amiel Feinstein.²

7.2 An Example

Suppose that a man, called the sender, is given a 10-digit number and is required to communicate it to a second man, called the receiver. Suppose further that their only means of communication is a set of 100 postcards, labeled serially 1, 2, …, 100. These cards are in the possession of the sender and are addressed to the receiver. On each card is a space in which a single number may be written; this number is required to be 0 or 1. The sender, after being told the 10-digit number, will write a number, 0 or 1, on each of the 100 cards and mail the entire set to the receiver. The latter, on receiving the set of cards and noting the number written on each card, will try to deduce the 10-digit number. Can the sender and receiver agree in advance on a code with which the receiver can make a correct deduction?

We can easily answer this question by counting. The number of 10-digit numbers is 10¹⁰. The number of ways of filling out the cards is 2¹⁰⁰ = 10^30.103 > 10¹⁰. Thus to each 10-digit number there may be assigned a corresponding sequence of 100 0s and 1s in such a way that no two 10-digit numbers are assigned the same sequence. By agreeing in advance on any particular such code, the sender and receiver can communicate perfectly.

We now modify the problem by supposing imperfect postal service, so that each card has a probability of only 0.5 that it will be delivered, with the deliveries of the separate cards being independent events. Perfect communication in this case is impossible, since every card could be lost in transit. We now ask, can the sender and receiver agree in advance on a code with which the probability of an error is small, no matter which 10-digit number is selected?

In general, we formulate the problem as follows. There is given a number k, the number of possible messages, of which exactly 1 is to be selected for transmission. In our example, we have k = 10¹⁰. There is given also a channel, i.e., a means of communication. This channel is specified by (a) an input set A, from which the sender selects any one element, (b) an output set B, of which the receiver observes one element, and (c) the probability law p of the channel, which specifies, for each input element a and output element b, the probability p(b|a) that the receiver will observe element b when the sender selects element a. In our example, A consists of the 3¹⁰⁰ sequences of length 100 that can be formed by using the two symbols 0 and 1; B consists of the 3¹⁰⁰ sequences of length 100 that can be formed by using the three symbols 0, 1, and X, where X represents nondelivery of a card; p(b|a) = 2⁻¹⁰⁰ for any b that has 0 or X wherever a has 0 and 1 or X wherever a has 1; and p(b|a) = 0 otherwise.

A k code consists of (a) a sequence of k input elements a₁, …, a_k and (b) a division of the set B of outputs into k disjoint parts E₁, …, E_k. The code is used as follows. When the sender wants to transmit the ith message, he chooses input a_i. When the receiver observes output b, he locates the set E_j to which b belongs, and he concludes that message j was sent. For message i, the probability of error is then

The maximum, over all i, of this error probability is called the maximum error probability for the given code. Our problem is this: For a given channel and a given k, how small can we make the maximum error probability, by proper choice of a k code? We shall see that, in our example, there is a 10¹⁰ code for which the maximum probability of error is less than 0.03.

7.3 Entropy

If there are k possible messages, we can encode each message by a binary sequence of length log₂ k. Thus a channel is adequate for transmitting any message from a set of k possible messages only if it is adequate for transmitting any binary sequence of length log₂ k. Now suppose that each of the k messages has a certain probability of being presented for transmission and that there is a relatively small number l of these messages such that the total probability of the remaining k − l messages is small. It will then be possible to encode all messages except a set of small total probability by using only log₂ l binary digits. The entropy of a set of messages with associated probabilities is defined, roughly, as the number of binary digits per message required to encode all long sequences of messages, except for a set of small total probability. Before making this notion precise, let us consider an example.

Suppose that there are two possible messages, 0 and 1, with probabilities α and 1 − α, respectively. In a long series of N such messages, the law of large numbers ¹ tells us that there will very probably be about Nα 0s and N(1 − α) 1s. Thus the message sequence that occurs will very likely be one with probability approximately equal to

α^Nα(1 − α)^N(1−α) = 2^−Nh(α)

where

There are then about 2^Nh(α) of these “likely” message sequences, and it is very unlikely that the actual message sequence will not be one of these. Note that, unless α = ½, we have h(α) < 1, so that the likely messages are an extremely small fraction of the possible messages. Instead of the N binary digits required to encode all possible message sequences of length N, we need only Nh(α) binary digits to encode all sequences of length N, except for a set of small total probability. The number h(α) is then the entropy of the pair of messages 0 and 1 with associated probabilities α and 1 − α.

In general, if a random variable X has k different possible values, with probabilities p₁, p₂, … , p_k, respectively, the entropy of the random variable is defined by

An extension of the above argument shows that we can encode all long sequences of messages X, except for a set of small total probability, with NH(X) binary digits, where N is the number of messages in the sequence. The following properties of H are easily established.

a. The function H(X) satisfies the inequality

0 ≤ H(X) ≤ log₂ k

with H(X) = log₂ k if and only if all values have the same probability 1/k.

b. For any two random variables X, Y, if (X,Y) denotes the cartesian product set of values X, Y, then the function H(X,Y) satisfies the inequality

H(X,Y) ≤ H(X) + H(Y)

with equality only if X and Y are independent.

The number R(X,Y) = H(X) + H(Y) − H(X,Y) is called the mutual information in X and Y. It represents the number of bits (binary digits) of information that we obtain about X by observing Y. This can be seen as follows: If we observe Y, we observe H(Y) bits of information about the pair (X,Y), so that H(X,Y) − H(Y) additional bits are needed to identify (X,Y), that is, to identify X. Had we not observed Y, we would have needed H(X) bits to identify X. Thus, observing Y reduces the number of bits required to identify X by

H(X) − [H(X,Y) − H(Y)] = R(X,Y)

bits. The number

H(X|Y) = H(X,Y) − H(Y)

is called the conditional entropy of X, given Y; it is, as we have just seen, the additional number of bits required to identify X, once we are given Y.

As an example, suppose a fair coin is tossed 100 times. Let X describe the result of the first 60 tosses and Y the result of the last 70. Then

7.4 Capacity of a Channel

Consider a channel with input set A, output set B, and probability law p(b|a). If we select an input a according to some probability distribution q on A, then the probability of the input-output pair (a,b) is

r(a,b) = q(a)p(b|a)

and the probability of the output b is

If X is the input variable and Y is the output variable, we have

With our interpretation of R(X,Y) as the number of bits of information that we get about X from observing Y, we may hope that our channel is capable of transmitting R(X,Y) bits of information, i.e., that a message set with not more than 2^R(X,Y) messages can be transmitted with small error probability. We shall see that this is approximately true, provided that we are allowed to encode long sequences of messages rather than individual ones.

For the present, we note that, by varying q, we obtain different numbers R(X,Y). The maximum of R(X,Y) over all input distributions q on A is called the capacity of the channel.

As an example, consider the channel consisting of a single postcard of the type considered in Sec. 7.2. We have A = (0,1), B = (0,1,X), and p(b|a) is as shown below:

For q(0) = α, q(1) = 1 − α, we obtain

Thus R(X,Y) is maximized for α = 0.5, and its maximum value, which is the channel capacity, is 0.5. This is exactly what we should expect: A channel that transmits one bit of information when it works, and that works half the time, should be able to transmit half as much information as a similar channel that works every time.

7.5 The Fundamental Theorem

The main result asserts that, if a channel has capacity C and, if we are allowed to use the channel N times, where N is large, then we can transmit any one of about 2^CN messages with small error probability. Note that, since our postcard channel has C = ½, we should be able to transmit, with our 100 postcards, any one of about 2⁵⁰ messages with small error probability. Since the message set of interest has only 10¹⁰ = 2^30.103 messages, the theorem implies that we should be able to find a code with small maximum error probability, since N = 100 is large.

The use of a channel N times may be considered as the single use of a large channel—the so-called N extension of the original channel. Thus our 100 postcards constitute the 100 extension of the single-card channel. In general, for a channel with input set A, output set B, and probability function p(b|a), the N extension has input set U consisting of all sequences of length N of elements of A, output set V consisting of all sequences of length N of elements of B, and probability function f(v|u) defined by

We can now state the following result.

FUNDAMENTAL THEOREM. Given a channel of capacity C and any number > 0, there is an integer N₀, depending on and the channel, such that, for any N ≥ N₀ and for the N extension of the channel, there is a 2^N(C−) code having the property that the maximum error probability of the code is at most .

The theorem can be proved easily from a simple inequality, which will also enable us to say something useful about how large N must be; the proof actually describes the construction of a particular code.

INEQUALITY. Given a channel having input set U, output set V, and probability law f(v|u), a probability distribution q on U, a positive number c, and a positive number k, there is a k code for the channel such that the maximum error probability of the code is at most

where

and where the probability of a pair (u,v) is q(u)f(v|u).

We shall not prove the fundamental theorem from our inequality but shall instead illustrate its use on our original postcard problem. We have U consisting of all sequences of 100 0s and 1s; V consisting of all sequences of 100 0s, 1s, and X’s; and f(v|u) = 2⁻¹⁰⁰ if v has 0 or X wherever u has 0 and 1 or X wherever u has 1. We are interested in k = 10¹⁰. We may choose any distribution q on U and any number c. We shall choose q(u) = 2⁻¹⁰⁰ for all u and c = 2⁴⁰. The second term in our error bound is

Let us evaluate the first term, that is, prob {J(u,v) < c}.

For any possible pair (u,v), we have

prob (u,v) = g(u,v) = q(u)f(v|u) = 2⁻¹⁰⁰ · 2⁻¹⁰⁰ = 2⁻²⁰⁰

Also for any v, we have

prob (v) = e(v) = 2⁻²⁰⁰ × number of u’s that could produce v

so that

e(v) = 2⁻²⁰⁰ × 2^x

where x is the number of X’s in v, that is, the number of cards that fail to arrive. Thus we obtain

The probability that J(u,v) < c (i.e., that 2^100−x < 2⁴⁰, or that x > 60) is the probability that more than 60 cards are lost in transit. This probability is found from tables of the binomial distribution to be 0.0176. Thus our inequality asserts the existence of a 10¹⁰ code with maximum error probability at most 0.0176 + 0.0091 = 0.0267.

Proof of the Inequality. Write

and let

Let S consist of all pairs (u,v) for which J(u,v) ≥ c. We proceed to construct a code, i.e., a series u₁, … , u_k of elements of U and a series E₁, … , E_k of disjoint subsets of V, such that

Choose, if possible, for u₁ any element of U for which there is a set E₁ in V such that

b. All pairs (u₁,v) with v in E₁ are in S.

Having found u₁ and E₁, choose, if possible, for u₂ any element of U for which there is a set E₂ in V such that

a′.

b′. All pairs (u₂,v) with v in E₂ are in S.

c′. E₂ is disjoint from E₁.

Having continued in this manner, and thus having found u₁, … , u_i and E₁, … , E_i, choose, if possible, for u_i+1 any element of U for which there is a set E_i+1 in V such that

a″.

b″. All pairs (u_i+1,v) with v in E_i+1 are in S.

c″. E_i+1 is disjoint from E₁, E₂, …, E_i.

Clearly, if u₁, … , u_i and E₁, …, E_i can be found, they are an i code with maximum error probability at most . It remains to be shown that they can be found for i = k. Suppose, then, that we have found u₁, … , u_i and E₁, … , E_i and are unable to find u_i+1 and E_i+1 satisfying the conditions a″, b″, and c″. We must show that i ≥ k. For any u, let F_u consist of all v for which (u,v) is in S and v is not in E₁ or E₂ or … or E_i. Then, for every u, we have

since otherwise u and F_u would be a possible choice for u_i+1 and E_i+1. Multiplying this inequality by q(u) and summing on u yields

where F consists of all pairs (u,v) for which (u,v) is in S and v is not in E₁ or E₂ or … or E_i. Also, for v in E_j, we have

so that

whence

But

that is,

Since we have

it follows that i ≥ k and the proof is complete.

7.6 Multistate Channels

In constructing the N extension of a given channel, we defined the probability function for the N extension by

This is correct for a single channel used N successive times only if the probability function of the channel does not depend on the history of the channel. Channels of this type are called memory-less or single-state channels. More generally, a finite-state channel is specified by an input set A, an output set B, a finite set S of states, and a probability function p(b,s′|a,s), which specifies the probability that, if the channel is currently in state s and receives input a, it will produce output b and move to state s′.

Fig. 7.1 A simple two-state channel.

A simple two-state channel is illustrated in Fig. 7.1. At the beginning of a cycle, the channel contains a single ball, numbered 0 or 1, and occupying either compartment. An input consists of a ball, labeled 0 or 1, dropped into the channel from the top. This input ball descends and occupies the remaining compartment. One of the two trap doors, selected at random, then opens, allowing one ball to descend. This ball is the output of the channel, and the cycle is complete. We have A = (0,1), B = (0,1), S = (0,1), and

For a finite-state channel, we can again consider its N extension, which will be a finite-state channel. For a given code, the probability of error will be a function of the initial state of the channel as well as of the particular message being transmitted. Again each finite-state channel has a capacity C, and there is a 2(^C−)N code for the N extension such that the maximum error probability of the code, over initial states and messages, is small. For these channels, however, C is defined in a more complicated way, as described below, and cannot easily be evaluated. For instance, the capacity of the channel described above is not known.

Indecomposability of a finite-state channel is defined as follows. Let s be any state and let u be any finite sequence a₁, …, a_k of input letters. A state s′ is u-attainable from s if, when the channel is initially in state s and receives the infinite sequence of inputs a₁, … , a_k; a₁, … , a_k; a₁, … , a_k; … , it is possible that at some time it enters state s′. A channel is indecomposable if for every two states s₁, s₂, and every u, there is a state s that is u-attainable from either s₁ or s₂.

Our two-state channel illustrated above is clearly indecomposable. For if u contains at least one 1, the state 1 is u-attainable from either 0 or 1, and similarly for 0.

7.7 Entropy of a Process; Capacity of Finite-state Channels

In defining the entropy of a random variable X, we noted that if X₁, X₂, …, X_N are independent variables with the same distribution as X, then the sequence of values a₁, …, a_N that occurs is practically certain to be one having probability about 2^−NH(X). More generally, let (A,B,S,p) be any indecomposable finite-state channel, let the initial state of the channel be chosen according to some probability distribution λ, let a specified input a* be used again and again, and let Y₁, Y₂, … be the resulting sequence of outputs. Sequences Y₁, Y₂, … defined in this way are called finitary processes. There is then a number H, depending on the channel and on a*, but not on λ, such that the sequence b₁, … , b_N that occurs is practically certain to have probability about 2^−HN. This number H is called the entropy of the process Y₁, Y₂, ….

For instance, if the output b is identical with the state of the channel, so that Y₁, Y₂, … are the successive states of the channel, the probability of the sequence s₁, … , s_N is

where q(s′|s) is the probability of moving to state s′, when the channel is in state s and receives input a*. Among the pairs (s₀,s₁), (s₁,s₂), … , (s_N−1,s_N), let f(s,s′) be the proportion that are equal to the pair (s,s′). Then, approximately,

say, since both sums represent approximately the relative frequency of state s among s₀, s₁, … , s_N. Now it is very likely that, among all occurrences of s, a proportion equal to about q(s′|s) are followed by s′; i.e., we have, approximately,

Summing over s yields the system

of linear equations in the proportions g(s). This system, for indecomposable channels, will have a unique solution g*(s). Thus it is very probable that the sequence s₀, s₁, … , s_N that occurs is one for which f(s,s′) is approximately g*(s)q(s′|s), so that its probability is about

where

Thus, for the case in which the output is identical with the state, the entropy of the process is given by the above formula. No similarly tractable formula is known for the entropy of finitary processes in general.

In terms of the entropy of finitary processes, it is easy to define the capacity of an indecomposable finite-state channel. If (A,B,S,p) is such a channel, let {X₁,X₂, …} be any finitary process such that the value of each X_n is one of the input letters of the channel. This process has an entropy H_X. If we start the channel in an arbitrary state and drive it with the inputs X₁, X₂, … , the resulting outputs Y₁, Y₂, … will form a finitary process, with entropy H_Y, and the pairs (X₁,Y₁), (X₂,Y₂), … will form a third finitary process, with entropy H_X,Y. The capacity C of the channel is defined as the upper bound over all finitary sources X₁, X₂, … of the numbers

EXERCISES

1. Find the capacity of each of the following channels:

a. A = (0,1) B = (0,1)

b. A = (0,1) B = (0,1)

2. Under what conditions will a memoryless channel have capacity zero?

3. Let C be any memoryless channel (A,B,p) and let C* be the channel (A,B*,p*), where B* consists of B and a single new symbol x, and

p*(b|a) = ½p(b|a)p*(x|a) = ½

for all a. Show that the capacity of C* is one-half the capacity of C.

4. A channel has input set (0,1) and output set (0,1). For input 0, outputs 0 and 1 are equally probable. For input 1, the output is 0 if the previous input and output agree and 1 if they differ. Describe this as a two-state channel.

5. Let C = (A,B,p) and C* = (B,D,q) be two memoryless channels, and let C** = (A,D,r), where

Show that the capacity of C** does not exceed that of C or that of C*.

REFERENCES

1. Brown, George W., Monte Carlo Methods, chap. 12 in “Modern Mathematics for the Engineer,” First Series, edited by E. F. Beckenbach, McGraw-Hill Book Company, Inc., New York, 1956.

2. Feinstein, Amiel, “Foundations of Information Theory,” McGraw-Hill Book Company, Inc., New York, 1958.

3. Shannon, C. E., A Mathematical Theory of Communication, Bell System Tech. J., vol. 27, pp. 379–423, 623–658, 1948.

†This chapter was prepared with the partial support of the Office of Naval Research (Nonr-222-53); it may be reproduced in whole or in part for any purpose of the United States government.