7
Information Theory†
DAVID BLACKWELL
PROFESSOR OF STATISTICS
UNIVERSITY OF CALIFORNIA, BERKELEY
7.1 Introduction
Information theory was founded by C. E. Shannon in 1948. In his fundamental paper,3 Shannon introduced the concepts of entropy and channel capacity and related them by the coding theorem. For an excellent treatment of Shannon’s work and its later development, the reader is referred to the book by Amiel Feinstein.2
7.2 An Example
Suppose that a man, called the sender, is given a 10-digit number and is required to communicate it to a second man, called the receiver. Suppose further that their only means of communication is a set of 100 postcards, labeled serially 1, 2, …, 100. These cards are in the possession of the sender and are addressed to the receiver. On each card is a space in which a single number may be written; this number is required to be 0 or 1. The sender, after being told the 10-digit number, will write a number, 0 or 1, on each of the 100 cards and mail the entire set to the receiver. The latter, on receiving the set of cards and noting the number written on each card, will try to deduce the 10-digit number. Can the sender and receiver agree in advance on a code with which the receiver can make a correct deduction?
We can easily answer this question by counting. The number of 10-digit numbers is 1010. The number of ways of filling out the cards is 2100 = 1030.103 > 1010. Thus to each 10-digit number there may be assigned a corresponding sequence of 100 0s and 1s in such a way that no two 10-digit numbers are assigned the same sequence. By agreeing in advance on any particular such code, the sender and receiver can communicate perfectly.
We now modify the problem by supposing imperfect postal service, so that each card has a probability of only 0.5 that it will be delivered, with the deliveries of the separate cards being independent events. Perfect communication in this case is impossible, since every card could be lost in transit. We now ask, can the sender and receiver agree in advance on a code with which the probability of an error is small, no matter which 10-digit number is selected?
In general, we formulate the problem as follows. There is given a number k, the number of possible messages, of which exactly 1 is to be selected for transmission. In our example, we have k = 1010. There is given also a channel, i.e., a means of communication. This channel is specified by (a) an input set A, from which the sender selects any one element, (b) an output set B, of which the receiver observes one element, and (c) the probability law p of the channel, which specifies, for each input element a and output element b, the probability p(b|a) that the receiver will observe element b when the sender selects element a. In our example, A consists of the 3100 sequences of length 100 that can be formed by using the two symbols 0 and 1; B consists of the 3100 sequences of length 100 that can be formed by using the three symbols 0, 1, and X, where X represents nondelivery of a card; p(b|a) = 2−100 for any b that has 0 or X wherever a has 0 and 1 or X wherever a has 1; and p(b|a) = 0 otherwise.
A k code consists of (a) a sequence of k input elements a1, …, ak and (b) a division of the set B of outputs into k disjoint parts E1, …, Ek. The code is used as follows. When the sender wants to transmit the ith message, he chooses input ai. When the receiver observes output b, he locates the set Ej to which b belongs, and he concludes that message j was sent. For message i, the probability of error is then
The maximum, over all i, of this error probability is called the maximum error probability for the given code. Our problem is this: For a given channel and a given k, how small can we make the maximum error probability, by proper choice of a k code? We shall see that, in our example, there is a 1010 code for which the maximum probability of error is less than 0.03.
7.3 Entropy
If there are k possible messages, we can encode each message by a binary sequence of length log2 k. Thus a channel is adequate for transmitting any message from a set of k possible messages only if it is adequate for transmitting any binary sequence of length log2 k. Now suppose that each of the k messages has a certain probability of being presented for transmission and that there is a relatively small number l of these messages such that the total probability of the remaining k − l messages is small. It will then be possible to encode all messages except a set of small total probability by using only log2 l binary digits. The entropy of a set of messages with associated probabilities is defined, roughly, as the number of binary digits per message required to encode all long sequences of messages, except for a set of small total probability. Before making this notion precise, let us consider an example.
Suppose that there are two possible messages, 0 and 1, with probabilities α and 1 − α, respectively. In a long series of N such messages, the law of large numbers1 tells us that there will very probably be about Nα 0s and N(1 − α) 1s. Thus the message sequence that occurs will very likely be one with probability approximately equal to
αNα(1 − α)N(1−α) = 2−Nh(α)
where
There are then about 2Nh(α) of these “likely” message sequences, and it is very unlikely that the actual message sequence will not be one of these. Note that, unless α = ½, we have h(α) < 1, so that the likely messages are an extremely small fraction of the possible messages. Instead of the N binary digits required to encode all possible message sequences of length N, we need only Nh(α) binary digits to encode all sequences of length N, except for a set of small total probability. The number h(α) is then the entropy of the pair of messages 0 and 1 with associated probabilities α and 1 − α.
In general, if a random variable X has k different possible values, with probabilities p1, p2, … , pk, respectively, the entropy of the random variable is defined by
An extension of the above argument shows that we can encode all long sequences of messages X, except for a set of small total probability, with NH(X) binary digits, where N is the number of messages in the sequence. The following properties of H are easily established.
a. The function H(X) satisfies the inequality
0 ≤ H(X) ≤ log2 k
with H(X) = log2 k if and only if all values have the same probability 1/k.
b. For any two random variables X, Y, if (X,Y) denotes the cartesian product set of values X, Y, then the function H(X,Y) satisfies the inequality
H(X,Y) ≤ H(X) + H(Y)
with equality only if X and Y are independent.
The number R(X,Y) = H(X) + H(Y) − H(X,Y) is called the mutual information in X and Y. It represents the number of bits (binary digits) of information that we obtain about X by observing Y. This can be seen as follows: If we observe Y, we observe H(Y) bits of information about the pair (X,Y), so that H(X,Y) − H(Y) additional bits are needed to identify (X,Y), that is, to identify X. Had we not observed Y, we would have needed H(X) bits to identify X. Thus, observing Y reduces the number of bits required to identify X by
H(X) − [H(X,Y) − H(Y)] = R(X,Y)
bits. The number
H(X|Y) = H(X,Y) − H(Y)
is called the conditional entropy of X, given Y; it is, as we have just seen, the additional number of bits required to identify X, once we are given Y.
As an example, suppose a fair coin is tossed 100 times. Let X describe the result of the first 60 tosses and Y the result of the last 70. Then
7.4 Capacity of a Channel
Consider a channel with input set A, output set B, and probability law p(b|a). If we select an input a according to some probability distribution q on A, then the probability of the input-output pair (a,b) is
r(a,b) = q(a)p(b|a)
and the probability of the output b is
If X is the input variable and Y is the output variable, we have
With our interpretation of R(X,Y) as the number of bits of information that we get about X from observing Y, we may hope that our channel is capable of transmitting R(X,Y) bits of information, i.e., that a message set with not more than 2R(X,Y) messages can be transmitted with small error probability. We shall see that this is approximately true, provided that we are allowed to encode long sequences of messages rather than individual ones.
For the present, we note that, by varying q, we obtain different numbers R(X,Y). The maximum of R(X,Y) over all input distributions q on A is called the capacity of the channel.
As an example, consider the channel consisting of a single postcard of the type considered in Sec. 7.2. We have A = (0,1), B = (0,1,X), and p(b|a) is as shown below:
For q(0) = α, q(1) = 1 − α, we obtain
Thus R(X,Y) is maximized for α = 0.5, and its maximum value, which is the channel capacity, is 0.5. This is exactly what we should expect: A channel that transmits one bit of information when it works, and that works half the time, should be able to transmit half as much information as a similar channel that works every time.
7.5 The Fundamental Theorem
The main result asserts that, if a channel has capacity C and, if we are allowed to use the channel N times, where N is large, then we can transmit any one of about 2CN messages with small error probability. Note that, since our postcard channel has C = ½, we should be able to transmit, with our 100 postcards, any one of about 250 messages with small error probability. Since the message set of interest has only 1010 = 230.103 messages, the theorem implies that we should be able to find a code with small maximum error probability, since N = 100 is large.
The use of a channel N times may be considered as the single use of a large channel—the so-called N extension of the original channel. Thus our 100 postcards constitute the 100 extension of the single-card channel. In general, for a channel with input set A, output set B, and probability function p(b|a), the N extension has input set U consisting of all sequences of length N of elements of A, output set V consisting of all sequences of length N of elements of B, and probability function f(v|u) defined by
We can now state the following result.
FUNDAMENTAL THEOREM. Given a channel of capacity C and any number > 0, there is an integer N0, depending on
and the channel, such that, for any N ≥ N0 and for the N extension of the channel, there is a 2N(C−
) code having the property that the maximum error probability of the code is at most
.
The theorem can be proved easily from a simple inequality, which will also enable us to say something useful about how large N must be; the proof actually describes the construction of a particular code.
INEQUALITY. Given a channel having input set U, output set V, and probability law f(v|u), a probability distribution q on U, a positive number c, and a positive number k, there is a k code for the channel such that the maximum error probability of the code is at most
where
and where the probability of a pair (u,v) is q(u)f(v|u).
We shall not prove the fundamental theorem from our inequality but shall instead illustrate its use on our original postcard problem. We have U consisting of all sequences of 100 0s and 1s; V consisting of all sequences of 100 0s, 1s, and X’s; and f(v|u) = 2−100 if v has 0 or X wherever u has 0 and 1 or X wherever u has 1. We are interested in k = 1010. We may choose any distribution q on U and any number c. We shall choose q(u) = 2−100 for all u and c = 240. The second term in our error bound is
Let us evaluate the first term, that is, prob {J(u,v) < c}.
For any possible pair (u,v), we have
prob (u,v) = g(u,v) = q(u)f(v|u) = 2−100 · 2−100 = 2−200
prob (v) = e(v) = 2−200 × number of u’s that could produce v
so that
e(v) = 2−200 × 2x
where x is the number of X’s in v, that is, the number of cards that fail to arrive. Thus we obtain
The probability that J(u,v) < c (i.e., that 2100−x < 240, or that x > 60) is the probability that more than 60 cards are lost in transit. This probability is found from tables of the binomial distribution to be 0.0176. Thus our inequality asserts the existence of a 1010 code with maximum error probability at most 0.0176 + 0.0091 = 0.0267.
Proof of the Inequality. Write
and let
Let S consist of all pairs (u,v) for which J(u,v) ≥ c. We proceed to construct a code, i.e., a series u1, … , uk of elements of U and a series E1, … , Ek of disjoint subsets of V, such that
Choose, if possible, for u1 any element of U for which there is a set E1 in V such that
a.
b. All pairs (u1,v) with v in E1 are in S.
Having found u1 and E1, choose, if possible, for u2 any element of U for which there is a set E2 in V such that
a′.
b′. All pairs (u2,v) with v in E2 are in S.
c′. E2 is disjoint from E1.
Having continued in this manner, and thus having found u1, … , ui and E1, … , Ei, choose, if possible, for ui+1 any element of U for which there is a set Ei+1 in V such that
a″.
b″. All pairs (ui+1,v) with v in Ei+1 are in S.
c″. Ei+1 is disjoint from E1, E2, …, Ei.
Clearly, if u1, … , ui and E1, …, Ei can be found, they are an i code with maximum error probability at most . It remains to be shown that they can be found for i = k. Suppose, then, that we have found u1, … , ui and E1, … , Ei and are unable to find ui+1 and Ei+1 satisfying the conditions a″, b″, and c″. We must show that i ≥ k. For any u, let Fu consist of all v for which (u,v) is in S and v is not in E1 or E2 or … or Ei. Then, for every u, we have
since otherwise u and Fu would be a possible choice for ui+1 and Ei+1. Multiplying this inequality by q(u) and summing on u yields
where F consists of all pairs (u,v) for which (u,v) is in S and v is not in E1 or E2 or … or Ei. Also, for v in Ej, we have
so that
whence
But
that is,
it follows that i ≥ k and the proof is complete.
7.6 Multistate Channels
In constructing the N extension of a given channel, we defined the probability function for the N extension by
This is correct for a single channel used N successive times only if the probability function of the channel does not depend on the history of the channel. Channels of this type are called memory-less or single-state channels. More generally, a finite-state channel is specified by an input set A, an output set B, a finite set S of states, and a probability function p(b,s′|a,s), which specifies the probability that, if the channel is currently in state s and receives input a, it will produce output b and move to state s′.
Fig. 7.1 A simple two-state channel.
A simple two-state channel is illustrated in Fig. 7.1. At the beginning of a cycle, the channel contains a single ball, numbered 0 or 1, and occupying either compartment. An input consists of a ball, labeled 0 or 1, dropped into the channel from the top. This input ball descends and occupies the remaining compartment. One of the two trap doors, selected at random, then opens, allowing one ball to descend. This ball is the output of the channel, and the cycle is complete. We have A = (0,1), B = (0,1), S = (0,1), and
For a finite-state channel, we can again consider its N extension, which will be a finite-state channel. For a given code, the probability of error will be a function of the initial state of the channel as well as of the particular message being transmitted. Again each finite-state channel has a capacity C, and there is a 2(C−)N code for the N extension such that the maximum error probability of the code, over initial states and messages, is small. For these channels, however, C is defined in a more complicated way, as described below, and cannot easily be evaluated. For instance, the capacity of the channel described above is not known.
Indecomposability of a finite-state channel is defined as follows. Let s be any state and let u be any finite sequence a1, …, ak of input letters. A state s′ is u-attainable from s if, when the channel is initially in state s and receives the infinite sequence of inputs a1, … , ak; a1, … , ak; a1, … , ak; … , it is possible that at some time it enters state s′. A channel is indecomposable if for every two states s1, s2, and every u, there is a state s that is u-attainable from either s1 or s2.
Our two-state channel illustrated above is clearly indecomposable. For if u contains at least one 1, the state 1 is u-attainable from either 0 or 1, and similarly for 0.
7.7 Entropy of a Process; Capacity of Finite-state Channels
In defining the entropy of a random variable X, we noted that if X1, X2, …, XN are independent variables with the same distribution as X, then the sequence of values a1, …, aN that occurs is practically certain to be one having probability about 2−NH(X). More generally, let (A,B,S,p) be any indecomposable finite-state channel, let the initial state of the channel be chosen according to some probability distribution λ, let a specified input a* be used again and again, and let Y1, Y2, … be the resulting sequence of outputs. Sequences Y1, Y2, … defined in this way are called finitary processes. There is then a number H, depending on the channel and on a*, but not on λ, such that the sequence b1, … , bN that occurs is practically certain to have probability about 2−HN. This number H is called the entropy of the process Y1, Y2, ….
For instance, if the output b is identical with the state of the channel, so that Y1, Y2, … are the successive states of the channel, the probability of the sequence s1, … , sN is
where q(s′|s) is the probability of moving to state s′, when the channel is in state s and receives input a*. Among the pairs (s0,s1), (s1,s2), … , (sN−1,sN), let f(s,s′) be the proportion that are equal to the pair (s,s′). Then, approximately,
say, since both sums represent approximately the relative frequency of state s among s0, s1, … , sN. Now it is very likely that, among all occurrences of s, a proportion equal to about q(s′|s) are followed by s′; i.e., we have, approximately,
Summing over s yields the system
of linear equations in the proportions g(s). This system, for indecomposable channels, will have a unique solution g*(s). Thus it is very probable that the sequence s0, s1, … , sN that occurs is one for which f(s,s′) is approximately g*(s)q(s′|s), so that its probability is about
where
Thus, for the case in which the output is identical with the state, the entropy of the process is given by the above formula. No similarly tractable formula is known for the entropy of finitary processes in general.
In terms of the entropy of finitary processes, it is easy to define the capacity of an indecomposable finite-state channel. If (A,B,S,p) is such a channel, let {X1,X2, …} be any finitary process such that the value of each Xn is one of the input letters of the channel. This process has an entropy HX. If we start the channel in an arbitrary state and drive it with the inputs X1, X2, … , the resulting outputs Y1, Y2, … will form a finitary process, with entropy HY, and the pairs (X1,Y1), (X2,Y2), … will form a third finitary process, with entropy HX,Y. The capacity C of the channel is defined as the upper bound over all finitary sources X1, X2, … of the numbers
EXERCISES
1. Find the capacity of each of the following channels:
a. A = (0,1) B = (0,1)
b. A = (0,1) B = (0,1)
2. Under what conditions will a memoryless channel have capacity zero?
3. Let C be any memoryless channel (A,B,p) and let C* be the channel (A,B*,p*), where B* consists of B and a single new symbol x, and
p*(b|a) = ½p(b|a)p*(x|a) = ½
for all a. Show that the capacity of C* is one-half the capacity of C.
4. A channel has input set (0,1) and output set (0,1). For input 0, outputs 0 and 1 are equally probable. For input 1, the output is 0 if the previous input and output agree and 1 if they differ. Describe this as a two-state channel.
5. Let C = (A,B,p) and C* = (B,D,q) be two memoryless channels, and let C** = (A,D,r), where
Show that the capacity of C** does not exceed that of C or that of C*.
REFERENCES
1. Brown, George W., Monte Carlo Methods, chap. 12 in “Modern Mathematics for the Engineer,” First Series, edited by E. F. Beckenbach, McGraw-Hill Book Company, Inc., New York, 1956.
2. Feinstein, Amiel, “Foundations of Information Theory,” McGraw-Hill Book Company, Inc., New York, 1958.
3. Shannon, C. E., A Mathematical Theory of Communication, Bell System Tech. J., vol. 27, pp. 379–423, 623–658, 1948.
†This chapter was prepared with the partial support of the Office of Naval Research (Nonr-222-53); it may be reproduced in whole or in part for any purpose of the United States government.