9780062296245_Chapter

BART KOSKO

Information scientist & professor of electrical engineering & law, University of Southern California; author, Noise

We should worry that much of our science and technology still uses just five main models of probability, even though there are more probability models than there are real numbers. I call these lamplight probabilities. The adjective refers to the old joke about the drunk who lost his keys somewhere in the dark and looks for them under the streetlamp because that’s where the light is.

The five lamplight probabilities do explain a lot of the observed world. They have simple closed-form definitions. So they are easy to teach. Open any book on probability and there they are. We have proved lots of theorems about them. And they generalize to many other simple closed-form probabilities that also seem to explain a lot of the world and find their way into real-world models ranging from finance to communications to nuclear engineering.

But how accurate are they? How well do they match fact rather than just simplify the hard task of picking a good random model of the world?

Their use in practice seldom comes with a statistical hypothesis test that can give some objective measure of how well the assumption of a lamplight probability fits the data at hand. And each model has express technical conditions that the data must satisfy. Yet the data routinely violate these conditions in practice.

Everyone knows the first lamplight probability: the normal bell curve. Indeed, most people think that the normal bell curve is the bell curve. But there are whole families of bell curves with thicker tails that explain a wide range of otherwise unlikely “rare” or “black swan” events, depending on how thick the tails are. And that is still assuming that the bell is regular or symmetric. You just don’t find these bell curves in most textbooks.

There is also a simple test for such thin-tailed “normality” in time-series data such as price histories or sampled human speech: All higher-order cumulants of the process must be zero. A cumulant is a special average of the process. Looking at the higher-order cumulants of an alleged normal process routinely leads to the same finding: They’re not all zero. So the process cannot be normal. Yet under the lamplight we go ahead and assume the process is normal anyway—especially since so many other researchers do the same in similar circumstances. That can lead to severely underestimating the occurrence of rare events, such as loan defaults. That’s just what happened in the engineering models of the recent financial panic, when financial engineers found a way to impose the normal curve on complex correlated derivatives.

The second and third lamplight probabilities are the Poisson and exponential probability models. Poisson probabilities model random counting events, such as the number of hits on an Internet site or the number of cars merging onto a freeway or the number of raindrops hitting a sidewalk. Exponential probabilities model how long it takes for the next Poisson event to happen—how long it takes until the next customer walks through the door or the next raindrop hits the pavement. This generalizes into how long you have to wait for the next ten Internet hits or the next ten raindrops. Modern queuing theory rests on these two lamplight probabilities. It’s all about waiting times for Poisson arrivals at queues. And so the Internet itself rests on these two lamplight models.

But Poisson models have an Achilles heel: Their averages must equal their variances (spreads about their means). Again, this routinely fails to hold in practice. Exponential models have a similar problem; their variances must equal the square of their means. This is a fine relationship that also rarely holds exactly in practice, holding only to some fuzzy degree in most cases. Whether the approximation is a good enough is a judgment call—and one that the lamplight makes a lot easier to make.

The fourth lamplight probability is the uniform probability model. Everyone also knows the uniform model, because it is the special case where all outcomes are equally likely. It is just what the layman thinks of as doing something “at random,” such as drawing straws or grabbing a numbered Ping-Pong ball from a Bingo hopper. But straws come in varying lengths and thicknesses and so their draw probabilities may not be exactly equal. It gets harder and harder in practice to produce equally likely outcomes as the number of outcomes increase. It is even a theoretical fact that one cannot draw an integer at random from the set of integers, because of the nature of infinity. So it helps to appeal to common practice under the lamplight and simply assume that the outcomes are all equally likely.

The fifth and final lamplight probability is the binomial probability model. It describes the canonical random metaphor of flipping a coin. The binomial model requires binary outcomes such as heads or tails and further requires independent trials or flips. The probability of getting heads must also stay the same from flip to flip. This seems simple enough. But it can be hard to accept that the next flip of a fair coin is just as likely to be heads as it is to be tails when the preceding three independent coin-flip outcomes have all been heads.

Even the initiated scratch their heads at how the binomial behaves. Consider a fair penny. Fairness here means that the penny is equally likely to come up heads or tails when flipped (and hence the fourth lamplight probability describes this elementary outcome). So the probability of heads is 1/2. Now flip the penny several times. Then answer this question: Are you more likely to get three heads in six coin flips or are you more likely to get three heads in just five coin flips? The correct answer is neither. The probability of getting three heads in both cases is exactly 5/16. That is hardly intuitive, but it comes straight out of counting up all the possible outcomes.

Lamplight probabilities have proven an especially tight restriction on modern Bayesian inference. That’s disappointing, given both the explosion in modern Bayesian computing and the widespread view that learning itself is a form of using Bayes’ Theorem to update one’s belief given fresh data or evidence. Bayes’ Theorem shows how to compute the probability of a logical converse. It shows how to adjust the probability of lung cancer given a biopsy, if we know what the raw probability of lung cancer is and the probability that we would observe such a biopsy in a patient if the patient in fact has lung cancer. But almost all Bayesian models restrict these probabilities to not just the well-known lamplight probabilities. They further restrict them so that they satisfy a very restrictive “conjugacy” relationship. The result has put much of modern Bayesian computing into a straitjacket of its own design.

Conjugacy is tempting. Suppose the cancer probability is normal. Suppose also that the conditional probability of the biopsy, given the cancer, is normal. These two probabilities are conjugate. Then what looks like mathematical magic happens. The desired probability of lung cancer, given the biopsy likelihood, is itself a normal probability curve. And thus normal conjugates with normal to produce a new normal. That lets us take this new normal model as the current estimate of lung cancer and then repeat the process for a new biopsy.

The computer can grind away on hundreds or thousands of such data-based iterations and the result will always be a normal bell curve. But change either of the two input normal curves and in general the result will not be normal. That gives a perverse incentive to stay within the lamplight and keep assuming a thin-tailed normal curve to describe both the data and what we believe. The same thing happens when we try to estimate the probability of heads for a biased coin using the binomial and a beta curve that generalizes the uniform. And a similar conjugacy relation holds between Poisson probabilities and exponentials and their generalizations to gamma probabilities. So that gives three basic conjugacy relations. Most Bayesian applications assume some form of one of these three relations—and almost always for ease of computation or to comply with common practice.

That’s no way to run a revolution in probabilistic computing.