17    Software Failure

Software doesn’t fail physically. It doesn’t wear out. Since software doesn’t wear out, its behavior is unchanging (although we noted in chapter 7 that changes in the environment often feel like “bit rot” in software). Unfortunately, wrong behavior by software is also consistent and unchanging. Can we use some kind of redundancy to lessen the impact of a software failure?

We have had some kind of “spare stuff” in each of our previous efforts to survive failure. Some researchers have likewise attempted to avoid common-mode failures in software by using multiple different realizations of the software. These are not identical copies of the same program—identical copies would succeed or fail identically, and so wouldn’t provide any advantage. Instead, the different realizations are deliberately designed to be different ways of achieving the same effect. The multiple versions are spare computational approaches, rather than spare bits or spare disks. The multiple realizations each compute their own independent version of an answer, and then those answers are compared to determine the single right answer. At one level this sounds like a straightforward generalization of error-correcting codes, stable storage, and RAID—but it’s actually a much harder problem and suffers from some difficulties that may not be obvious.

Specifications Revisited

Recall that a specification tells us what software should do. In contrast, the implementation provides all the details required about how the software works (we first encountered this distinction in chapter 3). Conceptually, we want a single specification that allows us to build multiple distinct implementations. Because they have the same specification, the multiple versions will all produce the same results; because they have different implementations, they will fail independently.

The first difficulty is that it is hard to develop the specifications for such multiversion software, even beyond the problems we saw previously for specifying any computation (chapter 6). In addition to the usual challenges for a specification, the specification for multiversion software must allow multiple competing realizations to be developed. Multiple distinct implementations are crucial to avoid common-mode failure. Accordingly, the specification cannot be overly prescriptive. If it is too detailed, it may effectively allow only one implementation, even if those implementations are written independently.

However, it is not enough to write a very “loose” specification so as to ensure that there can be multiple different implementations. The specification must also ensure that all implementations are in fact computing the exact same result. If the specification isn’t sufficiently strict, the different versions may compute different results. Then, comparison of the results reaches an incorrect conclusion—not because of detecting errors in individual implementations, but simply because of ambiguity in the specification.

So, overall, building multiversion software requires specifications that are “specific enough” but not “too specific.” In general, getting this “just right” or “Goldilocks” balance in the specification is hard. In fact, getting the right flexibility in a specification for multiversion is probably harder than just writing a single correct implementation of a simpler specification. So despite the theoretical attraction of having spare software versions, in practice it may not work any better than just producing one good version.

Consistent Comparison

There is an additional subtle problem for the construction of the independent versions: the consistent comparison problem. To understand the consistent comparison problem, we need to understand something about arithmetic as it’s performed in computers.

Computer arithmetic is a digital version of the analog arithmetic mathematicians use. Accordingly, as we know from our previous look at digital vs. analog (chapter 2), computer arithmetic has a “stepped” quality that’s not true for the perfectly smooth, infinitely subdivisible numbers that are familiar from mathematics. That stepped or gritty quality to computer arithmetic means that two different ways of computing the same logical value may actually wind up with values that differ by small amounts, because the step-size differences between a digital version and the smooth analog version accumulated differently in the two computations.

You can see a simple example of this kind of glitch in finite decimal notation by considering what it means to divide 1 by 3 and then multiply that result by 3. Logically, the division produces 1/3, and then the multiplication produces 1 again. But the decimal representation of 1/3 is:

0.3333

where the “” captures the idea that the sequence repeats infinitely. Since we’ve said we’re using a finite decimal notation, we can’t represent that infinite sequence. Instead we have to use a finite approximation like:

0.33333

but that doesn’t work quite right. When we multiply that by 3, we get 0.99999, not 1. That result is very close, of course—but if we don’t know how to manage these small gaps and errors, we can get into trouble. Computers don’t usually use decimal representations for calculations, so the specific values that cause problems are different from this example, but the nature of the problem is the same.

Suppose that a program is testing whether a particular value is larger than zero, and its output is simply “yes” or “no.” It’s easy for one version of the calculation to wind up with a very tiny value that is positive while the other version of the calculation winds up with a very tiny value that is negative. Our different approaches have produced results that are quite close, and perhaps the differences are not significant—but the nature of the comparison to zero means that the similarities are lost. Of course, we could fix this particular case now that we have identified it: this particular test with these particular computations can produce a misleading result. But although it is possible to fix specific cases of this problem that are identified, in general it is not possible to avoid them or detect them.

In fact, there is a lurking contradiction here that threatens to foil our best efforts. First, we recall that the only reason these comparisons even arise is that we are trying to build multiversion software. Then we also recall that to do multiversion software correctly, the specific implementation details of algorithms are not part of the specification.

Consequently, we have a paradox. On one hand, to implement multiversion software successfully we must avoid actions that could introduce common elements in different versions, since those common elements could fail identically, undercutting the value of multiversion implementation. On the other hand, to dodge the consistent comparison problem we must understand and identify places in the separate implementations that produce meaningless differences in what are logically identical results.

Comparing Results

In our earlier exploration of limits (chapters 6 and 7), we repeatedly identified a limiting factor and then set it aside to ask what else was a limit. In a similar fashion, we will now choose to ignore the expense and difficulty of specifications for multiversion software, assuming instead that we can somehow get past that problem. We’ll also assume that we aren’t troubled by issues related to consistent comparison. Can we then build multiversion software with the desired reliability? No, because there is still a potential common-mode failure when the different results are compared. In multiversion software, there comes a point at which the collection of programs somehow combine the separate results into a single decision or course of action (figure 17.1).

Figure 17.1

Combining diverse computations to produce a single result.

If we simply write a single program that combines the different values, that single program becomes a critical risk for design failures. We can’t mitigate that risk by writing multiple different versions of the combining mechanism, because those different versions would still need some single combining mechanism of their own. The most common solution to this problem is to accept that the single combining mechanism has a high risk of being a source of design failure, and to design it accordingly: it has to be as simple as possible, so as to reduce the likelihood of errors.

It is also possible to build multiversion systems without using a single piece of software to combine results. Human spaceflight is an area in which it is critical to have reliable systems despite component failures. The designers of the U.S. space shuttle included multiple redundant computers for navigation of the shuttle, and they had to find a way of reliably combining the outputs of those multiple computers. They didn’t even try to combine the results in software. Instead, each of the computers is physically connected to a shared control lever (think of it as a kind of joystick). Each computer uses its available force to move the lever in the “correct” direction. The resolution of any disagreement among the navigation computers is effectively arm-wrestling: by design, a majority of computers can overpower a minority of computers.

Let’s consider whether we have identified a successful way forward for using redundant computations. In figure 17.2, the “?” at the bottom represents the uncertainty about whether diverse implementations are an improvement overall.

Figure 17.2

Are diverse computations valuable?

The underlying problem here is that we need to replicate goodness but not replicate badness. But we already know from an earlier chapter that one of our limitations is that we cannot reliably distinguish good software from bad software (chapter 7). After all, if we were able to reliably distinguish good software from bad software, we would only build or select good software. Then we would have no need for any of this multiversion complexity.

Byzantine Failure

Instead of supplying “spare computations,” we can use a completely different failure model. Rather than using a fail-stop model, it’s possible to design systems around a Byzantine model of failure. In a Byzantine model, failed elements might not stop cleanly, but instead may cause further problems. In particular, a Byzantine model means that failed elements are free to perform in precisely the most damaging and confounding ways, no matter how unlikely those sequences of actions might seem to be.

The Byzantine failure model gets its name from the “Byzantine Generals” problem that was first posed in 1982. There, the problem was one of treachery among military leaders, and the question of how a cohort of loyal generals could reach successful agreement on whether to attack or retreat despite the efforts of some number of traitorous generals that were undercutting them.

Intuitively, it might seem that it is impossible to survive Byzantine failures—how can a group continue to function with members that are deliberately choosing to behave in the worst possible way? And yet a contrary intuition suggests that a single traitor in a large group might not be enough to undercut the group’s ability to reach agreement. Both intuitions are correct but incomplete. The crucial issue is how many generals are loyal vs. how many are traitors. In a systems context, we would view that in terms of how many elements are good vs. how many have failed. Some of the time we are pretty sure that the system can survive the failed component; some of the time we are pretty sure the system can’t survive with most of its components working against it. Where is the crossover?

Computer scientists have determined the answer to this question. Although the details depend on the specific arrangement of elements and the problem to be solved, a reasonable high-level summary is that a system can survive these malicious failures as long as only about 1/3 of the elements misbehave. Meanwhile, the other 2/3 of the elements behave properly and cooperate despite the efforts of the misbehaving elements. This is a useful result to keep in mind when considering other domains of decision making and governance. It is worth noticing how far away this threshold is from the common idea of majority rule: if we have only a bare majority of well-behaving elements, we are in trouble.

It is not yet common to build systems to tolerate Byzantine failures, but they are practical. We will look at Bitcoin in chapters 25 and 26, and Bitcoin is effectively built around a kind of Byzantine agreement. The system allows for the possibility that some participants will behave badly and try to cheat others; nevertheless, the system still functions well as long as there are enough honest participants.