The term “software crisis” is not heard as much today, but there was a lot of talk about it in the 1960s. As Matti Tedre relates in his well-researched 2015 history The Science of Computing: Shaping a Discipline,
In the course of the 1960s, computing’s development curve was about to break. The complexity of computer systems had all but met the limits of the popular software development methods of the time. The crisis rhetoric entered computing parlance over the first half of the 1960s. … By the end of the 1960s project managers, programmers, and many academics alike had grown so weary of the blame and shame that immediate improvements were deemed necessary.
(Tedre also comments, “The crisis talk that was rooted in the 1960s and popularized in the early 1970s has remained with computing ever since—whether or not a decades-long quagmire of problems should be called ‘crisis’ anymore.”)1
In 1968, the North Atlantic Treaty Organization (NATO) sponsored a conference in Garmisch, Germany, bringing together academia and industry to discuss these problems. This conference was a seminal moment in the history of software engineering (for one thing, it popularized the term), although its net effect is unclear. The conference report laid out all the problems in software that continue to plague us today—reliability, management, scheduling, testing, and so on.2 Then again, many writers were describing the same problems at the same time; the issues weren’t a secret to anybody who had written nontrivial software. Some have described the conference as a critical point in the schism between industry and academia, especially after a follow-up conference in Rome the next year devolved into a more explicit division between theorists and practitioners.3 Then there is a theory that the whole thing was part of a fiendish plot by Dijkstra to get people to pay more attention to his notions of structured programming.4
Whatever the NATO conferences did or did not accomplish, the root cause of all this concern was that programmers couldn’t figure out how to write software without bugs. The problems in software engineering to this day either relate directly to bugs (reliability and testing) or to dealing with the unpredictability of bugs (management and scheduling).
Just what are software bugs? Although it might feel like a bug when software doesn’t perform the way the user expects, from a programmer’s perspective a bug is when the software doesn’t perform the way the programmer expected. To them, bugs are surprises. There are books devoted to the issue of what programmers expect versus what normal people do, and who should be designing the software and setting the expectations, but that is a separate topic.5 Certainly, many changes to software are made because the user wanted the software to operate differently from the way it was designed to work, but I consider those to be “enhancements” rather than “bugs.” Donald Knuth, while categorizing the changes he made to a large software package, provided the cleanest differentiation: “I felt guilty when fixing the bugs, but I felt virtuous when making the enhancements.”6
Actually, the term bug is used too broadly. People who study software errors talk about three levels: defect, fault, and failure.7 Those terms are not used consistently (even by me, I’m sure); typically for software engineering, there has not been enough study of the matter to bring about any standardization. Some writers use infection instead of fault, and the word bug is sprinkled around to mean any or all of them. For our purposes, I’ll define them as follows. A defect is an actual flaw in the code: the mistake that a programmer makes. The running program logic, viewed at a certain level, consists of manipulating data in the memory of a computer (what Dijkstra, in his complaint about GOTO
s, called the process); at some point a piece of memory will have the wrong value in it, due to the defect in the code—this is the fault. Finally, this fault will cause an error that is noticeable to the user—the failure.
Not every code defect will cause a fault; it must be executed under the necessary conditions. If you recall the Year 2000, or Y2K, bug, one of the fears was that software that stored the year in two digits as opposed to four would make wrong decisions about durations of time when the year rolled over from 1999 to 2000. For example, the code controlling a nuclear reactor might contain this (where the two vertical bars, ||
, mean “or”):
years_since_service = this_year - year_of_last_service;
if (years_since_service == 0 ||
years_since_service == 1) {
// we are OK, was serviced this year or last
} else {
// haven’t been serviced in 2+ years, help!
initiate_emergency_shutdown();
}
If the variables holding the year values (this_year
and year_of_last_service
) are storing only the last two digits, then this code has a defect, but it only manifests itself as a fault in the year 2000, when this_year
is 00 and year_of_last_service
is 99.8 The code will calculate that it has been –99 years since the last service, which is not equal to the values of 0 or 1 that the IF
statement is looking for, and decide to initiate an emergency shutdown. That value of –99 in years_since_service
is the fault, and the call to initiate_emergency_shutdown()
would (presumably) cause a visible failure, but the defect could lurk for years before the fault and then the failure actually arose.
Conversely, there may be no defect; storing dates as two digits doesn’t require that the code behave this way. It could be written such that when it sees an unusual value like –99 for years_since_service
, it assumes that we have crossed a century boundary and handles the situation correctly. This is why there was such debate, ahead of time, about what failures would result from the Y2K bug—unlike in civil engineering, say, where you can reasonably assess the condition of a bridge and how much it will cost to repair it, it is very difficult to determine how failure-prone a piece of software is. In the end the year 2000 brought some minor problems, with automated ticket machines not working or websites displaying an incorrect date, but no major disruptions.9 Given that a lot of software had been patched or replaced in the years leading up to 2000, it will likely never be known how severe the problem would have been had it been ignored.
The code above is a dramatization of a bug; most are more subtle, and don’t involve nuclear reactors and APIs named initiate_emergency_shutdown()
. And to be fair to the authors of Y2K-suspect code, many assumed that their software would be replaced long before the year 2000 rolled around, so they didn’t bother writing it in a Y2K-safe way. The key point to appreciate is that when reading the code above, it is not obvious that there is a defect that will cause an API to be called when it is not supposed to be (and in many cases, the source code for old software isn’t available to be read).
Not every defect will cause a fault, but every fault is caused by a defect (leaving out hardware problems, which I will do). Faults don’t just happen with nobody being to blame; they happen due to defects in the code, and if the code had no defects, they wouldn’t happen. Furthermore, they happen deterministically if the right repro steps are followed. Every time you execute the defective code under the right conditions—in the example above, in the year 2000, when the reactor was last serviced in 1999—the fault and failure will happen.
So you have your fault—the bad data in memory. Yet just as not every defect causes a fault, not every fault will cause a failure. To turn a fault into a failure, the bad value in memory is read by code further along, which may lead to further bad data, which is read by other code, and so on, until the effect of the bad memory becomes visible to the user—a nuclear reactor shuts down, perhaps, but more commonly an icon on the screen is in the wrong place, a character is not visible, the spreadsheet shows the wrong value, or if things are wrong in exactly the right way, the program will crash, which we’ll get to in a moment. It’s also possible, though, that this whole chain of events may not link up; the incorrect value in memory may not be used by any future code, so it may not matter that it was incorrect.
Consider an underage person who goes into a bar and shows identification to prove that they are twenty-one years old, the legal drinking age in the United States. Imagine that the bouncer only considers the year in which somebody was born and therefore will think a person is twenty-one even if they will turn twenty-one later this year, but have not yet had their birthday. That flaw in the bouncer’s thinking is the defect (assume the bouncer always makes this mistake in a consistent way, as a computer program would). Now imagine further that the bouncer looks at the twenty-year-old’s identification and decides that the person is of legal age. That is now the fault; in software terms, you could think of this as the bouncer having the “are they of legal age?” variable set to “true” when it should have been “false.” Note that the defect didn’t have to cause this fault—the person might have already had their birthday this year, or maybe they were eighteen or twenty-three, so the defective calculation didn’t matter—but in the exact set of circumstances, the bouncer will make the mistake.
So now we have the fault, whereby the bouncer thinks that the person is of legal drinking age. This will not automatically cause the failure of letting the person into the bar. Perhaps the bouncer will decide that the person is not dressed appropriately and deny them entrance for that reason. Perhaps there will be a second identification check by somebody else who does the math correctly. Maybe the person, as they are about to enter the bar, will receive a text message inviting them to go somewhere else. Still, if none of that happens, then we will have an underage person in the bar, which is the visible failure, which could be traced back to the fault, which could then be traced back to the defect. And if you fixed the defect, by either training the bouncer to calculate ages properly or hiring a new bouncer, then that specific way of generating the fault would be removed and thus would never cause a failure.
This fix would not prevent underage people from ever being admitted to a bar; it would just mean that this particular sequence of defect to fault to failure has been prevented. You may have other faults that lead to the exact same failure, which can make it hard to figure out what is going on. You might fire that bouncer, offer math classes to the rest, and yet next week the police come by and find another twenty-year-old ordering drinks.
The repro steps of software bugs are typically reported in terms of the failure (the problem visible to the user), and debugging them consists of two parts: finding the fault (the bad value in the computer’s memory), and using that to track down the defect (the code error). Finding the fault can be tricky, because what you need to find is the first fault, which may quickly metastasize into a set of faults, as variables are assigned values calculated from the faulty values of other variables. Debugging often consists of running the program for a bit, examining the contents of memory, determining them to be fine, running it a bit more, realizing that the contents of memory have gotten messed up, and then repeating this in smaller increments, trying to narrow down the point of first fault (this is why having reliable repro steps on a bug is so important). Once the first fault is found, the code defect is generally obvious, although this assumes you are familiar with the code—hence the added difficulty of debugging somebody else’s code. The proper way to fix the defect, of course, is subject to the usual debate.
Broadly speaking, there are three types of failures: crashes, hangs, and just plain misbehaviors. Crashes are the most dramatic kind of failure, when the program stops running unexpectedly; they usually happen because the program tries to access memory that has not been allocated to it. As we’ve discussed, in C a pointer is just a number, which is not guaranteed to be the address of a valid memory location. For one thing pointers are frequently initialized to 0, known as the null pointer, which will cause a crash if it’s used to access memory (although the first byte of memory in a computer is nominally at address 0, that location is marked as off-limits to programs). Since C doesn’t do array bounds checks, any slightly out-of-bounds array index may result in a bad pointer crash (which arguably, is better than silently reading bad data, which might also happen). In languages that do have runtime bounds checking, the program will be intentionally crashed if an out-of-bounds array index is detected, which is a slightly cleaner experience conceptually, but isn’t much better from the user’s perspective.10
If a program does not crash but instead hangs, it is likely stuck in a loop. I have shown examples of simple loops:
FOR I = 1 TO 10
PRINT I
NEXT
so you might wonder how such code can get stuck, but many loops are much more complicated.
One form of loop is called the WHILE
loop, which terminates when a logical expression is false. A widely reported bug in a WHILE
loop caused Microsoft’s Zune music player to hang when it tried to boot on December 31, 2008. The clock on the device stored the date as the number of days since January 1, 1980, which took up less space than storing a full date and made calculating date ranges easier; in order to convert that date to a year for displaying to the user, it had this code, which starts with the year 1980 and lops off one year’s worth of days from the date, until it gets to the current year:11
year = 1980;
while (days > 365) {
if (IsLeapYear(year)) {
if (days > 366) {
days -= 366;
year += 1;
}
}
else {
days -= 365;
year += 1;
}
}
If you want to read this code, it’s best to first pretend that leap years don’t exist, so IsLeapYear(year)
is always false; then the code is starting with a days
value and going through the ELSE
block of the if (IsLeapYear(year))
statement, doing this over and over:
while (days > 365) {
days -= 365;
year += 1;
}
which is a reasonable (if slightly inefficient in this case) way to figure it out. If the number of days still to be accounted for, as stored in the variable named days
, is more than 365, then add a year and subtract off those 365 days, then loop back and check again.
From there, you can see that the “true” branch of the IF
covers the special case of leap years, subtracting off 366 (rather than 365) from days
to account for them. Again, there is nothing logically wrong with this.
The code works correctly on any day that is not December 31 of a leap year. Unfortunately, on December 31 of a leap year it loops forever: IsLeapYear(year)
is true when year
reaches the current year, and once the code is done chopping days down to account for every year between 1980 and last year, then days
will be 366 (December 31 being the 366th day of a leap year), but the code only checks if days
is greater than 366; if days
is exactly 366, then the code won’t change it at all, and the WHILE
loop will iterate again and again—it’s the “Lather. Rinse. Repeat.” instructions from the shampoo bottle, writ in code (meanwhile, if days
is less than 366, on any earlier day in a leap year, the while (days > 365)
loop will have exited already). In addition to checking for days
being greater than 366, as it does, the code needs an additional check for days
being equal to 366 so it can break out of the loop in that case (there are also myriad other ways to rework the code to avoid the bug).
The defect is this missing code; the fault is days
remaining at 366 forever; and the failure is the visible hang—the Zune would not boot on that day. As with the Y2K bug, this defect lurked for a while before causing a fault, from the time the Zune was released in November 2006 until the first time the calendar had a 366th day in a year, on December 31, 2008 (when days
entered the above loop with a value of exactly 10,593).
Crashes and hangs are rarer than the “just plain misbehaving” category. Whether code bugs cause the application to crash, hang, or misbehave is usually a matter of luck, unrelated to the difficulty of the code being written. In the Zune example, a slight tweak would produce code that didn’t hang but instead reported December 31 of a leap day as “January 0,” and if the code then tried to look up day number 0 in an array, it could easily crash on a bad array access. All these would appear as roughly the same code; there is nothing obvious in a defect that indicates what sort of failure the fault will cause.
Complicating the situation is that the problem may not even be in code that you can look at; as in other areas of programming, you are at the mercy of the API you are calling. You can call an API that normally works, and then suddenly it crashes, or hangs, or does the wrong thing (the Zune hang bug was not in Microsoft’s code but instead in the implementation of an API that it got from another company, which Microsoft’s code called during the Zune start-up sequence). Doing the wrong thing often means that the API returns an error rather than succeeding, at which point the code calling the API has to decide what to do. Does it shrug its shoulders and show the user the unexpected error? Does it call the API again in hopes that it succeeds? Does it do something clever to preserve the user’s work? All this requires that more code be written, so it’s up to the programmer’s judgment about how likely an API is to fail, and how much work it is worth doing to deal with that case in a clean way.
One notorious example of software being at the mercy of an API it was calling was in the original version of DOS that shipped with the IBM PC in 1981. If a program called an API that DOS provided to save a file to a disk—which back then was a removable 5 ¼-inch floppy disk—and it failed, DOS itself would prompt with a message asking the user to “Abort, Retry, Ignore.” Abort crashed the program immediately, and Ignore pretended that the operation worked even though it hadn’t—neither of which was a particularly useful choice (as the DOS manual noted in regard to Ignore, “This response is not recommended because data is lost when you use it”).12 Retry would try again, which could handle the most trivial case (when the user forgot to insert a floppy disk in the drive), but then you would be stuck in the Retry loop until you inserted a floppy disk. From the perspective of the program calling the API, this was all done under the covers; the API call would not return until the operation had been successfully retried or the error ignored. Programs could avoid getting stuck in this DOS error prompt, but it required writing extra code, which some programmers neglected to do.13 At one point in the early days of the IBM PC, whether a program such as a word processor could handle a missing floppy without crashing and losing data was a point of evaluation in magazine reviews. A fourth option, “Fail,” was eventually added to the choices in an update of DOS, allowing the API to return to the program with an error and giving the program the opportunity to decide how to proceed.
Conceptually the same thing can happen in a car crash: the car is calling the tire “API” and asking it to provide adhesion to the road, and if the tire fails to do so, the car may crash. But the tire has been tested and certified to operate a certain way. Think of all the other engineered objects with which you interact in your daily life, such as every railing you lean on, every electric device you plug in, or every medication you take. You expect these to work every single time, over and over, without randomly “crashing,” and because you expect this, the people who engineer these products put a lot of research and design effort into making sure that they do work. The tire that blows out and causes your car to crash was made by people who expect it to never fail catastrophically if used as intended, and were subject to regulations designed to prevent such failures. And most important, if it fails, they understand that they have failed in some small way and try to prevent it in the future—despite the fact that tires suffer physical wear and tear that makes failures more likely, unlike software.
When you call an API to figure out what year it is, there is no way to know if it is going to suddenly fail, hang, or crash just because we happen to be on December 31 of a leap year. As with the term worm from the Morris worm that I discussed in the last chapter, talk of viruses and crashes makes them sound like bad luck that can happen to anybody, and thus may be avoided due to environmental factors, but that’s misleading. When people say, “It’s inevitable that a large program will have bugs,” they don’t mean inevitable in the sense, “It’s inevitable that cars will have accidents.” What they mean is, “We don’t have the proper software engineering techniques to root out all defects so we’re not even going to attempt to remove them all—and we’re not going to improve the techniques either.”
Yet a lot can be achieved by aiming for perfection, even if you don’t reach it. The car company Volvo has set the following goal: “By 2020 no one should be killed or seriously injured in a new Volvo car.”14 Will it achieve this? Probably not. But it sure does help focus Volvo on safety. This institutionalized acceptance of shoddiness is one of the most shameful aspects of software engineering. Software bugs are not inevitable, but trying to write software that never crashes is a nongoal, as they say, for the current crop of programmers. And the implicit comparison to automobile crashes silently enables this attitude.
The early days of software did feature a significant emphasis on how to produce completely bug-free software. In the years following the 1968 and 1969 NATO software engineering conferences, two books appeared on the topic of writing better software; not surprisingly, both were titled Structured Programming. The first came out in 1972, and was written by Ole-Johan Dahl, Edsger Dijkstra, and C. A. R. Hoare. The second came out in 1979, and was written by Richard Linger, Harlan Mills, and Bernard Witt.15 The first book was written by leading academic computer scientists (Dijkstra attended both NATO conferences, and Hoare was at the second), and IBM veterans (none of whom were at either conference, although IBM was well represented at both) wrote the second.16
The first book is three long essays: “Notes on Structured Programming” by Dijkstra, “Notes on Data Structuring” by Hoare, and “Hierarchical Program Structures” by Dahl and Hoare. They lay out the basics of structured programming, as we have seen earlier, as well as talking about a few common data structures and how to break a large problem into smaller ones. This is not to criticize the work; at the time, the “structured programming” debate, which could in hindsight be summarized as the “don’t use GOTO
s” debate, was still being fought, so this is a worthwhile accomplishment. As Knuth enthused, “A revolution is taking place in the way we write programs and teach programming. … It is impossible to read the recent book Structured Programming without having it change your life.”17
The problem with the book is summarized in a quote from the book itself, in a section of Dijkstra’s essay titled “On Our Inability to Do Much”:
What I am really concerned about is the composition of large programs, the text of which may be, say, of the same size as the whole text of this chapter. Also I have to include examples to illustrate the various techniques. For practical reasons, the demonstration programs must be small, many times smaller than the “life-size programs” I have in mind. My basic problem is that precisely this difference in scale is one of the major sources of our difficulties in programming!
It would be very nice if I could illustrate the various techniques with small demonstration programs and could conclude with “… and when faced with a program a thousand times as large, you compose it in the same way.” This common educational device, however, would be self-defeating as one of my central themes will be that any two things that differ in some respect by a factor of already a hundred or more, are utterly incomparable.18
Dijkstra’s solution to this problem is to have his essay contain little code and focus more on theoretical insights, and the essays on data structuring and program structure also confine themselves to theory and small fragments of code.
The IBM crew, meanwhile, takes a noble stand with its goals, right from the first paragraph in the book:
There is an old myth about programming today, and there is a new reality. The old myth is that programming must be an error prone, cut-and-try process of frustration and anxiety. The new reality is that you can learn to consistently design and write programs that are correct from the beginning and that prove to be error free in their testing and subsequent use. …
Your programs should ordinarily execute properly the first time you try them, and from then on. If you are a professional programmer, errors in program logic should be extremely rare, because you can prevent them from entering your programs by positive action on your part. Programs do not acquire bugs as people do germs—just by being around other buggy programs. They acquire bugs only from their authors.19
What about Dijkstra’s point about scaling, keeping in mind that this book was written by people who had been involved in writing some of the largest software of the day and so leaving them acutely aware of the problem? The IBM trio’s answer is,
It will be difficult (but not impossible) to achieve no first error in a thousand-line program. But, with theory and discipline, it will not be difficult to achieve no first error in a fifty-line program nine times in ten. The methods of structured programming will permit you to write that thousand-line program in twenty steps of fifty lines each, not as separate subprograms, but as a continuously expanding and executing partial program. If eighteen of those twenty steps have no first error, and the other two are readily corrected, you can have very high confidence in the resulting thousand-line program.20
Worthy goals indeed, but a thousand-line program is pretty short and still within the capacity of one person to produce. The rest of the book is about how to prove your software is correct using a mathematical approach. Still, this is the distilled wisdom of people from IBM. What did I think of this advice, when I went off to college five years later?
I thought nothing of it, naturally, since I wasn’t exposed to any of it. Whether you believe in mathematical proofs or not, I wasn’t taught how to know whether the software I wrote worked. But looking back, it’s clear that the time for mathematical proofs was passing. For simple code like a sort routine, you can prove that your algorithm is correct: the array begins in this state, the loop iterates this many times, after each iteration one more element is sorted, and therefore the array is sorted at the end. This is a standard mathematical technique known as induction. The problem is that modern software is vastly more complicated than that. A failure today along the lines of “my word processor is showing this character in the wrong place” involves extremely complicated code that has to factor in the margins of the document, the font being used, whether the character is super- or subscripted, how line justification is being done, the size of the application window, and a host of other factors; the code winds up being a tangle of IF
statements and nonobvious calculations.
Think back to the Zune leap year bug: the defect was not in the algorithm but rather in the implementation. Many annoying bugs, when finally excavated, turn out to be nothing more than a typing mistake by the programmer. The IBMers’ Structured Programming book was a curious atavism—a book advocating techniques that were only effective for short programs, produced by people who had worked on large programs.
If there was a guiding spirit of software quality that attended my college years, it came not from the academics or IBM veterans but instead from a third fount of programming wisdom: Bell Labs, the source of UNIX and C. As it happens, Princeton is located near Bell Labs, and professors on leave from there taught several of my courses. I don’t recall them inveighing against formal correctness proofs, but I assume they had a subtle effect on me. Even for students not at Princeton, however, the “UNIX folks” exerted influence due to the books they wrote, not just on UNIX and C, but on other programming topics as well. I’ve already mentioned Programming Pearls by Bentley (who gave a guest lecture to us at Princeton). Kernighan, one of the authors of C, wrote a book called The Elements of Programming Style with P. J. Plauger, another employee. The book is consciously modeled on William Strunk Jr. and E. B. White’s The Elements of Style, and consists of a series of examples supporting a set of maxims, the first two of which are “write clearly—don’t be too clever” and “say what you mean, simply and directly.”21
The book is meant to support structured programming. As the preface to the second edition (1978) says about the first edition (1974), “The first edition avoided any direct mention of the term ‘structured programming,’ to steer well clear of the religious debates then prevalent. Now that the fervor has subsided, we feel comfortable in discussing structured coding techniques that actually work well in practice.” Notwithstanding that, Kernighan and Plauger eschewed both the theory of the first Structured Programming book and the formal proofs of the second to focus on code itself: “The way to learn to program well is by seeing, over and over, how real programs can be improved by the application of a few principles of good practice and a little common sense.”22 It’s incremental: keep improving your programs and they will wind up being good; here are seventy-seven pieces of advice on how to do that. Or consider this comment about GOTO
in the original C book by Kernighan and Ritchie: “Although we are not dogmatic about the matter, it does seem that goto statements should be used sparingly, if at all.”23 This is a far cry from Dijkstra’s strident denunciation.
The book Software Tools, also by Kernighan and Plauger, has a similarly nuanced view of GOTO
s. Discussing other control structures available in the language they are using (Ratfor, short for Rational Fortran, as the name implies a more “sensible” variant of Fortran), they state, “These structures are entirely adequate and comfortable for programming without goto’s. Although we hold no religious convictions about the matter, you may have noticed that there are no gotos’s in any of our Ratfor programs. We have not felt constrained by this discipline—with a decent language and some care in coding, goto’s are rarely needed.”24
This message lands quite easily on the ears of a self-taught programmer. It reinforces the idea that there is not a lot of formal knowledge associated with software engineering, and your own personal experience is a valuable guide in how to proceed. Sure, you could improve a bit, but who can’t? And to the extent that I picked up any sense of software engineering in college, it was this vibe. Not only did I adopt C from the UNIX crowd, but I also acquired the incrementalist view of software quality.
And on the question of formally testing your software, I acquired no knowledge at all, even from the Bell Labs visitors; it wasn’t part of the curriculum at Princeton. The algorithm was what was important, and proving that you had accurately translated it into code was less so. Understand that the stakes are lower in college. When I wrote a compiler for a class, it was to learn how to write a compiler, not to use it to compile a lot of programs. If it failed on the programs I tested it with, then I fixed those issues; it was never subject to anything more stressful and at the end of the term was set aside. As Weinberg writes, “Software projects done at universities generally don’t have to be maintainable, usable, or testable by another person.”25 The notion of intentionally trying to make a program break by feeding it unusual inputs or navigating the user interface in an unusual way never occurred to me. If I felt the algorithms were correct, the code looked reasonable, and I hadn’t observed any crashes, then it was perfect as far as I was concerned. The programming contest that I talked about in chapter 4 presented a slight concern, since the actual data it would be run on was kept secret. I recall running it using several randomized data sets that I generated; nonetheless, when my program ran successfully (albeit not as expeditiously as some others) during the actual contest, my reaction was more “thank goodness!” than “well, of course, I tested it.”
My high school friends who studied engineering at Canadian universities participated in an event called the Ritual of the Calling of an Engineer when they graduated. At this ceremony (which was designed by Rudyard Kipling, of all people, back in the 1920s), they were presented with a rough-hewn iron ring, which is worn on the little finger of their working hand and “symbolizes the pride which engineers have in their profession, while simultaneously reminding them of their humility.”26 Now it’s not that I have observed any particularly greater humility or aversion to C string handling in graduates of Canadian versus US universities. Still, I left college with no sense of the impact that buggy software could have on the world. Partly this was because back in 1988, this impact was much more constrained, with computers rarely even being connected to a network. Even the Morris worm, which was released later that year, did nothing to change this attitude. My experience with the effect of bugs had been limited to my IBM PC BASIC Pac-Man game not working correctly or having to work late in von Neumann to get my 3-D billiards animation to display properly. Certainly as a user I had been annoyed when I lost work due to a crash in a program I was using, but I can’t recall ever connecting that feeling with the notion that I would now be the one writing software that others would be entrusting their data to.
Understand that I am not blaming a UNIX slacker attitude for this; the programmers at Bell Labs obviously cared greatly about the quality of their software, since it was supporting critical telephone infrastructure. And in their books they talked about quality a lot, although the approach was incremental improvements in their software until it was of high quality. I just never got any message of “once you graduate, this stuff really matters.”
So what did companies, writing larger pieces of software for customers who were paying them money, do to ensure that the software worked? Unquestionably, to the extent that they were inspired at all, it was by the UNIX approach.
At Dendrite, where I worked immediately after college, there was little testing of the software—not atypical for a small start-up that employed about ten programmers when I started. There were several customer support people working in the office, and they would run through basic operations with a new release before sending it out (by mailing a floppy disk to every sales representative running the software!), but there was no official verification process. They counted on the programmers getting things right (which to our credit, we mostly did, not through application of any formal design process, but just through being careful; the tricky bug I described in chapter 3 was unusual in that it was seen by customers).
Dendrite was small enough—all the programmers sat together in a warren of cubicles—that communication with other programmers was easy: we could yell if we wanted a question answered. When I got to Microsoft, the Windows NT team had around thirty programmers on it, laughably small compared to today, but much larger than Dendrite. In addition, Windows NT had been in development for a year and a half, so already had a sizable body of code that I needed to get familiar with.
Even at Microsoft, a well-established company at the time, there was no training on how to proceed with the task of software engineering. The general attitude was, “You’re smart, so you figure it out.” Trial and error was the foremost technique employed, with imitating existing code that looked similar a close second. You might wonder if this was intentional—if there was a sense that asking somebody for help was a sign of weakness. I never saw any indication that this was a policy; it was just that everyone else had learned to program by figuring things out on their own, so they simply kept doing it that way and never gave the matter much thought. Imagine if new electricians learned this way.
At Microsoft, I encountered a group of people with the title software test engineer—the testers—whose job it was to take the software that the developers produced and give it a stamp of approval before it was released to customers. In its early days, Microsoft had no testers either—IBM PC DOS, the software that started Microsoft on its road to prominence, must therefore have been released without a separate testing team—but eventually management realized that counting on developers to test their own code posed problems. One concern was that developers spending time testing their code was inefficient; users were more common than developers, so you could hire people to pretend to be users while the developers churned out more features. There was also this sense that you couldn’t “trust” developers to test their own code; that if you asked a developer, “Tell me when your code has been adequately tested?” they would immediately respond, “It’s good.”
I always felt that the implied laziness/evilness/delusionality on the part of developers was unfair, and attribute their overoptimism to not having been exposed to testing in college. I found that most developers, when they arrived at a place like Microsoft, fairly quickly adopted a more conscientious approach (for example, I don’t recall DOS or Microsoft BASIC being buggy). In any case, the profession of software tester was already extant at Microsoft when I arrived in 1990. The book Microsoft Secrets, which came out in 1995, fills in some of the background: the first test teams were set up in 1984; there were expensive recalls of two pieces of software, Multiplan (a spreadsheet, which was a precursor to Excel) in 1984, and Word in 1987; and in May 1989, there was an internal meeting on the optimistic subject of “zero-defects code.”27
The testers would come up with test cases, sequences of steps designed to exercise different areas of the code; if a spreadsheet supported adding two cells together, there would be a test case to create two cells, have the spreadsheet add them, and verify that the result was correct, while also keeping an eye out for any untoward behavior, such as a crash or hang. The goal in having the test cases be formalized was to both ensure that nothing was missed and hopefully have a reliable set of repro steps for whatever went wrong.
The first edition of Cem Kaner’s Testing Computer Software, the most recommended testing book at the time, had come out in 1988; people had been writing books about testing for at least a decade before then, and they had been testing as well as thinking about testing software for a while before that. In 1968, Dijkstra stated (in support of structured programming, remember, which advocates up-front proof rather than after-the-fact testing) that testing was “a very inefficient way of convincing oneself of the correctness of programs,” and the following year he formalized this as “program testing can be used to show the presence of bugs, but never to show their absence,” which he repeated in his essay in the 1972 Structured Programming book.28 Mills and the IBMers, in their Structured Programming, state as fact (with the same motivation as Dijkstra), “It is well known that a software system cannot be made reliable by testing.”29 Well known, but not apparently to people at Microsoft a decade later, who indeed were attempting exactly that.
In chapter 2 of his book, Kaner delves into the motivation for testing; his main points are summarized in the titles of sections 2.1 and 2.2, “You Can’t Test a Program Completely” and “It is Not the Purpose of Testing to Verify That a Program Works Correctly,” respectively.30 What, then, is testing for? Section 2.3 explains it: the point of testing is to find problems and get them fixed. Kaner’s reasons for throwing cold water on your dreams of thoroughness come from G. J. Myers’s 1979 book The Art of Software Testing: if you think your task is to find problems, you will look harder for them than if you think your task is to verify that the program has none.31 And while you are puzzling that one out, I’ll throw in this bit of existential angst from Kaner: “You will never find the last bug in a program, or if you do, you won’t know it.”32
Kaner doubles down by stating, “A test that reveals a problem is a success. A test that did not reveal a problem was a waste of time.”33 His point, also borrowed from Myers, is that a program is like a sick patient whom a doctor is diagnosing; if the doctor can’t find anything wrong and the patient really is sick (and software, the analog of the patient here, is assumed to be “sick”—that is, to have bugs), then the doctor is bad.34 Of course, the doctor finding nothing wrong after a battery of tests is different from any given test not finding anything. The baseball player Ichiro Suzuki, who is known to get a lot of hits but also swing at a lot of bad pitches, was once asked why he didn’t ignore the bad pitches and swing at the ones that were going to be hits. Ichiro explained that it didn’t quite work that way.
I believe what Kaner was trying to do, by focusing on finding bugs, was a shift away from the notion of “the testers said the software was good,” which implies they should be blamed if it turns out it wasn’t, to “the testers said they couldn’t find any more bugs,” which implies that we are all stepping to the same tragic pavane. Despite this, the notion of “testers signing off” was in full force at Microsoft in 1990. Which of course was another way for developers to avoid blame for bugs: it’s the testers’ fault for not finding them. It didn’t help that Myers, back in 1979, had pushed the idea that programmers could not and thus should not try to test their own software: “As many homeowners know, removing wallpaper (a destructive process) is not easy, but it is almost unbearably depressing if you, rather than someone else, originally installed it. Hence most programmers cannot effectively test their own programs because they cannot bring themselves to form the necessary mental attitude: the attitude of wanting to expose errors.”35
What emerged, unfortunately, was the “throw it over the wall” culture.36 Developers would strive to reach “code complete,” meaning that all the code had been written and successfully compiled, and if the stars aligned just right, might even be bug free. They would then hand the software off to the tester, with the implication that they had done their part and the tester was responsible for figuring out if it worked or not. Code complete is not an inconsequential milestone; it meant that none of your early design decisions had painted you into a corner, and the API you had designed to connect the pieces together was at least functionally adequate. But the problem was that code complete was viewed as “the developer’s work is done, pending any bugs received; if the software ships with bugs, it’s the testers’ fault for not finding them.” And woe betide the tester who pushed back, saying they could not test in the code in time; they were supposed to deal with whatever they got from the developers. At one point in the Windows NT source code, next to code for a feature that had been disabled because the testers could not find the time to test it, there existed a snarky comment from a developer along the lines of, “Well, now that the testers are designing our software for us, we’ll remove this.”
I don’t mean to make developers look entirely bad. We do have a craftsperson’s pride in their work and don’t want bugs to happen; we certainly feel some sense of guilt if our programs freeze or crash, especially if they lose data. And we generally enjoy the intellectual challenge of fixing bugs. I do believe that I wrote solid code in my early days at Microsoft, but it was not because of any sense that Microsoft’s stock price could be affected if I messed it up. Morris, author of the Morris worm, intended it to spread much more slowly than it did, with the goal of lurking for a while before being revealed. Presumably he was angry with himself that he had not tested the worm first to get a sense of how fast it would replicate, which would have prevented the rapid detection brought on by the havoc that it wreaked; at the time, one observer commented that he should have tested it on a simulator before releasing it.37
The sense that testers were there to guard against devious developers trying to sneak bugs past them often led to an antagonistic attitude between developers and testers. Just as developers viewed their users as an annoying source of failure reports rather than as the people who ultimately paid their salary—shades of the old luser attitude—they came to see testers in the same light, since their interactions with testers had a depressing similarity: every time you heard from a tester, they were reporting a failure, interrupting whatever lovely new technical problem you were working on and replacing it with a bug investigation, which always had the potential to turn into a bottomless pit. The response to a failure report was frequently a heavy sigh followed by a quick attempt to determine if the investigation could be shuttled off to another developer, and finally a “why would you use the software that way?” eye roll. Unfortunately the message that “the tester’s job is to find bugs” led to using bug counts as a measure of the effectiveness of testers, which led to testers sometimes favoring quantity over quality when looking for bugs, filing many bugs for obscure cases rather than looking for problems that users were most likely to hit, which of course did nothing to improve developers’ opinion of testers. But overall the bad actors here were the developers, not the testers.
Worst of all were bug reports with inexact repro steps, such that when a developer took the trouble to walk through the steps, the bug would not reproduce. Ellen Ullman’s novel The Bug captures the general attitude here, as a programmer tries to reproduce a bug report from a tester:
He started up the user interface, followed the directions on the report: He went to the screen, constructed the graphical query, clicked open the RUN menu, slid the mouse out of the menu. Now, he thought, the system should freeze up now. But nothing. Nothing happened. “Shit,” he muttered, and did it all again: screen, query, click, slide, wait for the freeze-up. Nothing. Everything worked fine. “Asshole tester,” he said. And then, annoyance rising, he did it all again. And again nothing.
A kind of rage moved through him. He picked up the first pen that came to hand—a thick red-orange marker—and scrawled “CANNOT REPRODUCE” in the programmer-response area of the bug report. Then in the line below he added, “Probable user error,” underlining the word “user” in thick, angry strokes. The marker spread like fresh blood onto the paper, which he found enormously satisfying … Not a bug. They were idiots, all of them, incompetent.38
Ullman’s book is fiction, but she is a veteran Silicon Valley programmer; the book was written in 2003, but is set in 1984, and that attitude of superiority that programmers had toward testers, possibly with a little less swearing, was still prevalent at Microsoft in the early 1990s. A programmer investigating a failure report very much hoped to reproduce it on their own machine, where they could use debugging tools to determine the fault and then work their way back to the defect in the code. If they couldn’t reproduce the failure, then they would reach for the thick red-orange marker (conceptually, since even back in 1990, Microsoft used an electronic bug-tracking system) and mark it as “Not Repro,” no matter how severe the effect of the failure was.
There was some software where testing was taken seriously and bugs were appreciated rather than dreaded. But this was only in situations where the developers realized that the cost of a failure would be severe—two common examples being software that ran on medical devices and spacecraft. Of course these occasionally had bugs also, such as the Therac-25 machine administering fatal doses of radiation or the Ariane 5 rocket that self-destructed (certainly not the only space mission that aborted due to a bug), but they were engineered with more care than we undertook at Microsoft. As opposed to being viewed as excellent examples of fine software craftspersonship, however, these were considered by programmers to be old, stodgy projects—imagine having to spend all that time simulating every possible situation just to be sure your software worked properly all the time!—as compared to the cool, fast work we were doing at Microsoft.
In 1990, certainly, nobody sat us newly hired developers down and talked about the expensive recalls of the 1980s. They may have harangued the testers about this, but I never heard about it. Nobody read us this quote, from Mills in 1976: “It is well known that you cannot test reliability into a software system.”39
Eventually, though, the needle swung away from testers “testing in” quality and back to the notion that programmers should “design in” quality. This is the same idea that the structured programming movement was going for, but with a different approach—which I will discuss in the next chapter.