Creating Artificial Perception

The debate between Hotz and Musk demonstrates the significance of the critical missing piece—artificial perception software—and how important its development will be for the maturation of a commercially viable driverless car. George Hotz’s home-built Acura is evidence that today a skilled developer can build a pretty good driverless car in a fairly short period of time. However, as Musk points out, when it comes to software you will trust with your life, the leap from 99 percent to 99.9999 percent accuracy is a big one.

In the past few years, mobile robots have gotten better at finding their way around their environment. It helps that the performance of computer vision software has improved dramatically, aided by the advent of big data, high resolution digital cameras, and faster processors. Another catalyst has been the successful application of machine-learning software to solve thorny problems of machine vision, sparking a mini-renaissance in the study of artificial perception.

The last mile for driverless-car technology remains the development of software to oversee the car’s perception and response. While writing this book, we found ourselves confounded when faced with the question of what to call this emerging bundle of various software tools that do not fall neatly under either high- or low-level controls. After some thought, we wound up naming this software the mid-level controls.

In this chapter, we use the phrase mid-level control software to describe the collection of different software tools that provide the car with artificial vision. Mid-level control software enables the car’s operating system to make sense of sensor data, understand the physical layout of the nearby environment, and to select the best reaction to nearby objects and events. The average duration, or what roboticists call an event horizon, of a mid-level activity ranges from a few seconds to a few minutes. In contrast, low-level controls operate within a time horizon of a fraction of a second and high-level controls plan hours ahead.

For a human, an example of a mid-level activity might be the act of selecting a dirty coffee cup from the sink and placing it into the dishwasher. In a disaster response robot such as CHIMP, the robot’s mid-level controls would note that its visual sensors have identified the presence of a round dark object on the floor. CHIMP’s mid-level control software would classify the viewed object as a rock (rather than a shadow, for example) and would guide the robot’s tank treads safely around it. In a driverless car, mid-level controls make critical life-and-death decisions based on data from the car’s visual sensors, for example, identifying a shape at a crosswalk as a man on a bicycle and then making the decision to wait and let him pass.

The challenge of object recognition

Mid-level control software guides a driverless car through complex, real-world situations where there are a nearly infinite number of possible responses. To demonstrate the challenges that developers face, imagine you’ve been tasked with writing software to guide a car through a busy intersection. Your software will be held to the same standard as a human driver, so it must obey the rules of the road as laid out in any standard-issue DMV handbook.²

Slow down and be ready to stop. Yield to traffic and pedestrians already in the intersection or just entering the intersection. Also, yield to the vehicle or bicycle that arrives first, or to the vehicle or bicycle on your right if it reaches the intersection at the same time as you.

Easy, right? The subtle complexities begin to emerge when you break down the process into functional steps. Your first task is to write code to enable the car to recognize that the intersection is approaching, so that it can “Slow down and be ready to stop.” Any good developer could engineer this response by using using GPS location points and a detailed digital map that contains information on intersections. The visual cue of a stop sign or stop light helps too. So far, so good.

Now on to the next task, to “Yield to traffic and pedestrians already in the intersection or just entering the intersection.” At this point, the project starts getting trickier. First, what counts as “traffic”? How does software know to recognize a category of object called “pedestrian?”

To perform as well as a human driver, your mid-level control software must somehow correctly classify the categories of things that are traffic, pedestrian, or bicycle. One way to tackle this problem could be to write an artificial-intelligence program that uses a rule-based approach. You could attempt to classify each pedestrian, vehicle, and specific traffic situation the car might encounter by writing an exhaustive list of what programmers call if-then statements.

Now on to the next problem: you must somehow describe all the things in these various categories so the software can identify them. One way to identify objects belonging to categories such as “traffic” or “pedestrian” could be according to what these objects are likely to look like. For example, traffic = rectangular boxy object between ten and twenty feet long. Pedestrian = two-legged object less than two feet wide and between two and seven feet high.

As artificial-intelligence researchers have learned over the past several decades, the problem with this approach is that category definitions don’t hold true for every situation. Inevitably, an exception will appear in the form of a corner case or two that brings the most carefully constructed set of if-then statements to its knees. In the example of the intersection, one pedestrian might be on crutches and another carrying a bulky package that increases his total girth to wider than two feet. Perhaps a motorcycle that’s seven feet long (i.e., too short to count as a vehicle) will roar through the intersection and your software will fail to recognize it as traffic.

The problem that plagues rule-based AI software is that without a robust method to classify every object a car might encounter, it’s impossible to write rules to guide the car’s response. Since your software can’t always be 99.999% certain that what it encounters is indeed truly a vehicle or a bicycle, it can’t advise the car how to react appropriately as instructed by the rules of the road. This is why if life were completely logical, writing mid-level control software would be a straightforward task. Instead, on any given busy street corner in any given city, a rich and steady stream of novel corner cases will elude definition. Clearly, although we take it for granted, we humans have mastered the art of perception.

In his landmark book On Intelligence, Palm founder Jeff Hawkins describes a long-standing puzzle that has confounded philosophers and machine vision researchers alike: invariant representation. We humans are equipped with a mental model that enables us to consistently recognize objects, even when they appear to us in an unfamiliar context, or that we view from a new angle. For example, most people can consistently recognize a friend’s face in a split second, even if she has changed her hair color from blond to brunette. Exactly how the human brain learns invariant representations, however, remains a mystery.

Human eyes stream data to our brain so smoothly we don’t have to consciously pick apart the visual scene to make sense of it. In contrast, software that reads streams of data from visual sensors has to do more work. A stream of visual data is at heart a numerical array. Machine-vision software processes these numerical arrays with aplomb, but is incapable of understanding the visual scene the numbers depict, a conundrum that cuts to the heart of artificial-intelligence research.

The mystery of how humans achieve nearly flawless scene understanding has puzzled philosophers and scientists for centuries. Since the first computers were developed, AI researchers have tried to crack the mystery of scene understanding and its close kin, object recognition, using a wide variety of techniques. Some techniques used logic (“If the object has three sharp corners, then it must be…”). Other techniques used brute force, storing as many images as possible of objects the robot might encounter and having the software check new visual information against that database, using some comparison metric. Both of these approaches worked some of the time but were too slow and still didn’t provide the software with a crucial skill, the ability to consistently recognize objects in unfamiliar settings.

To automate the process of object recognition, it’s necessary to have software that can extract visual information from raw data in order to identify the objects depicted. Over the years, researchers have attempted to do this in several different ways. One of the earliest forms of computer vision software developed in the 1960s worked by distilling digital images into simple line drawings.

A famous example of this approach was a robot named Shakey, described somewhat optimistically by his creator, Stanford researcher Charles Rosen, as “the first electronic person.” Shakey’s “body” consisted of a stack of heavy boxes containing electronic equipment stacked onto a cart. On top of the boxes perched Shakey’s “head,” a tall, thin, swiveling mast of cameras and cables. Shakey’s vision sensor was a TV camera that generated a steam of images, each containing just a few hundred pixels.

Figure 5.1 Time-lapse image of Shakey in action (circa 1970).

Source: SRI International

Shakey wowed the world of machine-vision research by responding to a command typed into his system to “go find a red block” and then, gazing around with his TV camera, slowly rolling around in search of a red block. When his vision system concluded that he indeed “saw” his desired object, Shakey would signal that his mission was complete.

To find his way around, Shakey used a type of artificial-intelligence called edge detection, part of a larger family of visual recognition techniques known as template-based perception. Shakey could identify objects of a particular shape or color by analyzing images from his TV camera and then simplifying those images into simple line drawings. Once an image was thus broken down, Shakey would identify what the object was by comparing it against a database of line drawings of pyramids, blocks, and cylinders.

Edge detection required relatively little computing power or memory and enabled efficient data storage. Edge detection was a speedy way to extract meaningful information from raw visual data, particularly given the crudeness of the hardware of Shakey’s day. Analyzing an entire digital image took several minutes. Breaking apart images into simple line drawings enabled Shakey to respond to commands in a time frame with which humans felt comfortable.

The limitations of template matching using edge detection are obvious. Shakey could function only in carefully arranged environments where he would encounter only objects he was already familiar with against a clean background. If Shakey were to “see” an object that did not match those in his personal collection of line drawings, his software would just have to make its best guess. If a new obstacle were thrown into the mix—a cylinder, or a cat, or a flying plastic bag, or a colleague inadvertently moseying into the lab after a late morning, Shakey’s sensors would capture this unfamiliar object, but its mid-level control would not be able to recognize it.

The researchers who followed after Shakey developed several different types of rule-based AI software programs that processed data from a number of different formats, including digital image files, 3-D point clouds, and videos. Some machine-vision programs calculated depth measurements from lidar range data. Other software specialized in recognizing different textures in digital photographs. Some programs recognized the presence of a distinct visual feature in an image and compared that feature against a database of visual features.

Many of these techniques worked well. In fact, many are still used by modern industrial robots to carry out concrete tasks, such as inspecting complex circuit boards or sorting machine parts into bins. A critical limitation of rule-based machine-vision software, however, is that it works best in structured environments where the robot’s machine vision will encounter only a selected set of objects. Show a banana to an industrial robot that’s programmed to sort only nuts and bolts and it will be dumbfounded, baffled by this novel yellow object that doesn’t match anything in its image library.

Rule-based AI fares especially poorly when used to provide artificial scene understanding to a driverless car. While watching several autonomous vehicles race in the DARPA Challenge of 2007, I witnessed an excellent (and unintentional) demonstration of a rule-based AI program’s failure to understand a traffic scene, a lapse that caused two autonomous vehicles to crash. (For more detail on the DARPA Challenges, see chapter 8).

In 2007, computer-vision software was still largely rule-based. As the race progressed, my university’s car, a vehicle called Skynet (from Cornell University) inched cautiously down the road, followed closely by Talos, a car from MIT. For no apparent reason, Cornell’s car suddenly stopped, backed up, then lurched forward and stopped again. The MIT car attempted to pass the faltering Cornell car. As the MIT car slowly moved into passing position, Cornell’s car unexpectedly started moving again and the MIT car crashed into its side.

Fortunately, no one was harmed, and both teams (later) enjoyed the distinction of participating in what was perhaps the first “driverless vs. driverless” accident in history. Analysis of the car’s driving logs revealed that the cars crashed because their rule-based machine vision software failed to do its job. In its postmortem analysis, the Cornell team concluded that the reason its car came to a sudden halt was that “a measurement assignment mistake caused a phantom obstacle to appear, momentarily blocking Skynet’s path.” The MIT team concluded that “Talos perceived Skynet as a cluster of static objects.”³

If the software guiding Cornell’s car had correctly identified MIT’s car as a moving vehicle and vice versa, the crash would not have occurred. This example demonstrates how essential it is that a driverless car’s mid-level control software have the capacity to recognize the objects it sees on and near the road. Subtle but critical differences between two similar objects—a parked motorcycle vs. one being actively steered into traffic—can make the difference between life and death.

Super-reliable and secure

We’ve discussed the technological challenges that make it difficult for an operating system to be perfectly reliable. What also needs to be defined and then quantified is a legal standard for reliability. Many people would insist that driverless cars not be legal until they achieve perfect 100 percent reliability, meaning no crashes, mishaps, or blunders—ever. The sad truth is that if perfect reliability is required, driverless cars will never attain legal status. No operating system is perfectly reliable all of the time.

A great deal can go wrong with an operating system, and even the best-designed ones fail periodically. On a personal computer, a user might install a new application that’s buggy and hence disrupts the system’s core processes. A system can be infected by a malevolent computer virus. A new hardware peripheral can unexpectedly bring down the entire system.

Computer operating systems have two characteristics that make them unreliable and insecure: One, they contain millions of lines of code, too much for a skilled coder to wade through in search of potential bugs. Two, operating systems suffer from what’s called poor fault isolation.

Windows, IOS, and Linux, for example, are built in such a way that there’s little or no isolation between their subparts or procedures. The architecture of Windows XP contains about five million lines of code that runs thousands of procedures in a single, monolithic digital “workspace,” or kernel. These thousands of procedures are linked together into a single binary program; like mountain climbers attached to a single line, if one falls, they all fall.

Hardware problems are frequently the root cause of a computer operating system failure. A robotic operating system would manage an even larger workforce of hardware components. If simple peripherals such as mice, keyboards, and headphones can cause a computer operating system to crash, imagine the unreliability introduced by a car’s “hardware”: its tires, brakes, and a steering wheel, not to mention all the sensors needed to drive the car.

Each attached bit of hardware has a special software program called a driver that enables that bit of hardware to speak with the rest of the operating system it is installed onto. Driver problems are another major cause of system failure. They can wreak havoc. Driver-induced crashes cause an estimated 70 percent of operating system failures, an error rate that’s three to seven times higher than the problems introduced by bugs lurking in software code.⁴

The more complex the operating system, the harder it is to predict how it’s going to fail. The number of lines of code is in the tens of millions, and even well-written code can crash. There is also the threat of data security, resulting from malevolent hackers installing rogue hardware onto a driverless car or tinkering with the operating system.

The stakes are higher, too. When a personal computer suffers a catastrophic system failure, the result is frustration. No user ever died from the dreaded Microsoft Windows “blue screen of death.” In a driverless car, however, the equivalent of the blue screen of death could be, literally, death. Other less disastrous but still dangerous failure scenarios would be if a specific piece of hardware stopped responding, or if the entire operating system got bogged down in some sort of systemic problem and began to run very, very slowly.

The reality is that driverless cars will certainly suffer from software failure. The open question, however, is how much failure is acceptable. In the established world of server operating systems, system administrators have made a science of counting and documenting their system’s downtime, the number of hours each year that a server is taken offline because of system failure. Downtime can be planned, as in when an upgrade is required, or unplanned, when there’s catastrophic system failure. Downtime, planned or otherwise, costs businesses money.

Mature computer operating systems benefit from the fact that system administrators have figured out several techniques to improve reliability. As a result, the number of hours of server downtime per year has dropped dramatically in the past few years. The typical well-configured Windows or Linux server now suffers just a few minutes of downtime a year, if that. One question that will need to be addressed is: if a few minutes a year of downtime is acceptable for a server farm that supports several businesses, should that level of downtime also be okay for a driverless car?

Some people, tired of the carnage caused by distracted, drunk, or emotional human drivers might be comfortable accepting less than perfect reliability for a driverless car. This pragmatic group of people might agree that driverless cars should become legal once the operating system can drive better than a human driver. Rationally, such a reliability benchmark makes a lot of sense. The challenge, however, would be one of convincing naysayers to give robots a chance. We’re used to tragedies caused by human drivers. However, the first tragedy caused by a robotic driver will be cause for outrage, perhaps turning the public against driverless cars for a long time.

Let’s propose a bolder baseline: in order to be considered legal, driverless cars must be twice as safe as the average human driver. Let’s assign some numbers to this baseline. According to car insurance industry data, on average, a human driver in the United States has an accident every seventeen years. Another way to look at this is that the average human driver will file an insurance claim once every 190,000 miles.

While car accidents are common, very few of these collisions involve fatalities—only 0.3 percent.⁵ In other words, a human driver is likely to have some sort of accident, ranging from a minor fender-bender to full-on fatality, once every 190,000 miles driven. Let’s take this metric and round it up a bit—one accident per 200,000 miles.

One widely used metric to quantify a robot’s reliability is how long it can operate alone, without human intervention. This metric is known as mean time between failures, or MTBF. Like humans, all robots sometimes need help. For example, at home, our Roomba needs to be rescued, on average, once every ten hours when it snarls itself in a jungle of chair legs or gets its wheels stuck in a nasty tangle of electric cords. High-end industrial robots have a much longer MTBF, especially if they’re operating in remote conditions where their human overseer stops by only once a month or so.

If the average driverless car must be twice as safe as the average human driver, that means an acceptable failure metric for the car would be to average one accident every 400,000 miles driven. That’s approximately more than twice the number of miles that an average person will drive accident-free. Let’s call this metric mean distance between failures, or MDBF. Ideally, a car’s MDBF should be measured across a typical suite of road conditions, traffic conditions, and weather conditions.

Let’s calculate how long it would take for a driverless car to rack up an MDBF of 400,000 miles. If an autonomous car can drive 1,000 miles per day, then its MDBF could be verified in 400 days, or just over one year. Alternatively, a fleet of 1,000 cars could verify an MDBF of 400,000 miles in twenty-four hours. To add statistical significance, the people testing the cars would probably need to repeat this test a few times.

One of the great advantages of teaching robots how to think is that they have a hive mind. If one robot learns something, that software can be copied to dozens of other robots, who will use that knowledge to continue to learn. As many different robotic systems learn in parallel, their individual learnings can be pooled into a central knowledge base and then poured back into the individual robotic minds, which in turn continue to learn at an even faster rate.

Some call this sort of collective learning fleet learning. One of the great advantages of driverless-car technology is that fleets of cars can learn at an exponential rate. Fleets of cars absorb one another’s combined digital streams of driving data from past miles driven, an approach used by Google, Volvo, and Tesla to hone their driverless-car technology.

Fleet learning involves repetition, lots and lots of repetition. This was demonstrated in a video posted in April 2014, when a test driver for Google Self Driving Car Project explained that “A big part of our job is to go out into the world and uncover all the potential scenarios that a car might encounter. Then we help the engineers teach the car how to best navigate each one.”⁶ The blog article posted to accompany the video added, “As we’ve encountered thousands of different situations, we’ve built software models of what to expect from the likely (a car stopping at a red light) to the unlikely (blowing through it). We still have lots of problems to solve, including teaching the car to drive more streets in Mountain View before we tackle another town.”

Twice as safe as a human

As the stored knowledge base of driving situations continues to grow exponentially, driverless cars will become more and more capable. Yet, in order to gain full social and legal acceptance, driverless cars will need a transparent set of reliability standards, a clearly defined and public standard for what constitutes safe and acceptable mean distance between failures. Ideally, the companies that manufacture driverless cars would work with the federal government to publicly define acceptable MDBF standards for driverless cars.

In the new driverless era, MDBF levels will become as essential a piece of consumer-facing information about a new set of wheels as its horsepower. When consumers decide what car to buy, the driverless car’s MDBF level will be one of the car’s defining characteristics reported. An MDBF, however, is a bit of a clumsy term. We propose a term that’s easier to say.

The MDBF of driverless cars should be measured as a driverless car’s humansafe level. After all, 100 years ago early car-makers quantified the power of their engines in a metric that people of the time understood, the strength of a horse, or horsepower. Driverless cars should similarly showcase their ability to drive safely in terms of a safety level we’re already familiar with, that of a human.

Here’s how it would work. A car that can drive twice as many accident-free miles as the average human could be advertised as having a humansafe rating of 2.0. A car that’s three and a half times as safe as an average human driver would have a humansafe rating of 3.5, and so on. The humansafe level would depend on the software and computing power, as well as the number and types of hardware sensors.

A federally regulated system of humansafe levels would be a boon for nearly everyone involved in the driverless-car industry. A driverless car’s humansafe rating would be a core part of its market appeal. Car companies could build special “super safe” cars with high humansafe ratings and charge a premium for the extra safety. Perhaps a driverless car that ferries unaccompanied children would be required, by laws, to have a safety level of 10.0, but cargo-only vehicles could be permitted to travel with lower rating. Consumers would speak knowledgeably about the trade-off in humansafes vs. horsepower in various models of sleekly designed sports cars.

Some of this work has already been done. A good example of a successful robotic system that has successfully tackled many of these issues are the operating systems that fly commercial aircraft. Commercial airplanes have one of the highest MTBF ratings of all transportation systems. On average, an airplane has one fatal system failure once every two million flights ⁷ (not counting pilot error, sabotage, terrorism, and other external factors that are responsible for over 80 percent of the fatalities).

One failure every two million flights is a vastly improved level of reliability than what airplanes had in the past, roughly 100 times better than the accident rate in the 1950s. Such an impeccable safety record is even more impressive, considering that modern airplanes fly for longer durations and greater distances than did planes in the 1950s. Modern airplanes also fly faster. Yet, despite these demands, in the past 50 years, the number of avionic system failures has dropped by two orders of magnitude.

What can driverless cars learn from commercial airplanes? One is that federal oversight is key. Modern airplanes are subject to more scrutiny, with tighter federal oversight of maintenance and pilot training. Driverless cars need federal oversight that’s robot-savvy and guided by safety data. It’s crucial that safety regulations for driverless cars are defined using a rational and transparent process for weighing and quantifying the merits of robotic drivers.

Federal oversight of standards is just one piece of the puzzle, however. Better operating-system design is also essential. You will recall that one of the fatal weaknesses of computer operating systems is that their software architecture was designed so that all of the system processes ran inside one, monolithic workspace. Dumping everything into one software bucket results in computer operating systems being unreliable and insecure, or leaky. As one paper on operating system security explained it, “Current operating systems are like ships before compartmentalization was invented: every leak can sink the ship.”⁸

Driverless cars need an operating system that’s highly modular and redundant, similar to those that guide airplanes. Airplanes are famous for their redundant architectural design, much of it strictly regulated by the Federal Aviation Administration. All critical physical subsystems have double or triple redundancy. For example, fuel lines of commercial aircraft are required to be double-sheathed so that a leak in the inner tubing will be captured and detected in the outer tubing.

A plane’s critical software subsystems are also highly modular and redundant. The avionic operating system is composed of several linked but independent electronic subsystems. Separation is key. For example, the software that runs the engines is separated from the software that runs the landing gears, and both are separated from the infotainment system software with which passengers interact.

Redundancy and fault-proofing are also important. Each avionic subsystem is designed so that it can test faults in other systems and if such a fault occurs, it can tolerate it. If several subsystems “disagree,” the gridlock is resolved by a digital majority vote.

Driverless cars will need a redundant real-time operating system that contains built-in independent, self-testing systems that are required by law. Perhaps a driverless car could be required to have three separate visual perception subsystems, each using its own visual sensors and its own unique technique for analyzing the data gathered by those sensors. To ensure its safety, a car’s operating system would need periodic tune-ups to ensure that it is running the latest software trained by the latest datasets, and that the mechanical systems are speaking flawlessly to the software systems.

Similar to those on airplanes, the car’s wiring and on-board computers should be physically walled off, safe from the tinkering hands of innocent passengers or malevolent hijackers. The car’s self-driving software should be modular and should be subject to periodic mandatory inspections from an independent third party that’s not the manufacturer of the software. If one subsystem disagrees with the conclusion drawn by another, the oversight software should either “tune” the system’s performance or drive the car immediately to seek maintenance.

The operating system of a modern driverless car is the fruit of decades of artificial intelligence and controls engineering research. The operating system of a driverless car must not only be reliable and smart, but it must “handle the world” in real time, and never crash. Given these challenges, some people ponder whether driverless cars are possible. They ask, “Will there ever be an operating system that can handle a car as well as a person can?” A more interesting question is to ask is, “Why weren’t driverless cars invented years ago?”

5 Creating Artificial Perception

The challenge of object recognition

Mid-level controls

Super-reliable and secure

Twice as safe as a human

Notes