CHAPTER 39

FAILURE ANALYSIS
Anything that can go wrong, will go wrong.
—Murphy’s Law


Legend has it that the original “Murphy” was a U.S. Air Force officer that worked on rocket sleds in the late 1940s and early 1950s. These sleds were used to learn about the effects of high accelerations (g forces) on the human body as well as to design appropriate restraint and safety systems for high-performance aircraft and spaceships. Computer simulations and mechanical models of the human body were not available at the time, so the tests were performed on a living person. The sleds were capable of tremendous speeds (approaching the speed of sound) and there was a very high probability for accidents due to mechanical failure. In readying a sled for a test and working through a myriad of problems, Murphy reportedly muttered the comment for which he was to become immortalized. You probably have already heard of Murphy’s Law and it has been referenced so many times in books that it has become trite. It was really put in here to introduce the following comment on it; one that you are sure to relate to as you work on your own robot designs:
Murphy was an optimist.
—Anonymous


Before trying to figure out how to fix a problem, the first thing that you will have to do is determine where the problem is occurring. In the following three sections, the three primary sources of problems are listed and the types of failures that are typically ascribed to them. Once you determine where the problem lies, you can start looking at how to fix it so that it does not reoccur. A process for determining the root cause of a failure and ensuring that the corrective action will allow the robot to run until some other problem surfaces is given in the following sections of this chapter.
Mechanical problems are perhaps the most common failure in robots. The typical source of the problem is that the materials or the joining methods you used were not strong enough. Avoid overbuilding your robots (that tends to make them too expensive and heavy), but at the same time strive to make them physically strong. Of course, strong is relative: a lightweight, scarab-sized robot needn't have the structure to support a two year old that a tricycle does. At the very least, however, your robot construction should support its own weight, including batteries.
When possible, avoid slap-together construction, such as using electrical or duct tape. These methods are acceptable for quick prototypes but are unreliable for long-term operation. When gluing parts in your robot, select an adhesive that is suitable for the materials you are using. Epoxy and hot-melt glues are among the most permanent. You may also have luck with cyanoacrylate (CA) glues, though the bond may become brittle and weak over time (a few years or more, depending on humidity and stress).
Use the pull test to determine if your robot construction methods are sound. Once you have attached something to your robot—using glue, nuts and bolts, or whatever—give it a healthy tug. If it comes off, the construction isn’t good enough. Look for a better way.
Electronics can be touchy, not to mention extremely frustrating, when they don’t work right. Circuits that functioned properly in a solderless breadboard may no longer work once you’ve soldered the components in a permanent circuit, and vice versa. There are many reasons for this, including mistakes in wiring, unexpected capacitive and inductive effects, even variations in tolerances due to heat transfer.
Some electrical problems may be caused by errors in programming, weak batteries, or unreliable sensors. For example, it is not uncommon for sensors to occasionally yield totally unexpected results. This can be caused by design flaws inherent in the sensor itself, spurious data (noise from a motor, for example), or corrupted or out-of-range data. Ideally, the programming of your robot should anticipate occasional bad sensor readings and basically ignore them. A perfectly acceptable approach is to throw out any sensor reading that is outside the statistical model you have decided on (e.g., a sonar ping that says an object is 1048 ft away; the average robotic sonar system has a maximum range of about 35 ft).
As more and more robots use computers and microcontrollers as their brains, programming errors are fast becoming one of the most common causes of failure. There are three basic kinds of programming bugs.
  • Compile bug or syntax error. You can instantly recognize these because the program compiler or downloader will flag these mistakes and refuse to continue. You must fix the problem before you can transfer the program to the robot’s microcontroller or computer.
  • Run-time bug, caused by a disallowed condition. A run-time bug isn’t caught by the compiler. It occurs when the microcontroller or computer attempts to run the program. An example of a common run-time bug is the use of an out-of-bounds element in an array (for instance, trying to assign a value to the thirty-first element in a 30-element array). Run-time bugs may also be caused by missing data, such as looking for data on the wrong input pin of a microcontroller.
  • Logic bug, caused by a program that simply doesn't work as anticipated. Logic bugs may be due to simple math errors (you meant to add, not subtract) or by mistakes in coding that cause a different behavior than you anticipated.
As you become experienced in programming, you will get a lot faster at finding problems as you understand where you normally make mistakes and learn how to use the tools at your disposal for finding and fixing the problems. When you first start working on robots, you will probably feel most uncomfortable about your skills in debugging programs, but as you gain experience, you will be amazed at your ability to produce code with relatively few errors, that operates efficiently, and can be debugged easily.
When given a problem to fix, most people will try to find the easiest way to resolve it and move on; the term used for this process is debugging. Looking for and implementing the quick fix often yields an effective (and sometimes optimal) repair action. It does not resolve all problems and quite often masks them. With only experience in working on the most likely cause of the problem, you will become very frustrated very quickly and unable to identify the reason for the problem as well as the most effective repair. For these cases, you will need to perform a root cause failure analysis (often shortened to just failure analysis) to determine what exactly is the problem and what is the best way to fix it.
The failure analysis that is outlined in the rest of this chapter applies to all three classes of failures that you will encounter in your robots (mechanical, electrical, and programming), which may seem surprising because each class seems so different from the other. Mechanical (chassis and drivetrain design and assembly) problems do not have anything in common with electrical or programming issues. What they do have in common is the process of understanding how the structure, circuit, or program should work, characterizing the problem, developing theories regarding what is actually happening, figuring out how to repair the problem, and testing your solution before finally applying the fix to the robot. The following set of actions may seem like a lot of work to fix a problem like a nut that has fallen off due to vibration, but if you follow it faithfully, the skills you gain by fixing the simple problems will make the more difficult ones a lot easier to solve and will prevent the simple ones from happening again.
Do you understand exactly what should be happening in your robot? Chances are you have a good idea of what the robot, or a part of the robot, should be doing at a given time but you have probably not looked in detail at what is actually happening. For a robot that has a failed glued plastic joint, you may have done an analysis of the forces on the joint when the robot is stationary, but have you looked at what happens during acceleration and deceleration? What about forces caused by vibration or large masses (such as batteries) shifting during operation? The forces during movement as well as changes in movement must be considered when looking at a mechanical failure.
Similarly, for an electrical problem, do you understand what the actual currents flowing through the circuitry are? Starting and stopping robot drive motors when they are under load will require greater currents than on a bench being tested out. Have you calculated the temperatures of different components during operation as well as their effect on components close to them? If the robot seems to miss detecting objects in front of it, have you put in some consideration for switch bouncing or changing fields of view during operation? Electrical problems can be especially vexing when you are using third-party designs or circuitry. To predict what should be happening, you should review the basic electrical laws and make sure you fully understand the basic electrical formulas and conventions.
Finally, for software, can you trace through the source code to understand what should be happening at any one particular time? How is the operation of the robot controller documented for different situations with varying inputs? A very important tool in understanding the operation of software is the simulator and how much time has been spent at understanding how the application should work. Many robot software applications are written quickly and debugged continuously to get the robot working as desired—this makes documenting the software a difficult and confusing chore unless you are very careful to keep track of different versions of software and the changes made to them. You will often find it easier to go back over the source code and try to map out how it is supposed to work and respond to different inputs.
Documenting the expected operation of the robot at the time of failure is a time consuming task, but one that is critical to finding and ultimately fixing the problem. In many cases when you start understanding the operation of the robot at the level of detail needed to find and fix the problem, the reason for the problem will become apparent—but you should refrain from implementing the apparent fix until you have worked through the following five steps.
After documenting and becoming very familiar with what is supposed to be happening in the robot, you will spend some time setting up experiments to observe what is actually happening. The effort required for this is not trivial and will test your ingenuity to come up with different methods of observing what is happening while having a limited budget and resources for test equipment. Spending a few minutes thinking about the problem can result in some very innovative ways of observing the different aspects of the robot in operation and help guide you to the root cause of the problem.
You will find that some failures are intermittent; that is to say they will happen at seemingly random intervals. By characterizing the operation of the robot and comparing the results to the documented expected operation, you should find situations where the operating parameters are outside the design parameters, leading to the opportunity for failure either immediately or at some later time. Once you become familiar with documenting the expected operation of your robot as well as characterizing different robot problems, you'll discover that there really is no such thing as a random failure. Each failure mode has a unique set of parameters that will cause the failure and allow you to understand exactly what is happening.
The conditions leading up to a mechanical failure can be extremely difficult to observe on the basic robot. Plastic or cardboard arrows attached to different points in the robot's structure will help illustrate flexing that is not easily observed by the naked eye. A small cup of water can also be used to show the operating angle of different components of the robot as well as the acceleration of the robot during different circumstances. A digital camera’s photograph of the robot in operation, with indicators such as arrows and cups of water will help you to observe deformations of the robot's structure and allow you to measure them by printing out the picture and measuring angles using a protractor.
When searching for electrical problems during the operation of the robot, your best friend is the LM339 quad comparator along with a few LEDs and potentiometers. The potentiometers are wired as voltage dividers and used to provide different extreme values for the different electrical parameters that are going to be measured (Fig. 39-1). When the robot exceeds one of these parameters, an LED wired to the LM339 comparator output will light. This allows you to easily observe any out-of-tolerance electrical conditions during robot operation, requiring just a few minutes of setup. Depending on how the robot is powered, you may have to add a separate power supply (a 9-V radio battery works well to allow a good range on the potentiometers) to the circuit in order to test it if you suspect the robot’s power supply is sagging.
images
If a programming failure is suspected, you will discover that the best method of characterizing what is happening is by recording the inputs followed by the outputs. Again, LEDs are your best tool for observing what are the inputs causing the bad outputs. You can also use an LCD (although this will require you to stand over the robot to see exactly what is happening) or output a different sound or message when there are specific inputs to the microcontroller. Once you have the actual inputs and output commands, you can set up a state diagram (showing the changing inputs and outputs) to help you understand exactly what the program is doing in specific cases.
When coming up with methodologies for observing what is happening in the robot when the failure is taking place, remember Heisenberg's Uncertainty Principle, which states that the apparatus used for measuring a subatomic particle parameter will affect the actual measurement. This is very possible in robotics when you are trying to characterize a failure; often the equipment used to record the failure will end up changing the behavior of the robot, hiding the true nature of the problem. For example, adding an LCD to display the inputs and outputs of the robot's microcontroller during operation may require you to stand over it to monitor the LCD's output, which may possibly result in you being detected by the robot and your presence causing the robot to behave differently than if you were further away from the robot.
With a clear understanding of how the robot should behave and how it is actually behaving, the differences should become very obvious and allow you to start making theories regarding what is the root cause of the problem. When you are hypothesizing about the problem, it is very important to (a) keep an open mind as to the cause of the problem and (b) avoid trying to come up with solutions, no matter how obvious they seem. It is easy to short-circuit this process and decide upon an obvious fix without working through the rest of the failure analysis.
Keeping an open mind is extremely difficult. To force yourself to look at different solutions, you should try to come up with at least three different possible root causes for the problem.
To illustrate this point, consider the case of a differentially driven robot with a light plastic frame that scrapes along the ground during changes in robot direction at the point where the batteries are mounted (see Fig. 39-2). As well as being scraped, the operation of the robot seems to be erratic when the chassis comes into contact with the running surface. These observations are confirmed by photographing the robot during changes in direction.
images
With this information, you could make the following theories regarding the problems the robot is having:
1. The robot is accelerating too quickly, and the chassis is distorting during starting, stopping, and direction changing.
2. The inertia of the battery pack is causing the chassis to flex.
3. The motors are too powerful, and they are warping the chassis during startup or stopping.
4. The caster is digging into the running surface during operation, causing the chassis to distort.
A generic theory could be that the chassis isn't strong enough, but this will steer you toward a single solution (strengthening the chassis) while the four expanded theories give you a number of ideas to try and fix the problem.
Along with these four theories, you could probably come up with more that you can compare against the data that you collect in the first two steps and see which hypothesis best fits the data. There’s a good chance that you will have to go back and look at different aspects of the robot; for example, if the caster was the problem, it should have some indications of high drag on a large part of its surface, not just the small area where it was in contact with the running surface.
Once you are comfortable with understanding the different possible root causes of the failure, you can start listing out possible corrective actions. Like the multiple possible root causes listed in the previous step, you should also list out multiple possible corrective actions. While some may jump out at you, after considering different options, a much more elegant and easier to implement solution may become obvious.
In the previous section, it was mentioned that the obvious possible root cause of the robot scraping against the ground is that the chassis is simply not strong enough. The obvious solution to this problem is to strengthen the chassis.
While it may fix the problem, it is probably not the optimal solution, as strengthening the chassis could require you to effectively redesign and rebuild the robot. Before embarking on this large amount of effort, you could review a number of different solutions:
1.  Strengthening the chassis.
2.  Using a lighter battery pack.
3.  Decreasing the robot’s acceleration.
4.  Relocating the battery pack.
5.  Changing the front caster.
6.  Using larger drive wheels.
The amount of work required for each of the different solutions varies. Along with documenting the amount of work for each solution, the potential cost and length of time needed to implement the solution can be documented in order to be able to choose the best possible fix.
It is usually possible to quickly rig up a sample solution to test the effectiveness of a specific repair action before making it permanent. It is actually preferable to do this as it may become obvious that some repair actions are not going to be effective or are going to cause more problems than they solve. This is why multiple possible root cause problems are listed along with multiple repair actions for each root cause.
When testing the solution, don't spend a lot of time making it look polished; there's a good chance that it will not be as good as some other solution and you will end up having to tear down the fix and try another one. Don't be surprised if the most obvious cause of the problem and repair action aren't right; over time your ability to suggest the most effective fixes will improve, but when you are starting out, listing as many as possible and trying them all out will guide you to the most effective solution to your problem.
Finally, there's a chance that multiple corrective actions will produce the best result. For the example here, minimizing the distortion of the robot's chassis could be achieved by relocating the battery pack and using a lighter one. Testing multiple fixes together can result in even better solutions than if you were to doggedly look for a single, simple fix.
Remember to record the results of your tests (using the same testing apparatus as you used to characterize the problem). Being able to compare a difference in a robot will result in confidence that the optimal solution will be found.
Once you have determined the "best" solution to the problem and implemented it, you should record it in a notebook for future use as chances are if you don't experience it again, you will experience something like it. Keeping notes listing what you did along with what you saw and expected to see will make the effort in documenting the expected state along with characterizing the actual state much easier.
Along with helping you with future problems, the notes will help you if the problem happens again (and there is a good chance that it will). In this case, you have a record of what has been tried and you can try something else.
Ideally the fix should be fairly simple; by using this process you will discover that problems you thought would require you to rebuild the robot can be resolved very easily. You will also find that after a while it takes you less time to work through the full failure analysis process than to perform a simple debug repair action. The added bonus is that your repairs will be a lot more reliable and less likely to break under extreme stress in the future.