FAILURE ANALYSIS
Anything that can go wrong, will go wrong.
—Murphy’s Law
Legend has it that the original “Murphy” was a U.S. Air Force officer that worked on
rocket sleds in the late 1940s and early 1950s. These sleds were used to learn about
the effects of high accelerations (g forces) on the human body as well as to design appropriate
restraint and safety systems for high-performance aircraft and spaceships. Computer
simulations and mechanical models of the human body were not available at the time, so
the tests were performed on a living person. The sleds were capable of tremendous speeds
(approaching the speed of sound) and there was a very high probability for accidents due to
mechanical failure. In readying a sled for a test and working through a myriad of problems,
Murphy reportedly muttered the comment for which he was to become immortalized. You
probably have already heard of Murphy’s Law and it has been referenced so many times in
books that it has become trite. It was really put in here to introduce the following comment
on it; one that you are sure to relate to as you work on your own robot designs:
Murphy was an optimist.
—Anonymous
Before trying to figure out how to fix a problem, the first thing that you will have to do is
determine where the problem is occurring. In the following three sections, the three primary sources of problems are listed and the types of failures that are typically ascribed to
them. Once you determine where the problem lies, you can start looking at how to fix it so
that it does not reoccur. A process for determining the root cause of a failure and ensuring
that the corrective action will allow the robot to run until some other problem surfaces is
given in the following sections of this chapter.
39.1.1 MECHANICAL FAILURE
Mechanical problems are perhaps the most common failure in robots. The typical source of
the problem is that the materials or the joining methods you used were not strong enough.
Avoid overbuilding your robots (that tends to make them too expensive and heavy), but at
the same time strive to make them physically strong. Of course, strong is relative: a lightweight,
scarab-sized robot needn't have the structure to support a two year old that a tricycle
does. At the very least, however, your robot construction should support its own weight,
including batteries.
When possible, avoid slap-together construction, such as using electrical or duct tape.
These methods are acceptable for quick prototypes but are unreliable for long-term operation.
When gluing parts in your robot, select an adhesive that is suitable for the materials
you are using. Epoxy and hot-melt glues are among the most permanent. You may also
have luck with cyanoacrylate (CA) glues, though the bond may become brittle and weak
over time (a few years or more, depending on humidity and stress).
Use the pull test to determine if your robot construction methods are sound. Once you
have attached something to your robot—using glue, nuts and bolts, or whatever—give it a
healthy tug. If it comes off, the construction isn’t good enough. Look for a better way.
39.1.2 ELECTRICAL FAILURE
Electronics can be touchy, not to mention extremely frustrating, when they don’t work
right. Circuits that functioned properly in a solderless breadboard may no longer work once
you’ve soldered the components in a permanent circuit, and vice versa. There are many
reasons for this, including mistakes in wiring, unexpected capacitive and inductive effects,
even variations in tolerances due to heat transfer.
Certain electronic circuit construction techniques are better suited for an active, mobile
robot. Wire-wrap is a fast way to build circuits, but its construction can invite problems. The
long wire-wrap pins can bend and short out against one another. Loose wires can come off.
Parasitic signals and stray capacitance can cause marginal circuits to work, then not work,
and then work again. For an active robot it may be better to use a soldered circuit board,
perhaps even a printed circuit board of your design (see Chapter 7, "Electronic Construction
Techniques," for more information).
Some electrical problems may be caused by errors in programming, weak batteries, or
unreliable sensors. For example, it is not uncommon for sensors to occasionally yield totally
unexpected results. This can be caused by design flaws inherent in the sensor itself, spurious
data (noise from a motor, for example), or corrupted or out-of-range data. Ideally, the
programming of your robot should anticipate occasional bad sensor readings and basically
ignore them. A perfectly acceptable approach is to throw out any sensor reading that is outside
the statistical model you have decided on (e.g., a sonar ping that says an object is 1048
ft away; the average robotic sonar system has a maximum range of about 35 ft).
39.1.3 PROGRAMMING FAILURE
As more and more robots use computers and microcontrollers as their brains, programming
errors are fast becoming one of the most common causes of failure. There are three basic
kinds of programming bugs.
- Compile bug or syntax error. You can instantly recognize these because the program
compiler or downloader will flag these mistakes and refuse to continue. You must fix the
problem before you can transfer the program to the robot’s microcontroller or computer.
- Run-time bug, caused by a disallowed condition. A run-time bug isn’t caught by the
compiler. It occurs when the microcontroller or computer attempts to run the program.
An example of a common run-time bug is the use of an out-of-bounds element in an
array (for instance, trying to assign a value to the thirty-first element in a 30-element
array). Run-time bugs may also be caused by missing data, such as looking for data on
the wrong input pin of a microcontroller.
- Logic bug, caused by a program that simply doesn't work as anticipated. Logic bugs
may be due to simple math errors (you meant to add, not subtract) or by mistakes in coding
that cause a different behavior than you anticipated.
As you become experienced in programming, you will get a lot faster at finding problems
as you understand where you normally make mistakes and learn how to use the tools at
your disposal for finding and fixing the problems. When you first start working on robots,
you will probably feel most uncomfortable about your skills in debugging programs, but as
you gain experience, you will be amazed at your ability to produce code with relatively few
errors, that operates efficiently, and can be debugged easily.
39.2 The Process of Fixing Problems
When given a problem to fix, most people will try to find the easiest way to resolve it and
move on; the term used for this process is debugging. Looking for and implementing the
quick fix often yields an effective (and sometimes optimal) repair action. It does not resolve
all problems and quite often masks them. With only experience in working on the most
likely cause of the problem, you will become very frustrated very quickly and unable to identify
the reason for the problem as well as the most effective repair. For these cases, you will
need to perform a root cause failure analysis (often shortened to just failure analysis) to
determine what exactly is the problem and what is the best way to fix it.
The failure analysis that is outlined in the rest of this chapter applies to all three classes
of failures that you will encounter in your robots (mechanical, electrical, and programming),
which may seem surprising because each class seems so different from the other. Mechanical
(chassis and drivetrain design and assembly) problems do not have anything in common
with electrical or programming issues. What they do have in common is the process of
understanding how the structure, circuit, or program should work, characterizing the problem,
developing theories regarding what is actually happening, figuring out how to repair
the problem, and testing your solution before finally applying the fix to the robot. The following set of actions may seem like a lot of work to fix a problem like a nut that has fallen
off due to vibration, but if you follow it faithfully, the skills you gain by fixing the simple
problems will make the more difficult ones a lot easier to solve and will prevent the simple
ones from happening again.
39.2.1 DOCUMENTING THE EXPECTED STATE
Do you understand exactly what should be happening in your robot? Chances are you have
a good idea of what the robot, or a part of the robot, should be doing at a given time but
you have probably not looked in detail at what is actually happening. For a robot that has a
failed glued plastic joint, you may have done an analysis of the forces on the joint when the
robot is stationary, but have you looked at what happens during acceleration and deceleration?
What about forces caused by vibration or large masses (such as batteries) shifting during
operation? The forces during movement as well as changes in movement must be
considered when looking at a mechanical failure.
Similarly, for an electrical problem, do you understand what the actual currents flowing
through the circuitry are? Starting and stopping robot drive motors when they are
under load will require greater currents than on a bench being tested out. Have you calculated
the temperatures of different components during operation as well as their effect
on components close to them? If the robot seems to miss detecting objects in front of it,
have you put in some consideration for switch bouncing or changing fields of view during
operation? Electrical problems can be especially vexing when you are using third-party
designs or circuitry. To predict what should be happening, you should review the basic
electrical laws and make sure you fully understand the basic electrical formulas and conventions.
Finally, for software, can you trace through the source code to understand what should
be happening at any one particular time? How is the operation of the robot controller documented
for different situations with varying inputs? A very important tool in understanding
the operation of software is the simulator and how much time has been spent at understanding
how the application should work. Many robot software applications are written
quickly and debugged continuously to get the robot working as desired—this makes documenting
the software a difficult and confusing chore unless you are very careful to keep
track of different versions of software and the changes made to them. You will often find it
easier to go back over the source code and try to map out how it is supposed to work and
respond to different inputs.
Documenting the expected operation of the robot at the time of failure is a time consuming
task, but one that is critical to finding and ultimately fixing the problem. In many
cases when you start understanding the operation of the robot at the level of detail needed
to find and fix the problem, the reason for the problem will become apparent—but you
should refrain from implementing the apparent fix until you have worked through the following
five steps.
39.2.2 CHARACTERIZING THE PROBLEM
After documenting and becoming very familiar with what is supposed to be happening in
the robot, you will spend some time setting up experiments to observe what is actually happening. The effort required for this is not trivial and will test your ingenuity to come up with
different methods of observing what is happening while having a limited budget and
resources for test equipment. Spending a few minutes thinking about the problem can result
in some very innovative ways of observing the different aspects of the robot in operation
and help guide you to the root cause of the problem.
You will find that some failures are intermittent; that is to say they will happen at seemingly
random intervals. By characterizing the operation of the robot and comparing the
results to the documented expected operation, you should find situations where the operating
parameters are outside the design parameters, leading to the opportunity for failure
either immediately or at some later time. Once you become familiar with documenting the
expected operation of your robot as well as characterizing different robot problems, you'll
discover that there really is no such thing as a random failure. Each failure mode has a
unique set of parameters that will cause the failure and allow you to understand exactly what
is happening.
The conditions leading up to a mechanical failure can be extremely difficult to observe
on the basic robot. Plastic or cardboard arrows attached to different points in the robot's
structure will help illustrate flexing that is not easily observed by the naked eye. A small cup
of water can also be used to show the operating angle of different components of the robot
as well as the acceleration of the robot during different circumstances. A digital camera’s
photograph of the robot in operation, with indicators such as arrows and cups of water will
help you to observe deformations of the robot's structure and allow you to measure them
by printing out the picture and measuring angles using a protractor.
When searching for electrical problems during the operation of the robot, your best
friend is the LM339 quad comparator along with a few LEDs and potentiometers. The
potentiometers are wired as voltage dividers and used to provide different extreme values
for the different electrical parameters that are going to be measured (Fig. 39-1). When the
robot exceeds one of these parameters, an LED wired to the LM339 comparator output
will light. This allows you to easily observe any out-of-tolerance electrical conditions during
robot operation, requiring just a few minutes of setup. Depending on how the robot is powered,
you may have to add a separate power supply (a 9-V radio battery works well to allow
a good range on the potentiometers) to the circuit in order to test it if you suspect the
robot’s power supply is sagging.
If a programming failure is suspected, you will discover that the best method of characterizing
what is happening is by recording the inputs followed by the outputs. Again, LEDs
are your best tool for observing what are the inputs causing the bad outputs. You can also
use an LCD (although this will require you to stand over the robot to see exactly what is happening)
or output a different sound or message when there are specific inputs to the microcontroller.
Once you have the actual inputs and output commands, you can set up a state
diagram (showing the changing inputs and outputs) to help you understand exactly what the program is doing in specific cases.
When coming up with methodologies for observing what is happening in the robot when
the failure is taking place, remember Heisenberg's Uncertainty Principle, which states that
the apparatus used for measuring a subatomic particle parameter will affect the actual measurement.
This is very possible in robotics when you are trying to characterize a failure;
often the equipment used to record the failure will end up changing the behavior of the
robot, hiding the true nature of the problem. For example, adding an LCD to display the inputs and outputs of the robot's microcontroller during operation may require you to stand
over it to monitor the LCD's output, which may possibly result in you being detected by the
robot and your presence causing the robot to behave differently than if you were further
away from the robot.
39.2.3 HYPOTHESIZING ABOUT THE PROBLEM
With a clear understanding of how the robot should behave and how it is actually behaving,
the differences should become very obvious and allow you to start making theories regarding
what is the root cause of the problem. When you are hypothesizing about the problem,
it is very important to (a) keep an open mind as to the cause of the problem and (b) avoid
trying to come up with solutions, no matter how obvious they seem. It is easy to short-circuit
this process and decide upon an obvious fix without working through the rest of the failure
analysis.
Keeping an open mind is extremely difficult. To force yourself to look at different solutions,
you should try to come up with at least three different possible root causes for the
problem.
To illustrate this point, consider the case of a differentially driven robot with a light plastic
frame that scrapes along the ground during changes in robot direction at the point where
the batteries are mounted (see Fig. 39-2). As well as being scraped, the operation of the robot seems to be erratic when the chassis comes into contact with the running surface.
These observations are confirmed by photographing the robot during changes in direction.
With this information, you could make the following theories regarding the problems the
robot is having:
1. The robot is accelerating too quickly, and the chassis is distorting during starting, stopping,
and direction changing.
2. The inertia of the battery pack is causing the chassis to flex.
3. The motors are too powerful, and they are warping the chassis during startup or stopping.
4. The caster is digging into the running surface during operation, causing the chassis to
distort.
A generic theory could be that the chassis isn't strong enough, but this will steer you
toward a single solution (strengthening the chassis) while the four expanded theories give
you a number of ideas to try and fix the problem.
Along with these four theories, you could probably come up with more that you can
compare against the data that you collect in the first two steps and see which hypothesis
best fits the data. There’s a good chance that you will have to go back and look at different
aspects of the robot; for example, if the caster was the problem, it should have some indications
of high drag on a large part of its surface, not just the small area where it was in
contact with the running surface.
39.2.4 PROPOSING CORRECTIVE ACTIONS
Once you are comfortable with understanding the different possible root causes of the failure,
you can start listing out possible corrective actions. Like the multiple possible root
causes listed in the previous step, you should also list out multiple possible corrective actions. While some may jump out at you, after considering different options, a much more
elegant and easier to implement solution may become obvious.
In the previous section, it was mentioned that the obvious possible root cause of the
robot scraping against the ground is that the chassis is simply not strong enough. The obvious
solution to this problem is to strengthen the chassis.
While it may fix the problem, it is probably not the optimal solution, as strengthening the
chassis could require you to effectively redesign and rebuild the robot. Before embarking on
this large amount of effort, you could review a number of different solutions:
1. Strengthening the chassis.
2. Using a lighter battery pack.
3. Decreasing the robot’s acceleration.
4. Relocating the battery pack.
5. Changing the front caster.
6. Using larger drive wheels.
The amount of work required for each of the different solutions varies. Along with documenting
the amount of work for each solution, the potential cost and length of time
needed to implement the solution can be documented in order to be able to choose the best
possible fix.
It is usually possible to quickly rig up a sample solution to test the effectiveness of a specific
repair action before making it permanent. It is actually preferable to do this as it may
become obvious that some repair actions are not going to be effective or are going to cause
more problems than they solve. This is why multiple possible root cause problems are listed
along with multiple repair actions for each root cause.
When testing the solution, don't spend a lot of time making it look polished; there's a
good chance that it will not be as good as some other solution and you will end up having
to tear down the fix and try another one. Don't be surprised if the most obvious cause of
the problem and repair action aren't right; over time your ability to suggest the most effective
fixes will improve, but when you are starting out, listing as many as possible and trying
them all out will guide you to the most effective solution to your problem.
Finally, there's a chance that multiple corrective actions will produce the best result. For
the example here, minimizing the distortion of the robot's chassis could be achieved by relocating
the battery pack and using a lighter one. Testing multiple fixes together can result in
even better solutions than if you were to doggedly look for a single, simple fix.
Remember to record the results of your tests (using the same testing apparatus as you
used to characterize the problem). Being able to compare a difference in a robot will result
in confidence that the optimal solution will be found.
39.2.6 IMPLEMENTING AND RELEASING THE SOLUTION
Once you have determined the "best" solution to the problem and implemented it, you
should record it in a notebook for future use as chances are if you don't experience it again, you will experience something like it. Keeping notes listing what you did along with what
you saw and expected to see will make the effort in documenting the expected state along
with characterizing the actual state much easier.
Along with helping you with future problems, the notes will help you if the problem happens
again (and there is a good chance that it will). In this case, you have a record of what
has been tried and you can try something else.
Ideally the fix should be fairly simple; by using this process you will discover that problems
you thought would require you to rebuild the robot can be resolved very easily. You will
also find that after a while it takes you less time to work through the full failure analysis
process than to perform a simple debug repair action. The added bonus is that your repairs
will be a lot more reliable and less likely to break under extreme stress in the future.
To learn more about . . . |
|
Read |
Interfacing issues |
|
Chapter 38, "Integrating the Blocks" |
Mechanical structures |
|
Chapter 3, "Structural Materials" |
Electrical theory |
|
Chapter 5, "Electrical Theory" |
Programming operations |
|
Chapter 13, "Programming Fundamentals" |