A failure mode occurs when your software or system fails in ways that you did not expect and from which your software does not directly recover. Failure mode analysis involves a systematic analysis of general ways the program or system can fail. Small company development teams often do not systematically examine how their product or service can break. Instead, most wait for a failure to occur and patch the system to fix it.
Small company developers usually focus on making the product work, rather than looking for what will cause it to fail. Breaking the product is a job for quality assurance. However, the QA team does not have the insight into the internals of the code that are required to perform a proper risk analysis. Engineers themselves need to analyze the risks and failure modes of every product or system.
A failure mode review must examine the system as a whole and in parts. Failures can occur in components or in the interactions of several components; some single components might show no obvious failure issues, but their interactions with other components can cause the system to break. In addition, a review must consider how unexpected customer data or usage can affect the system, including the effects of unusual data, overload of data streams, data size issues, data rate issues, and timing issues.
External abnormal occurrences can also cause problems to the system and should be studied. Using a system diagram as your reference point, ask a series of questions about what could happen, such as the following:
What happens if third-party vendors do not provide the bandwidth needed?
What happens if someone cuts a cable or a machine goes down?
If a system loses data, how does its recovery mechanism work?
What synchronization problems can be identified?
What happens when the wrong data enters the system?
How does the system respond to data provided in the wrong order?
How will the system detect unauthorized access?
Scale the analysis based on the potential problems a failure would create. Although intense failure mode analysis approaches can be used, most products require a less intensive examination, except for cases in which failure could have an extreme adverse effect on the customer.
Requiring a systematic analysis of failure modes will improve the reliability of your product or system. As a manager, require an analysis for every major release of a system. Perform this analysis early in the development cycle and act on any issues uncovered.