Failure Mode and Effect Analysis (FMEA) is a very common structured analysis method used to assure product reliability and safety. It is used to identify how a product can fail (its failure modes), the causes of those failures, and the effects of the failures on system/product performance.
FMEA is commonly used after a designed product is sold and in use, and uses available data about actual failures to identify and prioritize problems in the product's design.
However, FMEA can and should also be used early in the product development process to help catch and remedy design flaws as early as possible.
In many ways, FMEA is a more structured way to implement the why why why method of finding root causes of undesired situations (i.e. failures).
FMEA can be used for various other kinds of design analyses, including:
A failure mode is a way in which a system or component might fail. Some failure modes include: yielding, ductile rupture, fatigue, wear, impact, etc. There are many modes that result from interaction with other systems or users, e.g. entering the wrong data to a program, executing tasks in the wrong order, etc. Some typical generic modes of failure include:
There are two ways in which FMEA can be applied:
The bottom-up approach is used more often – typically because short lead-times do not allow the time to do the top-down approach during the design process itself.
The main interest here is in the top-down approach, because from an engineering standpoint, it is more likely to generate more reliable designs over entire development programs.
NOTES:
Depending on how much information is available about a product, there are three general ways to “measure” failures and their effects.
Failures are generally measured in terms of their criticality, which is typically the sum of a failure's severity and its probability. Details will vary from one implementation of FMEA to another.
Sometimes, a third element, detection, is also added. Detection can be a probabilistic or qualitative measure of the likelihood that the failure mode will be actually detected before the failure occurs.
One typically has either not enough information or not enough time/money to conduct a complete FMEA. It is important to set a goal of the level of detail to which the FMEA is to be carried out.
Also, FMEA should be intended to determine the effects of failure modes on specific factors, such as safety, mission outcomes, or repair costs.
Failures occur due to activities by the product or its users/environment. Activities involve interactions between product components. At the very least, then, a preliminary systems design must have been carried out.
Typically, systems are represented in some kind of block diagram indicating the major functional components of the system and their interactions. Such schematics are often called product architectures.
Developing a product architecture is a key element of Systems design.
For each system component (not necessarily parts), identify possible failure modes, including how (operationally) the failure might happen.
For example, one failure mode is corrosion, which might cause a metal pipe under a kitchen sink to develop a leak. The how of this failure mode might be: water and foreign materials will act chemically on the metal over time.
This is where the why why why method can come in again.
The root cause of a failure is the most basic reason for a failure that can be reasonably determined. For example, the root cause of the leaky pipe mentioned above might be a lack of appropriate coating on the inside surface of the pipe.
Given a failure mode and its cause, what _effects_ will the failure have on the “containing” system? In the case of the leaky pipe, some effects include:
Classify the severity of the failure effects. One often used scale (from MIL-STD-882B1) is:
Classify the probability of the failure mode. One often used scale is:
If detection is also being considered, an analogous 4-point scale can be used to determine qualitative values for it.
Typically, the values for severity and probability (and detection) are just added together. The higher the sum, the greater the risk arising from that particular failure mode.
The tabulated form makes it easier to look for the greatest risks and allocate the most resources to dealing with them. FMEA tables are augmented with extra data explaining why particular choices of values were made.
Here are two simpler examples of FMEA tables, based on [Pug91]. The weights were calculated using pairwise comparison2). The rankings were arrived at by summing individual rankings of severity and probability on the 4-point scale noted above. (Detection was not considered.)
Here is a sample FMEA for the _parts_ of a lead pencil.
Part | Weight | Failure Modes | Total | % | |||
---|---|---|---|---|---|---|---|
Breaks | Falls Apart | Smudges | Wears Out | ||||
Lead | 40% | 8 (3.2) | 4 (1.6) | 8 (3.2) | 8 | 48 | |
Wood | 40% | 8 (3.2) | 8 (3.2) | 6.4 | 39 | ||
Eraser | 5% | 2 (0.1) | 2 (0.1) | 8 (0.4) | 8 (0.4) | 1 | 6 |
Eraser Holder | 15% | 8 (1.2) | 1.2 | 7 | |||
Total | 16.6 | 100 |
Here is a sample FMEA for the _functions_ of a paperclip.
Function | Weight | Failure Modes | Total | % | ||||
---|---|---|---|---|---|---|---|---|
Release paper | Wire snaps | Snags | Corrodes | Catches clothing | ||||
Hold papers | 50% | 8 (4) | 8 (4) | 2 (1) | 4 (2) | 11 | 71 | |
Reusability | 20% | 4 (0.8) | 2 (0.4) | 4 (0.8) | 2 (0.4) | 2.4 | 16 | |
Toothpick | 10% | 1 (0.1) | 3 (0.3) | 0.4 | 3 | |||
Stress relief | 20% | 8 (1.6) | 1.6 | 10 | ||||
Total | 15.4 | 100 |
Now that one can estimate what can go wrong, one can look at trying to fix the problems before they actually occur. Addressing each failure mode can be looked at as a design task in itself.
Generally, there are two strategies to treat failure modes: failsafe (FS), and fail operational, fail safe (FOFS). In the first case, failure of a component will cause the system to shut down in a controlled (safe) way. In the second case, the system remains (possibly only partially) operational after the first component failure, and shuts down safely after the second failure. Both strategies can be used in the same general product. For example, in a multi-engine airplane, failure of one engine causes the engine to shut down in failsafe mode, while the aircraft itself can continue to fly (FOFS).
Some typical ways that failure modes are addressed are: