Featured Product
This Week in Quality Digest Live
Operations Features
Mike Figliuolo
No one needs recurring meetings, unnecessary reports, and thoughtless emails
David Suttle
What is breakthrough technology really capable of?
Eric Whitley
Robotic efficiency coupled with human intuition yields a fast, accurate, adaptable manufacturing system
Etienne Nichols
How to give yourself a little more space when things happen
InnovMetric Software
One software capable of operating portable metrology equipment and CMMs within the same user interface

More Features

Operations News
Pioneers new shape-memory alloys
A centralized platform and better visibility are key improvements
Greater accuracy in under 3 seconds of inspection time
Oct. 17–18, 2023, in Sterling Heights, Michigan
Enables scanning electron microscopes to perform in situ Raman spectroscopy
For current and incoming students in manufacturing, engineering, or related field
Supports back-end process control
Transforming the development and optimization of bioprocesses using Tetra data
For processed, frozen, and preprocessed vegetables, confections, and more

More News

Bryan Christiansen

Operations

What Is Fault Tree Analysis and How To Perform It

Reverse engineering the root causes of a potential failure

Published: Wednesday, June 16, 2021 - 12:02

Many techniques can be used to find the root causes of asset failures and other important events we want to analyze. Fault tree analysis is one of those techniques, and it is being utilized by many different companies to improve system reliability.

This guide aims to give a basic to intermediate introduction to the fault tree analysis process. It discusses uses cases, types, symbols, processes, examples, and helpful software solutions.

What is fault tree analysis?

Fault tree analysis (FTA) is a graphical as well as mathematical tool to analyze the potential for failure for a machine or a system. It is a top-down approach that tries to reverse engineer the root causes of a potential failure. It is used as a part of the root cause analysis process.

Fault tree analysis tries to model how failure propagates through a system. It tries to create a graphical model of how component failures lead to systemwide failures. It helps reliability engineers create well-defined systems with proper redundancies where component failures do not always cascade into systemwide failures.

The graphical elements used to model fault tree analysis are called fault trees. They are acyclic graphs that resemble the structure of a tree, hence the name.

To conduct fault tree analysis, a catastrophic event for a process is imagined, and the team conducting fault tree analysis will think about conditions that can lead to such event. All reasons for why the event could happen will be mapped relationally using Boolean logic. The fault tree does not stop here, though. It goes down the list of causes till the root causes are mapped out.

When finished, the fault tree diagram helps us understand how one or more small failure events can lead to a catastrophic failure—so we can apply proper corrective and preventive measures.

Who uses fault tree analysis and why?

In 1962, Bell Telephone Laboratories was designing safeguards for the intercontinental ballistic missile system, called the Minuteman system, for the U.S. Air Force. For such a complex and dangerous technology, safety was paramount. To improve their reliability analysis, Bell Labs created the fault tree analysis method.

This new methodology added a graphical element that helped visualize the concepts of failure mode and effects analysis (FMEA). Later on, the fault tree analysis methodology was adopted and popularized by Boeing. Today, fault tree analysis has widespread use in analyzing the failure potential of critical systems. This rigorous analysis ensures that complex systems operate safely and reliably.

In general, fault tree analysis is very useful for preventing future failures and identifying critical areas of concern for new workflows, products, and services. Various industries use fault tree analysis as a method for safety analysis and risk mitigation, including:
• Aerospace
• Aeronautical
• Power generation
• Defense
• Cybersecurity
• Nuclear operations
• Specialty chemical manufacturing
• Pharmaceuticals
• Healthcare
• Disaster management
• Environmental study

Fault tree analysis can be done at the time of the system’s design or during operation (to anticipate potential failures and take preventive actions). The goal is to bolster the subsystems and components that have a high probability to cause a major incident.

Fault tree analysis symbols and structure

As mentioned earlier, fault tree analysis is performed by building fault trees. Fault trees have a standard set of symbols and nomenclature so that they can be understood across plants and industries.

fault tree symbols

Depending on the resource you’re looking at, the visual representation of certain symbols can vary slightly from what is shown in the image above. The differences are small, though, so no one should have an issue recognizing which symbols are used.

The fault tree is a directed acyclic graph (DAG), which represents the flow and relationship between a series of activities. The activities are visually represented as nodes. Fault tree diagrams have two main types of nodes called events and gates.

Event symbols

Events are occurrences in a system, like the failure of a subsystem or a failure of an individual component. Events that come up in fault trees are described below. Event symbols will have only one input and one output.

fault tree symbols

Below you can find a short explanation for each event:
Top event (TE): This is the event at the top of the fault tree that is being analyzed. Often, it is a catastrophic event that causes a systemwide outage. The top event is represented by a rectangle that has only an input, without any output.
Basic event (BE): It represents root-cause events that propagate up the chain of the system to cause the top event. The BE is represented by a circle that does not have any input.
Intermediate events: BEs cause intermediate events, which eventually cause transfer events (TE). Intermediate events are represented by rectangles that have both an input and an output.
Transfer events: When a fault tree is too large to fit on a paper, a transfer event can be created. This way, we can replace one big part of the fault tree with a single symbol. Transfer events are represented by triangles. A transfer-out event will have a triangle with output to the right of the triangle. Transfer-in events will have input to the top apex of the triangle.
Underdeveloped events: Sometimes, events occur that are not basic events, but there is insufficient information to develop a subtree. Such events are marked as underdeveloped events. Underdeveloped events are represented by the diamond or rhombus symbol.
Conditional events: Conditional events are the ones that act as a condition for an INHIBIT gate, which is defined below. Conditional events are represented by an oval symbol.
House events: An external event that is normally expected to occur. These events can either occur or not, so they carry the probability of 1 or 0, respectively.

Gate symbols

Gates represent how failures propagate through the analyzed system. Sometimes, a single event can culminate in a top-level event (i.e., catastrophic failure). Other times, a combination of two or more different events can cause the top event. Logic gates indicate how events combine to propagate failure. Each gate will have only one output event and can have one or more input events.

fault tree symbols

The most frequently used gates in drawing fault trees are described below:
AND gate: This gate can have any number of input events. The output event it is connected to will occur only if all the input events happen. An AND gate has a rounded top out of which comes the output, as shown in the image above standard FTA symbols for gates.
Priority AND gate: An output event will occur only if all input events happen in a specific sequence. It looks very similar to an AND gate, just with an added line at the bottom.
OR gate: An output event will occur if any one or more of the input events occur. The symbol for the OR gate will have a pointed top end, from where the output emerges. As in the image above the other end is curved and is connected to the inputs.
XOR gate: An output will occur only if exactly one input element occurs. It looks like a triangle drawn inside the standard OR gate.
k/N or VOTING gate: For this gate, there will be an “N” number of input events and one output event. The output event will occur if “k” number of input events occurs. It looks similar to the OR gate with a “k/N” written at the bottom end.
INHIBIT gate: It acts similar to an AND gate. An output event will occur when input events occur and a conditional event also occurs. The symbol for the INHIBIT gate is a hexagon. The input event is connected directly below the gate, and the conditional event is connected to the right of the gate. At the top is the output like in all other symbols.

Types of fault tree analyses

In addition to standard fault tree analysis methods, there are many extensions of fault tree analysis developed for specific use cases and industries. The extensions would be capable of visualizing features that are not easily expressed by standard fault trees. Some of them are:
• Dynamic FTA
• Repairable FTA
• Extended FTA
• Fuzzy FTA
• State-event FTA

Fault tree analysis methods can be broadly classified into qualitative fault tree analysis and quantitative fault tree analysis. Qualitative analysis is performed every time, while quantitative analysis can be done as an add-on in situations when the probabilities of events are known.

Let’s take a closer look at qualitative and quantitative fault tree analysis.

Qualitative fault tree analysis

Qualitative fault tree analysis is used to gain insight into the structure of fault trees to analyze a system’s vulnerabilities. There are many ways to go about conducting qualitative fault tree analysis, like minimal cut sets (MCS), minimal path sets (MPS), and common cause failures (CCF).

Minimal cut sets (MCS) can be used to identify the vulnerabilities of a system. If a fault tree contains a small number of components or a set of elements whose failure is highly likely, the system would be deemed unreliable. A minimal cut set is concerned with identifying such sets of elements in a fault tree. Reducing the failure probability of the identified components or adding redundancies will improve the reliability of the system.

Minimal path sets (MPS) can be used for determining the robustness of a system. A minimal path set tries to identify the minimum set of components that can keep the system functional. After those elements are identified, extra effort is made to ensure they have a lower chance of experiencing failure. This increases the overall reliability of the system.

Common cause failures (CCF) method is used to figure out if multiple failures can be caused by a single element. The components identified through CCF are considered to be critical components. Often, the maintenance team must make sure these components are routinely inspected and replaced.

Quantitative fault tree analysis

Quantitative fault tree analysis can be done to derive the relevant numerical value for the fault tree. Assigning a numerical probability of failure will help to gain a better understanding of the risks faced by the system. The priority of fixing various cut sets can be determined with the help of quantitative analysis.

The result of quantitative fault tree analysis can be in the form of stochastic measures or importance measures. Stochastic measures give the probability of failure for the system. Importance measures assign how important a cut set or path is to the reliability of the whole system. The most common way of qualitative fault tree analysis is to just calculate the probabilities of failure.

When probabilities of basic events are known, the probabilities of intermediate events can be easily calculated based on the gates that connect basic events with intermediate events. The most common gates are the AND gate and OR gate. A simple example is given below.
fault tree symbols

Here, A, B, C, and D are basic events. E is an intermediate event, and TE is the top event. The intermediate event E is connected to the basic events A, B, and C using an AND gate. A, B, and C have to fail for the intermediate event E to happen. The probabilities of failure for A, B, and C are known. Therefore:
fault tree equation

The top event failure TE is reached by connecting E and D through an OR gate. E in itself is a failure event, and the probability of occurrence of the basic event D is known.
fault tree equation

The probability of top event failures can be calculated like this using the qualitative fault tree analysis method.

Steps to follow when conducting fault tree analysis

Below are the general steps that should be followed to successfully complete fault tree analysis.

Step 1: Build a diverse team
First, a team with a diversity of thoughts, opinions, experience, and expertise should be assembled. That’s important because conducting fault tree analysis is, in a way, a speculation based on facts.

Experienced professionals in the field will be able to pick excerpts from their professional life. They will also be aware of the precise technical aspects of the system. Other team members with less technical knowledge can contribute by pitching out-of-the-box ideas and have other useful info about the analyzed system.

Brainstorming sessions and meetings should be led by a professional with prior experience in conducting fault tree analysis. Engineers of respective fields, industrial engineers, and system design specialists are a must for any team conducting fault tree analysis.

Step 2: Identify failure causes
To draw fault trees and fault tree analysis diagrams, the first objective is to identify potential failures that could occur. It is a critical step because the intermediate events and basic events are reverse engineered from the top event. Fault tree analysis looks from the perspective of the top event and tries to gauge how that event could occur. Trying to identify the root causes will lead down to the basic events needed to draw fault trees. Potential failures, their characteristics, duration, and different impacts of the failure must be defined to start the process.

Step 3: Understand the system’s inner workings
The team performing fault tree analysis needs to have a deep understanding of the inner workings of the system. The engineers closely working with the system should already have a good idea of how everything works. Nonetheless, other team members should raise questions because that can result in an expanded list of failure causes worth exploring.

A professional with knowledge and expertise of the system should be in charge of guiding the discussion. The goal is to get a good grasp of the requirements, connections, and dependencies of the system. The team should collect the schematics of the system, specifications of different components, and other available manufacturer information. Studying these materials should build an understanding of how each subsystem and component is connected to each other.

Step 4: Draw the fault tree analysis diagram
Once the inner working of the system is well understood, the next step is to graphically present a functional map of the system using Boolean logic. Using the fault tree symbols and structure we covered in the previous section, the team can draw the graphical representation of the analyzed system.

Step 5: Identify MCS, MPS, or CCF
After the fault trees have been created, the team can look to identify MCS, MPS, or CCF, depending on what they want to accomplish. MCS or minimal cut sets are identified to know the most vulnerable parts of the system. An MPS or minimal path set is determined to identify the core components and subsystems required for the system to remain operational. CCF identifies the components that cause the maximum number of failures.

The reason why we are performing fault tree analysis in the first place will determine whether the team needs to find MCS, MPS, CCF, or a combination of those.

Optional step: Assess the probability of failure
More often than not, there are multiple pathways that can lead to the same failure event. For a large system, it would be nearly impossible to address all failure causes at once. To create a list in order of priority, the team can calculate the probabilities of failure for different critical sets. The critical set with the highest probability of failure should be given the top priority.

It is an optional step because accurately assessing the probabilities of different event failures is not an easy task.

Step 6: Develop risk mitigation strategies
The components and subsystems identified as cut sets should be bolstered. High priority must be given to protect the MPS, which is the minimum set of components that will keep the system operational. Strict maintenance schedules have to be maintained for CCFs as they can cause a multitude of issues.

One potential risk-mitigation strategy, especially for CCFs, is preventive maintenance. The company can use a CMMS system to ensure adherence to required maintenance schedules. This includes following the best spare parts management practices so the maintenance team always has needed replacement components in stock. This effort has to be made to minimize the probability of failure.

Fault tree analysis example

For the purpose of this example, let’s say that we have ended up with the following diagram that represents how a server that stores critical data might experience a catastrophic failure.
fault tree symbols

Here are quick explanations for certain elements:
• B is a nonredundant system bus.
• PS is the power supply to the server.
• C1 and C2 are two redundant CPUs for the server, meaning one of the two CPUs can fail without causing total system failure.
• M1, M2, and M3 are memory components that can be shared between both CPUs.

The purpose of this fault tree is to analyze the path, cut sets, and probabilities of the top event (system failure) happening.

Failure propagates from the basic events to the top event through the gates G1–G6. Gate G1 is an INHIBIT gate with the condition that the system failure will happen only when the system is in use. This means that faults can be repaired during scheduled downtimes allocated for maintenance. Gate G2 indicates failure of either basic event B or the failure of the subsystem propagated till G3. Gate G3 fails only when both the CPU subsystems (with C1 and C2) fail.

Each CPU subsystem consists of the power supply (PS), CPU (C1 or C2), and memory component that is propagated through G6. Each of the CPU subsystems will fail if either the power supply, CPUs, or the memory component fails. Failure at a level above will happen only if both the CPU subsystems fail. G6 is a voting gate and for failure to propagate, at least two of the three memory components have to fail.

The Boolean expressions for the system are shown below:
G1 = U ∩ G2
G2 = B ∩ G3

Combining the two gets us:
G1 = U ∩ (B ∩ G3)
G1 = (U ∩ B) ∪ (U ∩ G3)

Continuing in this manner till all the intermediate events are eliminated and only basic events remain will provide the minimal cut sets. This is the top-down approach.

In the bottom-up method, expressions for the gates at the bottom of the tree are taken. Here, {B} and {U, M1, M2} are two MCSs. Even though {U, M1, M2, M3} is a cut set, it is not an MCS since {U, M1, M2} is included in that. The minimal path set (MPS) for this system is {B, C1, M1, M2, PS}, and permutations with C1/C2 and two of M1/M2/M3.

Since the probabilities of the basic events are not stated, we can’t perform quantitative analysis.

Streamlining fault tree analysis with software

Fault tree analysis for big and complex systems can quickly become so large that they can’t be drawn on a single page or a whiteboard. That can be partially negated by using transfer elements. However, even with them, the diagram can become too large to handle, read, and comprehend. Fault tree analysis software is a great solution for this type of problem.

In addition to simplifying graphical representation, some applications have algorithms that can automatically identify quantitative aspects of fault tree analysis like MCS, MPS, and CCF. If the failure probabilities are known for the basic events, the different probabilities for top event and subsystem failures can also be calculated at the click of a button.

Here are a few systems you can try out:
Visual paradigm: Feature-rich fault tree analysis software with a free trial
Blocksim: Fault tree analysis software that is a part of a suite of reliability software applications from ReliaSoft
• • • ALD fault tree analyzer: A free cloud-based fault tree analysis software

Those are by no mean all the available solutions, just the more popular ones. There might be better products for specific purposes and industries.

Additional resources

For those who want to dive deeper into this subject, check out the additional fault tree analysis resources listed below:
Fault Tree Analysis Primer (CreateSpace Independent Publishing Platform, 2011), by Clifton A. Ericson II
Fault Tree Analysis: A Complete Guide (5STARCooks 2020), by Gerardus Blokdyk
Coursera lecture on fault tree analysis
Fault tree analysis lecture on YouTube by the Dept. of Industrial and Systems Engineering at IIT Kharagpur, India
• Another fault tree analysis lecture on Youtube by xSeriCon, an engineering consultancy and safety training firm

Conclusion

Root cause analysis is a complicated process that can’t be learned by simply reading one article. You have to perform it on a few real-life examples to really get the hang of it. That being said, we do hope this guide gave you a good introduction to the topic and showed what to expect if you plan to use this method at your company.

First published on the Limble CMMS blog.

Discuss

About The Author

Bryan Christiansen’s picture

Bryan Christiansen

Bryan Christiansen is the founder and CEO of Limble CMMS. Limble is a modern, easy-to-use mobile CMMS software that takes the stress and chaos out of maintenance by helping managers organize, automate, and streamline their maintenance operations.