Featured Product
This Week in Quality Digest Live
Management Features
Mike Figliuolo
No one needs recurring meetings, unnecessary reports, and thoughtless emails
Etienne Nichols
How to give yourself a little more space when things happen
Gleb Tsipursky
The future of work is here, and AI is the driving force
William A. Levinson
Quality and manufacturing professionals are in the best position to eradicate inflationary waste
Chandrakant Isi
Experts in design and manufacturing describe the role of augmented and virtual reality

More Features

Management News
Recognition for configuration life cycle management
Streamlines the ISO certification process
Nearly two-thirds of HR managers feel AI is changing the skills needed in today’s workplace
On the importance of data governance in the development of complex products
Base your cloud strategy on reliable information
Forecasts S&A subsector to grow 9.2% in 2023
How to consistently make optimal choices in business and life
Embrace mistakes as valuable opportunities for improvement

More News

Bryan Christiansen

Management

Applied Root Cause Analysis, Part 1

Theory and steps

Published: Monday, May 10, 2021 - 12:03

All articles in this series:

Root cause analysis (RCA) is the process of finding the basic underlying cause for an effect we observe or experience. In the context of failure analysis, RCA is used for finding the root cause of frequent machine malfunctions or a big machine breakdown. But what exactly is RCA, and how is it done?

In this series we take an in-depth look at how to perform RCA: We outline the steps, describe common tools and techniques, and give a couple of practical examples. Let's start by defining what RCA is.

What is root cause analysis (RCA)?

Root cause analysis is the process of tracing causes of an observable problem and identifying the basic underlying issue that was causing it. Fixing the identified basic problem should stop the recurrence of other problems that originated due to it.

If the problem fixed is not the underlying cause, there is no guarantee that the same fault will not occur again. RCA tries to follow the chain of cause and effects to pinpoint the problem that, when eliminated, makes all the other faults disappear.

RCA is not a process that guarantees an outcome. Conducting RCA can be complicated and generally involves a vast amount of data collection and scrutiny. The result of an RCA is not always black and white. It is not a litmus test that can conclusively indicate whether the problem we identified is the root cause. More often than not, we will get only a strong correlation between cause and effect and not a causal relation. From there, an experienced professional must judge whether to investigate further.

RCA is a craft that requires domain knowledge and experience. Otherwise, any fixes implemented will likely be just a cosmetic solution to the problem. In the worst-case scenario, the changes we make can result in worsening the problem.

Despite this dose of uncertainty, RCA is still a powerful tool for understanding and improving the fundamental nature of systems and procedures.

The origin of RCA

RCA has existed as an investigative tool for centuries. But it was never formalized for a long time. It was formally introduced to the world of engineering and technology by Sakichi Toyoda. He was the founder of Toyota Industries Corporation, and he is widely considered to be the father of Japanese industrialization.

One could argue that the innovations from Japanese manufacturing like kaizen and other lean manufacturing processes can be attributed to the practice of finding the root cause of problems and fixing them, rather than being satisfied with a cosmetic solution. All these process improvement techniques have helped to improve the efficiency of manufacturing processes all over the world.

Why conduct RCA?

There are two broad ways in which RCA can be used:
1. To identify the root cause for problems (the more common way)
2. To recognize the root causes for positive changes experienced: Sometimes, the procedures we implement give results that are better than expected; when the reason for the phenomenally good results cannot be easily explained, RCA can be used to identify it.

When to conduct RCA

Conducting RCA requires a significant investment of time, manpower, and money. It will cause further disruption in the production line or the system in which RCA is to be conducted. Therefore, RCA should not be done for every single fault. There is no cut-and-dried rule for when to conduct RCA.

Here are some of the instances based on which experienced professionals can make an informed decision whether to conduct RCA:
• Persistent faults. If the same fault occurs repeatedly, it is worth investigating. Because the same fault is recurring, we can deduce that the fault will not be cleared by fixing the visible problem. There is some underlying reason for the recurring faults. Such incidents should be investigated with RCA.
• Critical failure. The degree of how critical a failure is can be measured using the cost to the plant or the total downtime due to the particular failure. When such a failure occurs it must be investigated to identify the root cause of the failure. This will help avoid such occurrences in future. Explosions at an oil rig and airplane crashes are examples of critical failures that need to be investigated.
• Failure impact. There are critical machines and critical subprocesses in any system. A failure of these will halt the entire operation because there might not be a backup or mitigation plan for that particular machine or process. In essence, the criticality of the machine or process determines whether to conduct RCA for a failure.

RCA process is based on the 3 Rs

Recognize
The true cause for an effect we observe is not always obvious. Cosmetic fixes don’t do much to correct the underlying fault. The elaborate exercise of RCA is conducted to pinpoint the true cause, so we can take corrective actions that will eliminate future issues. As mentioned earlier, RCA can also be done to identify the cause for an unexpected positive outcome.

Rectify
Once the root cause is recognized, the corrective course of action has to be undertaken. If the root cause is addressed, the same problem should not be cropping up again. If the same problems reappear, it is highly likely that the cause identified was not the root cause. You will have to conclude that the previous RCA conducted was not comprehensive, and more investigation needs to be done.

Replicate
Once the root cause is figured out and rectified, you must ensure that the same fault will not occur again in the same system in another location, or at a later time. If the RCA was done to identify the reasons behind unexpectedly good outcomes, you will have to test whether the same factors can be replicated in other scenarios and environments.

In essence, root cause analysis is used to precisely figure out what happened, how it happened, and why it happened, for any incident that occurs.

RCA is applied across many different industries

RCA is in essence a knowledge tool for identifying the root cause of any event or fault. Faults and problems occur in almost every industry, and RCA techniques can be used to investigate the underlying cause and contributing factors.

The most obvious and ubiquitous use we come across is in medical diagnosis. The same symptoms can be caused by a whole set of illnesses. It is the duty of the doctor to identify the underlying cause before a patient can be treated effectively. Almost all episodes of the popular TV show House, M.D. are exercises in root cause analysis, although in an unconventional manner.

Many other industry verticals use root cause analysis on a regular basis. Some of them are:
• Manufacturing (machine failure analysis)
• Industrial engineering and robotics
• Industrial process control and quality control
• Information technology (software testing, incident management, cybersecurity analysis)
• Complex event processing
Disaster management and accident analysis
• Pharmaceutical research
• Change management
Risk and safety management

RCA is a structured way of thinking and investigating any type of incident. With that in mind, RCA is not just limited to the areas mentioned above. It can be implemented in any sector or industry where the root cause of a problem needs to be identified.

Root cause analysis steps

RCA can be accomplished using many different tools and techniques. These make use of different conceptual models to identify the problem at the root. Alhough all the tools differ at a cosmetic level, each of the techniques has to go through the conceptual steps to conclude the analysis.

Step 1: Problem statement
A problem statement and definition are essential for any form of analysis, not just RCA. This is a clear description of the problem and symptoms experienced. It gives the scope for the analysis.

Without a precise problem statement, RCA will be like a rudderless boat, without a direction to sail toward and unable to change direction. A well-defined problem statement also helps to determine the scale and scope of the potential solution to be implemented.

Step 2: Data collection
All available data related to the incident should be collected. Take, for example, machine failure in a manufacturing plant. Some of the pertinent information that needs to be collected is given below:
• Age of the machine
• Time of continuous operation
• Operating patterns
• Maintenance schedule
• Operators handling the machine
• Specifications of the machine
• Schematic of the plant infrastructure
• Operating characteristics of the machine
• Characteristics of the operating environment

Inspecting the machine in person also provides information that could be beneficial for RCA. For facilities that collect data for predictive analytics (in other words, that run predictive maintenance), it will be easy to collate data quickly.

Step 3: Chronology, differentiation, and mapping
A timeline of events must be established. This will help to determine which factors among the data collected are worth investigating. RCA needs data points that potentially lead to the root cause. Chronological sequencing of events and data is very helpful in deciphering causal events from noncausal events.

From the data collected, correlations can be found between various events, their timing, and other data collected. This can be used as an initial step to differentiate between causal and noncausal events. One important thing to remember is that correlation does not mean causation.

causation vs. correlation
Causation vs. correlation

You cannot conclude any analysis when a correlation is identified. Causal relations need to be investigated.

From the data collected, chronological sequencing, and clustering, we should be able to create a causal graph (or use one of the root cause analysis tools we’ll discuss below). This graph can be used to represent the relationship between various events that occurred and the data collected. The different paths are given different probability weights and can serve as a visual tool to track down the root cause.

causal graphs
Example of causal graph. (Source)

Step 4: Root cause resolution
Once the root cause is identified, the solution to fix it can be easily determined. It can be mapped against the scope defined in the problem statement. If the solution fits the scope, it is implemented.

Fixing the root cause should eliminate the recurrence of the symptoms. If the symptoms occur again, we would need to go back to the drawing board and conduct RCA again.

Once the problem is solved, steps must be taken to avoid its recurrence. There can be multiple solutions applied to solve a single problem. For example, the root cause could be the wear of a bearing, which happened much earlier than expected. In such a case the procedure has to be adjusted to change the bearing at an earlier time. Similar steps taken to avoid recurrence of fault can be changes in the maintenance schedule, different modes of maintenance, and changes in design.

The implemented solution will have to be in line with the available resources. So, if the root cause is pushing the machine too hard, the obvious solution is to shorten machine run time. However, when the production schedule doesn’t allow it, another solution might be to schedule preventive maintenance more frequently.

If you’ve found this information helpful, stay tuned for part two, where we’ll dive into RCA tools and techniques.

First published on April 14, 2021, on the Limble CMMs blog.

Discuss

About The Author

Bryan Christiansen’s picture

Bryan Christiansen

Bryan Christiansen is the founder and CEO of Limble CMMS. Limble is a modern, easy-to-use mobile CMMS software that takes the stress and chaos out of maintenance by helping managers organize, automate, and streamline their maintenance operations.

Comments

RCA

Great read!

I have worked and used RCA methods in the Automotive, Oil & Gas, Non-Profit, and now in the Construction industry(*sighs*). Construction has been the most difficult to get others to Want to understand why the same errors happen or the same required tasks get forgotten, project-after-project-after-project(*sighs again!*)

 But I am persistent and it always comes down to managing People! Thanks, CS 

RCA and Healthcare

I work in Patient Safety and we use RCA techniques all the time to look at events that affect patient care. We also use RCA techniques for what might be called proactive risk assessments (PRA). While aviation and nuclear industries have matured the use of root cause analyses, healthcare is still working on making this mainstream and part of a just culture so it is not seen as punitive. 

Root Cause Analysis

I worked for a number of years as QA Manager for a plant making non-woven fabrics from polypropylene.  A big market was for use in baby diapers.  A customer in Japan complained that our rolls of fabric were non-uniform, being streaky with lanes thin and thick.  The thin lanes leaked glue in converting forcing line outages for cleanup.

Our manager's RCA method was to call department heads together, solicit ideas as to cause, vote to rank them in terms of probability, then retire to his office to call the customer.  The advantage of this method is showing rapid response to a customer complaint.

The solution selected was to install a "randomizer' which moved our fabric stream back and forth about six inches across the width of the machine.  The result were rolls of fabric which appeared even with no hills or valleys.

An engineer friend and I were skeptical of this "solution" and wondered how long it would take the clever Japanese engineers to discover how they had been tricked into believeing we had fixed the problem.  Our solution was to monitor the stream of fabric across its width to ensure even distribution.  Took longer but was more effective.

When we started using a few control charts, this same manager was quick to want analysis when a process went out of control in a "bad for the customer" direction, but saw no reason to find the source of an out of control signal in a "benefit to the customer" direction.

Enjoyed your article