|
Enhance the Ability to Perform Root Cause Analysis With
Reliability Physics
by Mark A. Latino, Sr.V.P. Operations, Reliability
Center, Inc.
www.reliability.com
Click
here for a print friendly 68k PDF version
Over the years root cause analysis has become a word
that is now associated with many problems. At one time
the National Transportation & Safety Board (NTSB) was
unique in its use referencing the conclusion of an
airline crash investigation as the root cause analysis
of the mishap. Today the word root cause is heard on the
news more often and associated with all kinds of events.
The events now range from the root cause of 9/11 to the
root cause of a factory explosion.
The context of the word root cause means many things to
many people. It is so diverse that people use it without
fully understanding its meaning. Because of this, the
word root cause is thrown out at times when people are
looking for answers. Most people don’t understand what
it takes to get those answers. This article will help to
give a deeper understanding of what Root Cause Analysis
is and what it takes to get to the root causes of an
event.
Root Cause Analysis to this writer means analyzing
failures down to their latent root causes, which are the
deficiencies in management systems and restraining
cultural norms that allowed the failure to occur.
Seldom is the time or effort taken to drive a root cause
analysis past the physical condition of the failure into
the human intervention and the system(s) that drive the
human behavior. There is a quantum difference in the
returns to the company when the extra effort to
understand the total failure mechanism is taken.
The power of root cause analysis is in its leverage. To
eliminate a failure that is currently under
investigation in a “one at a time” fashion should be a
secondary objective of root cause analysis. When root
cause investigations drive down into the latent roots,
you will discover that the latent roots are common to
failures both past and future. These are in equipment
and systems that are entirely different or even in
another facility. They may share nothing in common with
the current failure being analyzed except the latent
failure roots.
When latent root causes are discovered, their correction
justified and their elimination finalized, the
facilities will have been raised to an unprecedented
high reliability. This quantum level of reliability
could never be reached by only the secondary objective
of solving one failure at a time.


Using the definition above in a perfect world would
reward the root cause analyst as being some kind of
problem solving genius. The reality is that this is very
difficult to do because there are so many agendas in
play at the same time. The root cause analyst must be
equipped with all the tools necessary to solve problems
nearly single handedly. This goes against what most root
cause methodologies teach. However, if the event is not
of significant loss and in the eye of the highest
stakeholders, the root cause becomes less significant
and the resources to solve the problem are not made
readily available to some analysts.
The root cause analyst must have a solid understanding
of how systems work to understand how systems fail.
Learning the reliability physics of mechanical,
electrical, and human failure can equip the analyst with
most of this knowledge.
Mechanical failure can tell the analyst particular
information about what was physically happening at the
time of the failure. This can be extremely valuable to
know for the analyst. It is important to get to the
physical cause of a failure as fast as possible. This is
usually not difficult because the reliability physics of
material is strait forward. An example of this would be
a shaft that failed in a torsion overload condition such
as the one shown in figure 1.
The forty-five degree angle of the break indicates that
the shaft was in torsion at the time of failure, the
surface corrosion indicates that the shaft’s material
fatigue strength was weakened, and the chevron marks on
the fractured surface indicate that the material was
overloaded. This information allows the analyst to rule
out fatigue and erosion and focus the failure scenarios
that would explain how the shaft came to be corroded and
overloaded.

Figure
1
The root cause analyst can also benefit from a solid
understanding of electrical failure modes in such
equipment as breakers, relays, batteries, transformers,
etc. If a relay fails and there is a signal loss that
could have been due to high resistance the analyst would
look for things that would cause high resistance such as
a connection failure, sulfur or chloride contamination,
insulation breakdown, thermal cycle fatigue and/or
mechanical fatigue failure of connections. This type of
understanding will give the analyst a great step forward
in solving the failure and understanding the failure
mechanism.
I have spoken of the physical failure, which is the
first and easiest understanding of any failure.
Materials, in most cases fail in a familiar manner that
can be explained using the properties of materials,
service environment, and the conditions such as motion
and flow that the material was performing. This is
needed, but it is the lesser concern because it has
already happened and the components will be replaced and
the equipment will quickly be restored to service. What
should be the concern at this point is how the material
or components came to be in the failed condition. This
is more difficult to uncover because the human is now
introduced into the failure mechanism. The human element
has remained as a constant over time. The human when
asked to expose what they may have done wrong to
contribute to a failure situation will almost always
skirt around the true chain of events to make it unclear
as to their contribution if any. I have often found that
the concern of being implicated is high enough that
people will not give up information just in case it
might look like they have had involvement in the event.
It is extremely important for the root cause analyst to
understand human error and how it manifests itself in
the work environment. I am going to talk about this to
some extent because it is the corridor to the true root
causes of events.
The place to start when dealing with human error is with
the supervision of the work force. Good supervision
makes all the difference in reducing the human error
rate. A good supervisor motivates the employees through
leadership. This means that the supervisor has informed
the employees of the expectations and has taken an
active role in enforcing those expectations through a
means that gets the point across but not in a
threatening manner. This is accomplished through a
consistent style of addressing expectation problems
immediately and not waiting until it has become a bad
habit (habits are hard to change). The other part of
this is to constantly reinforce the expectations until
those expectations are a part of the work culture. Good
supervision is the number one skill in reducing human
error. There are nine other skills that a supervisor
should strive to master.
-
Accountability
-
Field Surveillance
-
Review & Verification
-
Pre-job Briefing
-
Complacency Mitigation
-
Problem Solving
-
Command & Control
-
Communication & Coordination
-
Crew Turnover
This is the framework that begins the process of
reducing human error.
Other things that affect the human error rate that are
considered to be human error traps are time pressure to
perform work. This is the top reason that human error
occurs. Employees generally want to do a good job and
deliver for their supervision. Often in our work culture
there are perceptions that are reinforced by actions
that production is paramount over all other management
concerns. This manifests itself through the supervisor
asking are we ready to start up? How much longer will it
take? This type of pressure real or perceived causes
employees to cut corners or to miss steps in a sequence
that cause secondary events to occur such as injury,
delayed start –ups and maintenance mishaps.
Now lets combine a distractive environment with the
pressure to get done on time or to start up on time. A
distractive environment is defined as some type of
interruption every 5 to 15 minutes. This would include
the management asking is it ready yet to field personnel
asking questions about job sequences or job routines.
This manifests itself in cultures where the employees
are micromanaged or when practice and rehearsal job
training are not routine. This culture is one where
there is poor supervision and the workforces are not
confident enough to make decisions on their own. There
are of course other things that may come into play as
contributors but if you encounter this type of
manifestation you can count on the higher rate of human
errors. The added distraction with pressure can increase
the possibility of human error to as high as a 31%
chance.
Other things that contribute are:
1.
High Workload
2.
First Time Evolution
3.
First Working Day After Days Off
4.
One-Half Hour After Wake-Up or Meal
5.
Vague or Incorrect Guidance
6.
Over-Confidence
7.
Imprecise Communication
8.
Work
Stress
If the root cause analyst understands human error and
how it affects the work, they can make recommendations
that will minimize the human error rate. This is usually
learned during a root cause investigation but it can
become a proactive tool in reducing the need for
performing root cause in the future.
Common latent root causes that are discovered are
procedural problems, training issues, culture issues,
design issues, supervision issues, accountability
issues, proper tool usage issues, etc. When the root
cause analyst drives into the latent area quantum
benefits can be gained by the company as I have
expressed at the beginning of this article. Leveraging
what is learned from the investigation back into the
company is the most important part of the root cause
analyst’s job.
At this point cost becomes a large factor as to how much
management is willing to invest into the correction of
the root causes identified. As an example let’s say an
identified root cause might be a procedure is inadequate
for the job because there were steps missing that if
followed would have avoided the event. The management
will now have to make a decision to add the missing
steps, revamp the entire procedure, train the personnel
involved on how to properly perform that particular job
task, discipline the individual or some combination of
these items. PITFALL.
From a cost standpoint it is cheaper to add a few steps
and take discipline against the individual. This covers
the management as far as compliance goes but it does not
solve the problem. It may have been that the procedure
was too difficult to follow in the first place and now
as a quick fix we add more steps to make it more
complicated. If this is the case, the root cause
investigation has uncovered the true roots but the
management’s decision to add steps has allowed the
failure mechanism to stay in place. Usually there would
be a combination of procedure review to look for human
error traps and rewrite the procedure without the traps.
Then we would do some training on the proper execution
of the procedure. Taking the cheap way out removes the
ability for quantum benefits and leaves you with
incremental improvement at best.
To conclude, you as root cause investigators have a
responsibility to get to the true root causes and to
have the training and tools to allow you to accomplish
this task. The management has the responsibility to
review the root causes and implement recommendations
that will remove the failure mechanism while reporting
the knowledge back into the organization.
About the Author
Mark Latino is Vice President of Operations for
Reliability Center, Inc. (RCI). Mark came to RCI after
19 years in corporate America. During those years a
wealth of reliability, maintenance, and manufacturing
experience was acquired. He worked for Weyerhaeuser
Corporation in a production role during the early stages
of his career. He was an active part of Allied Chemical
Corporations (Now Honeywell) Reliability Strive for
Excellence initiative that was started in the 70’s to
define, understand, document, and live the reliability
culture until he left in 1986. Mark spent 10 years with
Philip Morris primarily in a production capacity that
later ended in a reliability engineering role. Mark is a
graduate of Old Dominion University and holds a BS
Degree in Business Management that focused on Production
& Operations Management.
"Reliability
Center, Inc. offers their Human Error Reduction
Workshop at their facility in Virginia or on-site at
the client's facility. This workshop explains the
underlying reasons why humans make errors and how you
can prevent these errors from happening. The techniques
learned in this course will enable you and your workers
to reduce human errors in the work place by as much as
20 days per year. For more information please call
804-458-0645 or email
info@reliability.com
Some content within this paper is based on the Eisenhart
Seminar Series, © Copyright 2004 VATIC. All rights
reserved. Used with Permission. |