Reliability Reality in Process Plants - The Archimedean
Leap from the “Bathtub” by
Bernie Price Polaris Veritas Inc.
This
article will explain how Archimedes actions could have
been connected to classic failure modes. It goes on to
describe experiences on current projects and offer
suggestions to improve the reliability of modern process
units. More specifically, it will explain and disprove
the over-simplification of the graph of Reliability
(risk of failure) vs. Time – THE BATHTUB CURVE.
Archimedes
was born and lived in Syracuse, Sicily, which at that
time was part of the Greek Empire. A brilliant
scientist, engineer and mathematician, he is credited
with many famous concepts, inventions and statements.
These include the statement, “Given a long enough lever
and a place to stand, I can move the world”.
Other credits include the
Archimedean screw for pumping water, the Compound
Mechanical Pulley, Hinged Mirrors, which could focus the
suns rays like a modern laser gun, and Mechanical
Catapults for throwing boulders. In addition, in the
field of mathematics, he invented pre-calculus and was
first to use the calculated the value of Pi.
He
is, however, most famous for jumping out of his bathtub
shouting, “Eureka! I have found it!” when he realized
that the volume of water that he had displaced weighed
the same as his floating body. Or, could there have been
another reason why he made his flying exit from the
bathtub?
|
 |
|
Fig. 1- Archimedes Contemplates Taking a Bath |
Concepts and Confusions
To
paraphrase a quotation from the ‘Reliability Edge’ web
site, “The concept of flat earth is not now widely
defended, but the unsupported assumption that most
reliability engineering problems can be modeled well by
the exponential distribution is still widely held. In a
quest for simplicity and solutions that we can grasp,
derive and easily communicate, many practitioners have
embraced simple equations derived from the underlying
assumption of an exponential distribution for
reliability prediction, accelerated testing, reliability
growth, maintainability and system reliability analyses.
This practice is perpetuated by some reliability authors
and lecturers, some reliability software makers, and
most military standards that deal with reliability”.
Our
experiences with the effect of these assumptions being
alternately over-simplify, and then complicate, with
“Polynomial Contrasts”, Latin Squares and Weibull
Analyses result in the bathtub curve being transformed
into a powerful source of misinformation.
It
must be added that until fairly recently, Polaris
consultants were also unwitting “parties to these
accepted fictions”, using them to explain reliability
theory which has been a mistake.
Automotive Analogy 1: Could the bathtub curve
describe the risk of failure against time for a car
driven at reasonable speeds on good roads, being
serviced regularly and having small defects identified
and fixed Promptly?
However convenient for mathematicians and relevant this
concept might be, in the case of single items of
equipment operating in steady-state, it certainly does
not represent what happens in typical process units. In
fact, it doesn’t even represent what happens in a
typical automobile operated under perfect conditions.
To
explain the development of a more realistic
representation, we first came up with the graph below
showing the “modified stepped version”. Each step in the
illustration below represents permanent damage/reduction
in useful life caused by a period of upset operating
conditions.
Before fully developing the explanation, it is necessary
to accept that in most plants these process upsets occur
on an irregular but persistent basis. They also often go
unreported unless they cause major equipment damage or
product loss.
One
of the many ways to recognize that one of these upsets
has occurred is the simultaneous damage, or “cluster
failure”, of minor items of equipment. Frequently, they
do not bring production to a halt and are typified by
pipe and equipment leaks, failures of groups of
mechanical seals in pumps, filter failures, etc.
In
simple terms, the steps represent a process upset
involving a damaging change in pressure, temperature,
flow or chemical composition. Each one in some way
reducing the life of the unit. The root cause can be
some external failure, but, in our experience, it is
much more likely to be due to some lack of operational
discipline, i.e. a failure to stay inside the
operating envelope or failure to accurately follow
operating procedures. The term lack of “Operational
discipline” does not refer to an intentional act or some
malevolent behavior but errors caused by the use of
deficient operating procedures, training in the
operating procedures, conflicting priorities, inadequate
labeling of equipment and instruments or lack of
effective administrative controls.
Automotive Analogy 2: The stepped graph describes
the probability vs. time of a vehicle driven outside
normal conditions (too fast) so that it overheats, is
driven in very dusty conditions, gets the wrong grade of
fuel for extended periods or suffers a series of missed
service(s) but doesn’t break down. Sometimes, a hose
will blow or cooling water pump seal fail, but, most
often, there are no visible effects of the abuse for a
couple of hours.
The
third diagram shows the already modified “stepped
bathtub curve” with the addition of a series of spikes.
Each
spike superimposed on the step represents a short period
of greatly increased risk of failure. Typically, this
can occur during a period when the unit is being either
shut down or started up, but it can also be caused by a
sudden loss of utilities or a raw material variation
such as a power surge / lightening strike, or raw
material interruption etc...
Modern continuous production processes are designed to
run without stopping for many years (two to seven years
is typical) and, while such incidents are few in number,
they are very significant. These crash shutdowns and
hard starts play havoc with prospective equipment
reliability, and every effort to avoid them should be
taken.
The
concepts hold true for batch process, but they see very
much fewer damaging “steps” and many more, but much
smaller, “spikes” of increased risk.
Comparing
the idealized bathtub curve with this new stepped-spiked
curve, we notice several things:
-
The unit’s total life (reliability) is frequently
reduced by 50% or more.
-
In the final months it is possible to get the
“Tipping Point Failure” where a small spike
superimposed on a step causes a crash shutdown.
-
As the overall condition of the plant deteriorates
due to the “steps”, it takes fewer and less severe
or a shorter and shorter spike to tip the unit into
such a shutdown.
-
Understanding the situation allows a reduction in
the frequency and mitigation of the effects. It is
then possible using sophisticated condition
monitoring techniques to know how much remaining
life all the major items of equipment have left.
This is critically important in order to plan and
schedule the next outage.
Automotive Analogy 3: The graph describes the
probability vs. time of a vehicle given “hard starts”
and “emergency stops” in bad conditions, such as street
drag racing.
Suggestions for Dealing with the Steps
(Archimedes - the Siege of Syracuse and Long levers)
As the
Romans laid siege to Syracuse, Archimedes used all the
war machines in his armory. These included catapults for
long range defense and crane-mounted swinging grappling
hooks to capsize the Roman boats at short-range. He used
his knowledge of “mathematics and mechanical advantage”
in all its many forms. Similarly, we must use a
combination of short and long-range approaches to
minimize, if not get rid of, the “steps” and “spikes”.
|
 |
|
Fig. 5 - Suggestions for reducing the size and
frequency of the “Steps of Increased Risk” |
To a
very large extent, a much-improved level of “operational
discipline” is required. This can also be described as
the ability of the operating crew to keep the unit
operating within a prescribed operating envelope. As a
first step, this involves an improved method for
compiling clear instructions on how to operate and
trouble-shoot the plant under all conditions, i.e.,
the “standard operating procedures”. (We recommend the
“T” bar or two-column format for its simplified
architecture and easy reference.) The procedures should
be - well engineered, identify the consequences of
deviation and corrective actions that need to be taken,
provide a separate trouble shooting section,
cross-functionally prepared, and be easy to read. They
should also be based on tightening operating limits and
be set in a background of continuous improvement;
meaning they can be improved by the agreement of the
group at any time.
Monitoring the process very closely using detectable
symptoms other than process instrumentation is, again,
essential. There is almost always some indicator or
grouping of observable defects that will provide earlier
warning of deviation and potential failure. Be mindful
of the chaos, which can occur without good alarm
management strategies. Simplistically, lowering the
alarm levels on equipment is not the way it should be
done. Search for specific critical small changes in
product quality, noise, smell, etc. Putting together the
knowledge of the operators who have been around the
process for years and appropriately directed engineers
it is possible to produce information on some early or
incipient change that will allow pre-emptive action to
be taken.
There is a need to create a participative open culture
where prolific incident, near miss recording and follow
up is routine. Formal root-cause analysis for larger,
potentially high-dollar incidents is one of the ways to
achieve the improvement needed with medium sized single
discipline issued being handled by expert engineers. All
of this is predicated on not being afraid of reporting
and investigating these “process variations”, which is
not the case in many plants.
The
diagram of Stress Level vs Error Rate illustrates two
essentials. First, if a plant operation is very stable
for long periods, operators can be mentally asleep with
their eyes wide open. Secondly as stress levels increase
only those actions driven by things that were learned
either by repetition or those where clear and simple
“Directions and Signs” are given will be effectively
executed. In dealing with an “Upset” there is little
time for discussion, multiple references or involved
calculation.
We are aware
that this can sound like “motherhood and apple pie”. Our
experience has been that it takes a series of very long
(Archimedean) levers to exert enough force at all levels
of the organization to implement these necessary
changes.
Suggestions for Dealing with Spikes
Spike
removal involves a similar but two-part solution. For
the 60% of spikes that occur at scheduled start-up and
planned shutdowns, more detailed procedures and
comprehensive checklists are needed. Plants should take
a lesson from the aviation industry where even a small
spike can be fatal which is why they make extensive and
effective use of checklists.
This
approach should not be new to the process industry, as
it is an OSHA requirement for plants operating under the
Process Safety Management code. OSHA mandates the use of
the Pre-Start-Up Safety Review Process, which is a
similar use of “checklists”. Take the time to write,
train and re-train operators in the use of checklists.
Making sure that these thorough checks are made is
critical to dealing with the spikes efficiently. The
written procedures with individual responsibility for a
set of required actions must have been set out and
rehearsed before hand.
For each one
of the remaining 40% of spikes due to sudden utility or
raw materials interruption, a “detailed and rehearsed
mitigation action plan” is needed. A well-written
procedure and a rehearsed response should be already
worked out for each type of spike before they happen.
These comprehensive training and routine drills are
essential. The safest plants have routine safety drills,
why not drills for predictable emergency operating
conditions?
What Not to Do!
In
the past, many plants used the simplistic approach that
if moving slowly reduces the chance of error (spikes) in
“pressure situations, then moving very slowly are best.
This is, however, patently untrue. Holding processes
through extended periods of low flows through heat
exchanges and through filters can render them 50% fouled
before the plant is back up to full operating rates.
As
an example of this, the chosen materials of construction
for machinery (pumps, fans, turbines) are selected for
use during normal operating conditions. Operating them
away from their design conditions usually means
misalignment and excessive corrosion rates, which
frequently damage them.
Automotive Analogy 4: No matter how badly you drive, if
you drive slowly enough, you will never have a serious
accident.
Defect Elimination and
Reliability Improvement
Lead
by DuPont in the 80’s extensive research into the
sources of problems leading to inefficient operation has
resulted in some surprising results. Based on thousands
of “Root-Cause Analyses” of plant problems, root-cause
defects, whether measured numerically or by Total
Financial Impact, are distributed as:
|
Maintenance Practices |
18% |
|
Maintenance Materials |
7% |
|
Raw Materials |
5% |
|
Design |
25% |
|
Operating Discipline |
45% |
We have
several examples of companies in the process industries
equating a reliability expert as a “good maintenance
guy”. This mind-set is then taken to its illogical
conclusion by the recruitment of reliability engineers
to fix "maintenance problems” when the individual should
instead be working on “reliability problems”. These
management groups then have the engineers’ work on
maintenance practices and materials problems, which
comprise only 25% of the causes of reliability problems.
Typically after
2 – 3 years of hard work,
these engineers will have reduced the maintenance number
by 50%, thereby solving or removing 12½ % of the total
number of defects or reliability problems.
Meanwhile, they will have missed the opportunity to go
after the fundamental problem areas of operating
discipline, primarily because they involve people,
communication skills and cultural changes rather than
equipment. These engineers in the same amount of time,
had they expended the same effort on solving 50% of the
operating discipline problems, could alone have made a
much more effective improvement in the overall plant
performance by removing 22.5 % of the defects. While all
this is understandable, much like “the story of the
king’s new clothes”, operating management’s belief is
often that “it can’t possibly be that bad” when the
evidence clearly indicates that it is.
|
 |
|
Fig 6 – With the Steps and Spikes removed a
Happy Archimedes takes his bath |
Working on reliability improvement programs in over 20
plants we have found that in dealing with “design
defects”, the 80/20 rule applies. Many can be removed
for a very small amount of money (the low- hanging
fruit). However, rectifying fundamental errors in the
choice of essentially unreliable equipment and systems
versus immediate capital effectiveness does invariably
take major capital expenditure.
Conclusion
Understanding the possible savings and the safety
improvement that can be made by using improved
operational discipline should be one of the first steps
not the last, in achieving world-class operating
performance. Each plant must have a set of periodically
reviewed, accurate, and easily referenced set of
operating procedures, and the operators must be
trained and routinely retrained in their use.
In
addition, as a means of administrative control, most
plants need daily formal reporting and investigation of
even small process variations (PDR’s). Even so, we see
errors covered up, and the “Steps of Increased Risk” go
unreported.
What
is needed is a culture in which minor defects are
identified and fixed earlier, and a situation is created
where even minor changes in process performance are
measured and reported. (Using the daily “process
deviation report”)
If
the difference between expectation and reality is
happiness, then the more we understand, the happier we
will be. We can now see that Archimedes was jumping out
of the reliability bathtub because it wasn’t the smooth
shape he expected it to be. He, like many other
reliability professionals, was in the process of being
impaled on a “spike” of ignorance that his fellow
mathematicians and theorists hadn’t told him about. Just
by knowing that these spikes and steps are there and
having the whole production team work on removing them
will improve the plants operating performance.
|