|
Detective Maintenance
by V.Narayan (Vee),
Effective Maintenance Ltd.
Author of
Effective Maintenance Management: Risk and Reliability
Strategies for Optimizing Performance. April 2004,
Industrial Press. ISBN 0-8311-3178-0
Abstract
When we think about maintenance strategies, the words
predictive, preventive, corrective, and breakdown spring
to mind.
There is however an important class of tasks that we do
to ensure that our equipment and Plant remains safe and
productive. These tasks are based on a Detective
Maintenance strategy. They help us win our
licence-to-operate and ensure long term viability. With
machinery and Plants becoming increasingly more complex,
the proportion of such tasks in the total maintenance
program is growing.
Managing a business efficiently means that we have to
manage risks well. In turn, this requires that our
safety devices and systems work on demand. It is
possible to arrive logically at the required
availability of the items in question and find suitable
detective maintenance strategies. While analysis is
relatively easy, there are several hurdles in
implementing their results. These challenges can be met
by a range of solutions. They are not universal and
need to be tailored to each situation.
The word pro-active is very popular, especially in the
maintenance context. Detective Maintenance strategies
are pro-active. More importantly, they are essential to
long term success.
Keywords
Safety, Integrity, Profitability, Viability, Risk,
Reliability, Failure, Reputation, Hidden, Unrevealed,
Hazard(ous), Protective, Protected, Trip(s), Complexity,
Availability, Repairable, Detective, Maintenance,
Testing, Function.
Business Objectives
A business succeeds by maximizing its assets’ life-cycle
net present value.
In the short term, every Production Plant, Distribution
Company or Service Facility has to ensure that it is
profitable. For this, it must be able to work when it
is required. Stated differently, it has to be available
on demand.
In the long term, an additional requirement is that its
Technical Integrity (TI) and longevity are ensured. TI
is defined as the absence of foreseeable risk of
failure during specified operation that endangers the
safety of people, asset value or the environment.
Implied in this definition is the possible harm to the
reputation of the Company. Its reputation and TI
determine whether society will provide the Company with
its licence-to-operate.
The design must be such that both these objectives can
be met with reasonable effort and cost.
Outputs of Maintenance
We can expect to achieve these results with an effective
maintenance effort. They are the primary role and
justification for doing any maintenance. The
raison
dętre
of maintenance is to minimize the quantified risk of
loss of TI or production capacity that can reduce the
profitability or viability of the Company1.
Single and Multiple Failures
Some failures have an immediate and direct effect on
performance. Thus a seal leak from a pump can have an
immediate impact on safety, environment and/or
production.
A
punctured tire means that you cannot use your car; if
the puncture occurs while you are driving, your safety
is at risk.
Other failures have no immediate effect. Thus, if you
car’s brake light bulb has failed, something else has to
happen before there is a consequence. At the time you
use the brake, there must be a driver behind you who
does not notice you are slowing down and hence crashes
into your car. Had the brake light worked, the person
would have known you were braking and would have
automatically started slowing down.
Failures that have no direct and immediate effect are
often much more hazardous and could lead to serious
events. Imagine that your spare tire has a very low
pressure and also that you suffer a tire puncture while
driving. Luckily you are able to pull over safely, but
you may still face several levels of problems. If you
are on a city street, there is the inconvenience and
delay of getting your spare tire checked and
re-pressurized before you can proceed. If you are in a
deserted and lonely place at night, your safety may be
at risk. Obviously there can be other many more
unpleasant scenarios.
Similarly, if the over-speed trip device on your steam
or gas turbine fails, by itself this may not matter. If
the load drops off suddenly, the turbine will accelerate
rapidly. Now the failure of the over-speed device
matters; the turbine may throw its blades, causing major
damage and perhaps serious injury.
The second type of failure is termed hidden or
unrevealed,
as we do not know that the item has failed till it was
called upon to work. These relate to items that remain
dormant for most of their life, and called upon to work
only when something else happens.
Protective Systems
These are devices, equipment or systems that protect
other systems or equipment from hazardous situations.
Protective systems may be designed to work at the
equipment, process system or Plant level.
Examples of such devices or systems providing protection
at the equipment level include, e.g., Pressure Relief
Valves (PRV) on vessels, over-speed trip devices on high
speed machines, relays protecting generators or
transformers, centrifugal brakes on elevators and
axial-displacement trips on large rotating machinery.
At the system level, we have for example, pressure
control or blow-down systems, smoke or fire detection
systems, emergency shutdown systems, etc. At the Plant
level, we have for example, emergency depressurizing or
shutdown systems and fire protection systems.
Protective systems lie dormant for most of their life.
They are only required to operate during an emergency
situation. During their idle periods, the moving parts
may become stuck by, e.g., gumming or due to deposits
from process fluids, electrical contacts may be shorted
by moisture or insects, or in other cases, there may be
leakage of pressurized actuating fluids. Loss of
function of these protective devices or systems exposes
the protected equipment or system to single-event
failures, since the protective barrier has been lost.
Protective devices or systems are installed when the
hazards are high. If they do not work on demand, the
consequences can be extremely high. For example, in the
Bhopal disaster, the scrubber system which was designed
to prevent toxic vapours from escaping into the
atmosphere had been left out of service. By itself,
this action did not cause a problem. However, when
toxic vapours were generated on December 3rd
1984, these were released to atmosphere, because the
scrubber system was not available2. Several
thousand people died and many more were seriously
injured.
Equipment Complexity
The design of equipment is increasingly more complex,
and many have software based control systems. Operator
intervention and surveillance tasks have reduced
substantially. Many of their data logging and other
routine tasks have been transferred to the control and
supervisory systems. Equipment productivity as well as
costs have gone up as a result of their increased
complexity. As a result, downtime and repair costs have
also increased significantly.
One of the features of modern equipment is the level of
protection provided to guard them from damage.
Similarly, modern Plant designs are more automated and
require fewer operators. For these reasons, we rely
increasingly on protective devices and systems to
safeguard our assets. Earlier we noted that failures of
protective devices and systems are hidden, so the
operator does not know at any point in time whether they
will work on demand.
Assuring Technical Integrity
There are many inputs required to ensure TI. One of the
important elements is the availability of protective
devices and systems.
A new or completely overhauled device or system (to As
Good As New or AGAN standards) has a survival
probability of 100% on installation. Thereafter the
survival probability (or reliability) falls. We can
never know its reliability at any point in time, but we
can estimate its value. We call this its expected value
and to estimate it, we have to make certain
assumptions. One of them is that the reliability at any
time t is represented by a continuous curve.
Another is that this curve follows a known distribution;
often, we assume it is exponential.
If we accept this basis, certain conclusions follow,
namely,
-
The reliability is 1 only at the time of
installation
-
It falls thereafter, the rate of decline depending
on the distribution curve
-
At some point in time, the reliability will be
unacceptable; this level is determined by the risk
we are willing to tolerate
Repairable and non-repairable items
We classify items that are replaced as a whole as
non-repairable. Strictly speaking, some of them may in
fact be dismantled and repaired off-site later. Others
will be discarded and replaced with new items. In both
cases, the replacement is AGAN and has a reliability of
1. Examples of such items include light bulbs, printed
circuit boards, gas detectors and ball or roller
bearings.
We treat another group of items as non-repairable, those
that are subject to hidden failures. In this case, the
operator cannot know whether the item is working or not
under normal circumstances, since it is lying dormant.
We can find out its condition by testing it
periodically. As discussed earlier, such items affect
TI, so when we know that one of them is defective, we
will be replace it immediately with a new one. For
example, once you know that your car’s brake light is
burnt out, you will replace it fairly quickly.
For such items, the availability on demand, i.e., the
probability it will work at any time t is
identical to its reliability, because in this situation,
that also has the same value. Thus the availability of
an item or system subject to hidden failures is the same
as its reliability. If we know its failure distribution
curve, we can compute the reliability and hence its
availability. Note that this argument applies in the
special case of hidden failures.
Expected Availability
The first question we have to ask is about the level of
risk we are willing to tolerate. As you are aware,
quantitative risk has two components, the probability
and the consequence of failure. The latter is dependent
on specific situations, each with different exposure
levels. For example, a gas turbine (GT) blade fracture
is a very serious failure. It could cause significant
equipment damage, injure people and may result in
multiple fatalities. If the GT was driving a gas
compressor in a remote and unattended compression
station, the consequences may be limited to equipment
damage and production loss. If it was located in a
Process Plant, it could injure or kill people in its
vicinity. If it was fitted on a Jumbo jet aircraft,
hundreds of passengers may lose their lives due to hull
damage. Such an incident occurred with United Airlines
flight UA 232 in July 1989. In this accident, the third
engine of the DC-10 aircraft, located above the
tail-plane, shed its blades. The shrapnel cut through
the redundant (3 x100%) hydraulic lines, causing loss of
all flight controls3. In all 112 people
died, but as a result of some extraordinary disaster
management, 184 people survived the crash-landing at
Sioux City.
Clearly the risk levels will vary, depending on the
circumstances. The expectations of protective system
availability must match our evaluation of risks. For
protective systems, we can expect their availability to
be from about 97.5% to over 99.5% depending on the
circumstances. We must carry out the maintenance work
(whether preventive, predictive, corrective or
detective) on or before the time the value of survival
probability reaches this level.
Detective Maintenance
The only way to find out if protective devices or
systems will work is to call on them to operate. If
there is a real demand and the item works, then of
course we know it is in good order. But we cannot wait
for a real demand to find out its state, so the
alternative is to test them.
In their seminal work on Reliability
Centered
Maintenance (RCM),
Nowlan
and Heap4 called these tests Failure Finding
Tasks. Moubray5 uses the term detective
tasks. Adopting the latter terminology, we call the
strategy which employs detective tasks as Detective
Maintenance.
Detective maintenance is a necessary (but not
sufficient) activity required to guarantee TI.
Detective tasks include, for example:
-
Testing of smoke, gas and fire detectors
-
Periodically starting fire pumps
-
Testing the ejector seats of fighter aircraft
-
Building evacuation tests
-
Stroking of valves that stay in one position for
most of the time
-
Annual vehicle inspection
-
Testing of emergency disconnect/release systems on
cargo ships
-
Pre-overhaul testing of PRVs
-
Testing control loops of safety devices
-
Testing of relays protecting electrical equipment
-
Over-speed trip tests
-
Furnace/Boiler trip tests
-
Periodic testing of (fire-protection) deluge valves
and sprinkler systems
This is not an exhaustive list, and you will recognize
many of them from your own experience. The testing may
be done by operators or maintainers, but the work itself
is a maintenance activity. The test reveals the
following:
-
Whether the item is in a working or failed state,
and
-
The failure mode
Detective tasks are not condition-monitoring
activities. The latter track ongoing failures, measure
trends, and predict the time-to-failure. Examples of
condition-monitoring tasks include, e.g., vibration
monitoring, oil-debris analysis and thermography. In
these cases, the degradation process has commenced, but
the item has not yet functionally failed. Detective
maintenance tasks are only applicable to items in one of
two discrete states. They are either working
or have already failed.
Protective devices or systems can fail in one of two
ways. The first is when the item does not work when
there is a demand. This is why the item was installed
in the first place, so this failure is likely to cause
an unsafe condition. We call them fail-to-danger
events. The second situation is when the item works
when there is no demand. It may then cause, for
example, a spurious trip resulting in loss of
production. We call these fail-to-safe or nuisance
events. If there are frequent spurious events, there is
a good chance that operators will routinely ignore all
such events, whether they are genuine or spurious, and
restart equipment without proper checks. This in turn
can lead to unsafe situations.
Detective maintenance only reveals the state of the
item; with this knowledge we have to take further action
if the item has failed. So this task can generate
further corrective maintenance tasks. Failed items are
usually replaced immediately, to minimize downtime. The
replacements are AGAN. (As an exception, Instrument
drifts or span changes are often corrected on detection,
without generating corrective work orders, since the
actual corrective work is quite small). These actions
bring the item back to a 100% reliable condition.
Practical difficulties with Detective Maintenance
Some protective devices can be tested during normal
operation, without requiring a shutdown of the whole
Plant or Facility. The sub-system or system may be
isolated for short periods while the tests are in
progress. Thus smoke detectors or fire pumps can be
periodically tested without direct impact on production
or safety.
Trips and shutdown systems have three active elements.
The sensor or detector identifies unacceptable
deviations from the norm. The output from the sensor(s)
goes to a logic unit. This uses a given set of
algorithms or software codes to compute an output
signal. The output signal is sent to the actuator of
the executive element. Some examples of executive
elements are:
-
Emergency shutdown valve
-
Ship’s rudder
-
Trip valve of a turbine
-
Circuit breaker
-
Deluge valve in a sprinkler system
-
Ejector seat of a fighter aircraft
Testing the sensors or logic unit usually does not pose
a problem. The output from the logic unit can be
defeated during the test so that the executive element
is not actuated. The problem lies with the testing of
the executive elements. Actuating them during normal
operation will result in a system or Plant shutdown with
direct production losses. Such trips may also create
unwanted safety hazards.
Executive elements can always be tested during planned
shutdowns. These are not very frequent, so waiting for
them may not give us the required protective system test
interval to assure its availability. Plant shutdowns
are getting even less frequent, so there are fewer
opportunities to carry out detective maintenance.
This of course raises a practical difficulty; how can we
ensure TI without being able to carry out detective
tasks at the right frequency? One answer is if we have
to plan short cleanout shutdowns, these provide an
excellent opportunity. Any trip of the Plant can also
be used, provided we are adequately prepared. For
these, the Planning system must be nimble!
Function tests
The traditional answer has been to carry out function
tests. In these tests the executive element is defeated
or gagged. As long as the actuator moves or the
appropriate hydraulic oil pressure is seen, it is
assumed that the executive element will work. Large
rotating machinery such as centrifugal compressors or
steam turbines have trip valves which stop the machines
in an emergency. Such trip devices often have
‘dead-man’ controls, which require a hydraulic of
mechanical force to be continually applied to keep the
power source live. On receiving a trip signal, the
hydraulic oil is ‘dumped’ to sump, or a trigger releases
the mechanical force holding the valve open. Based on
this data, it is assumed that the steam, gas or
electrical power source will be isolated, stopping the
machine.
This last assumption is not always correct. In
practice, a number of factors may prevent the steam or
electrical power source from being disconnected. Common
faults include – sticking of the valve or trigger
mechanism due to gumming, dust and other deposits, bent
or distorted stems, welded contacts in electrical
circuits.
Over-speed devices have an important safety function.
It is not enough to function-test them. The machine
must be run up to the trip speed, and this may require
the coupling to be removed.
Function tests may be acceptable as intermediate tests,
since they prove that most of the system works – only
the executive element does not move, so the machine does
not actually stop. We can reduce the risks by keeping
the trigger mechanism, stems etc. dry, clean and
lubricated, thus increasing the chances that the
executive element works. Remember however that the only
guarantee is when the machine actually trips!
Using installed spares and on-line testing methods
We are legally required to test PRVs at an acceptable
frequency. If there are two full-capacity PRVs
installed with one in operation (1 out of 2) with
interlocked isolation valves, we can take one out during
normal operation for testing.
Commercial on-line relief valve testing methods are also
available. These can be used as intermediate tests when
shutdowns are not possible. They do test the operation
of the PRVs under simulated conditions and verify that
the valve lifts and reseats properly. However, PRVs
have to be visually inspected internally for deposits,
and cleaned if dirty. For this reason, we cannot avoid
removing them periodically from their location.
Similarly, we can start and load standby equipment such
as pumps, compressors or emergency power generator
sets. Such tests confirm that the equipment does start
and that it will take up the full load.
We test individual gas detectors in-situ by exposing
their sensor units to a known concentration of test
gas. Similarly smoke and fire detectors are exposed to
appropriate tests. Control panel lights are checked by
lighting up all of them using a test switch. In all
these cases, the tests are under controlled conditions,
isolating signals from reaching the executive elements.
Opportunistic maintenance
If the plant comes down for any reason, be it a trip or
cleanout shutdown, we can use the opportunity. For
example:
-
We can test PRVs at that time. They don’t need an
overhaul, only a test, so we confirm that the
mechanism operates and gets de-gummed. We reset
only those relief valves that need adjustment. This
way, we can re-install them quickly, and not prolong
the plant outage.
-
We do over-speed tests at the beginning of planned
shutdowns, so that if there is a fault, we get some
time to fix it.
-
Whenever a Plant or Unit has to be shut down for
planned work, we use a different trip system each
time to do so, thus using the opportunity to test
each one in turn.
Such work needs careful planning effort and rapid
response.
Partial closure/opening tests
Emergency blow-down or shutdown valves and similar
mechanical devices can be operated partially, by
limiting the movement of their shafts to say, 2 to 3
mm. This can often be done by inserting a mechanical
stop to prevent further movement. For large
hydraulically operated valves, special devices are
available to allow controlled partial movement. These
tests ensure that the valve actually moves slightly,
breaking off any gumming or jamming deposits. Partial
closure tests give some, but not complete assurance of
integrity, so they are useful intermediate tests.
Whenever a plant has to be shut down on a planned basis,
we should operate these emergency valves to full stroke,
so that we have confidence in their integrity.
Test frequencies
It can be shown that for a given failure rate of a
protective device, the required availability can be
obtained by adjusting the test frequency (see pages
34-40, reference 1). The failure rates must be for the
appropriate failure modes. For example, if we wish to
compute the test frequency for a trip valve closing and
stopping a machine, the corresponding failure mode is
‘fail to close’.
Summary
A successful business must generate profits while
operating safely. The latter needs an acceptable level
of Technical Integrity (TI). Hidden failures are major
contributors to the loss of TI. Detective maintenance
strategies help identify such failures, and are
therefore important.
Equipment and Plant designs are increasingly more
complex. They are generally larger, more efficient and
require less operator attention. If they are down for
any reason, the cost of lost production can be very
high. They are therefore equipped with protective
devices.
Protective devices are generally dormant for much of
their life. The operator does not know if they will
work on demand, and their failure modes are hidden. In
order to manage risks properly, we have to be sure that
their availability is acceptable. We do this by
testing the item at the right frequency, and call this
strategy ‘Detective Maintenance’.
There are some practical difficulties in implementing
this strategy. These have several possible solutions,
none of which are perfect, but suitable in specific
applications. They do not replace the normal test
procedures, but can be used to extend the interval
between the normal tests.
What matters is that we recognize the importance of
detective maintenance strategies and demonstrate our
commitment to reaching the required TI levels. Our
licence-to-operate depends on a successful detective
maintenance program.
References
1. Narayan V. 2004.
Effective Maintenance Management – Risk and Reliability
Strategies for Optimizing Performance.
New York. Industrial Press Inc., ISBN 0-8311-3178-0
2.
http://www.bhopal.org/whathappened.html (accessed
3rd February 2005)
3.
http://www.airdisaster.com/eyewitness/ua232.shtml
(accessed 3rd Feb 2005)
4.
Nowlan, F.S., and H.F. Heap. 1978. Reliability-Centered
Maintenance. Washington D.C. U.S. Department of Defense.
Unclassified, MDA 903-75-C-0349.
5. Moubray, J. 2001. Reliability-Centered
Maintenance. New York. Industrial Press, Inc. ISBN:
0-831-131462.
Editors Note: Vee is a frequent contributor
on the
MaintenanceForums.com Discussion forums here.
You can post comments about this article there or ask
Vee a question.
You can also see Vee present a
one day workshop "Reliability Engineering for
Maintenance Professionals" at RCM-2006 - The Reliability
Centered Maintenance Managers' Forum March 8-10, 2006 in
Las Vegas.
Click
here for more details
|