|
The Seven Questions of
Reliability Centered Maintenance by Bill Keeter and Doug
Plucknette,
Allied Reliability
Abstract
Reliability-Centered
Maintenance (RCM) is a phrase coined thirty years ago to
describe a cost effective way of maintaining complex systems.
The RCM method uses the answers to seven very basic questions to
help determine the best maintenance tasks to implement in an
Equipment Maintenance Plan (EMP). This paper focuses on those
seven questions and how they help determine the EMP.
Introduction
On December 29th, 1978
F. Stanley Nowlan and Howard F. Heap published report number
A066-579, "Reliability-Centered Maintenance". The report was the
culmination of several years of work aimed at determining a new,
more cost effective way of maintaining complex systems. The
called it Reliability-Centered Maintenance (RCM) because
programs developed through RCM "are centered on achieving the
inherent safety and reliability capabilities of equipment at a
minimum cost". RCM is a time consuming, resource intensive
process. Many practitioners have tried to reduce the amount of
time and resources required to accomplish RCM projects with
varying degrees of success. The most successful ones have
focused on understanding the basic goals of RCM, and on the
seven basic questions that need to be asked about each asset. In
this paper we will concentrate on understanding each of the
seven questions and how the answers to those questions help
determine a Reliability-Centered approach to asset management.
The Definition of
Reliability
In the book
Maintainability, Availability, and Operational Readiness
Engineering Dimitri Kececioglu defines reliability as:
"The probability
that a system will perform satisfactorily for given period of
time under stated conditions."
Nowlan and Heap define
Inherent Reliability as:
"…the level of
reliability achieved with an effective maintenance program. This
level is established by the design of each item and the
manufacturing processes that produced it. …"
In The Fault Tree
Analysis Guide a system is defined as:
"A composite of
equipment, skills, and techniques capable of performing or
supporting an operational role, or both. A complete system
includes all equipment, related facilities, material, software,
services, and personnel required for its operation and support
to the degree that it can be considered self-sufficient in its
intended operational environment."
When we look at these
definitions in conjunction it becomes very evident that any
asset management program must address system development through
all phases of a systems life. There is no maintenance program
that can improve the reliability of a poorly designed system.
Additionally, whatever maintenance program is developed is
determined by the design of the system and the goals of the
organization.
The Goal of
Reliability-Centered Maintenance (RCM)
The primary goal of
Reliability-Centered Maintenance (RCM) should therefore be to
insure that the right maintenance activity is performed at the
right time with the right people, and that the equipment is
operated in a way that maximizes its opportunity to achieve a
reliability level that is consistent with the safety,
environmental, operational, and profit goals of the
organization. This is achieved by addressing the basic causes of
system failures and ensuring that there are organizational
activities designed to prevent them, predict them, or mitigate
the business impact of the functional failures associated with
them.
The Seven Questions
of RCM
There are seven basic
questions used to help practitioners determine the causes of
system failures and develop activities targeted to prevent them.
The questions are designed to focus on maintaining the required
functions of the system.
1. What are the
functions of the asset?
2. In what way can the
asset fail to fulfill its functions?
3. What causes each
functional failure?
4. What happens when
each failure occurs?
5. What are the
consequences of each failure?
6. What should be done
to prevent or predict the failure?
7. What should be done
if a suitable proactive task cannot be found?
What Are The
Functions of the Asset?
Every facility is
uniquely designed to produce some desired output. Whether it is
tires, gold, gasoline, or paper the equipment is put together
into systems that will produce the end product. Each facility
may have some unique equipment items, but in many cases common
types of equipment are just put together in different ways.
Within every RCM analysis we have two types of functions. First,
the Main or Primary function, this function statement will
describe the reason we have acquired this asset and the
performance standard we expect it to maintain. Second, are the
Support Functions, which list the function of each component or
maintainable item that makes up the system. The Support
Functions are provided by the bottom level of equipment in most
facilities such as pumps, electric motors, valves, rollers, etc.
Each of those maintainable items has one or more easily
identifiable functions that enable the system to produce its
required output. It is the loss of these functions that lead to
variation in the Main or Primary function of the system and the
safety, environmental, operational, and profit output of the
facility.

The
key thing to remember when describing equipment functions is
that we are interested in what the equipment does in relation to
its operating context, not what it is capable of doing. For
example, a cooling tower pump may be capable of pumping 100 gpm
at 275 ft of head, but may only need to pump 75 ± 10 gpm at that
same pressure. It is necessary to focus on the required and
secondary functions within the system operating context in order
to analyze asset functions. Our main function statement for
this system would address the functionality within the operating
context; Be able to pump cooling tower water at a rate of 75 ±10
gpm at 275 ± 15 ft of head while maintaining all quality,
health, safety and environmental standards.
The
rate, the head requirement, quality, health, safety and
environmental standards are all performance standards for the
pump.
Functions need to be well defined. Statements such as “pump
water from the pond” don’t lend themselves well to understanding
what functional failure would look like. A statement such as
pump 1000 ± 100 gpm at 275 ± 15 ft of head from the pond make it
easy to understand what a functional failure might look like.
If we can only pump 800 gpm then we obviously have an
unacceptable variation in output.

In
What Way Can the Asset Fail to Fulfill its Functions?
Nowlan
and Heap said there are two types of failures. There are
functional failures and potential failures. Functional failures
are usually found by operators, and potential failures are
usually found by maintenance personnel. In many organizations
there are great debates about what constitutes a failure. In
their original work Nowlan and Heap used a very good definition
for failure. “A failure is an unsatisfactory condition.” Using
this definition allows us to grasp the idea that equipment can
continue to operate yet be considered failed. Many condition
monitoring programs don’t achieve their desired output because
those running the program do not recognize that a failure has
occurred as soon as an unsatisfactory condition is detected.
They often try to run the equipment as long as possible or until
they get closer to the F of the P-F curve. At Allied
Reliability we call this “managing to the F”. More mature
programs manage to the P, meaning that they take action as soon
as the unsatisfactory condition is recognized. Remember, the
further we go along the P-F curve the higher the level of
business risk we are accepting.
It is
equally important to recognize that there is significant value
in ensuring that equipment is installed and commissioned
properly.

The I-P-F curve
shown above is the standard P-F curve with an I-P portion
added. Point I is defined as the point of installation of the
component. The I-P portion of the I-P-F curve is the failure
free period. This is the time during which the operation is
defect free. The I-P interval for machines that were installed
improperly may be just a few seconds. The I-P interval for
machines installed by well trained crafts people using well
designed procedures, precision techniques, and precise measuring
equipment, and commissioned by operators using well designed
operating procedures may be years.
The graphic above
shows what the I-P-F curve for two differently installed
identical machines might look like. The machine with the longer
I-P interval was installed by well trained crafts personnel
using a properly designed procedure and precision measuring
devices, and commissioned by operators using a well designed
operating procedure. The machine with the shorter I-P interval
was installed by inadequately trained personnel using either no
procedure or a poorly designed procedure without precision
measuring devices and techniques, and commissioned by operators
using either no procedure or a poorly designed procedure. The
difference in lengths of the I-P portions of the curve for the
two pieces of equipment may represent large sums of money. The
dollars represent the additional cost of parts and labor and
also the amount of additional foregone production as a result of
the extra maintenance work that had to be performed.
Looking at an
organization’s shift in focus from F toward I is a more
effective way to determine its maturity than by looking at the
age of their maintenance program. Many organizations reactively
maintain equipment for a long time. An organization that is
constantly focused on Point F and staying clear of it, will
undoubtedly be a reactive culture. Typical things heard around
this organization might be “How long can we run it before it
fails?” and “Just how bad is it?”.
An organization’s
first step toward maturity will be to shift its focus from Point
F to Point P. The organization then focuses its efforts on
understanding how things fail and their ability to detect these
failures early. Typical things overheard in this organization
may be something like: “Is this the best way to detect these
defects early?” or “I appreciate you letting me know about this
problem, even though it’s very early.”
Further maturation
results in a transition from focusing on Point P to focusing on
Point I. Overheard in the hallways of this organization are
things like “Take the time to do it right, it will pay big
dividends for us not too far down the road” and “Let’s update
the procedures for that job to reflect what we just learned”.
This organization is trying to prevent failures from occurring
in the first place by applying best practices with fits,
tolerances, alignment standards, contamination control and well
documented procedures. They will see the step change in
performance and they are the ones we label “mature” not the
organizations that have been doing it poorly but for a longer
period of time.
The functional
failure statement describes the loss of the equipment’s
function, not what is wrong with the equipment. A good
functional failure statement will most likely not have the noun
name of an equipment part in it.

What Causes Each
Functional Failure?
At the end of the
day we will be building maintenance tasks designed to prevent
functional failures from occurring. In order to do this we must
understand what causes each functional failure. The cause may
be the failure of some equipment part, but it can just as easily
be a failure in some human activity. Improper operation and
improper maintenance are likely to be the causes of failures.
Remember the definition of a system. Everything and everybody
in the facility has some impact on system reliability.
It is very
important to describe these causes or failure modes in a way
that allows us to create a living program for improving asset
management. Easy to use codes in the Enterprise Asset
Management (EAM) system will allow us to capture data about what
types of failures are occurring and to react to that data by
reengineering the maintenance plan, training plan, or equipment
design associated with the equipment. A well designed Failure
Reporting, Analysis, and Corrective Action System (FRACAS) is a
must for continuously improving system performance.
For part failures
we may want to use a simple three part code that consists of the
part name, part defect, and defect cause.

What Happens When
Each Failure Occurs?
Known as Failure
Effects, these statements clearly describe what happens when a
failure occurs and what events are required to bring the process
back to normal operating conditions. Different things can
happen when a failure occurs. Not all failures are created
equal. When listing failure effect statements we should fulfill
the following criteria:
-
Events that
led up to the failure – Any immediate notable effects of
wear or imminent failure
-
First Sign of
Evidence – Is the failure evident to the operating crew as
they perform their normal duties? If so explain how.
-
Secondary
Effects – The effects of failure on the next higher
indenture level under consideration.
-
Events
Required to Bring the Process Back to Normal Operating
conditions
What Are the
Consequences of Each Failure?
What makes
failures matter is their impact on the business. Every business
has goals for profitability, safety performance, environmental
performance, and operational performance. Each failure has a
different impact on business performance, and it is important
for the RCM team to understand the consequences of each one.
Some failures are of little to no consequence, and some can
result in the loss of lives, or in extreme cases total failure
of the business.
Most organizations
use some sort of severity matrix to define the consequences of
failures. The tables below represent just some of the ways this
can be done.

How would your
company handle creating severity rankings for failures?
In most cases each
failure will be ranked according to what is known as
criticality. The criticality is the result of combining
probability and consequence rankings together to yield a single
number. The criticality will be a biased towards the business’s
philosophy of safety, environmental, and operational risk. The
tasks in the Equipment Maintenance Plan (EMP) generated from the
RCM analysis are designed to lower the criticality of the
significant failures in the system. Tasks can be rank ordered
for implementation by implementing those that yield the higher
reduction in criticality first.
What Should be
Done to Predict or Prevent the Failure?
Each failure mode
must be examined to determine what type of maintenance task, if
any, should be used to prevent or predict it. Nowlan and Heap
recognized four basic types of PM tasks.
-
Scheduled
inspection of an item at regular intervals to find any
potential failures
-
Scheduled
rework of an item at or before some specified age limit
-
Scheduled
discard of an item (or one of its parts) at or before some
specified life limit
-
Scheduled
inspection of a hidden-function item to find any functional
failures
When and how these
tasks are performed depends on the failure mechanism that is
present. In the original report six failure shapes were
investigated. The team determined that only 11% of the failure
modes present in their study of aircraft part failures would
lend themselves to scheduled rework or replacement. In this
instance 89% of the failure modes present would require some
sort of inspection. The majority of the failure modes, 63%,
could actually be made worse by time based overhaul or
replacement. Clearly, some good non-invasive method of
inspecting for potential failures would be very beneficial.

Figure 3:
Failure Shapes (John Moubray, Nolan and Heap)
In some cases it
is not possible to detect functional failures during normal
operations. Those undetectable failures are called hidden
failures. Hidden failures are usually associated with some sort
of protective system that is designed to minimize the impact or
prevent the high consequences associated with a failure of the
protected system. Items such as pressure safety valves (relief
valves), circuit breakers, high temperature interlocks, and high
level interlocks are just a few examples of devices that could
have hidden failures. The bad news is that the consequences of
failure can be extremely high. The good news is the probability
of the catastrophic event is often quite low. It requires that
both the protecting and the protected item fail at the same
time. In cases where functional failure is not immediately
detectable during normal operations a failure finding task must
be done to prevent the high consequences associated with
multiple failures.
Table 6,
reproduced from the Nowlan and Heap report presents a comparison
of the four types of tasks and their applicability. For
non-critical failures the order of preference will generally be
inspection, rework, and lastly discard or replacement of the
item.

When Nowlan and
Heap published their report in 1979 condition monitoring methods
such as vibration analysis (VA), ultrasonic inspection (UE),
ultra-violet inspection (UV), and other non-invasive technology
based inspection methods were in their infancy and were very
expensive to deploy. Now, nearly thirty years later, technology
based inspection methods are relatively inexpensive and easy to
deploy. These methods are really nothing more than inspection
methods that can be used on a periodic basis to determine the
condition of equipment. We can be almost certain that Nowlan
and Heap would have recommended extensive use of these
technologies had they been readily available.
In any case, the
task chosen must either lower safety, environmental, or
operational risk to an acceptable level, or for non-critical
failures be economically effective. Risk is always the top
driver in the decision making process. We may have to spend
more money to ensure that we meet our risk goals.

What Should be
Done if a Suitable Proactive Task Cannot be Found?
There may be a
couple of reasons why we wouldn’t be able to find a suitable
proactive task. We are either unable to find a task that will
lower business risk to an acceptable level, or we are unable to
find a task that is economically feasible. Each case requires a
different response. In the first case, the system will have to
be redesigned to that an acceptable level of risk. In the
second case, we can choose a run to failure approach for the
failure mode. It is important to remember that when a run to
failure strategy is employed we should then put in place
consequence reduction tasks to mitigate the impact of the
failure. The RCM team must ensure that appropriate steps are
taken to have written procedures in place to deal with the
failure mode, and that proper spares levels are maintained.
Conclusion
Answering the
seven questions of RCM properly will yield a cost effective EMP
that achieves the business’ goals for safety, environmental, and
operational risk. Answering the questions properly requires a
cross-functional team of maintenance, operations, and
engineering personnel who have a thorough understanding of how
the asset works, and what the organization’s risk and profit
goals are.

Click here for
a larger table (PDF)
Bibliography
|