|
There were already a large number of engineers actively engaged
in “reliability” related activities. I thought that most people
already understood and supported “reliability” as a core value.
I wondered why a “Reliability Workshop” was needed.
Now, twenty years later, I have a much better answer to that
question. Many if not most companies still need to have a
reliability workshop, and the objectives of each workshop should
attempt to answer the question, “What level of reliability do I
have a right to expect?” To answer that question, companies must
first understand all the elements that affect “reliability” and
then evaluate how well they are dealing with each of them.
What do you have a right to expect?
“What you have a right to expect” is the result of the
reliability characteristics that were designed and built-in to
your systems and equipment, and have been preserved through
proper operation and maintenance. If you are able to recognize
the things that cause failures, you will be able to make a
realistic evaluation of what you have a “right to expect”
because you can determine the amount of failure prevention that
has already been applied. You will also have a head start on
achieving the level of performance you desire.
When a device is designed and manufactured, the included
components have a certain level of robustness and the
configuration provides a certain amount of redundancy.
The combination of component choices and configuration leads to
the characteristic best described as “inherent reliability”. No
matter how well you operate and maintain a system, it cannot
perform better than its inherent reliability. If the system is
operated and maintained as well as possible, you will harvest
the maximum inherent reliability. If you operate or maintain the
system in a manner less than optimum, you will experience a
reliability performance that is something less than the inherent
reliability.
Each part of an item of equipment tends to deteriorate over
time. There are things you can do to minimize the deterioration.
Generally, this deterioration will lead to failure at some point
in time. If you understand the deterioration rate and the
current status of deteriorating components, it is possible to
intervene before the failure takes place. These actions that
minimize the deterioration, or intervene before failure, are
best described as “proactive maintenance”. Proactive maintenance
that is intended to simply monitor the current situation is a
form of “predictive maintenance”. Proactive maintenance that is
intended to change deteriorated components before they fail are
forms of “preventive maintenance”. To capture all the available
“inherent reliability”, you need to implement an optimum program
of proactive maintenance.

Despite how “perfect” you believe your proactive maintenance
program to be, there are always new defects or forms of
deterioration that you did not expect. These new defects will
result in unexpected failures, and the way you deal with them
through “reactive maintenance” will affect the reliability. If
your reactive maintenance system responds properly, does a good
job of diagnosing and troubleshooting the problem, repairs it
correctly, verifies the repair, records “as found” and “as left”
conditions (so that the deterioration rate can be calculated),
keeps good records, and installs proactive tasks that will
intervene before the next failure, then you will harvest all the
available inherent reliability.

Saying this statement much more succinctly, if you have
calculated the expected inherent reliability and you are certain
you have good proactive and reactive maintenance programs, then
you know “what you have a right to expect” for reliability
performance.
“Reliability” or “reliability”
Many people who are not reliability experts tend to group
several other characteristics under the heading of reliability.
For the sake of discussion, I use the term “Reliability” (with a
capital R) to describe the concept of reliability that includes
all these elements.
First, reliability (with a small r) is defined as a measure of
the instantaneous likelihood that a system or device will fail
in a given period of time. My best analogy for reliability is a
die (half of a pair of dice). Assume that the number one is a
defect and when the one comes out on top a failure will occur.
The reliability then is five-sixths and the unreliability is
one-sixth. As long as the defect exists, there is some
likelihood that a failure will occur. The only way to eliminate
or reduce the likelihood of failure is to eliminate the defect.

Second, many people tend to roll the characteristic of
“availability” into their perception of a system’s Reliability.
Availability is a ratio of “up-time”, or time the system or
device can perform its intended function, to “total time”.
Total “down-time” or out-of-service time is the sum of all
planned down-time and unplanned down-time. “Planned
availability” is determined by how long a system can operate
between planned outages, and how much time is needed to conduct
the outage. “Unplanned availability” is determined by how
frequently unplanned outages occur (reliability), and the amount
of time needed to respond to unplanned interruptions.
Third, many people also tend to roll many of the characteristics
of “maintainability” into their concept of Reliability.
Maintainability is a measure of your capability to return a
system or device to full inherent reliability in a ratable
period of time. If you were to say, “It will take three hours to
fix it, but I don’t know how reliable it will be”, the device
would not be maintainable. Also if you were to say, “I don’t
know how long it will take to fix it, but when I finish it will
be right”, it is also not maintainable. To be “maintainable”,
you need to be able to both restore the inherent reliability and
to do it in a known amount of time.
All the elements of Reliability are not strictly parts of
reliability, but many asset managers tend to include those
characteristics when demanding Reliability improvements.
If you are attempting to answer the question, “what do you have
a right to expect?” and the question is intended to address
aspects of availability and maintainability, you will need to be
ready to answer some additional questions.
Drinking out of a Firehose
You are probably asking yourself, “How can this individual
expect to explain the whole subject of reliability in such a
small text?” The answer is that I am not. I have two objectives
for this text. The first is to describe a good starting point
for individuals and organizations who are willing to admit they
are still new to reliability. Second, I am going to try to fill
a gap that most reliability experts have ignored. That gap is
the one between being a highly reactive organization (and
gathering little information) and having sufficient information
to begin the journey to becoming a proactive organization. I
will describe an approach that will be useful and valuable to
reliability engineers, as well as being worthy of resource
investment by asset managers.
A number of years ago I purchased a text entitled “The Little
Black Book of Project Management”. At the time it was published
and for sometime thereafter, there were few comparable texts on
Project Management. Few individuals seemed to come “equipped”
with project management skills. As a result, I repeatedly loaned
the book to subordinates to help increase their knowledge and
improve their skills. As usual, that practice turned out to be a
good way to lose a book, so I was without a copy for several
years. A few years ago I spotted another copy on the shelf of a
used bookstore and purchased it. Again, I am regularly making
the mistake of lending my copy out.
In exchange, I get improved performance by young engineers
assigned to manage projects. My objective here is to create a
text that is as useful to others who are trying to improve
reliability performance as the “Little Black Book on Project
Management” has been for me. The key characteristics are:
• The book is relatively short.
• Application of the approach described in the book is
straightforward.
• Usefulness is not limited by scale. The book is equally
useful to both large and small enterprises.
• You don’t need to be an expert to use the knowledge
Prepare for Change
Like most things of any value, application of the techniques
found in this book require change. The most significant change
confronting the individual is a change in roles. The most
significant change confronting an organization is a change in
the corporate culture. To implement the approaches described in
this text, both kinds of change will be required. A number of
individuals will need to add tasks and change the way they are
performing some of their current tasks. The organization will
need to become unwilling to accept sloppiness in gathering facts
and using information.
It is important to keep in mind that the tasks here discussed
are closely integrated with tasks currently being accomplished
within most organizations. Current meetings, current planning
and scheduling protocols, and current organizational structures,
will need to be modified in a thoughtful, integrated manner if
optimum results are to be achieved.
One last issue …. when you think about how things have changed
in the last fifteen or twenty years, there are few that have
changed as dramatically as those that have been affected by
computerization. Before Reliability Centered Maintenance (RCM)
became popular, there were a few organizations that seemed to be
light-years ahead of everyone else in terms of equipment
reliability. One might ask how they achieved their performance.
The answer was that those companies exercised the patience and
discipline to track failures and their causes. At some point
they were able to recognize patterns and relationships that were
hidden in the data, and use that information to prevent failures
before they occurred.
One cannot over-emphasize the dedication needed to record, store
and analyze information before computers became available. There
were cabinets full of paper files. The files in those drawers
were faithfully maintained and the data was transferred to
manual graphs that made the patterns and relationships more
apparent. Most organizations that successfully accomplished this
effort were led by a single-minded individual and staffed over a
long period by a group of highly-dedicated people.
In today’s work environment, individuals are seldom allowed the
luxury of single-mindedness. They are expected to dilute their
thoughts and standards to fit in with other members of their
team. There are far fewer individuals in each work group, and
few assignments last more than a few years. So the “corporate
memory” must come from some other mechanism.
Fortunately, many of these shortcomings can be addressed by
supporting the processes with computerized files. In fact,
without the use of a computerized file system, this initiative
would be a foolish undertaking. Few organizations have the
determination and discipline to make it work without having key
functions automated by a computerized filing system.
This is not to say that a well-designed computerized system will
eliminate the need for human interaction and administration.
There are a wide variety of elements that can go astray if they
are not properly managed.
One example we will discuss is “bucketing”, a term used for
classifying initial failure reports (Failure Notifications) and
closing reports (Failure Modes). One way in which many systems
become corrupted is by allowing too many people to define
classes of failure. If individuals are allowed to create a new
class every time they cannot find an exact fit, there will soon
be too many classes that are only slightly different from each
other. When only a portion of each true failure mode is assigned
to each of a number of similar failure descriptions, the final
statistics can point to an incorrect failure description as
being the most statistically likely. This improper result will
cause inappropriate corrective actions to be taken.
The most successful system will combine a well designed
computerized database and the right amount of human interaction
to ensure it is not misused or corrupted by individuals who lack
an overall understanding of the system design and objectives.
If this is beginning to sound like a lot of book-keeping, it is.
Good reliability management is a matter of understanding how
your equipment fails. This understanding must depend on facts,
not speculations or beliefs. In many ways, your equipment, and
the way it operates and is maintained, is unique. As a result
your reliability information will be unique.
|