|

Click here to order this book |
Effective Maintenance Management -
Risk & Reliability Strategies for Optimizing
Performance
By Vee Narayan
Providing a clear explanation of the value and
benefits of maintenance, this unique guide is
written in a language and style that practicing
engineers and managers can understand and apply
easily. Effective Maintenance Management examines
the role of maintenance in minimizing the risk of
safety or environmental incidents, adverse
publicity, and loss of profitability. In addition to
discussing risk reduction tools, it explains their
applicability to specific situations, thereby
enabling you to select the tool that best fits your
requirements. Intended to bridge the gap between
designers/maintainers and reliability engineers,
this guide is sure to help businesses utilize their
assets more effectively, safely, and profitably.
|
| An
excerpt has been provided below courtesy of
Industrial Press |
Maintenance can mean different things to different
people. Quite often,
senior managers and accountants see maintenance as a
cost burden that
should be minimized. At the working level, some of us
see it as a set of preventive,
corrective, or breakdown rectification activities. Some
classify it as reactive or proactive work. To still
others, it means predictive, planned, or unplanned
activity. All these are merely the various dimensions of
maintenance. They are valid descriptions, but do not
address its functional
aspects. We prefer to look at the role or function of
maintenance and its strategic
contribution to the health of a business. In Chapter 8,
we examined the
role of maintenance in preventing event escalation and
how it helps retain the integrity and productive
capacity of the facility over its life.
This is its strategic
role; maintenance helps maximize the profitability of a
business over its
life.
In this chapter, we will see how appropriate maintenance
strategies can help manage risk effectively.
In Chapter 2, we noted that the capability of an item of
equipment, system
or plant may deteriorate over time, due to fouling,
wear, corrosion, or
fatigue. At some point in time, the capability falls
below the required performance
level. We can restore the performance before this point,
or shortly thereafter. We term such restoration activity
as maintenance. There is another situation where we
require maintenance. This is when the operator
does not know the state of an item, whether it is
working or has failed.
These
are the items that can have hidden or unrevealed
failures.
In these cases, the
role of maintenance is to identify the state by carrying
out a test.
If the item is
in a failed state, we need to carry out further
on-failure maintenance to
restore
it to a working state.
|
9.1 MAINTENANCE AT THE ACTIVITY LEVEL—AN
EXPLANATION OF TERMINOLOGY |
9.1.1
Types of maintenance—Terminology and application
rationale
When the consequence of failure in service is
negligible, we can afford to do the restoration work
after the item has failed. We call this strategy
on-failure
or breakdown maintenance.
Unfortunately, many failures have an
unacceptable consequence, so we cannot always apply a
breakdown strategy. If we can measure the deterioration and note the period of incipiency, it is
possible to predict the time of failure. In such a
case, we can schedule the work to ensure minimum
disruption of production. This ability to schedule the
work facilitates a quick and efficient turnaround. We
call this strategy on-condition maintenance, where we
can detect and rectify a deteriorating condition
before there is functional failure.
In the case of hidden failures, we have to test the equipment
periodically.
This will identify whether it is in working condition.
When we carry out the tests, we carry out
failure-finding tasks. If we find the item in a failed
state, we rectify it by carrying out breakdown
maintenance. Under certain conditions, periodic repair or replacement of the item
is warranted, even though it is still in working
condition.
Planned maintenance includes all
of the following:
·
Testing for hidden failures;
·
Condition monitoring of incipient failures; Pre-emptive
repair or replacement action based on time (running
hours, number of starts, number of cycles in operation,
or other equivalents of time).
We can summarize the terminology
discussed above with the following descriptions of the
types of maintenance.
Breakdown Maintenance –
repair is done after functional failure
of equipment, so it is not possible to schedule the
repair work. It is also termed on-failure
maintenance.
Corrective Maintenance –
repair is done after initiation of
failure, leading to degraded performance. Usually
condition monitoring or inspections will reveal such
degradation. The actual repair may be done before or
after
functional failure, based on our evaluation of consequences
of failure, but the key difference from breakdown
maintenance is this – we were aware of the functional
failure before it occurred, so we had an opportunity to
schedule the repair.
Scheduled overhaul or replacement or hard-time
maintenance –
repair
is done based on age (calendar time, number of cycles,
number of starts or similar
measures of age as appropriate). This strategy
is applicable when the age at failure is
predictable, i.e., the failure distribution curve is
peaky. Fouling, corrosion, fatigue and wear related
failures typically exhibit such distributions.
On-condition maintenance –
repair is based on the result of inspections or
condition-monitoring activities which are themselves
scheduled on calendar time to discover if failure has already
commenced.
Vibration monitoring
and on-stream inspections are typical examples
of on-condition tasks. Monitoring of some parameters may be
continuous, with the use of dedicated instrumentation.
All on-condition maintenance is corrective in nature.
Testing or failure-finding
is aimed at finding out whether an item
is able to work if required to do so on demand. It is
applicable
to hidden failures and non-repairable items, i.e.,
the item must be removed from service
if we know it
has failed. Thereafter, if the item has failed, we do
corrective maintenance.
Predictive maintenance –
repair is based on predicted time of
functional failure, generally by extrapolating from the
results of on-condition activities
or continuously monitored condition readings.
It is synonymous with on-condition maintenance.
Preventive maintenance –
repair or inspection task is carried out
before
functional failure. It is carried out on the basis
of age-in-service and the anticipated time of failure. Thus, if the estimate is
pessimistic,
it may be done even when the equipment is in perfect operating condition.
Scheduled overhauls or replacement, on condition and
failure finding tasks (themselves time-based), are all
part of the preventive maintenance program.
When we do work on a predictive or anticipatory basis,
we call it proactive
maintenance. If we work on it after it has functionally
failed, we call it reactive maintenance. When the
incipiency period is relatively small, there is
insufficient time available to plan the work.
Opportunities to minimize production
losses are smaller, and some losses may be
unavoidable. In this case,
the timing of the work is not in our control, and the
corrective maintenance is reactive.
Hence corrective maintenance work can be proactive or
reactive, depending on the circumstances.
In Chapter 5, we defined planning as the process of thinking through the
execution of work. In the course of preparing a plan, we
can identify potential pitfalls. We can find solutions
in anticipation of the problems, thereby improving the
quality and speed of execution. Planned maintenance is
that which is correctly prepared sufficiently ahead of
its execution. All preventive maintenance can be planned
and scheduled.
In most cases, we can plan corrective maintenance
as well, but there is less time available to schedule
the work, since the onset of failure has already
occurred. The term
scheduling
means the allocation of materials and
resources as well as assigning a start and finish
date to the work.
When it comes to breakdown maintenance however,
we do not know the exact scope and timing in advance. It
is difficult to plan such work, except in
the most generic terms. Hence, breakdown maintenance
tends to be less efficient in terms of resource utilization and control
of duration.
People tend to regard preventive and
predictive maintenance as good while they frown on
breakdown maintenance. This view is fashionable but
incorrect. It has resulted in unnecessary maintenance
expenditure and equipment downtime. There are many
failure modes that have little or no effect in terms of
consequences on the system or plant as a whole. In such
cases, it is economical
to allow the failures to take place before taking any
action. Preventive maintenance became very popular
after the second World War, when the mass production
industries enjoyed a period of rapid growth.
It became fashionable
to apply preventive maintenance strategies as a matter
of policy, even
in industries where the economic logic was different.
The result was that
items of equipment became ‘due’for maintenance, even
though they were performing
perfectly well.
There are situations where each of the strategies is appropriate and one
must base the selection on the most appropriate way to
reduce risks. When
the consequences are negligible, the risk is usually
low, so a breakdown strategy
is appropriate. If there is a threat to safety,
production, or the environment, preventive strategies
are appropriate.
9.1.2 Applicable maintenance tasks
As the Weibull distribution has wide applicability in maintenance
analysis, we will be using the Weibull shape and scale
factors in the discussion that follows.
In Chapter 3 (refer Figure 3.16), we discussed the significance of the
Weibull shape factor of the
pdf
curve. Let us now address the effect of the
Weibull shape factor in cases where the failure is
evident.
When the Weibull shape factor is less than 1, the stresses on the
components reduces with
time. This can be due to the physical characteristics of
the failure mode or to
in-built quality problems, and results in an early-failure
pattern.
When this is
a result of underlying quality problems
introduced during the design, maintenance, or
operational phases, we may do more harm than good by
carrying out maintenance. What we need is an analysis of
the root cause of the failure, and
suitable corrective actions to
improve work quality. Similarly when the Weibull shape
factor is 1 (or close to 1), the probability of failure
does not decrease as a result of planned maintenance
work. In this case, we should only do the work when
performance has already started deteriorating.
We
should use the incipiency curve to predict the
functional failure.
Time-based
maintenance strategies are applicable
when the Weibull shape factor is
1, since this indicates a wear out
pattern. The higher the value of the Weibull
shape factor, the more definite
we can be about the time of failure.
When this is high, we can easily justify
preventive time-based maintenance as it will improve
performance. We can determine the maintenance interval
by using the
pdf
curve to determine the required survival probability at
the time of maintenance intervention.
Turning our attention to hidden failures next, we require a time-based
test
to identify whether the item is in a failed state. If the item has failed
already,
we have to carry out breakdown maintenance to bring it
back in service.
As you can see from the above discussion, only certain
tasks are applicable
in addressing the failures. The kind of failure, namely,
whether it is evident or
hidden, and the shape of the
pdf
curve help determine the applicable task.
9.1.3 How much preventive maintenance
should we do?
The ratio of preventive maintenance work
volume to the total is a popular indicator used in
monitoring maintenance performance.
With a high ratio, we can plan more of the work. As discussed earlier,
planning
improves performance, so people aim to get a high ratio. In some cases we
know that a breakdown maintenance strategy is perfectly
applicable and effective. The proportion of such
breakdown work will vary from system to system, and
plant to
plant. There is therefore no ideal ratio of preventive
maintenance work to the total. In cases where there is a
fair amount of redundancy or buffer storage
capacity, we can manage with a very high proportion
of breakdown maintenance. In these cases, it will be the lowest total
cost option. In a plant assembling automobiles, the stoppage of the production
line for a few minutes can prove to be extremely
expensive. Here the regime would swing towards a high
proportion of preventive maintenance.
This is why it is important to analyze the situation before we choose the strategy. The
saying,
look before you leap,
is certainly applicable in this context!
We have to analyze at the failure mode level and in the
applicable operating context. The tasks identified by
such analysis would usually consist of
some failure
modes requiring preventive work, others requiring corrective work, and
some others allowed to run to failure. We can work out
the correct ratio for each system in a plant, and should
align the performance indicators to this ratio.
|
 |
9.2 THE RAISON D’ÊTRE OF MAINTENANCE
In Chapter 8, we examined the process of escalation
of minor failures into serious
incidents. If a serious incident such as an explosion
has already taken place, it is important to limit the
damage.
We can combine the escalation and damage limitation
models and obtain a
composite picture of how minor events can eventually
lead to serious environmental
damage, fatalities, major property damage, or serious
loss of production capacity. Figure 9.1 shows this
model.
We can now describe the primary role of maintenance
as follows:
The raison d’être of maintenance1 is to
minimize the quantified risk of serious safety,
environmental, adverse publicity or production incidents
that can reduce the viability and profitability of an
organization, both in the short and long term, and to do
so at the lowest total cost.
This is a positive role of keeping the revenue stream
flowing at rated capacity, not merely that of fixing or
finding failures.
We have to avoid or minimize
trips, breakdowns, and predictable failures that affect
safety and production. If these do occur, we have to
rectify them so as to minimize the severity of safety
and production losses. This helps keep the plant safe
and profitability high. In the long-term, maintaining the
integrity
of the plant ensures that safety and environmental incidents are minimal.
An organization’s good safety and environmental
performance keeps the staff morale high and minimizes adverse publicity. It enhances the reputation
and helps the organization to retain its right to operate. This assures
the viability of the plant. Note that maintenance can
reduce the quantified risks, but in the process it can
also help reduce the qualitative risks.
Compare this view of maintenance with the
conventional view—namely
that it is an interruption of normal operations and an
unavoidable cost burden. We recognize that every
organization is susceptible to serious incidents
that may result in large losses.
Only a few of the minor events will escalate into
serious incidents, so it is not possible to predict
precisely when they will occur. One could take the view
that one cannot anticipate such incidents.
Often, we can see that the situation is ripe
and ready for a serious incident, as in
the case of Piper Alpha, but even so, we cannot predict
the timing.
Sometimes these losses are so large that they may result
in the closure or bankruptcy of the organization itself
As an example from a service industry,
consider the collapse of the Barings Bank2.
Their Singapore branch trader Nick
Leeson speculated heavily in arbitraging deals,
losing very large sums of
money in the process. He did this over a relatively long
period of time, using a large number of ordinary or
routine looking transactions. There were deviations from
the Bank’s policies, which an observant management could
have noticed. In our model, these deviations from the
norm constitute the process demand rate. Leeson was a
high performing trader, and in order to operate
effectively, he needed to make quick decisions.
So the Bank removed some of
the normal checks and balances. These controls included,
for example, the separation
of the authority to buy or sell on the one hand,
and on the other hand, to settle the payments. Thus,
they defeated a Procedure barrier, permitting an
opportunity for event escalation.
With the benefit of
hindsight, we can question
whether the reliability of the People barrier was
sufficiently high to justify this confidence. Barings
had carried out an internal audit a few months before
Leeson’s activities came to light.
In terms of our model, this was a test to identify
hidden failures. The auditors did find some areas of
concern, and recommended that Leeson’s authority be
limited to trading or settlements, but not
both. The Bank did not implement this recommendation.
By January 1995 the
London
office was providing more than $10 million per
day to cover the margin
payment to the Singapore Exchange. There were clear
indications that something
was amiss, but all the people involved ignored
them. The Bank of England,
which supervised the operations of Barings
Bank, wondered how Barings Singapore was
so profitable but did not pursue the matter further.
Hence the People barrier in the damage limitation
level was also weak. When you compare
this disaster with Piper Alpha, Bhopal, or Chernobyl,
some of the similarities become evident. With so many
barriers defeated, a disaster was looming,
and it was only a matter of time before it happened.
Integrity issues are quite often the
result of unrevealed failures.
We can minimize escalation of minor
events by taking the following steps:
·
Reduce the process variability to reduce
the demand rate;
·
Increase the barrier availability. We can do this by
increasing the intrinsic
reliability, through an improvement in
the design
or configuration. Alternatively, we can increase the test
frequency to achieve the same results;
·
Do the above in a cost effective way.
We discussed the effect of the law of diminishing returns
and how to determine the most cost-effective strategy,
in Chapter 8. In order to achieve the
required level of availability in the case of each
barrier, we have to determine its
intrinsic reliability.
We can then calculate the test interval to produce
the required level of availability.
At this stage, we encounter a practical problem.
How does one measure the reliability of the People or
Procedures barriers? There is no simple metric to use, and even if there was one, a consistent
and repeatable methodology is not available. If we take
the case of the People barrier, their knowledge,
competence, and motivation are all important factors
contributing to the barrier availability. As we
discussed in Chapter 8, motivation can change with
time, and is easily influenced by unrelated outside
factors.
There would be an element of similarity in motivation
due to the company culture, working conditions, and the
level of involvement and participation. As long as the
average value is high and the deviations small, there is
no
problem. Also, if there are at least two people
available
to do a job in an emergency,
the redundancy can help improve the barrier
availability. We can test the knowledge and competence
of an individual from time to time, either by formal
tests or by observing their performance under conditions
of stress. In
an environment where people help one another, the
People
barrier availability
can be quite high. In this context, salary and reward
structures that favor individual performance in
contrast to that of the team can be counterproductive.
Procedures
used on a day-to-day basis will receive comments frequently. These will initiate
revisions, so they will be up-to-date. Those used
infrequently will gather dust and become out-of-date. If
they affect critical functions, they need more frequent
review. We should verify
Procedures
relating to damage limitation periodically, with tests
(such as building evacuation
drills).
The predominance of soft issues in the
case of the
People
and
Procedures
barriers means that estimating their reliability is a
question of judgment.
Redundancy helps, at least up to a point, in the case of
the
People
barrier. Illustrations, floor plans, and memory-jogger
cards are useful aids in improving the availability of
the
Procedures
barrier. It is a good practice to keep some drawings and
procedures permanently at the work site. Thus we
see some wiring diagrams on the doors of control
cabinets.
Similarly we get
help screens with the click of a mouse button and see
fire-escape instructions
on the doors of hotel rooms. Obviously, we have to ensure
that these are kept
up to date by periodic replacement.
9.3 THE CONTINUOUS IMPROVEMENT CYCLE
Once the plant enters its operational phase, we can
monitor its performance.
This enables us to improve the effectiveness of
maintenance. This process can
be represented by a model, based on the Shewhart3
cycle.
In this model, we represent the maintenance process in four phases. The
first of these is the planning phase, where we think
through the execution of
the work. In this phase, we evaluate alternative
maintenance strategies in
terms of the probability of success as well as costs and
benefits.
In the next
phase, we schedule the work. At this point, we allocate
resources and finalize the timing. In the third phase,
we execute the work, and at the same time we generate
data. Some of this data is very useful in the next
phase, namely that
of analysis, and we will discuss the data we need and
how to collect it in Chapter
11. The results of the analysis are useful in improving
the planning of
future work. This completes the continuous improvement
cycle.
Figure 9.2 below shows these four
phases.
9.3.1 Planning
We begin the planning process by
defining the objectives. The production
plant has to achieve a level of system
effectiveness that is compatible with the production
targets. We have to demonstrate that the availability of
the safety
systems installed in the plant
meets the required barrier availability. Using
reliability block diagrams, we
can translate these requirements to availability
requirements at the sub-system and equipment level.
The next step consists of identifying those failure
modes that will prevent
us from achieving the target availability. Next, we
evaluate alternative ways to
resolve these problems. We have to execute the selected
tasks at the correct
frequencies, with the specified skilled resources.
We can bundle a number of
these tasks together. We can do so if the work is on the
same equipment, using the same
trade skills at the same frequency. We call such an
assembly of tasks
a maintenance routine. These routines will cover
all time-based tasks including
condition monitoring and failure finding tasks.
When we execute condition monitoring tasks, we will
detect incipient failures. This will result in the generation of corrective
maintenance work. We carry out failure finding tasks, to
identify whether items subject to hidden failures are
in a working state. If they are in a failed state, we
have to carry out breakdown maintenance work to restore
it to a working condition. Lastly, we will allow certain
items of equipment to run to failure, and some others
will fail in service as a result of poor operation or
maintenance. These will also require breakdown
maintenance. We have to make a provision for such
corrective and breakdown work in our plan. Various
tools are available to assist us in planning this work,
and we will review some of these in the next chapter.
We cannot execute all the work during
normal operations and so some of these will require a
plant shutdown.
Planning of maintenance encompasses all
the routine and corrective work done during normal
operations as well as during shutdowns. There is an element of generic planning that we can do with
respect to
breakdowns. For example, in a plant using process steam, we can expect
leaks from flanges, screwed
connections, and valve glands from time to time.
These leaks can grow rapidly,
especially if the pressures involved are high, or the
steam is wet. The prompt availability of leak-sealing
equipment and skilled personnel can prevent the event
from escalating into a plant shutdown. In the case of
plans made to cope with breakdowns, the work scope is
usually not definable in advance. We
require a generic plan that will cater to a variety
of situations. Note that while
such a plan may be in place, we still cannot schedule
the work till there is a failure. If a breakdown does take place, we
will have to postpone some low priority
work, so that we can divert resources to
the breakdown.
9.3.2 Scheduling
We have to schedule maintenance work in
such a way that we minimize production losses. The
scheduler’s task is to find windows of opportunity to
minimize the losses. We can schedule
maintenance work during weekends or month-ends if there
are calendar-based production quotas. We schedule the
work so that it commences towards the
end of the week or month, and complete it in the early
part of the next week or month. By boosting the production rate before and after the
transition point, we can build up sufficient additional
production volumes to compensate for the production lost
during the maintenance activity.
We can avoid loss of production if intermediate storage or installed
spares
are available. When carrying out long duration maintenance work on
protective system equipment such as fire pumps, the scheduler must evaluate the
risks and take suitable action. For example, we can
bring in additional portable
equipment to fulfill the function of the equipment under
maintenance.
If
this is not possible, we have to reduce the demand rate,
for example, by not
permitting hot work. Using this logic, one can see why
the Piper Alpha situation
was vulnerable. The fire deluge systems were in poor
shape, the fire
pumps on manual, at a time when there was a high
maintenance and project
workload with a large volume of hot work.
We have to prioritize the work, with jobs affecting integrity at the
highest
level. This means that testing protective devices and systems has the
highest
priority. Work affecting production is next in importance. Within this
set, we
can prioritize the work according to the potential or actual losses.
All other work falls in the third category of
priorities. When scheduling maintenance
work, we have to allocate resources to the high priority
work and thereafter to
the remaining work. If the available resources are
inadequate to liquidate all the work on an ongoing
basis, we have to mobilize additional resources.
We
can use contractors to execute such work as a
peak-shaving exercise.
The available pool of skills may not meet the requirements on a
day-to-day
basis. If each person has a primary skill and one or two
other skills, scheduling
becomes easier. This requires flexible work-practices
and a properly trained workforce. On the other hand, if
restrictive work practices apply, scheduling becomes
more difficult.
We then have to firm up the duration and
timing of each item of work,
arrange materials and spare parts,
special tools if required, cranes and lifting gear, and
transportation for the crew. When overhauling complex
machinery,
we may need the vendor’s engineer.
Similarly we may require specialist
machining facilities. We have to plan
all these requirements in advance.
It is the scheduler’s job to ensure that
the required facilities are available at the
right time and place and to communicate
the information to the relevant people.
A good computerized maintenance
management system (CMMS) can help
us greatly in scheduling the work
efficiently.
9.3.3 Execution
The most important aspects in the
execution of maintenance work are safety
and quality. We have to make every
effort to ensure the safety of the workers. Toolbox talks, which we discussed
in Chapter 6, are a good way of ensuring two-way communications. They are
like safety refresher training
courses. Amore formal Job Safety Analysis (JSA),
used in some high hazard
industries helps increase safety awareness in maintenance
and operational
staff. JSA cards are used not just for hazardous
activities, they are also used for
increasing awareness during routine
maintenance activities. The worker
needs protective apparel such as a hard
hat, gloves, goggles, overalls, and special shoes.
These ensure that even if an accident occurs, there is
no injury to
the worker. Note that protective apparel
is the
Plant
barrier in this case. If the
work is hazardous, for example, involving the potential
release of toxic gases,
we must ensure that the workers use respiratory
protection. In cases where
the consequence of accidents can be very high, escape
routes needs advance
planning. We have noted earlier that redundancy
increases the availability of
Plant.
Hence in high risk cases, we should prepare two
independent escape
routes. In addition to the normal toolbox talk, the
workers should carry out a
dry run before starting the hazardous work. During this
dry run, they will practice their escape in full
protective gear. The damage limitation barriers must
also be in place. For example, in the case discussed
above, we must
arrange standby medical attention and rescue equipment.
In a practical sense,
the management of risk requires us to ensure that the
People, Plant,
and
Procedure
barriers are in-place and in good working condition.
The quality of work determines the operational reliability of the
equipment. In order to reach the intrinsic or built-in
reliability levels, we must
operate the equipment as designed, and maintain them
properly.
Both require
knowledge, skills, and motivation. One can acquire
knowledge and skills by
suitable training. We can test and confirm the worker’s
competence.
Pride of
ownership and motivation are more difficult issues, and
they require a lot of effort and attention. The
employees and contractors must share the values of
the organization, feel that they get a fair treatment,
and enjoy the work they are
doing. This is an area in which managers are not always
very comfortable.
As a result, their effort goes into the areas in which
they are comfortable
and they tend to concentrate on items relating to
technology, knowledge and
skills. Quality is a frame of mind, and motivation is an
important contributor.
Good planning and organization are necessary for
efficient execution of
work. A number of things must be in-place, in good time.
These include the following:
Permits to work;
Drawings and documentation;
Tools;
Logistic support, spare parts, and consumables;
Safety gear;
Scaffolding and other site preparation.
If these are not in place, we will waste resources while waiting for the
required item or service. The efficiency of execution is
dependent on the quality of planning and organization.
The two drivers of maintenance cost are the operational reliability of
the equipment, and the efficiency with which we execute
the work. We require good quality work from both
operators and maintainers to achieve high levels
of reliability. The number of maintenance interventions
falls as the reliability improves. This also means that
equipment will be in operation for longer periods. When
we carry out maintenance work efficiently, there is
minimum
wastage of resources. As a result, we can minimize the
maintenance cost. As
we have already noted, good work quality improves
equipment reliability, and good planning helps raise the
efficiency of execution. These two factors, work quality
and good planning, are where we must focus our
attention.
There are many reasons for delays in
commencing the planned maintenance work. There may be a delay in the release of
equipment
due to production pressures. Similarly, if critical spares, logistic
support, or skilled resources are not available, we may
have to postpone the work to a more convenient time. While we can tolerate some slippage,
it is counter productive to
spend a lot of time and money deciding when to do
maintenance,
and then not do it at the correct time. When planned
work is done on schedule, we have achieved compliance.
For practical purposes, we accept it as compliant as
long as it is completed within a small range, usually
defined as a percentage of the scheduled
interval. As a guideline, we should commence items of
work that we consider safety critical, within +/-10% of
the planned maintenance interval, from the scheduled
date. For safety critical work that is planned every
month, e.g., lubricating oil top-up of the gear-box of
fire pumps, we would consider it compliant if it was
executed some time between 27 and 33 days from the
scheduled date on the previous occasion. If the work was
considered production critical, again planned as a
monthly routine, e.g, lubricating
oil top-up of the gear-box of a single process
pump, as long as the work
was done within +/-25%, or in this case between 23
and 37 days of the previous due date,
it would be considered compliant. Finally, if the same
work was planned on non-critical equipment, e.g., the
gearbox of a duty pump (with a 100% standby pump
available), a wider band of, say +/-50% is acceptable.
In this case, for a monthly routine, if the work was
done between 15 and 45 days
of the previous scheduled date, it would be considered
compliant. Progressive
slippage is not a good idea. Thus, we must retain the
original scheduled dates
even if there was a delay on the previous occasion. If
the work falls outside these ranges, the maintenance
manager must approve and record the deviations. This
step will ensure that we have an audit trail.
Procedural delays, caused for example, by having a permit-to-work system
that needs a dozen or more signatures are sometimes
encountered. The Author
has audited one location where technicians sat around
every morning for 1.5-2 hours, waiting for the
permits-to-work. No work started before this time, and
the site considered this practice normal. The PTW for
simple low-hazard activities needed 12 signatures,
mostly to ‘inform’ various operating staff that
work was going on. Over the years, the PTW had evolved
into a work slowdown process, instead of being the enabler of safe
and productive work.
The timely execution of work is very important, so we should measure and
report compliance. This is simply a ratio of the number
of jobs completed on
the due date (within the tolerance bands discussed
earlier), to those scheduled in a month,
quarter, or year. This is a key performance indicator to
judge the output of maintenance.
We noted earlier that whenever we do work, we generate
data. Such data can be very useful in monitoring the
quality and efficiency of execution. By analyzing this
data, we can improve the planning of maintenance work in
future, as discussed below.
9.3.4 Analysis
The purpose of analysis is to evaluate
the performance of each phase of maintenance
work—planning, scheduling, and execution. The quality
and efficiency of the work depend on how well we carry
out each phase. There is a tendency to concentrate on
execution, but if we do not look at how well we plan and
schedule the work, we may end up doing unnecessary or
incorrect work efficiently!
In the planning phase, it is important to ensure that we do work on those
systems, sub-systems, and equipment that matter. Failure
of these items will
result in safety, environmental, and production
consequences. How well we increase the revenue streams
and decrease the cost streams determines the
value added. Quite often, the existing maintenance
plan may simply be a collection of tasks recommended by
the vendors, or a set of routines established by custom
and practice. So we may end up doing maintenance on
items whose failures do not matter.
The objective of planning is to maximize
the value added. We do this by
carrying out a structured analysis to
establish
the strategy at the failure mode level. This task can be large and
time-consuming, so we have to break it up into small
manageable portions. We must analyze only those systems
that matter, therefore that we use our planning
resources effectively. We identify progress milestones
after estimating the selection and analysis workload. In
effect, we make a plan for the plan. To achieve this
objective, we have to measure the progress using these
milestones. Such an analysis can help monitor the
planning process.
At the time of execution, we may find that some spare part, tool,
resource,
or other requirement is not available. This can happen if the planner
did not
identify it in the first place or the scheduler did not make suitable
arrangements. There will then be an avoidable delay. We can attribute such
delays to
defective planning or scheduling. A measure of the
quality of planning and
scheduling is the ratio of the time lost to the total.
In the execution phase, we can identify a number of performance parameters
to monitor. The danger is that we pick too many of them.
In keeping with
our objectives, safety and the environment are at the
top of our list, therefore we will measure the number of
high potential safety and environmental incidents.
We discussed the importance of hidden failures in the
context of barrier
availability. We maintain system availability at the
required level by testing
those items of equipment that perform a protective
function. Operators or maintainers may carry out such
tests, the practice varying from plant to
plant. The result of the test is what is important, not
who does it.
We have to
record failures as well as successful tests. Sometimes
people carry out
pre-tests in advance of the official tests. Pre-tests
defeat the objective of the
test, since the first test is the only way to know if
the protective device would
have functioned in a real emergency. In such a case, we
should report the
results of the pre-test as if it is the real test, so
that the availability calculations
are meaningful. If a spurious trip takes place, this is
a fail-to-safe event.
By
recording such spurious events, we can carry out
meaningful analysis of these
events.
One can use some simple indicators to measure the quality of maintenance.
These include, for example, the number of days since the
last trip of
the production system, sub-system, or critical
equipment. Another measure
is the number of days that critical safety or production
systems are down for
maintenance. If we concentrate on trends, we can get a
reasonable picture of
the maintenance quality. Note that work force
productivity and costs do not
feature here, as safety and quality are the first order
of business.
Earlier, we discussed the importance of doing the planned work at or
close
to the original scheduled time. Compliance is an important parameter
that we should measure and analyze. The ratio of planned
work to the total, and associated costs are other useful indicators. In measuring parameters such as
costs, it is useful to try to normalize them in a way
that is meaningful and reasonable,
to enable comparison with similar items elsewhere. For
this purpose,
we use some unit representing the complexity and size of
the plant such
as the volumes processed or plant replacement value in
the denominator.
Finally, we can evaluate the analysis phase itself, by
measuring the improvements made to the plan as a result
of the analysis. In a
Thermal Cracker unit in a petroleum
refinery, the six-monthly clean-out shutdowns
used to take 21 days. Over a period of
three years, the shutdown manager
reduced the duration to 9 days, while
stretching
the shutdown intervals to 8
months. The value added by this plant
was $60,000
per day, so these changes meant that the profitability increased by about
$1.7 million per annum. This required careful analysis
of the activities, new ways of working, and minor
modifications to the design to reduce the duration and
increase the run lengths. The plant was located in the
Middle East, where day temperatures could be 40 - 50°C.
Working inside columns and vessels under these
conditions could be very tiring and, therefore, took a
long time. One suggestion
was to cool the fractionator
column and soaker vessel
internally, using a portable air-conditioning unit. In
the past, they had been used to cool reactors in
Hydro-Cracker shutdowns, to
reduce the cooldown
time. Use of these units for the comfort of people was a
new application. When the shutdown manager introduced
air-conditioning, the productivity rose sharply, and
this helped reduce the duration by about 36 hours.
Another change was to relocate two pairs of 10 inch
flanges on transfer lines from the furnace to the
soaker. This clipped an additional six hours. There were
many more such innovations, each contributing just a few
hours, but the overall improvement
was quite dramatic. This case study illustrates
how one can measure the success of the analysis phase in
improving the plan and thus the profitability.
It is easy to fall into the trap of carrying out
analysis for its own sake. In order to keep the focus on
the improvements to the plan, we need to record changes
to the plan as a result of the analysis. Further, we
have to estimate the
value added by these changes and bank them. Hence,
analysis must focus on improvements to
all four phases of the maintenance process.
9.4 SYSTEM EFFECTIVENESS AND MAINTENANCE
The primary role of maintenance is to minimize the risk of minor events
escalating into major incidents.
We achieve this by ensuring the
required level of
barrier availability. Let us examine how we can do
this in practice, with some examples.
More...Click here to buy a copy of
|