Focal Points: Sponsored links

New CMMS! MVP Plant - Smart Software for Smart Maintenance

 Join The Association For Maintenance Professionals

RCM-EAM-MTrain-2009 Daytona Beach 

Infrared windows and safety products

Follow us on Twitter



 

 

 


Return to Home Page

Effective Maintenance Management Excerpt


Click here to order this book

Effective Maintenance Management - Risk & Reliability Strategies for Optimizing Performance By Vee Narayan

Providing a clear explanation of the value and benefits of maintenance, this unique guide is written in a language and style that practicing engineers and managers can understand and apply easily. Effective Maintenance Management examines the role of maintenance in minimizing the risk of safety or environmental incidents, adverse publicity, and loss of profitability. In addition to discussing risk reduction tools, it explains their applicability to specific situations, thereby enabling you to select the tool that best fits your requirements. Intended to bridge the gap between designers/maintainers and reliability engineers, this guide is sure to help businesses utilize their assets more effectively, safely, and profitably.
An excerpt has been provided below courtesy of Industrial Press

Maintenance can mean different things to different people. Quite often, senior managers and accountants see maintenance as a cost burden that should be minimized. At the working level, some of us see it as a set of preven­tive, corrective, or breakdown rectification activities. Some classify it as reactive or proactive work. To still others, it means predictive, planned, or unplanned activity. All these are merely the various dimensions of mainte­nance. They are valid descriptions, but do not address its functional aspects. We prefer to look at the role or function of maintenance and its strate­gic contribution to the health of a business. In Chapter 8, we examined the role of maintenance in preventing event escalation and how it helps retain the integrity and productive capacity of the facility over its life. This is its strate­gic role; maintenance helps maximize the profitability of a business over its life.  In this chapter, we will see how appropriate maintenance strategies can help manage risk effectively.

In Chapter 2, we noted that the capability of an item of equipment, system or plant may deteriorate over time, due to fouling, wear, corrosion, or fatigue. At some point in time, the capability falls below the required perfor­mance level. We can restore the performance before this point, or shortly thereafter. We term such restoration activity as maintenance. There is another situation where we require maintenance. This is when the operator does not know the state of an item, whether it is working or has failed. These are the items that can have hidden or unrevealed failures. In these cases, the role of maintenance is to identify the state by carrying out a test. If the item is in a failed state, we need to carry out further on-failure maintenance to restore it to a working state.
 

9.1 MAINTENANCE AT THE ACTIVITY LEVEL—AN EXPLANATION OF TERMINOLOGY

9.1.1 Types of maintenance—Terminology and application rationale When the consequence of failure in service is negligible, we can afford to do the restoration work after the item has failed. We call this strategy on-failure or breakdown maintenance.

Unfortunately, many failures have an unacceptable consequence, so we cannot always apply a breakdown strategy. If we can measure the deteriora­tion and note the period of incipiency, it is possible to predict the time of fail­ure. In such a case, we can schedule the work to ensure minimum disruption of production. This ability to schedule the work facilitates a quick and effi­cient turnaround. We call this strategy on-condition maintenance, where we can detect and rectify a deteriorating condition before there is functional fail­ure.

In the case of hidden failures, we have to test the equipment periodically. This will identify whether it is in working condition. When we carry out the tests, we carry out failure-finding tasks. If we find the item in a failed state, we rectify it by carrying out breakdown maintenance. Under certain condi­tions, periodic repair or replacement of the item is warranted, even though it is still in working condition. Planned maintenance includes all of the following:

·    Testing for hidden failures;

·    Condition monitoring of incipient failures; Pre-emptive repair or replacement action based on time (running hours, number of starts, number of cycles in operation, or other equivalents of time). 

We can summarize the terminology discussed above with the following descriptions of the types of maintenance.

Breakdown Maintenance – repair is done after functional failure of equipment, so it is not possible to schedule the repair work. It is also termed on-failure maintenance.

Corrective Maintenance – repair is done after initiation of failure, lead­ing to degraded performance. Usually condition monitoring or inspections will reveal such degradation. The actual repair may be done before or after functional failure, based on our evaluation of consequences of failure, but the key difference from breakdown maintenance is this – we were aware of the functional failure before it occurred, so we had an opportunity to schedule the repair.

Scheduled overhaul or replacement or hard-time maintenance – repair is done based on age (calendar time, number of cycles, number of starts or sim­ilar measures of age as appropriate). This strategy is applicable when the age at failure is predictable, i.e., the failure distribution curve is peaky. Fouling, corrosion, fatigue and wear related failures typically exhibit such distribu­tions.

On-condition maintenance – repair is based on the result of inspections or condition-monitoring activities which are themselves scheduled on calen­dar time to discover if failure has already commenced. Vibration monitoring and on-stream inspections are typical examples of on-condition tasks. Monitoring of some parameters may be continuous, with the use of dedicated instrumentation. All on-condition maintenance is corrective in nature.

Testing or failure-finding is aimed at finding out whether an item is able to work if required to do so on demand. It is applicable to hidden failures and non-repairable items, i.e., the item must be removed from service if we know it has failed. Thereafter, if the item has failed, we do corrective maintenance.

Predictive maintenance – repair is based on predicted time of functional failure, generally by extrapolating from the results of on-condition activities or continuously monitored condition readings. It is synonymous with on-con­dition maintenance.

Preventive maintenance – repair or inspection task is carried out before functional failure. It is carried out on the basis of age-in-service and the antici­pated time of failure. Thus, if the estimate is pessimistic, it may be done even when the equipment is in perfect operating condition. Scheduled overhauls or replacement, on condition and failure finding tasks (themselves time-based), are all part of the preventive maintenance program.

When we do work on a predictive or anticipatory basis, we call it proactive maintenance. If we work on it after it has functionally failed, we call it reac­tive maintenance. When the incipiency period is relatively small, there is insufficient time available to plan the work. Opportunities to minimize pro­duction losses are smaller, and some losses may be unavoidable. In this case, the timing of the work is not in our control, and the corrective maintenance is reactive. Hence corrective maintenance work can be proactive or reactive, depending on the circumstances.

In Chapter 5, we defined planning as the process of thinking through the execution of work. In the course of preparing a plan, we can identify potential pitfalls. We can find solutions in anticipation of the problems, thereby improving the quality and speed of execution. Planned maintenance is that which is correctly prepared sufficiently ahead of its execution. All preventive maintenance can be planned and scheduled.

In most cases, we can plan corrective maintenance as well, but there is less time available to schedule the work, since the onset of failure has already occurred. The term scheduling means the allocation of materials and resources as well as assigning a start and finish date to the work.

When it comes to breakdown maintenance however, we do not know the exact scope and timing in advance. It is difficult to plan such work, except in the most generic terms. Hence, breakdown maintenance tends to be less effi­cient in terms of resource utilization and control of duration.

People tend to regard preventive and predictive maintenance as good while they frown on breakdown maintenance. This view is fashionable but incor­rect. It has resulted in unnecessary maintenance expenditure and equipment downtime. There are many failure modes that have little or no effect in terms of consequences on the system or plant as a whole. In such cases, it is economical to allow the failures to take place before taking any action. Preven­tive maintenance became very popular after the second World War, when the mass production industries enjoyed a period of rapid growth. It became fash­ionable to apply preventive maintenance strategies as a matter of policy, even in industries where the economic logic was different. The result was that items of equipment became ‘due’for maintenance, even though they were per­forming perfectly well.

There are situations where each of the strategies is appropriate and one must base the selection on the most appropriate way to reduce risks. When the consequences are negligible, the risk is usually low, so a breakdown strat­egy is appropriate. If there is a threat to safety, production, or the environ­ment, preventive strategies are appropriate.

9.1.2 Applicable maintenance tasks

As the Weibull distribution has wide applicability in maintenance analysis, we will be using the Weibull shape and scale factors in the discussion that follows.

In Chapter 3 (refer Figure 3.16), we discussed the significance of the Weibull shape factor of the pdf curve. Let us now address the effect of the Weibull shape factor in cases where the failure is evident. When the Weibull shape factor is less than 1, the stresses on the components reduces with time. This can be due to the physical characteristics of the failure mode or to in-built quality problems, and results in an early-failure pattern. When this is a result of underlying quality problems introduced during the design, mainte­nance, or operational phases, we may do more harm than good by carrying out maintenance. What we need is an analysis of the root cause of the failure, and suitable corrective actions to improve work quality. Similarly when the Weibull shape factor is 1 (or close to 1), the probability of failure does not decrease as a result of planned maintenance work. In this case, we should only do the work when performance has already started deteriorating. We should use the incipiency curve to predict the functional failure. Time-based maintenance strategies are applicable when the Weibull shape factor is 1, since this indicates a wear out pattern. The higher the value of the Weibull shape factor, the more definite we can be about the time of failure. When this is high, we can easily justify preventive time-based maintenance as it will improve performance. We can determine the maintenance interval by using the pdf curve to determine the required survival probability at the time of maintenance intervention.

Turning our attention to hidden failures next, we require a time-based test to identify whether the item is in a failed state. If the item has failed already, we have to carry out breakdown maintenance to bring it back in service.

As you can see from the above discussion, only certain tasks are applicable in addressing the failures. The kind of failure, namely, whether it is evident or hidden, and the shape of the pdf curve help determine the applicable task.

9.1.3 How much preventive maintenance should we do?

The ratio of preventive maintenance work volume to the total is a popular indi­cator used in monitoring maintenance performance. With a high ratio, we can plan more of the work. As discussed earlier, planning improves performance, so people aim to get a high ratio. In some cases we know that a breakdown maintenance strategy is perfectly applicable and effective. The proportion of such breakdown work will vary from system to system, and plant to plant. There is therefore no ideal ratio of preventive maintenance work to the total. In cases where there is a fair amount of redundancy or buffer storage capacity, we can manage with a very high proportion of breakdown mainte­nance. In these cases, it will be the lowest total cost option. In a plant assem­bling automobiles, the stoppage of the production line for a few minutes can prove to be extremely expensive. Here the regime would swing towards a high proportion of preventive maintenance. This is why it is important to ana­lyze the situation before we choose the strategy. The saying, look before you leap, is certainly applicable in this context! We have to analyze at the failure mode level and in the applicable operating context. The tasks identified by such analysis would usually consist of some failure modes requiring preven­tive work, others requiring corrective work, and some others allowed to run to failure. We can work out the correct ratio for each system in a plant, and should align the performance indicators to this ratio.

9.2 THE RAISON D’ÊTRE OF MAINTENANCE

In Chapter 8, we examined the process of escalation of minor failures into seri­ous incidents. If a serious incident such as an explosion has already taken place, it is important to limit the damage.

We can combine the escalation and damage limitation models and obtain a composite picture of how minor events can eventually lead to serious environmental damage, fatalities, major property damage, or serious loss of produc­tion capacity. Figure 9.1 shows this model. We can now describe the primary role of maintenance as follows:

The raison d’être of maintenance1 is to minimize the quantified risk of serious safety, environmental, adverse publicity or production incidents that can reduce the viability and profitability of an organization, both in the short and long term, and to do so at the lowest total cost.

This is a positive role of keeping the revenue stream flowing at rated capac­ity, not merely that of fixing or finding failures. We have to avoid or minimize trips, breakdowns, and predictable failures that affect safety and produc­tion. If these do occur, we have to rectify them so as to minimize the severity of safety and production losses. This helps keep the plant safe and profitabil­ity high. In the long-term, maintaining the integrity of the plant ensures that safety and environmental incidents are minimal. An organization’s good safety and environmental performance keeps the staff morale high and mini­mizes adverse publicity. It enhances the reputation and helps the organization to retain its right to operate. This assures the viability of the plant. Note that maintenance can reduce the quantified risks, but in the process it can also help reduce the qualitative risks.

Compare this view of maintenance with the conventional view—namely that it is an interruption of normal operations and an unavoidable cost bur­den. We recognize that every organization is susceptible to serious incidents that may result in large losses. Only a few of the minor events will escalate into serious incidents, so it is not possible to predict precisely when they will occur. One could take the view that one cannot anticipate such inci­dents. Often, we can see that the situation is ripe and ready for a serious inci­dent, as in the case of Piper Alpha, but even so, we cannot predict the timing.

Sometimes these losses are so large that they may result in the closure or bankruptcy of the organization itself As an example from a service industry, consider the collapse of the Barings Bank2. Their Singapore branch trader Nick Leeson speculated heavily in arbitraging deals, losing very large sums of money in the process. He did this over a relatively long period of time, using a large number of ordinary or routine looking transactions. There were deviations from the Bank’s policies, which an observant management could have noticed. In our model, these deviations from the norm constitute the process demand rate. Leeson was a high performing trader, and in order to operate effectively, he needed to make quick decisions. So the Bank removed some of the normal checks and balances. These controls included, for example, the sep­aration of the authority to buy or sell on the one hand, and on the other hand, to settle the payments. Thus, they defeated a Procedure barrier, permitting an opportunity for event escalation. With the benefit of hindsight, we can question whether the reliability of the People barrier was sufficiently high to justify this confidence. Barings had carried out an internal audit a few months before Leeson’s activities came to light. In terms of our model, this was a test to iden­tify hidden failures. The auditors did find some areas of concern, and recom­mended that Leeson’s authority be limited to trading or settlements, but not both. The Bank did not implement this recommendation. By January 1995 the London office was providing more than $10 million per day to cover the margin payment to the Singapore Exchange. There were clear indications that some­thing was amiss, but all the people involved ignored them. The Bank of Eng­land, which supervised the operations of Barings Bank, wondered how Barings Singapore was so profitable but did not pursue the matter further. Hence the People barrier in the damage limitation level was also weak. When you com­pare this disaster with Piper Alpha, Bhopal, or Chernobyl, some of the similari­ties become evident. With so many barriers defeated, a disaster was looming, and it was only a matter of time before it happened.

Integrity issues are quite often the result of unrevealed failures. We can minimize escalation of minor events by taking the following steps:

·    Reduce the process variability to reduce the demand rate;

·    Increase the barrier availability. We can do this by increasing the intrinsic reliability, through an improvement in the design or configuration. Alternatively, we can increase the test frequency to achieve the same results;

·    Do the above in a cost effective way.

We discussed the effect of the law of diminishing returns and how to deter­mine the most cost-effective strategy, in Chapter 8. In order to achieve the required level of availability in the case of each barrier, we have to determine its intrinsic reliability. We can then calculate the test interval to produce the required level of availability.

At this stage, we encounter a practical problem. How does one measure the reliability of the People or Procedures barriers? There is no simple met­ric to use, and even if there was one, a consistent and repeatable methodology is not available. If we take the case of the People barrier, their knowledge, competence, and motivation are all important factors contributing to the bar­rier availability. As we discussed in Chapter 8, motivation can change with time, and is easily influenced by unrelated outside factors.

There would be an element of similarity in motivation due to the company culture, working conditions, and the level of involvement and participa­tion. As long as the average value is high and the deviations small, there is no problem. Also, if there are at least two people available to do a job in an emer­gency, the redundancy can help improve the barrier availability. We can test the knowledge and competence of an individual from time to time, either by formal tests or by observing their performance under conditions of stress. In an environment where people help one another, the People barrier availability can be quite high. In this context, salary and reward structures that favor indi­vidual performance in contrast to that of the team can be counterproductive.

Procedures used on a day-to-day basis will receive comments fre­quently. These will initiate revisions, so they will be up-to-date. Those used infrequently will gather dust and become out-of-date. If they affect critical functions, they need more frequent review. We should verify Procedures relating to damage limitation periodically, with tests (such as building evacua­tion drills).

The predominance of soft issues in the case of the People and Procedures barriers means that estimating their reliability is a question of judg­ment. Redundancy helps, at least up to a point, in the case of the People bar­rier. Illustrations, floor plans, and memory-jogger cards are useful aids in improving the availability of the Procedures barrier. It is a good practice to keep some drawings and procedures permanently at the work site. Thus we see some wiring diagrams on the doors of control cabinets. Similarly we get help screens with the click of a mouse button and see fire-escape instructions on the doors of hotel rooms. Obviously, we have to ensure that these are kept up to date by periodic replacement.

9.3 THE CONTINUOUS IMPROVEMENT CYCLE

Once the plant enters its operational phase, we can monitor its performance. This enables us to improve the effectiveness of maintenance. This process can be represented by a model, based on the Shewhart3 cycle.

In this model, we represent the maintenance process in four phases. The first of these is the planning phase, where we think through the execution of the work. In this phase, we evaluate alternative maintenance strategies in terms of the probability of success as well as costs and benefits. In the next phase, we schedule the work. At this point, we allocate resources and finalize the timing. In the third phase, we execute the work, and at the same time we generate data. Some of this data is very useful in the next phase, namely that of analysis, and we will discuss the data we need and how to collect it in Chap­ter 11. The results of the analysis are useful in improving the planning of future work. This completes the continuous improvement cycle. Figure 9.2 below shows these four phases.

9.3.1 Planning

We begin the planning process by defining the objectives. The production plant has to achieve a level of system effectiveness that is compatible with the production targets. We have to demonstrate that the availability of the safety systems installed in the plant meets the required barrier availability. Using reliability block diagrams, we can translate these requirements to availability requirements at the sub-system and equipment level.

The next step consists of identifying those failure modes that will prevent us from achieving the target availability. Next, we evaluate alternative ways to resolve these problems. We have to execute the selected tasks at the correct frequencies, with the specified skilled resources. We can bundle a number of these tasks together. We can do so if the work is on the same equipment, using the same trade skills at the same frequency. We call such an assembly of tasks a maintenance routine. These routines will cover all time-based tasks includ­ing condition monitoring and failure finding tasks.

When we execute condition monitoring tasks, we will detect incipient fail­ures. This will result in the generation of corrective maintenance work. We carry out failure finding tasks, to identify whether items subject to hidden fail­ures are in a working state. If they are in a failed state, we have to carry out breakdown maintenance work to restore it to a working condition. Lastly, we will allow certain items of equipment to run to failure, and some others will fail in service as a result of poor operation or maintenance. These will also require breakdown maintenance. We have to make a provision for such cor­rective and breakdown work in our plan. Various tools are available to assist us in planning this work, and we will review some of these in the next chap­ter.

We cannot execute all the work during normal operations and so some of these will require a plant shutdown.

Planning of maintenance encompasses all the routine and corrective work done during normal operations as well as during shutdowns. There is an ele­ment of generic planning that we can do with respect to breakdowns. For exam­ple, in a plant using process steam, we can expect leaks from flanges, screwed connections, and valve glands from time to time. These leaks can grow rapidly, especially if the pressures involved are high, or the steam is wet. The prompt availability of leak-sealing equipment and skilled personnel can prevent the event from escalating into a plant shutdown. In the case of plans made to cope with breakdowns, the work scope is usually not definable in advance. We require a generic plan that will cater to a variety of situations. Note that while such a plan may be in place, we still cannot schedule the work till there is a failure. If a breakdown does take place, we will have to postpone some low priority work, so that we can divert resources to the breakdown.

9.3.2      Scheduling

We have to schedule maintenance work in such a way that we minimize pro­duction losses. The scheduler’s task is to find windows of opportunity to mini­mize the losses. We can schedule maintenance work during weekends or month-ends if there are calendar-based production quotas. We schedule the work so that it commences towards the end of the week or month, and com­plete it in the early part of the next week or month. By boosting the produc­tion rate before and after the transition point, we can build up sufficient additional production volumes to compensate for the production lost during the maintenance activity.

We can avoid loss of production if intermediate storage or installed spares are available. When carrying out long duration maintenance work on protec­tive system equipment such as fire pumps, the scheduler must evaluate the risks and take suitable action. For example, we can bring in additional porta­ble equipment to fulfill the function of the equipment under maintenance. If this is not possible, we have to reduce the demand rate, for example, by not permitting hot work. Using this logic, one can see why the Piper Alpha situa­tion was vulnerable. The fire deluge systems were in poor shape, the fire pumps on manual, at a time when there was a high maintenance and project workload with a large volume of hot work.

We have to prioritize the work, with jobs affecting integrity at the highest level. This means that testing protective devices and systems has the highest priority. Work affecting production is next in importance. Within this set, we can prioritize the work according to the potential or actual losses. All other work falls in the third category of priorities. When scheduling maintenance work, we have to allocate resources to the high priority work and thereafter to the remaining work. If the available resources are inadequate to liquidate all the work on an ongoing basis, we have to mobilize additional resources. We can use contractors to execute such work as a peak-shaving exercise.

The available pool of skills may not meet the requirements on a day-to-day basis. If each person has a primary skill and one or two other skills, schedul­ing becomes easier. This requires flexible work-practices and a properly trained workforce. On the other hand, if restrictive work practices apply, scheduling becomes more difficult.

We then have to firm up the duration and timing of each item of work, arrange materials and spare parts, special tools if required, cranes and lifting gear, and transportation for the crew. When overhauling complex machinery, we may need the vendor’s engineer. Similarly we may require specialist machining facilities. We have to plan all these requirements in advance. It is the scheduler’s job to ensure that the required facilities are available at the right time and place and to communicate the information to the relevant people.

A good computerized maintenance management system (CMMS) can help us greatly in scheduling the work efficiently.

9.3.3 Execution

The most important aspects in the execution of maintenance work are safety and quality. We have to make every effort to ensure the safety of the work­ers. Toolbox talks, which we discussed in Chapter 6, are a good way of ensur­ing two-way communications. They are like safety refresher training courses. Amore formal Job Safety Analysis (JSA), used in some high hazard industries helps increase safety awareness in maintenance and operational staff. JSA cards are used not just for hazardous activities, they are also used for increasing awareness during routine maintenance activities. The worker needs protective apparel such as a hard hat, gloves, goggles, overalls, and spe­cial shoes. These ensure that even if an accident occurs, there is no injury to the worker. Note that protective apparel is the Plant barrier in this case. If the work is hazardous, for example, involving the potential release of toxic gases, we must ensure that the workers use respiratory protection. In cases where the consequence of accidents can be very high, escape routes needs advance planning. We have noted earlier that redundancy increases the availability of Plant. Hence in high risk cases, we should prepare two independent escape routes. In addition to the normal toolbox talk, the workers should carry out a dry run before starting the hazardous work. During this dry run, they will practice their escape in full protective gear. The damage limitation barriers must also be in place. For example, in the case discussed above, we must arrange standby medical attention and rescue equipment. In a practical sense, the management of risk requires us to ensure that the People, Plant, and Pro­cedure barriers are in-place and in good working condition.

The quality of work determines the operational reliability of the equip­ment. In order to reach the intrinsic or built-in reliability levels, we must operate the equipment as designed, and maintain them properly. Both require knowledge, skills, and motivation. One can acquire knowledge and skills by suitable training. We can test and confirm the worker’s competence. Pride of ownership and motivation are more difficult issues, and they require a lot of effort and attention. The employees and contractors must share the values of the organization, feel that they get a fair treatment, and enjoy the work they are doing. This is an area in which managers are not always very comfort­able. As a result, their effort goes into the areas in which they are comfortable and they tend to concentrate on items relating to technology, knowledge and skills. Quality is a frame of mind, and motivation is an important contributor.

Good planning and organization are necessary for efficient execution of work. A number of things must be in-place, in good time. These include the following:

Permits to work;

Drawings and documentation;

Tools;

Logistic support, spare parts, and consumables;

Safety gear;

Scaffolding and other site preparation.

If these are not in place, we will waste resources while waiting for the required item or service. The efficiency of execution is dependent on the quality of planning and organization.

The two drivers of maintenance cost are the operational reliability of the equipment, and the efficiency with which we execute the work. We require good quality work from both operators and maintainers to achieve high levels of reliability. The number of maintenance interventions falls as the reliability improves. This also means that equipment will be in operation for longer periods. When we carry out maintenance work efficiently, there is minimum wastage of resources. As a result, we can minimize the maintenance cost. As we have already noted, good work quality improves equipment reliability, and good planning helps raise the efficiency of execution. These two factors, work quality and good planning, are where we must focus our attention.

There are many reasons for delays in commencing the planned mainte­nance work. There may be a delay in the release of equipment due to produc­tion pressures. Similarly, if critical spares, logistic support, or skilled resources are not available, we may have to postpone the work to a more con­venient time. While we can tolerate some slippage, it is counter productive to spend a lot of time and money deciding when to do maintenance, and then not do it at the correct time. When planned work is done on schedule, we have achieved compliance. For practical purposes, we accept it as compliant as long as it is completed within a small range, usually defined as a percentage of the scheduled interval. As a guideline, we should commence items of work that we consider safety critical, within +/-10% of the planned maintenance interval, from the scheduled date. For safety critical work that is planned every month, e.g., lubricating oil top-up of the gear-box of fire pumps, we would consider it compliant if it was executed some time between 27 and 33 days from the scheduled date on the previous occasion. If the work was con­sidered production critical, again planned as a monthly routine, e.g, lubricat­ing oil top-up of the gear-box of a single process pump, as long as the work was done within +/-25%, or in this case between 23 and 37 days of the previ­ous due date, it would be considered compliant. Finally, if the same work was planned on non-critical equipment, e.g., the gearbox of a duty pump (with a 100% standby pump available), a wider band of, say +/-50% is acceptable. In this case, for a monthly routine, if the work was done between 15 and 45 days of the previous scheduled date, it would be considered compliant. Progressive slippage is not a good idea. Thus, we must retain the original scheduled dates even if there was a delay on the previous occasion. If the work falls outside these ranges, the maintenance manager must approve and record the devia­tions. This step will ensure that we have an audit trail.

Procedural delays, caused for example, by having a permit-to-work system that needs a dozen or more signatures are sometimes encountered. The Author has audited one location where technicians sat around every morning for 1.5-2 hours, waiting for the permits-to-work. No work started before this time, and the site considered this practice normal. The PTW for simple low-hazard activities needed 12 signatures, mostly to ‘inform’ various operating staff that work was going on. Over the years, the PTW had evolved into a work slow­down process, instead of being the enabler of safe and productive work.

The timely execution of work is very important, so we should measure and report compliance. This is simply a ratio of the number of jobs completed on the due date (within the tolerance bands discussed earlier), to those scheduled in a month, quarter, or year. This is a key performance indicator to judge the output of maintenance.

We noted earlier that whenever we do work, we generate data. Such data can be very useful in monitoring the quality and efficiency of execution. By analyzing this data, we can improve the planning of maintenance work in future, as discussed below.

9.3.4      Analysis

The purpose of analysis is to evaluate the performance of each phase of main­tenance work—planning, scheduling, and execution. The quality and effi­ciency of the work depend on how well we carry out each phase. There is a tendency to concentrate on execution, but if we do not look at how well we plan and schedule the work, we may end up doing unnecessary or incorrect work efficiently!

In the planning phase, it is important to ensure that we do work on those systems, sub-systems, and equipment that matter. Failure of these items will result in safety, environmental, and production consequences. How well we increase the revenue streams and decrease the cost streams determines the value added. Quite often, the existing maintenance plan may simply be a col­lection of tasks recommended by the vendors, or a set of routines established by custom and practice. So we may end up doing maintenance on items whose failures do not matter.

The objective of planning is to maximize the value added. We do this by carrying out a structured analysis to establish the strategy at the failure mode level. This task can be large and time-consuming, so we have to break it up into small manageable portions. We must analyze only those systems that matter, therefore that we use our planning resources effectively. We identify progress milestones after estimating the selection and analysis workload. In effect, we make a plan for the plan. To achieve this objective, we have to measure the progress using these milestones. Such an analysis can help monitor the planning process.

At the time of execution, we may find that some spare part, tool, resource, or other requirement is not available. This can happen if the planner did not identify it in the first place or the scheduler did not make suitable arrange­ments. There will then be an avoidable delay. We can attribute such delays to defective planning or scheduling. A measure of the quality of planning and scheduling is the ratio of the time lost to the total.

In the execution phase, we can identify a number of performance parame­ters to monitor. The danger is that we pick too many of them. In keeping with our objectives, safety and the environment are at the top of our list, therefore we will measure the number of high potential safety and environmental inci­dents. We discussed the importance of hidden failures in the context of barrier availability. We maintain system availability at the required level by testing those items of equipment that perform a protective function. Operators or maintainers may carry out such tests, the practice varying from plant to plant. The result of the test is what is important, not who does it. We have to record failures as well as successful tests. Sometimes people carry out pre-tests in advance of the official tests. Pre-tests defeat the objective of the test, since the first test is the only way to know if the protective device would have functioned in a real emergency. In such a case, we should report the results of the pre-test as if it is the real test, so that the availability calculations are meaningful. If a spurious trip takes place, this is a fail-to-safe event. By recording such spurious events, we can carry out meaningful analysis of these events.

One can use some simple indicators to measure the quality of mainte­nance. These include, for example, the number of days since the last trip of the production system, sub-system, or critical equipment. Another measure is the number of days that critical safety or production systems are down for maintenance. If we concentrate on trends, we can get a reasonable picture of the maintenance quality. Note that work force productivity and costs do not feature here, as safety and quality are the first order of business.

Earlier, we discussed the importance of doing the planned work at or close to the original scheduled time. Compliance is an important parameter that we should measure and analyze. The ratio of planned work to the total, and asso­ciated costs are other useful indicators. In measuring parameters such as costs, it is useful to try to normalize them in a way that is meaningful and rea­sonable, to enable comparison with similar items elsewhere. For this pur­pose, we use some unit representing the complexity and size of the plant such as the volumes processed or plant replacement value in the denominator.

Finally, we can evaluate the analysis phase itself, by measuring the improvements made to the plan as a result of the analysis. In a Thermal Cracker unit in a petroleum refinery, the six-monthly clean-out shutdowns used to take 21 days. Over a period of three years, the shutdown manager reduced the duration to 9 days, while stretching the shutdown intervals to 8 months. The value added by this plant was $60,000 per day, so these changes meant that the profitability increased by about $1.7 million per annum. This required careful analysis of the activities, new ways of working, and minor modifications to the design to reduce the duration and increase the run lengths. The plant was located in the Middle East, where day temperatures could be 40 - 50°C. Working inside columns and vessels under these condi­tions could be very tiring and, therefore, took a long time. One suggestion was to cool the fractionator column and soaker vessel internally, using a porta­ble air-conditioning unit. In the past, they had been used to cool reactors in Hydro-Cracker shutdowns, to reduce the cooldown time. Use of these units for the comfort of people was a new application. When the shutdown man­ager introduced air-conditioning, the productivity rose sharply, and this helped reduce the duration by about 36 hours. Another change was to relo­cate two pairs of 10 inch flanges on transfer lines from the furnace to the soaker. This clipped an additional six hours. There were many more such innovations, each contributing just a few hours, but the overall improvement was quite dramatic. This case study illustrates how one can measure the success of the analysis phase in improving the plan and thus the profitability.

It is easy to fall into the trap of carrying out analysis for its own sake. In order to keep the focus on the improvements to the plan, we need to record changes to the plan as a result of the analysis. Further, we have to estimate the value added by these changes and bank them. Hence, analysis must focus on improvements to all four phases of the maintenance process.

9.4 SYSTEM EFFECTIVENESS AND MAINTENANCE

The primary role of maintenance is to minimize the risk of minor events esca­lating into major incidents. We achieve this by ensuring the required level of barrier availability. Let us examine how we can do this in practice, with some examples.

More...Click here to buy a copy of Effective Maintenance Management - Risk & Reliability Strategies for Optimizing Performance By Vee Narayan

Advertisement

Click here to return to Home Page