Focal Points: Sponsored links

New CMMS! MVP Plant - Smart Software for Smart Maintenance

 Join The Association For Maintenance Professionals

RCM-EAM-MTrain-2009 Daytona Beach 

Infrared windows and safety products

Follow us on Twitter


Resources for Maintenance & Reliability Professionals

Knowledge Bases
Web Workshops

Photo Albums
Reliability Radio


Network Links

en Español
XML/RSS Feeds Advertise

Click to verify BBB accreditation and to see a BBB report.

Our privacy promise: We respect your privacy and never sell or rent our subscriber lists, a fact that is certified and audited by  Subscribing will not result in more spam! I guarantee it! announces the Top 100 list each year as a way of delivering value to our members and as a way of acknowledging the extra work that these companies put into creating a web site that contributes to the overall maintenance and reliability community.



Which Reliability Tool should I use? By H. Paul Barringer, Barringer & Associates Inc. 
Publisher's note:  When a person has a hammer - everything looks like a nail.  Once a maintenance engineer learns techniques like Reliability Centered Maintenance (RCM) or Weibull analysis, it seems like they apply the technique to every potential area of failure they can find - whether RCM or Weibull analysis can add value or not.  Reliability tools must be used in the proper context to create the best result and the more tools we understand the better we can apply them.
We asked our favorite reliability guru, Mr. H. Paul Barringer to help us understand what reliability tools are available to us as maintenance professionals, when we can and should use them and what results we can expect if we apply them correctly. 
To get a better education in Reliability engineering for Maintenance Professionals be sure and visit Mr. Barringer's web site after you read this article at   - Terrence O'Hanlon, CMRP, Publisher 

Reliability Tools:

 Reliability tools exist by the dozens:  what are the tools, why use the tools, when should I use the tools, and where should I use the tools?  Click on the tools below for answers. 

Reliability Tools

Accelerated Testing

Design Review



Reliability Testing



Life Cycle Cost

Pareto Distribution

Simultaneous Testing

Bathtub Curves


Life Units

Poisson Distribution

Software Reliability

Block Diagram Models



Probability Plots

Sudden Death Testing






Configuration Control




Weibayes Estimates

Contract For Reliability

Failure Forecast


Reliability Audits

Weibull Analysis

Cost Of Unreliability

Failure Rates

Maintenance Engineering


Weibull Database

Critical Items List

Fault Tree Analysis

Mean Time





Mechanical Component Interactions

Reliability Engineering


Decision Trees

FRACAS Systems

Monte Carlo

Reliability Growth




Normal Distribution

Reliability Policies


The details about these tools will be brief as books are written about each item.  Think of the presentations below as hors d’oeuvres (a little snack food or starters)—not the main course.   

The most important reliability tool is a Pareto distribution based on money—specifically based on the cost of unreliability which directs attention to work on the most important money problem first.  No magic bullet exists for reliability issues—don’t waste your time looking for a single magic tool—none exist!

Accelerated Testing-

What:     A test method of increasing loads to quickly produce age-to-failure data with only a few data points which are then scaled to reflect normal loads.

Why:      The benefit of accelerated testing is to save time and money while quantifying the relationships between stress and performance along with identifying design and manufacturing deficiencies to get useful data quickly and at low cost.

When:    Usually performed during the development of devices, components, or systems.  Also applies to items that have been in service to obtain a metric needed to show how the item is performing under heavy loads.  Accelerate testing is a useful method for solving old, nagging, problems within a production process.

Where:  Used for correlating test results with real life conditions.

Return to top



What:    A tool for measuring the % of time an item or system is in a state of readiness where it is operable and can be committed to use when call upon.  Availability ceases because of a downing event which causes the item/system to become unavailable to initiate a mission when called upon.  In the simplest view the metric is availability = uptime/(uptime + downtime).  For many other definitions see MIL-HDBK-338, section 5.

Why:      The measure is important for knowing the commitment of time for performing the mission and it usually only involves the use of arithmetic.

When:    Often the measurement tool is based on past experiences and the complement of the measurement tool addresses unavailability to perform the task.

Where:  In design of a system it is a calculated value and in operation of a system it is a performance index that is often easy to use and provides and index that is understandable to the average person.  Today there is a great tendency to “Enronize” availability metrics by using uptime metrics that presents data in the best light (an issue of data integrity) to maximize managerial bonuses by excusing (deducting) downtime from the calculations to put lipstick on the pig.  Use the KISS principle.  Think of availability in terms of the investor’s typical year of 8760 hours.  The no-excuse annual metric in hours is availability = uptime/8760.  Suddenly you’ll find a metric of great interest to investors that can be bench marked as a financial issue, and thus motivate the management team to solve real issues of importance to the business.  Please note, you can have high availability but many failures and thus low reliability as availability ≠ reliability.  Likewise, you can have high availability but little output so team the metric with effectiveness to get the complete story.

Return to top


Bathtub Curves-

What:    The concept is derived from the human life experience involving infant mortality, chance failures, plus a wear out period of life since data for births and deaths is accumulated by government agencies.  Most equipment lacks the birth/death recording by government agencies and most non-human systems can be regenerated to live/die many times before relegation to the scrap heap.

Why:      Failure rates are different for both people and equipment at different phases of operation and the medicine to be applied to both humans and equipment need to be considered for effectively treating the roots of the problem.

When:    The concept is useful during design, operation, and maintenance of equipment and systems to understand the failure mechanisms

Where:  It explains the human experiences to the ordinary person to relate equipment/system failures to those experienced in real life so as to coordinate the design, operation and maintenance of equipment.  For other definitions see MIL-HDBK-338, section 9.

Return to top


Block Diagram Model (same as Reliability Block Diagram Models)-

What:    Reliability block diagram (RBD) models are graphical representations of a calculation methodology for reliability systems.

Why:      The RBD models allow calculation of system reliability based on knowing/assuming failure details of the components starting with the least component and growing the model to the greatest system to predict performance from the elements.

When:    RBDs are used in upfront designs as a performance parameter and after the system is constructed to ferret out poor performing blocks that limit the system performance.

Where:  Frequently used as a trade-off tool to search for the lowest long cost of ownership and to help sell alternative courses of action for moderating the effects of reliability issues or overcoming the poor performance by alternative designs where the results can be calculated before building the system as the results of the calculations provide knowledge about availability, maintenance interventions required for failures, and the number of spare parts required to sustain operations.  For other definitions see MIL-HDBK-338, section 4 and 6.

Return to top



What:    A measure of how well the product performance meets objectives.  In short how well are the outputs actually accomplished against a standard?  Capability is frequently the product of efficiency * utilization.

Why:      Capability is a component of the effectiveness equation and usually under the control of production.

When:    Data for this metric is frequently produced by the Accounting department each month as a segment of the financial reports for the purpose of handling variances against the standards.

Where:  Frequently in the effectiveness measure it is a weak point [as a measure of how well the production process des the job for which it was purchased] requiring substantial improvement that cannot be solved by the usual reliability and maintainability (RAM) tools.  However, this metric may be deficient from the original design [an issue of design effectiveness] of the system or from the way the system is operated [an issue of use effectiveness].

Return to top


Configuration Control-

What:    Configuration control is involved with the management of change by providing traceability of failures back into the design standard.  If the design details are not specified, the design will not contain the requirements and thus implementation of the project will be hit or miss for achieving the desired end results beginning with the conceptual design and resulting in the operating facility.

Why:      With active configuration control you know where items are used and contained, where and why they were installed, where signal originate, what items are used where and in what environments, what drawing revisions have occurred and the product conforms to the drawings and specifications, what alternate materials/components have been used, and test reports/certifications are available as original documents for review.

When:    Configuration control begins after the first design review to build an unbroken chain of traceability to aid in avoiding surprises in the field which would destroy the designed-in criteria for availability, reliability, maintainability, and cost effectiveness established as a portion of the original design criteria.

Where:  Frequently these documentation details are assembled into a dossier with third party witnessing for use in validating conformance to the design requirements and provided to the owner of the equipment as witness documents.

Return to top


Contracting For Reliability-

What:    Say what you want and want what you say to your vendors.  Provide explanations of the objectives in contracts in terms the vendors will understand.

Why:      If you can’t spell clearly spell-out the requirements for availability, reliability, and maintainability the contractors cannot make these issues features of the design.  Thus it is important to be specific in the features the design must manifest.  Explanations such as: “You know what I want and what I need, just do it quickly” are self defeating expressions of vague generalities that lead to inferior designs and constant arguments.  Be specific about requirements for building reliability block diagrams, using quality function deployment, performing failure mode and effects analysis, conducting fault tree analysis and finally conducting design reviews for reliability.

When:    Write the specifications before procurement begins.  Plant to spend time with your own Purchasing Department to explain the details and sell the team on the financial advantages for including reliability requirements into the specifications; and likewise, spend time selling your vendors on the requirements and why they are stated.

Where:  These are up front decisions to avoid replication of previous problems that are built into previous designs and never corrected.

Return to top


Cost Of Unreliability-

What:    The cost of unreliability is a big picture view of system failure costs, described in annual terms, for a manufacturing plant as if the key elements were reduced to a series block diagram for simplicity.  It looks at the production system and reduces the complexity to a simple series system where failure of a single item/equipment/system/processing-complex causes the loss of productive output along with the total cost incurred for the failure.  If the system IS sold out, then the cost of unreliability must include all appropriate business costs such as lost gross margin plus repair costs, scrap incurred, etc.  If the system is NOT sold out, and make-up time is available in the financial year, then lost gross margin for the failure cannot be counted.  The cost of unreliability is a management concern connected to management’s two favorite metrics: time and money.

Why:      In private enterprise, failures must be concerned from a financial view point and not a gear-head approach of simply counting the number of failures; and you must speak the language of the enterprise which describes events by monetary measures over a period of time.  The annual cost for failures is usually not stated in a clear cut manner nor is failure costs summarized by system/sub-system to identify the weak links in a monetary fashion so that appropriate action is taken to reduce the annual cost of unreliability by building a clear Pareto distribution to attack the vital (high cost) areas with an action plan to reduce failures (unreliability) and to reduce the cost of unreliability.

When:    For new a new plant, this can be a design criteria to limit costs of unreliability for competitive reasons in the marketplace, i.e., by plan, the hidden costs of failures is made obvious as a portion of the strategic plan.  For an existing plant, this can be an exercise in defining the cost of unreliability and building a long term plan to reduce the cost of failures as a portion of the tactical plan.

Where:  This activity is best performed with high level involvement of the management team to provide fundamental understanding of the size of the icebergs about to rip out the underbelly of the plant and to involve the organization in a plan to reduce the costs so that profits are pushed upward because of the improvements.  If the cost of unreliability cannot be reduced, then the costs become extra weight for the saddle bags in the race for survival.

Return to top


Critical Items List-

What:    The critical items list is a top level summary of problems/cost used for discussions with management about key reliability issues.  The summary list converts technical details to a summary of costs and time while placing the issues into a Pareto distribution explained in terms of money and the vital few problems to be solved for competitive reasons.

Why:      The purpose of the critical items list is to focus management’s attention on items that need to be resolved during the design phase as a corrective action loop for influencing the life time costs.

When:    The list starts with the first design review as issues are disclosed in design reviews for reliability.

Where:  The critical items list is presented to top level management as issues to be accepted or resolved before paper plans become steel and concrete.

Return to top



What:    Data is the informational energy which runs the reliability improvement machine.  Data is acquired at great cost.  Data needs to be retained and used to prevent future failure events.  Proper use of data provides an understanding of failure mechanisms and prevents reoccurrence of bad events which cause safety or high cost failures to occur.  Reliability data requires definition of a failure.  Failures can be catastrophic failures or slow degradation—you decide by defining the failures.  The units of the measure for the data must be in units of the degradation—sometimes it is hours, some times it is miles, and so forth—in short, what ever motivates the failure.  Reliability always ceases with a failure or a removal from service in some aged condition which then generates a category of data called a suspension or censored data.  Data is information in the form of facts, figures, or engineering databases which is obtained from engineering tests, experiments, or actual operating conditions.  Reliability data is often incomplete as the exact times to failure are rarely known or recorded with much precision so that only partial information is available for analysis.  Reliability data comes in two forms: 1) age-to-failure data, and 2) censored/suspended data such as occurs when unfailed items are removed from service or when they fail due to a different failure mode than we are studying—this is useful information and part of the data set.   Some data is better than no data for resolving reliability issues.

Why:      Data is the information that, when used in an informed manner, helps prevent repetition of bad history and allows an enlightened approach to rationally solving a reliability issue using facts and figures.   Intelligent use of data for reliability issues provided the objective evidence needed for helping to solve the root cause of failures.

When:    Databases of reliability information of past experience is very helpful for predicting future failure events.  The data is helpful if failure rates, or the reciprocal of failures rates is described in mean times to failure which reduces the information to an average failure rate or average time to failure.  The reliability data is particularly valuable if retained for components as a Weibull data base with shape factor beta and scale factor eta.

Where:  The data is useful for understanding failure modes, and for predicting future failures for a population of equipment during the design stage and for predicting future failures with subsequent increases in the aging of equipment.  The role of the reliability engineer is to acquire the failure data and convert the data into useful information for both current and future use.

Return to top


Decision Trees-

What:    Most business decision have considerable uncertainty which implies at least two outcomes if you choose a course of action.  Making decisions in the face of uncertainty requires the costs for taking action and the probability along with the cost for not taking action and the probability of the occurrence.  In most cases the probabilities are not well known (maybe to one significant digit) and the costs are not well know (maybe to $10000).  The quantitative assessment is called risk assessment.  The issue is to take these not well identified issues and devise a strategy which can minimize exposure to risk for the business. The graphical representation of the methodology is called decision trees to reach the expected values for decision to take/not-take action.

Why:      Most business decisions have no exact answers, i.e., no black and white answers but rather shades of grey.  The use of the tool is to help decide which course of action may be to the advantage of the business given the best estimates that can be made.

When:    Decisive details will only be know into the future and decisions have to be made today so use of decision trees are tools to help wisely span from today into the future with the wisest decisions that can be made from sketchy data.

Where:  If you have absolute date, use it.  Must most decisions must be made with indecisive information which requires decisions about the odds for a given event, usually based on estimates—the wiser the estimate the better the decision, taking into account the probabilities of the outcomes and the money involved in the decision.  Use this tool when few details are available and you must be the pioneer to cut through the forest to reach the promised land of opportunity and profitable ventures.

Return to top



What:    The International Electrical Congress (IEC) defines dependability as “Dependability describes the availability performance and its influencing factors: reliability performance, maintainability performance and maintenance support performance.”  MIL-HDBK-338 defines dependability differently as a measure of the degree to which an item is operable and capable of performing its required function at any (random) time during a specified mission profile, given that the item is available at mission start.  (Item state during a mission includes the combined effects of the mission-related system R&M parameters but excludes non-mission time; see availability.)  Dependability is related to reliability with the intention that dependability would be a more general concept than the measurable issues of reliability, maintainability, and maintenance.

Why:      The key dependability issue is make equipment and processes work as advertised, which is, without failure.  Dependability aims at facilitating co-operation by all parties concerned (supplier, organization, and customer by fostering an understanding of the dependability needs and value to achieve the overall dependability objectives) so it involves harmonizing conflicting issues.  Dependability has a better view point from the end user of the equipment or system than from the designer’s viewpoint or the maintainer’s viewpoint.  From a system effectiveness viewpoint, reliability and maintainability provide system availability and dependability.

When:    You cannot repair yourself to happiness with a failure prone system as the failure prone system will be viewed lacking dependability to function as required when you need it.  Thus dependability is viewed over the longer term and not in convenient snap-shots and dependability also involves life cycle cost issues.

Where:  Reliability contributes directly to uptime by avoiding failures whereas maintainability contributes directly to reducing downtime by faster repairs.  Thus reliability and maintainability jointly provide impact on dependability of the system.  Dependable systems must be ready to function, in an operable state, to produce the desired output, upon demand by the end user, at the specified quantity and quality of output.

Return to top


Design Reviews For Reliability-

What:    Specific questions to ask the design engineers during a review specifically for reliability using failure data from operations and maintenance are: 1) show the calculated availability for the system based on a RAM model, 2) show the calculated number of failures during the specified mission time between turnarounds based on a reliability and maintainability (RAM) model, 3) show details of FEMA studies, 4) show details of FTA calculations,  5) show the calculated mean times between downing events, 6) show the calculated the mean time between cutbacks from full production capability and losses thus incurred, 7) show the QFD matrix and details, and 8) show the calculated cost of unreliability.

Why:      Design reviews should demonstrate by calculation or through the use of models and reliability tools that the system is capable of achieving the design objects rather than making a giant leap of faith that all will be well and good.

When:    Design reviews for reliability should be a part of the design process starting with conceptual designs and ending when the drawings are revised for the as-built system.

Where:  This is a logical extension of the design process to show me rather than tell me how the system will function and is performed as a portion of the up-front design by the numbers process.

Return to top



What:    The potential or actual probability of a system to perform a mission for a given level of performance under specified operating conditions defined as the product of reliability*availability*maintainability*capability.  Many variants of the effectiveness equation exist, e.g., OEE, and others.

Why:      The effectiveness equation defines the ability of a product, operating under specified conditions, to meet operational demands when called upon.  This is a practical measure of how well the system is performing—not how well we want it to perform but a practical measure of how it’s doing.  Since all the elements are measured between 0 to 1, the elements of the equation quickly draw the eye to where opportunities exist for making improvements.

When:    The effectiveness equation is useful for trade-off boxes for various alternatives when plotted on an X-Y scale for effectiveness vs net present value (NPV) for improvement alternative selections.  For the elements::
reliability defines the probability of a failure free interval (or the complement unreliability which describes the probability of failure),
availability defines the probability of the system being up and alive to handle the demand (or the complement, unavailability which describes the probability of the system being down),
maintainability defines the probability of making repairs within the allowed repair standard,
capability defines the probability of production achieving the desired production results [a measure of how well the product performs compared to the standard] and frequently it is described as the product of efficiency * utilization where
      efficiency is an output/input relationship such as (output achieved)/(the standard required) and
      utilization is how time is used such as (direct labor)/(direct labor + labor lost)
                      [in the old days, if this index decreased to as low as 80% we went berserk—today,
                      you can’t get this high because of wasted time when noses are not to the grindstone!!!].

Where:  It is used to describe new systems and old systems performance.  Consider this example for effectiveness:  If we are comparing a heavy duty truck versus a sports car for transportation, the truck may be more effective for heavy loads whereas the sports car may be more effective for acceleration and high speeds—neither are defined by the effectiveness equation until the mission is defined.

Return to top


Environmental Stress Screening (ESS)-

What:    A series of screens are conducted under environmental stresses to disclose weak parts and workmanship defects which require corrections and this requires and understanding of burn-in testing and ESS of which both techniques identify weak points and eliminate them by motivating early failures.  Burn-in is usually a long process of operating under load(s) and at fixed temperature (in short, this is a special case of ESS) or it can be operated at varying loads and accelerated temperatures to achieve a shorter burin-in period, whereas ESS is a scientifically planned and conducted test which is usually conducted under accelerated loads to produce the same test/use results in a shorter period of time by increasing the stress on the components or assemblies.  The objective of these screens is to produce a failure free product when released into operations.  ESS is not intended as a test to validate compliance to a design, however it is intended to force latent defects into becoming defects before the end user finds them in day-to-day usage.

Why:      The extremes of operating conditions such as high power levels, high temperatures, high vibration levels, etc. produce failures not anticipated from testing at nominal conditions.  Generally ESS is directly applicable and interpreted to be applicable to electrical/electronic equipment, however the same issues/concepts apply to mechanical equipment when the stressing conditions are loads/pressures/temperatures/vibrations/thermal shocks/etc., so as for all reliability issues—think broadly!

When:    When acquiring data, the tests are done upfront of production.  When controlling early failures that would be discovered by the end user, these test are done as a portion of the production process to eliminate week units to control warranty costs and improve customer satisfactions

Where:  Some tests are conducted in the laboratory for quick results and then the data is used to control product testing/release for the purpose of limiting costs and preventing the loss of customers from unsatisfactory performance in the field.

Return to top



What:    Events/incidents are single events or occurrences that happen, especially one that is particularly significant, that results in a failure from an non-aging mechanism for reliability purposes.  Usually the event/incident result in a serious consequence of the loss of functional life of a component or system.  The death of the device must be recorded as censored (suspended) data.

Why:      For reliability purposes, failure of the component, device, subassembly, or system has been a success up to the point in life where a failure from a non-aging event too place.  This means the event-age was a success (up to the point it was killed by an event/incident) and inclusion of the data is required as censored/suspended data—this is important data.

When:    Include the suspended/censored data into every analysis.  Young suspensions/censored data have little impact on the results of an analysis but old suspensions have major effect on the analysis.

Where:  The data is used for MTBF/MTTF analysis and particularly for Weibull analysis.

Return to top


Exponential Distribution-

What:    The probability of survival and of failure of components or equipment is under the condition of chance failure which means a constant instantaneous failure rate where the die-off rate is the same for any surviving (unfailed) population.  An old part is as good as a new part.  For any survivors in this memory-less system that have survived to time t, a certain percent of the survivors will die in a specified interval of time such as 2*t.  The reliability of the system is often described by the exponential distribution because many times a system is made-up of mixed failure modes which in the aggregate will function like a constant failure rate system.  The reliability of exponential distributions are described mathematically as R(t) = e^(-lt) = e^(-t/Q) where t is the mission time, l is the failure rate, and Q is the mean time, given that l=1/Q.  The exponential distribution is frequently used as a first approximation to describe reliability based on a simple failure rate or a simple mean time to failure—particularly if the system or component has multiple failure modes.

Why:      The constant hazard rate, l, is usually a result of combining many failure rates into a single number.

When:    The exponential distribution is frequently used for reliability calculations as a first cut based on it’s simplicity to generate the first estimate of reliability when more details failure modes are not described.

Where:  In electronic systems (which can have many different types of failure modes and the fact that any electrical/electronic system is an amalgam of many different components) the simple assumption is that the electrical/electronic package will have a constant failure rate system defined by the exponential distribution.  When in doubt about the failure mechanisms, it is common to assume use of the exponential distribution with it’s constant failure rate for simplicity.

Return to top



What:    Failure is the loss of function when you needed the function to occur.  Failures for reliability purposes must be precisely defined so they are recorded correctly.  Much life data is incomplete because failures are mixed-up with censored/suspended data where aged items may not have failed or they represent removals from service before failure, or they have not yet failed for the mode of failure under study—in short these censored/suspended items represent successes and are a portion of data set for study.

Why:      We study failed items for the same reason we do autopsies on humans—we want the data and we want it categorized correctly for making important decisions.  Failures require: 1) a time origin which must be unambiguously defined, 2) a scale for measuring the passage of time/starts/stops/etc. which motivates failure, and 3) the meaning of failure must be entirely clear for recording the event.

When:    Failure data must be recorded as it occurs to prevent loss of information.

Where:  The CMMS system is frequently where most data resides but usually in crude fashion.  The failure data is often transferred into the FRACAS system for converting the symptoms of the failure into the root causes of failure.  The failure data must be converted into action items for making management decisions about future failures and the corrective action needed.

Return to top


Failure Forecast-

What:    Failure forecasting is a projection of failures into the future based on assumed or documented failure details.  It is also known as risk analysis of future failures.  For a constant failure mode system this is very straight forward.  However for complicated failure modes where the failure rate increases with time (wear out failure modes) or where failure rates decrease with time (infant mortality failure modes) this becomes a more complicated analysis as described by the Abernethy Risk which is described in The New Weibull Handbook and implemented in the software package WinSMITH Weibull for predicting future failures.  Like wise, reliability block diagrams are useful for predicting future failures when the authentic failure details are supplied to the Monte Carlo models. 
Please note manufacturers follow two general strategies for their equipment:
      1) build the equipment to avoid failures even though this increases the original capital costs, or
      2) build equipment and see the original equipment at a low cost (or even a break-even costs)
          expecting to make profits with the sale of replacement parts.
Thus for end users of the procured equipment, it is important to know the forecasted failures in the face of suppler protest that “our equipment never fails”—in that case ask to see the sale of spare parts for similar equipment and an estimate of the number of units working to get a crude estimate of the strategy employed by the equipment supplier.
A failure is an event which renders equipment as non-useful for the intended or specified purpose during a designated time interval.  The failure can be sudden, partial, or one-shot, intermittent, gradual, complete, or catastrophic.  The degree of failure can be degradation or gradual, sudden, or one-shot, from weakness, from imperfections, from misuse, or so forth.
A failure mechanism includes a variety of physical processes which results in failure from chemical, electrical, thermal, or other insults.

Why:      Future failures costs money and frequently increase the risk for safety or environmental problems.  For manufacturers, the forecasted failures predict impending high costs for warranty expenses which can make/break a company.  With good failure forecast, you can anticipate expected failures now (after x-usage), future failures when failed units are not replaced, and future failures when failed units are replaced either with the same failure modes or with differently designed components with different failure detais.

When:    This analysis is wisely performed in during the design of the equipment, however many surprises arise from different failure modes build into the assembled product or incurred by not anticipated usage in operations.

Where:  Generally this analysis is made during the up-front design effort—with much disbelief the products could be “this bad”.  Follow-up analysis occurs when unexpected failure modes arise during operation of the equipment which causes loss of service of the equipment and high costs for the end users.

Return to top


Failure Rates-

What:    Failure rates, in the simplest form, are S(time in use)/S(number of failures) or the reciprocal of mean times to/between failure.  For more sophisticated failure data bases such as Weibull data bases the failure rates can be disclosed without giving away proprietary data such as the shape factors, beta, which tell the failure mode for the equipment.

Why:      Simple failure rates are a precursor of maintenance events and production interruptions that will occur into the future which drive up costs and cause chaos.

When:    Failure rates derive from the history of operation or from well known data sources such as OREADA, IEEE 500, IEEE 493, EPRI, and other sources listed in reading lists for reliability including Weibull databases.

Where:  The failure rates are used as an awareness criteria for the average person just as you used automobile fuel consumption rates for understanding the health of your automobile as well as anticipating your weekly/monthly/annual out-of-pocket expenditures for gasoline or diesel fuel.  The failure rates drive the maintenance interventions, spare parts, and maintenance cost for the Maintenance Department.  Similarly they predict the interruptions to the process and lead to misses on promised deliveries and result in negative variances for production costs.  In sort, failure rates are precursors for the misery expected for the organization.

Return to top


Fault Tree Analysis-

What:    Fault tree analysis (FTA) is a top down processes of defining the top level problems and through a deductive approach using parallel and series combinations of possible malfunctions to find the root of the problem and correct it before the failure occurs.  The reliability tool can be used as qualitative or quantitative methods.

Why:      The tool aids the design process, shows weak links that cause failures, and in the critical legs of the trees helps to define maintenance strategies for which pieces of equipment and processes should be defended with the greatest maintenance vigor to prevent “Murphy” from shutting down the process or causing serious safety issues. The technique provides a graphical aid for the analysis and it allows many failure modes including common cause failures.  Results from a FTA is usually more pessimistic that other analysis tools such as RBDs as you can see from a study of the Space Shuttle reliability analysis where each system is studied by multiple reliability tools because of the high cost/profile of failures.

When:    FTA is widely used in the design phase of nuclear power plants, subsea control and distribution systems, and for oversight studies in layers of protection studies for process safety and loss control in chemical plants and refineries so as to prevent accidents and control the costs of risks.  The technique is helpful for identifying critical fault paths, observing vague failure combinations before they occur in reality, comparing alternate designs for safety, and setting a methodology to provide management with a tool to evaluate the overall hazards in a system and avoid single sources of critical failures.  Finally when thinking top down about failures and where/how they can occur, the methodology gives a diagram for setting maintenance strategies for protecting key pieces of equipment/processes to prevent failures.

Where:  FTA is helpful for defining potential event sequences and potential incidents, evaluating the incident consequences of outcomes, and estimating the risks of events occurring.  FTAs work in the design room and on the operating floor where first hand knowledge has been gained for preventing failures.

Return to top



What:    Failure mode and effect analysis (FMEA) is the study of potential failures that might occur in any part of a system to determine the probable effect of each failure on all other parts of the system and on probable operations success.  When criticality analysis is added for sophisticated studies the method is know as FMEAC.  In the automotive world where FMEA is a required portion of the quality systems, it is frequently known as PFMEA for potential failure mode and effect analysis.  The basic thrust of the analysis tool is to prevent failures using a simple and cost effective analysis that draws on the collective information of the team to find problems and resolve them before they occur.

Why:      The analysis is known as a bottom up (inductive) approach to finding each potential mode of failure and preventing failures that might occur for every component of a system and determines the probable effects on system operation of each failure mode in turn on probable operational success and the results of which are ranked in order of seriousness.  FMEA can be performed from different viewpoints such as safety, mission success, availability, repair costs, failure modes, reliability reputation, production processes, and follow-on service, and so forth.

When:    The FMEA is most productive when performed during the design process to eliminate potential failures.  It can also be performed on existing systems where operations personnel and maintainers are made team members to add real-life experiences to educate the team in a problem solving forum that is constructive to eliminating existing problems.

Where:  The analysis can be conducted in the design room or on the shop floor and it is an excellent tool for sharing experiences to make the team aware of details that are know to one person but seldom shared with the team.  It is also an extremely productive tools for educating young engineers, young maintainers, and your operators into details they should be aware that can kill the system.

Return to top


FRACAS Systems-

What:    Failure reporting can corrective action systems (FRACAS) is an organized database for aiding in solving reliability problems using a common sense approach by systematically and permanently removing failure mechanism.  Good historical data from this system can populate a Weibull database.

Why:      Use data to solve problem by attacking root causes to reduce failures and make reliability grow.  Fixing failures requires data—not opinions—to use the data acquisition system in a closed loop to record, analyze, correct, and verify improvements have been achieved.  First data reported is usually a symptom of a failure and with a failure investigation, the symptom can be converted into a root cause which requires the system to be editable to correctly report failes.

When:    The maintenance repair order system usually generates evidence of a failure.  Failures with significant costs (repair costs + collateral damage + lost margin from the failure + other appropriate business costs) must be investigated and evaluated to reduce failures and to reduce failure costs.  Little is to be gained by spending big money to investigate trivial failures.

Where:  This is an engineering tool requiring clerical effort to input the data and build the Pareto distributions for identifying significant events requiring corrective action and thus it also becomes a management tool for controlling costs.

Return to top



What:    Highly accelerated life test (HALT) is an offspring of older environmental stress screening (ESS) tests and it is a testing process for ruggedization of pre-production products by heavily stressing the product to identify failure modes quickly and to verify weak links in the system.

Why:      HALT tests are intended to quickly find failures and accelerate the improvement program so that when products are delivered to end users, they will be mature products by elimination of potential failure modes that would normally generate a reliability growth program.  Usually the HALT programs reduce time, cost, and delays experienced in new products by recalls, warranty costs, etc.  HALT is similar to HASS but the stresses are more severe.  In the HALT process, design and process flaws are found, root causes identified, and corrective actions implemented quickly.

When:    HALT is used during the development program to get engineers to acknowledge and correct fatal problems in designs by adding loads (generally temperature, vibrations, pressures, physical stresses, etc) by rapidly changing the load conditions over and above normal operating loads

Where:  HALT is frequently used for electronic systems but also applicable to mechanical systems where thermal shocks are used to validate designs for extreme conditions of loads.  The tests are performed in the laboratory for engineering evaluation.

Return to top



What:    Highly accelerated stress screen (HASS) uses the same stresses as HALT, but at a lower stress level.  Compared to HALT testing, temperature and voltage extremes may be reduced by 10-15%, vibration levels reduced 50%, etc. depending upon the design although all the stresses may be above rated product specifications with the motivation to produce test results quickly for verifying product compliance.

 Why:     HASS testing is to verify product performance is on target and has not shifted toward inferior performance in the manufacturing process.  Note that higher stresses often produce accelerated failures out of proportion to the increased stress applied.

When:    Products are periodically screened by HASS to verify no shifts have occurred in the manufacturing process.

Where:  HASS tests are performed as a quality assurance test in manufacturing facilities to learn what you don’t know about each product as it is faster than a simple burn-in test.  If 100% of the finished goods do not receive HASS, as when only a percentage of the product is screened by HASS this is called a highly accelerated stress audit (HASA).

Return to top


Life Cycle Cost-

What:    Life cycle cost (LCC) are all costs associated with the acquisition and ownership of a system over its full life.  The usual figure of merit is net present value (NPV).  Projects are considered most favorable for large positive NPVs.  However for many cost individual cases, decisions are made for the least negative NPVs.  In all cases, the default position for accounting is to know the NPV for making no change and this is usually the last alternative for most people associated with change.

Why:      The first cost for capital equipment (acquisition) is between ½ and 1/20 of the total life time cost!  The first cost, acquisition cost, is usually definable by a firm quotation and sustaining costs must be estimated and put into the appropriate time slots for discounting to obtain the NPV for the project life.   Typical values used in industry for LCC are: discount rate = 12%, tax rate = 38%, and project life is usually between 10 and 20 years.

When:    Life cycle cost is usually calculated as an up-front decision making effort either for projects or for cost reduction efforts.  I does not work well for doing the analysis after the project is underway.

Where:  LCC is the business of investing money to make changes occur.  The NPV values add the voice of investments to technical decisions to work for the lowest long term cost of ownership.

Return to top


Life Units-

What:    A measure of use duration applicable to an item.  For example, the life units may be starts-stops, run hours, hot-cold cycles, distances traveled, emergency starts or starts, shelf life, and other measurements which motivate failures.

Why:      Life is consumed by usage of life units.  Some life units occur as a sum of the different cases, for example on a gas turbine aircraft engine take-offs consume more life than landings or enroute conditions which requires a synthetic value for how life is consumed on a mission.  For a land based, heavy –duty gas turbine used in the generation of electrical power the number of starts is not equivalent to hours of operation as other wear mechanisms are involved, however, 1 trip cycle = 8 normal shutdown cycles and thus decreases the time between required maintenance actions. 

When:    Development of a life consuming profile may be more important than the literal measurement of an elapsed time to adequately measure consumption of life that in the end will result in a failure.

Where:  Life units have different measures and must be considered to obtain the proper “common denominator” for calculations.

Return to top


Load-Strength Interactions-

What:    For reliability successes, loads must always be less than strengths.  When loads are greater than strengths, failures occur.  The issue is determining the probability of load-strength interference which is a joint probability of when loads exceed strengths.  The loads should include expected conditions plus the foolishness of people to violate rules and overload equipment, plus the vagaries of Mother Nature to impose unexpected static and dynamic loads from hurricanes, tornadoes, earth quakes, wild fires, and so forth.

Why:      Neither loads nor strengths are unmovable point estimates, although most designers use point values.  Failures occur and reliability terminates when loads exceed strengths.

When:    Loads usually increase over time (e.g., airplanes like people gain weight over time from accumulation of dirt and extra equipment), strength usually decrease over time (small fatigue cracks appear with many cycles and load bearing strengths decline).

Where:  Bridges have finite lives because of load-strength interactions, wings break off of airplanes from fatigue, etc.  A few failures are dramatic but most failures sneak up from the unknown in a variety of ways to cause loss of reliability.  To prevent loss of the system requires many physical inspections to learn what you don’t know!

Return to top



What:    Lognormal distributions are continuous life functions that have long tails to the right (display positive skewness) in time or usage.  A lognormal distribution plotted on semi-log papers would appear as a normal curve.

Why:      The lognormal distribution is a common competitor to the Weibull distribution for life.  However it is adequate for 85-95% of all repair times.

When:    Lognormal distributions are motivated by multiplicative (or proportional) events that grow with time like crack growth , molecular diffusion, and some wear out problems.

Where:  In the days when plots had to be made by hand, it was the first widely used transform to convert plotted data into straight lines.  Today it is simply one of an arsenal of probability tools used to obtain good curve fits to data with multiplicative type events.

Return to top



What:    The measure of the ability of an item to be retained in or restored to specified condition when maintenance is performed by personnel having specified skill levels, using prescribed procedures and resources.

Why:      Maintainability measures the percent of maintenance jobs completed to a standard time for the repair with repair times for the task usually plotted on a lognormal probability plot.

When:    First you set a standard repair time for the task, second you set a skills level, third you measure how you’re doing against the standard.

Where:  Applies to major tasks where many repetitions are expected and where considerable time is required.

Return to top



What:    All actions necessary, both technical and administrative, for retaining an item in or restoring it to a specified condition so it can perform a required function.  The actions include servicing, repair, modification, overhaul, inspection, reclamation, and restored condition determination. 

Why:      Equipment deteriorates because of entropy changes, because of errors both overt and convert, and because of the use of incorrect procedures.

When:    Maintenance is generally routine and recurring. 

Where:  The effort includes fault location, diagnosis, repair, test, adjustment, replacement, administration, and overhauls wherever equipment is located.

Return to top


Maintenance Engineering-

What:    A tactical job for rapidly repairing equipment to operable conditions by studying operating and repair manuals.  Acquires failure data and prepares maintenance plans of restoring equipment to operable condition in a minimum amount of time.  Prepares general diagrams, charts, drawings, and spare parts requirements for maintenance planners.  Makes recommendations for improving the repair cycle.  Provides manning level forecast for supervisors and estimates the duration of outages.  Determines the cost advantages of alternatives for developing action plans to comply with internal/external customer demands for timely repairs of processes/equipment.  The purpose of these activities is to restore equipment to service in a timely manner.

Why:      Facilitates speedy repairs by providing maintenance technology above the craftsman level and up to but not including reliability engineering principles.

When:    Provides expertise for more complicated maintenance tasks or when organization and oversight is required and time is of the essence for fast repairs.

Where:  Provides on-site expertise to aid craftsmen to solve non-standard repairs without hands-on tool contact.  Maintenance engineers serve as liaison with reliability engineers.

Return to top


Mean Time-

What:    A density figure-of-merit metric often referred to as the average or expected value.  In the simplest form it appears as arithmetic S(time)/S(events) or in complicated situations as a statistic metric.  It applies to mean life (ML), mean down time (MDT), mean maintenance time (MMT), mean time between failures (MTBF for repairable items), mean time to failures (MTTF for replacement items), mean time between maintenance (MTBM), mean time between maintenance scheduled (MTBMs), mean maintenance time unscheduled (MMTu), mean maintenance time scheduled (MMTs), mean time between overhauls (MTBO), mean time between unscheduled removals(MTBRu), mean time to restore (MTR), mean time between downing events (MTBDE), and so forth.  The units will be time/metric, e.g., hours/failure.  The reciprocal of the metric provides an incident rate, e.g., failures/hour.

Why:      The metric provides an awareness factor for deciding central tendency numbers and for the expected number of events which will occur into the future based on historical situations.  The arithmetic simplicity of mean time is a reason to establish the metric, and listen to the information derived from it to gain insight.  The arithmetic provides immediate answers to categorize facts for starting continuous improvement rather than postponing a metric while searching for delayed perfection!

When:    The metrics are used as criteria of performance and variations from the central tendency numbers are expected however for the long term the variations are expected to be controlled to prevent distortion of the measurement.

Where:  The metrics are use from the shop floor to the management levels as criteria for “How are we doing?”.

Return to top


Mechanical Components Interaction-

What:    Mechanical components suffer from interactions and degradations of overloads, strength deterioration, wear, corrosion, process variations during the fabrication process, effects of special processes where the procedures must be controlled as discovery of the end results would result in destruction of the component, and removal of safety factors by increasing loads.

Why:      The naïve expectation is that individually the impact of a single insult will not destroy reliability of the component.  However, you frequently have multiple insults occurring which results in failures that are not predicted up front but which can be perfectly explained after the components have failed.

When:    The multiple destructive events are more predominate in complex devices and highly stressed devices which too often have small safety factors which cannot cope with the overload conditions and thus failures occur.

Where:  The foolishness of humans adds further insults to the interactions of many different failure mechanism which demands high maintenance interventions and frequent inspections.  Of course the solution to many of these cases where failures occur is to increase safety factors by adding extra material (when possible) but this adds extra weight and extra costs.

Return to top


Monte Carlo Simulation-

What:    Monte Carlo simulation (modeling) is a method to solve engineering problems by sampling methods.  The method applies to such things as system reliability and availability modeling by simulating random processes such as life to failure and repair times.

Why:      The technique is used when: 1) many variables are present and their interrelationships are unclear, 2) the system can’t be analyzed by direct and formal methods; 3) building analytical models would be time consuming, complex, and just too hard, 4) you cannot do direct experiments, 5) when the input details such as equipment life and repair times are not discrete and they vary over time according to a distribution, and 6) you need to do some tweaking of the system to understand where opportunities lie for improving uptime, reliability, and costs.

When:    Build models before you commit systems to bricks and mortar so you know their performance on paper.  Revise the models after they are in operation to help improve the unknown weaknesses an improve costs for future cases.

Where:  Monte Carlo models are used for gaining insight about how things work and data collected from the model is done at an accelerated rate compared to real life.

Return to top


Normal Distribution-

What:    A fundamental frequency distribution that produces a symmetrical bell-shaped diagram based on the Gaussian distribution to form a normal law of errors. 

Why:      The distribution is easily described with two statistics, the mean (X-bar, which is a location parameter) and the standard distribution (sigma, which is a shape parameter carrying units of the location parameter) as these are parameters of the population.

When:    The distribution is widely used for quality issues where errors are frequently symmetrically distributed and for a few cases of reliability problems where life data is also symetriclly distributed.  For symmetrical life data, the normal data makes a good Weibull plot whereas a Weibull data usually makes a poor normal plot—thus Weibull plots have almost displaced normal plots for reliability data.

Where:  The distribution is used where the statistics simplify descriptions of the distribution so it is easy to describe and explain.

Return to top



What:    Overall equipment effectiveness (OEE) is a manufacturing index to reduce complexity of discrete systems for problem solving and benchmarking.  In many ways, it is a subset of effectiveness.  OEE=availability*performance*quality where availability = (operating time)/(planned production time), performance = (ideal cycle time)/(operating time/total pieces), quality = (good pieces)/(total pieces) and is best suited to discrete manufacturing.  The index is larger than for effectiveness and allows for acceptance of down time without have a hard measure for utilization losses in the capability (although it does have a performance index which takes elements from both efficiency and utilization) and it accepts planned downtime as OK in the availability index.  The effectiveness index looks at the system from the perspective of the investor, where as OEE looks at the system from the perspective of the operations management which excuses many losses such as planned outages, etc. which has the propensity for the indices to be “Enronized” so they look good when in fact from the investors viewpoint the results are not good which is a violation of the principle of Esse Quam Videri (To be rather than to seem).

Why:      It’s a simple and easy to use index for the big picture summary of performance in industry and it can be benchmarked against similar industries.

When:    Use for a quick assessment and approximation of the effectiveness equation

Where:  Widely used for a first cut at improving manufacturing operations in lieu of the more stringent and complete effectiveness equation.

Return to top


Pareto Distribution-

What:    Vilfredo Pareto, and Italian economist in the late 1800s, who described the unequal distribution of wealth in the world.  The concept was improved by Joe Juran for manufacturing operations when he said it was a methodology for separating the vital few problems from the trivial many problems.  When the Pareto distribution is listed in order of money lost (or the risk for money lost) it becomes a work priority for attacking business problems that have the greatest impact on the enterprise.  Winners in the organization work on the vital few important items, as they put their reputations at stake, while the losers in the organization work on the trivial many problems, which if solved, would have little impact on the enterprise.

Why:      The Pareto distribution sets work priorities and assuming a one year pay back period describes how much money can be spent to resolve the issues.  Most reliability engineers need to be working on the top 5 or 6 items all the time as data and solutions are developed slowly and the key items always need to be on the mind for active consideration.  The mentality is to think like a bank robber—go for where the big money is located and get it back.

When:    At least quarterly reviews of the Pareto distribution are important for accountability of who has solved what problems and to define what new targets have come over the horizon that require immediate attention.

Where:  Pareto distributions are used throughout the organization to keep attention on the vital few issues.  They are favored by management when engineers employ them based on money.  Pareto distributions help set work priorities and avoid focusing on love affairs with equipment or process which often occurs to the detriment of the business.  Pareto distributions explain why some work orders always get maintenance priority while other task are relegated to the category of when ever we get time to solve the problem.

Return to top


Poisson Distribution-

What:    Poisson distributions are discrete distributions and the simplest statistic process where Poisson events are random in time which describes a stable average rate of occurrence of counted events.  The Poisson is frequently used as a first approximation to described failures expected with time.  The calculations are driven by an average value, e.g., failures/year, defects/meter2, hurricanes/year, etc.  Answers from the Poisson will come as probabilities for 1 failure, 2 failures, etc. or the probability for 1 hurricane in a year or 2 hurricanes in a year, etc.  The average value is obtained from a constant*time-interval which is usually explained as l*t.  Frequently charts are used to obtain solutions to the Poisson equation such as the Thorndike Chart from Bell Labs or the Abernethy-Weber chart from The New Weibull Handbook.  The equation is often described in two formats: 1) probability = (np)re-np/r! where n = number of trials, r = number of occurrences, and p=probability of an occurrence, or 2) probability = ZCe-Z/C! where Z=expected number (i.e., the mean) and C=probability of an event in counting numbers.  Of course for the two different formats np=Z and r=C.  When n is large and p (or 1-p) is small, the Poisson is an excellent approximation to the binomial distribution. 

Why:      Simplicity is the major reason for use of the Poisson distribution.

When:    Use the Poisson when an answer is needed quickly and the answer deals with counting terms.

Where:  When you know the average number of events the Poisson is easy to use to find the probability of 1, 2, 3,…events occurring.

Return to top


Probability Plots-

What:    Probability plots make sense of the chaos of failure data on an X-Y plot.  Each type of plot is divided differently on the X and Y axis based on the fundamental mathematics for a given distribution.  The decision which type of graph paper to use is based on: 1) a simple pragmatic approach (use the one that gives the best curve fit to the data), and 2) the physics of failure or the mechanism driving the data for non-failures.  For reliability data, 85% to 95% of the data will adequately fit a Weibull distribution.  For repair data, 85% to 95% of the data will adequately fit a lognormal distribution.  Often Weibull plots or lognormal plots compete to which distribution best fits the failure data.

Why:      The acquired data is plotted in the units acquired on the X-axis of a probability plot and the data is plotted in rank order.  The Y-axis in most cases is determined using Benards median rank approximation to provided the probability percentage.  The result is often a straight line on the properly divided X-Y graph paper.  Please note, over the years many different plotting positions have been tried with Benard’s plot position being the strongest survivor for tailed data.

When:    Use when you have failure data or repair data.  They work best when age-failure plots are made by individual failure modes or individual repair modes.  They also will handle high level failure data and repair times where the data represent how the system is behaving.

Where:  Use probability plots to get complicated data summarized onto one side of one sheet of paper.  When the plots have the cumulative distribution plotted on the Y-axis, it tells what percent of the population will have a life (or repair time) less than the corresponding X-value.

Return to top


Quality Function Deployment-

What:    QFD is a bad translation of a good reliability technique for getting the voice of the customer into the design process so the product delivered is the product the customer desires.  In particular it is applicable to soft issues that are difficult to specify.

Why:      The method helps pinpoint: 1) what to do, 2) the best ways to accomplish the objective, 3) the best order for achieving the design objectives, and 4) the staffing/assets required to complete the task.

When:    QFD is a major up-front effort (as is the case with most Japanese techniques) to learn and understand the customers requirements and the approach that will satisfy their objectives.

Where:  The methodology is used as a team approach to solving problems and satisfying customers beginning with a listing of customer requirements, converting customer requirements into engineering characteristics (the house of quality), converting engineering characteristics into parts characteristics (the house of parts deployment), converting parts characteristics in process characteristics (the house of process planning), and finally converting the process characteristics into production characteristics (the house of production planning).  As with all Japanese techniques the up front costs are high and many clever graphical tools exist for transferring information with the intention of decreasing costs downstream while satisfying customer’s needs.

Return to top



What:    Reliability is the probability that a device, system, or process will perform its prescribed duty without failure for a given time when operated correctly in a specified environment.

Why:      Reliability has two broad ranges of meanings: 1) qualitatively-operating without failure for long periods of time just as the advertisements for sale suggest, and 2) quantitatively-where life is predictable long and measureable in test to assure satisfactory field conditions are achieved to meet customer requirements.  Reliability is concerned with failure-free operation for periods of time, whereas quality is concerned with avoiding non-conformances at a specified time prior to shipment thus reliability measures a dynamic situation but quality measures a static situation.  As in physics, statics is easier to understand and calculate than dynamics which involves higher levels of math and greater mental capabilities for comprehension.

When:    Reliability is expected for new equipment to start, run, and continue to function for long periods of time without failure.  Reliability is also expected when the equipment is dormant and called to duty.  Reliability is also expected upon service or restoration and resumption of long life.  Reliability is designed into the system by up-front activities, and reliability is sustained by careful operation of the system along with careful nurturing of the system with sustaining maintenance activities.  Reliability always terminates in a failure and the roots of failure can be due to design, fabrication, installation, operation, maintenance (repair and period servicing), and management of the system—in short there are many ways and means to kill the system but few ways to keep is operating without failure.

Where:  The adage says the proof of the pudding is in the eating; and for reliability, the proof of the system is in the long failure free interval.  Reliability tools are used from stem to stern to demonstrate high reliability (the absence of failures for long periods of time) by use of many tools such as:
reliability acceptance test to demonstrate long life,
reliability analysis to compute the expected results,
reliability and maintainability the mathematical tasks which predict the expected results from the elements,
reliability apportionment to allocate life issues in a top-down manner to meet an overall reliability goal,
reliability assessment determines the achieved level of reliability of an existing system using data gathered during test or use
reliability assurance implements planned management and technical measures to provide confidence that a reliability target is obtained and maintained,
reliability block diagrams to graphically and mathematically calculate reliability results prior to building a system,
reliability-centered maintenance is the systematic approach to identify preventive support and service according to a set of procedures to reduce and avoid failures,
reliability confidence limits demonstrate the limits for reliability within a given confidence limit,
reliability control is the coordination and direction of system dependability through design activities and management planning,
reliability critical item identification whereby failure significantly affects system safety/cost or operational success or maintenance/logistics support costs,
reliability data is the basic age-to-failure data as life unit information relating to the time-to-failure when organized by probability distributions,
reliability degradation which incurs loss of the failure-free performance due to poor workmanship or bad parts or improper operation or abuse or inadequate maintenance,

               reliability design practices are a series of trade-off-tools to meet or beat the design specification for reliability, 
reliability development/growth test are the evaluations to disclose deficiencies and verify corrective actions to prevent reoccurrence of the failures to achieve the design specifications and sustain reliability growth toward longer times between failure,
reliability estimates are life values used prior to statistical experimentation with the end products to make predictions or assessments, or stress analysis evaluations,
reliability function is the graphical representation of life characteristics plotted against operating time,
reliability growth achievement is the systematic improvements of a item/systems dependability by removing failure mechanisms through corrective actions to eliminate deficiencies and flaws often achieved by means of test-analyze and fix,
reliability growth models (Crow-AMSAA) measures the reliability growth by means of log-log plots of cumulative failures on the Y-axis and cumulative time on the X-axis to demonstrate with statistics that failures are coming more slowly and reliability goals have been achieved,
reliability guarantee is the commitment by suppliers to provide a given meant time between replacements or to maintenance and overhauls intervals for equipment,
reliability improvement is the identification of failure modes and effects having a critical impact on the system failure potential of the design along with the systematic removal of the failures to produce long life without failures,
reliability index is the ratio of the mean reliability level achieved to the acceptable level specified in the design as a figure of merit,
reliability measurement is failure free endurance assessment activity for making decisions about reliability and demonstrating compliance,
reliability mission is the mission time for demonstrating failure free performance,
reliability prediction is the process of quantitatively assessing whether a proposed or existing deign meets a specified life requirement,
reliability prediction functions estimate the life characteristics for setting goals and evaluating the design benchmarks and needs,

               reliability prediction limitations describes the shortcomings in life values by analytical methods
reliability prediction requirements describes life assumptions, environmental data, and failure rates for the design,

               reliability prediction summary is a report providing conclusions and recommendations based upon an reliability assessment analysis
reliability program at the activities to organize and achieve a system to insure reliability goals are achieved and deficient areas shored-up,
reliability program plan is the formal written definition of the specific tasks to fulfill the reliability requirements<
reliability qualification test (RQT) is an evaluation conducted under specified conditions using items representative of the approved product configuration,
reliability quantitative elements are the life characteristics and factors considered in predicting and measuring reliability performance,
reliability requirements are the numerical values representing a specified failure-free life or dependability performance characteristic,
reliability sequential tests are evaluations of the number of failures and the time required to reach a decision based on the accumulated results of the reliability tests,
reliability tasks describe the activities required to achieve a reliability program,
reliability tests are the formal evaluation to determine a product’s longevity for the failure-free interval or stability relative to time/usage,
and finally
reliability with repair is the failure-free performance achieved by redundancy with permitted online repairs without interrupting equipment operation.

Return to top


Reliability Audits-

What:    Reliability audits verify your reliability program is effective and find areas of weakness for corrective action.  They are inquiries by factual examination of elements of the system with a written an objective criteria for performance beginning with an assessment of how management is involved and are they effective in building an productive reliability program.

Why:      Most organizations know where they are strong.  On an objective basis, few organizations know where they are weak.  Reliability audits are a fact finding exercises similar to financial and quality audits to ferret out weaknesses for corrective action.  The questions to be answered are: 1) how well are you doing what you promised against your reliability policy, 2)  How well is upper management doing against company objectives for reliability, 3)  how well are reliability plans, systems, and procedures working, 4)  How well are plans systems, and procedures being executed against the policy, 5) how well are the productive effort for reliability working for achieving the goals, 6) how well has the reliability system been communicated to employees  and are they committed to understanding and implementing the improvements, and 7) are financial objectives being met as a result of ongoing reliability improvements (which is the main objective of the audit—not just a rigid procedural/bureaucratic compliance to details).

When:    Detailed annual audits should occur annually with a follow-up on the annual audits to occur six month later to insure that corrective action has been implemented.  With out a six month deadline few tasks will be completed because of procrastination.

Where:  Audits are needed for 1) reliability system management, 2) new techniques, technology, developments, and controls, 3) supplier control (internal and external), 4) process operation and control, 5) reliability data programs, 6) problem solving techniques, 7) control of reliability measurements, 8) human resources involvement, 9) customer satisfaction assessment (internal/external), and 10) software reliability (excluding Microsoft products used in the office environment).

Return to top


Reliability Block Diagrams-

What:    Reliability block diagram (RBD) models are graphical representations of a calculation methodology for reliability systems.

Why:      The RBD models allow calculation of system reliability based on knowing/assuming failure details of the components starting with the least component and growing the model to the greatest system to predict performance from the elements.

When:    RBDs are used in upfront designs as a performance parameter and after the system is constructed to ferret out poor performing blocks that limit the system performance.

Where:  Frequently used as a trade-off tool to search for the lowest long cost of ownership and to help sell alternative courses of action for moderating the effects of reliability issues or overcoming the poor performance by alternative designs where the results can be calculated before building the system as the results of the calculations provide knowledge about availability, maintenance interventions required for failures, and the number of spare parts required to sustain operations.  For other definitions see MIL-HDBK-338, section 4 and 6.

Return to top


Reliability-Centered Maintenance-

What:    Reliability-Centered maintenance (RCM) is a systematic planning process used to determine the maintenance requirements for a system.  RCM expects the system has an inherent reliability and maintenance requirements are imposed upon the baseline of inherent safety and inherent reliability which can be no better than the worst than designed into the system.

Why:      RCM does what is required to make sure the systems continue to do what the users want done.  If the excellent maintenance programs demonstrate the lack of reliability expected, then the system must be improved by design changes to physical assets or the manner in which the assets are used.

When:    RCM requires a cultural change in both management and the work forces to “do maintenance by the numbers”.  This requires discipline in the organization to perform the FMEAs that drive the work process for maintenance and it also requires defining functional failures.

Where:  RCM works better in top quartile manufacturers who have a disciplined work force and are interested in achieving excellence in 1) safety, 2) operability, 3) reduced maintenance downtime by a disciplined approach to the maintenance activities, 4) high uptimes and 5) a reduction in failures.  Lacking one or more of the five efforts at excellence generally results in a failed RCM program.

Return to top


Reliability Engineering-

What:    A strategic job for preparing plans to reduce the failures and the cost of failures as a preventative measure to reduce the cost of unreliability.  Acquires failure data and analyzes the  data to quantify the financial impact and prepare long term solutions to prevent reoccurrences to improve reliability and uptime.  Determines the cost advantages and proposes alternatives for solving the problem and recommends the alternative with the lowest long term cost of ownership.  The purpose of these actions is to prevent failures.

Why:      Prevents future failures by working on medium and long term projects using technology to solve the problems.  As required, provides technical assistance to maintenance engineers to aid their efforts for quickly restoring equipment to service.

When:    Provides expertise for avoiding failures by means of a technical solution to reduce the high cost reliability problems on the Pareto distribution.

Where:  Provides technical support and solutions for management on longer range problems, and as required, supplies technical assistance to maintenance engineers for immediate and difficult restoration projects as a liaison effort.  Supports task improvements to accomplish longer term objectives (think months and quarters) which will result in smoother operations, at lower costs, without failures.

Return to top


Reliability Growth Models-

What:    Reliability growth models are important management concepts for making reliability visual with simple displays.  The simple log-log plots of cumulative failures on the Y-axis against cumulative time on the X-axis often make straight lines where the slope of the trend line is highly significant for telling if failures are coming faster (b>1) which is undesirable, slower (b<1) which is desirable, or without improvement/deterioration (b=1), which usually drifts toward undesirable results.  The reliability growth models are frequently call Crow-AMSSA plots in honor of Larry Crow’s proof of why the charts work as described in MIL-HDBK-189 when he worked with AMSAA.

Why:      If must see reliability problems to fix them.  The simple log-log plots make the models visible.  The task of the reliability engineer is to put favorable cusps on the Crow-AMSAA trend lines to make failures come more slowly and thus decrease the long term cost of ownership.  If you’re doing your improvement job correctly, you’ll never have many failures until you have a cusp.

When:    The plots are useful for development tasks (where they first were used) or to long term operations.  They work for safety programs, plant improvement programs, environmental programs, or for cost problems.  Use the plots as show me, don’t tell me, how the projects are proceeding and the key metric in the form of line slope is easy to understand and easy to communicate in less than 60 seconds.

Where:  They are used for technical development issues or for management reviews.  A picture is worth a 1000 words for getting management’s attention for focusing on a problem.  Likewise the charts are highly useful for showing the reductions in failures that have occurred from making a desirable and permanent fix.

Return to top


Reliability Policies-

What:    Management communicates with their staffs through important policy statements.  Management policies are general and relate to procedures and rules which are specific for implementing policies.  Written statements of policy regarding reliability are decisive documents about avoid system failures in the same way as safety policies address the need for absence of human injuries, quality policies address the need for absence of product discrepancies, environmental policies address the need for avoiding spills and releases.  Management needs to also say by a policy statement a reliability policy which may read like this:  We will build an economical and failure free process which will operate for 5 years between planned outages.  This statement will clearly communicate that failures to the process (which is the money machine) are to be abhorred and avoided!

Why:      Process failures are clearly money issues because when the process ceases to run, the company has no income, thus process failures are to be abhorred for killing the money machine.

When:    Implementing a policy before constructions of new facilities is important to use the policy as design criteria.  When implemented with older facilities the task is more difficult and old facilities may never be able to comply with the objectives at a reasonable cost alternative.

Where:  Responsibility for implementing the policy lies with: 1) the chief operating officer must authorize the policy and ensure the policy is applied thorough out the operations under the administrative directive which sets the guidelines for financial and engineering measures, 2) the engineering/R&D executives are responsible for ensuring the policy is implemented by systems engineering, design engineering, project engineering, pilot plant engineering and test engineering, 3) the manufacturing executive is responsible for ensuring that the reliability policy is carried out by the materials and procurement functions ,industrial engineering functions, manufacturing engineering functions, operations functions, and maintenance functions, 4) the quality assurance executive is responsible for the dissemination of the reliability policy, it’s annual review and auditing for compliance to the spirit of the policy, and for making recommendations to the chief operating officer concerning continued relevance, applicability, and effectiveness, and 5) the human resources executive is responsible for ensuring that ll new employees are indoctrinated into the purpose and implementation of the reliability policy as a part of the operation’s mission, goals, and priorities.

Return to top


Reliability Testing-

What:    Suppliers have two strategies for testing: 1) test for success and 2) test for failures.  Reliability testing produces failures, particularly when the tests are accelerated with extra loads, and this may be troublesome to have in the records for future lawsuits.  Thus it is often to everyone’s advantage to perform reliability test under code names to protect against the broad rules of legal discovery. 

Why:      The reliability tests will determine a product’s longevity and failure-free performance.  This requires data recording and data integrity.  Plans must be set for how the tests are to be conducted, loads to be handled, duration of the tests, environmental conditions, operating modes, failure definitions, and documentation for recording/analyzing the test data.

When:    Reliability test are usually run prior to release of the product for sale or after the product has been released and troublesome failures appear in field applications where no problems were expected.

Where:  Laboratory test are conducted in many cases but in other cases the data may simply come from field use.  Note the failures induced require extra components which must be expected and budgeted along with the extra costs for data acquisition/analysis.

Return to top


Simultaneous Testing-

What:    For inexpensive components and inexpensive tests, simultaneous tests involve many components under test loads/conditions at the same time for the purpose of quickly acquiring data and producing test analysis as the failures occur.   In simultaneous testing the suspensions (censored data) become important details for use in the statistical analysis.  Most simultaneous tests are accelerated to generate the data in a short period of time although this carries the risk of introducing unexpected failure modes (but this can also be useful information for anticipating field failures).

Why:      Conducting analysis of the early test results, when only a few failures have occurred, will give precursors as to passing/failing the longer term tests.  If the early test results look encouraging, the larger test may be allowed to run to conclusion.   However if early test results are disappointing, the test may be abandoned without using all of the testing budget so that remedial action can occur prior to completing the full scale planned test.

When:    This testing is usually conducted prior to release of products.  However, a similar watch may be setup for warranty repairs so as to anticipate the cost and extra supplies required to cope with an unexpected failure which was not forecasted.

Where:  This strategy is appropriate for inexpensive components in the test laboratory.  However, for warranty problems, the issues are very appropriate for expensive components or assemblies.

Return to top


Software Reliability-

What:    Software does not wear out but it does fail and most failures are due to specification errors and code errors with only a few errors in copying or use.  The only software repair is by reprogramming and adding safety factors is almost impossible.  Software reliability improves by finding errors and fixing the errors but estimating the number of errors which canse failures is extremely difficult as many branches of software code may lie dormant and unused until special events occur to make the latent failures obvious.  Software failures are not often time related but are more software code page dependent.  Software reliability is improved by extensive testing to disclose the failures and then fixing them to repeat the test all over again to validate the fix did not generate more failures and to continue the search of other latent defects.

Why:      More than 50% of the software bugs (failures) occur from specifications with lesser amounts of failures from system design and the coding process and this is due to the lack of visibility in the software process along with problems from those specifying the requirements with problem roots in ambiguities, inconsistencies, incomplete statements, and lack of logical requirements.  This requires that both inputs and outputs for software must be specified in greater detail than for mechanical, electrical, or system data to avoid the errors and conflicts.

When:    “Clean room” software procedures are a technique for extracting details from the customers to insure the programmers and they are used up-front to reduce errors and wasted code.  The acquiring of the data is tedious and roughly 80% of the software budget is spent get the details “right” before programming commences.

Where:  Disciplined software specialist carefully work the plan up-front to reduce errors and testing time.  Undisciplined, so called “neo-experts” want to see busyness in code writing up-front and thus their software reliability is worse from not having a firm foundation from which to work. 

Return to top


Sudden Death Testing-

What:    For expensive components and expensive tests, sudden death tests involve a few components that tie-up a test frame as they are heavily loaded under the same test loads/conditions with several items being run at the same time.  When one of the items fails the entire test frame is shut down so that you have 1 failure (this is the sudden death!) and several suspensions because the unfailed units are survivors as the test is halted until the test frame is loaded with new samples for resumption of the life test.  Opening the test frame (instead of tying up the frame until all samples have failed) is cost effective.  If three units can be tested simultaneously and the test is halted on the first failure, then perhaps we will literally have only 4 failures and 8 suspensions for preparing the Weibull analysis.   Will the 4 sample + 8 suspension data set be different than if all 12 samples had been run to failure?—the answer is yes, they will be different, but will they be significantly different—the answer is no to the significant difference. So, as with simultaneous testing the suspensions (censored data) become important details for use in the statistical analysis.  Most sudden death tests are accelerated to generate the data in a short period of time although this carries the risk of introducing unexpected failure modes (but this can also be useful information for anticipating field failures).

Why:      Sudden death testing is all about the economics and shorter elapsed time for results.

When:    Sudden death testing is used for product acceptance tests.

Where:  It is a quick test for many products and the on-going test for production lots.

Return to top


Total Productive Maintenance-

What:    Total productive maintenance (TPM) is a corporate-wide effort involving all employees to fully use equipment to the maximum limit employing an equipment-oriented management concept to reduce failures and increase utilization of equipment and processes in a productive manner.  TPM programs are teamwork programs and require a corporate culture of teamwork devoid of us vs. them issues.  All employees are expected to accept ownership of the equipment and processes to do many small things all the time to insure high levels of availability by eliminating failures in the early stages with low cost actions.  The employees approach the process equipment as owners rather than renters.

Why:      Maximizing equipment uptime with lower costs by all employees working to reducing the many small incidents which lead to a failure

When:    Major maintenance tasks are handled by the craftsmen.  Most small tasks are handled by operators in a never ending effort of cleaning, lubricating, and tightening to find problems early when they can be solved simply instead of letting the problem grow to a major issue.

Where:  TPM is a system wide effort of providing care to the equipment rather than “it’s not my job” and “we’ve got to fill out the paperwork before “they” can do anything”.   The technique makes good use of the 5 human senses but technical details must be taught to the work force to understand good from bad and when action must be taken along with what must be done—this requires a sharing environment where the work team works for the common good of higher performance.  If the culture is me, me, me, TPM will not work.

Return to top


Weibayes Estimates-

What:    If you’ve got one piece of failure data and nothing else, you’re a poor person without much hope.  I’ve you’ve got one piece of failure data and a Weibull database, you’re a rich person with a map on the back of an envelope and a compass by your side to get you out of the abysmal swamp of ignorance and misunderstanding.

Why:      The Weibayes technique uses your failure data and past experience to make Weibull analysis forecast about what you should expect into the future and in may cases, given a hypothesis of worst-case/best-case a failure forecast can be generated.

When:    Use the technique when you lack specific details but you know something from your past experience—often the past experience reduces errors of Weibull analysis.  Use Weibayes analysis to make sense out of emotional non-sense.

Where:  Use the technique to say something and point noses in the right direction rather than playing the role of Chicken Little with the sky falling.  Some data is better than no data in most cases and when you can keep your wits and everyone else is in panic mode, it quiets the problem to allow reason to prevail. 

Return to top


Weibull Analysis-

What:    Weibull analysis is the tool of choice for most reliability engineers when they consider what to do with age-to-failure data.  It uses the Weibull distribution which says mathematically that reliability, R(t) = e-(t/h)^b  where t is time, h is a scale factor know as the characteristic life (most of the Weibull distributions have tailed data and lack an easy way to describe central tendency as the mode≠median≠mean, however, regardless of the b-values, which is a shape factor, and all of the cumulative distribution function values pass through the h value at 63.2% which thus entitles it to be know as the single point characteristic life).

Why:      The Weibull distribution is so frequently used for reliability analysis because one set of math (based on the weakest link in the chain will cause failure) described infant mortality, chance failures, and wear-out failures.

When:    Use Weibull analysis when you have age-to-failure data.  When you have age-to-failure data by component, the analysis is very helpful because the b-values will tell you the modes of failure which no other distribution will do this!  When you have age-to-failure by system, the b-values have NO physical significance and the b-, h-values only explain how the system is functioning—this means you loose significant information for problem solving.

Where:  When in doubt, use the Weibull distribution to analyze age-to-failure data.  It works with test data.  It works with field data.  It works with warranty data.  It works with accelerated testing data.  The Weibull distribution is valid for ~85-95% of all life data, so play the odds and start with Weibull analysis.  The major competing distribution for Weibull analysis is the lognormal distribution.   For additional information read The New Weibull Handbook, 5th edition by Dr. Robert B. Abernethy and use the WinSMITH Weibull and WinSMITH Visual software for analyzing the data (both software are bundled for a reduce price as SuperSMITH).

Return to top


Weibull Database-

What:    The smartest way to maintain a reliability database is in Weibull format and Weibull databases are available.  Seldom do you see Weibull databases from vendors because they jealously protect their data for proprietary reasons—they life/die financially from the Weibull database information. 

Why:      The Weibull databases simplify the complications of failure data into two statistical values of great importance: 
b tells you HOW things fail, and
h tells you WHEN things fail. 
The results are key benchmark data that tell you how you’re doing.

When:    Gather your failure data and create your own database.  No one is going to give you their database because they put much sweat and tears into cleaning up the data so it is useful.  The data needs to be locally generated because it tells you: 1) the life from the grade of equipment your purchase, 2) it describes the grade of operation of the equipment—do you operate it like 16 year-old teen agers or wise old men/women of 65?, 3) it describes the grade of maintenance you use to renew it’s life, and 4) it tells you management’s expectations for how to treat the system.

Where:  The data starts out as a silly exercise by maintenance to accumulate data with much ridicule from the unknowledgeable about why are you spending this effort to build a Weibull database.  Then suddenly when adversity arises, it becomes everyone’s prized possession.  Remember the worlds of Runyard Kipling about the English soldier:  In peace time it’s Tommy this and Tommy that, and Tommy get out of the way…..but you let the bullets fly in wartime and it Mr. This and Mr That and Mr. if you please!  Everyone wants the baby but no one wants the dirty diapers that go with every baby!  If you don’t have a Weibull database, you’re already too late because your competitor has one started and it using it for your disadvantage and he’s not doing to tell you why you’re left in the dirt!

Return to top


Refer to the caveats on the Problem Of The Month Page about the limitations of the following solution. Maybe you have a better idea on how to solve the problem. Maybe you find where I've screwed-up the solution and you can point out my errors as you check my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by clicking here.  

Return to top of page.

You can download a copy of this page as a PDF file.


Discuss this article at

Search provided by and Google


List Your Web Site Editorial Policy Privacy Policy Contact us
Feedback © Copyright 2000-2008 NetexpressUSA Inc. All rights reserved Terms of Service Trademark Notice