Reliability tools exist by the dozens:
what are the tools,
why use the tools,
when should I use the
tools, and where
should I use the tools? Click on the tools below for
answers.
The details about these tools will be brief
as books are written about each item. Think of the
presentations below as hors d’oeuvres (a little snack food
or starters)—not the main course.
The most important reliability tool is a
Pareto distribution based on money—specifically based on
the
cost of unreliability which directs attention to
work on the most important money problem first. No magic
bullet exists for reliability issues—don’t waste your time
looking for a single magic tool—none exist!
Accelerated
Testing-
What:
A test
method of increasing loads to quickly produce age-to-failure
data with only a few data points which are then scaled to
reflect normal loads.
Why:
The
benefit of accelerated testing is to save time and money
while quantifying the relationships between stress and
performance along with identifying design and manufacturing
deficiencies to get useful data quickly and at low cost.
When:
Usually
performed during the development of devices, components, or
systems. Also applies to items that have been in service to
obtain a metric needed to show how the item is performing
under heavy loads. Accelerate testing is a useful method
for solving old, nagging, problems within a production
process.
Where:
Used for correlating test results with real
life conditions.
Return to top
Availability-
What:
A tool for
measuring the % of time an item or system is in a state of
readiness where it is operable and can be committed to use
when call upon. Availability ceases because of a downing
event which causes the item/system to become unavailable to
initiate a mission when called upon. In the simplest view
the metric is availability = uptime/(uptime
+ downtime). For many other definitions see
MIL-HDBK-338, section 5.
Why:
The
measure is important for knowing the commitment of time for
performing the mission and it usually only involves the use
of arithmetic.
When:
Often the
measurement tool is based on past experiences and the
complement of the measurement tool addresses unavailability
to perform the task.
Where:
In design of a system it is a calculated
value and in operation of a system it is a performance index
that is often easy to use and provides and index that is
understandable to the average person. Today there is a
great tendency to “Enronize” availability metrics by using
uptime metrics that presents data in the best light (an
issue of data integrity) to maximize managerial bonuses by
excusing (deducting) downtime from the calculations to put
lipstick on the pig. Use the
KISS principle. Think of availability in terms of the
investor’s typical year of 8760 hours. The no-excuse annual
metric in hours is availability = uptime/8760. Suddenly
you’ll find a metric of great interest to investors that can
be bench marked as a financial issue, and thus motivate the
management team to solve real issues of importance to the
business. Please note, you can
have high availability but many failures and thus low
reliability as
availability ≠ reliability. Likewise, you can have high
availability but little output so team the metric with
effectiveness to get the complete story.
Return to top
Bathtub Curves-
What:
The
concept is derived from the human life experience involving
infant mortality, chance failures, plus a wear out period of
life since data for births and deaths is accumulated by
government agencies. Most equipment lacks the birth/death
recording by government agencies and most non-human systems
can be regenerated to live/die many times before relegation
to the scrap heap.
Why:
Failure
rates are different for both people and equipment at
different phases of operation and the medicine to be applied
to both humans and equipment need to be considered for
effectively treating the roots of the problem.
When:
The
concept is useful during design, operation, and maintenance
of equipment and systems to understand the failure
mechanisms
Where:
It explains the human experiences to the
ordinary person to relate equipment/system failures to those
experienced in real life so as to coordinate the design,
operation and maintenance of equipment. For other
definitions see
MIL-HDBK-338, section 9.
Return to top
Block
Diagram Model (same as Reliability Block Diagram Models)-
What:
Reliability block diagram (RBD) models are graphical
representations of a calculation methodology for reliability
systems.
Why:
The RBD
models allow calculation of system reliability based on
knowing/assuming failure details of the components starting
with the least component and growing the model to the
greatest system to predict performance from the elements.
When:
RBDs are
used in upfront designs as a performance parameter and after
the system is constructed to ferret out poor performing
blocks that limit the system performance.
Where:
Frequently used as a trade-off tool to search
for the lowest long cost of ownership and to help sell
alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by
alternative designs where the results can be calculated
before building the system as the results of the
calculations provide knowledge about availability,
maintenance interventions required for failures, and the
number of spare parts required to sustain operations. For
other definitions see
MIL-HDBK-338, section 4 and 6.
Return to top
Capability-
What:
A measure
of how well the product performance meets objectives. In
short how well are the outputs actually accomplished against
a standard? Capability is frequently the product of
efficiency * utilization.
Why:
Capability
is a component of the
effectiveness equation and usually under the control of
production.
When:
Data for
this metric is frequently produced by the Accounting
department each month as a segment of the financial reports
for the purpose of handling variances against the standards.
Where:
Frequently in the effectiveness measure it is
a weak point [as a measure of how well the production
process des the job for which it was purchased] requiring
substantial improvement that cannot be solved by the usual
reliability and maintainability (RAM) tools. However, this
metric may be deficient from the original design [an issue
of design effectiveness] of the system or from the way the
system is operated [an issue of use effectiveness].
Return to top
Configuration Control-
What:
Configuration control is involved with the management of
change by providing traceability of failures back into the
design standard. If the design details are not specified,
the design will not contain the requirements and thus
implementation of the project will be hit or miss for
achieving the desired end results beginning with the
conceptual design and resulting in the operating facility.
Why:
With
active configuration control you know where items are used
and contained, where and why they were installed, where
signal originate, what items are used where and in what
environments, what drawing revisions have occurred and the
product conforms to the drawings and specifications, what
alternate materials/components have been used, and test
reports/certifications are available as original documents
for review.
When:
Configuration control begins after the first design review
to build an unbroken chain of traceability to aid in
avoiding surprises in the field which would destroy the
designed-in criteria for availability, reliability,
maintainability, and cost effectiveness established as a
portion of the original design criteria.
Where:
Frequently these documentation details are
assembled into a dossier with third party witnessing for use
in validating conformance to the design requirements and
provided to the owner of the equipment as witness documents.
Return to top
Contracting For Reliability-
What:
Say what
you want and want what you say to your vendors. Provide
explanations of the objectives in contracts in terms the
vendors will understand.
Why:
If you
can’t spell clearly spell-out the requirements for
availability, reliability, and maintainability the
contractors cannot make these issues features of the
design. Thus it is important to be specific in the features
the design must manifest. Explanations such as: “You know
what I want and what I need, just do it quickly” are self
defeating expressions of vague generalities that lead to
inferior designs and constant arguments. Be specific about
requirements for building
reliability block diagrams, using
quality function deployment, performing
failure mode and effects analysis, conducting
fault tree analysis and finally conducting
design reviews for reliability.
When:
Write the
specifications before procurement begins. Plant to spend
time with your own Purchasing Department to explain the
details and sell the team on the financial advantages for
including reliability requirements into the specifications;
and likewise, spend time selling your vendors on the
requirements and why they are stated.
Where:
These are up front decisions to avoid
replication of previous problems that are built into
previous designs and never corrected.
Return to top
Cost Of
Unreliability-
What:
The cost
of unreliability is a big picture view of system
failure costs, described in annual terms, for a
manufacturing plant as if the key elements were reduced to a
series block diagram for simplicity. It looks at the
production system and reduces the complexity to a simple
series system where failure of a single
item/equipment/system/processing-complex causes the loss of
productive output along with the total cost incurred for the
failure. If the system IS sold out, then the cost of unreliability
must include all appropriate business costs such as lost
gross margin plus repair costs, scrap incurred, etc. If the
system is NOT sold out, and make-up time is available in the
financial year, then lost gross margin for the failure
cannot be counted. The cost of unreliability is a
management concern connected to management’s two favorite
metrics: time and money.
Why:
In private
enterprise, failures must be concerned from a financial view
point and not a gear-head approach of simply counting the
number of failures; and you must speak the language of the
enterprise which describes events by monetary measures over
a period of time. The annual cost for failures is usually
not stated in a clear cut manner nor is failure costs
summarized by system/sub-system to identify the weak links
in a monetary fashion so that appropriate action is taken to
reduce the annual cost of unreliability by building a
clear
Pareto distribution to attack the vital (high cost)
areas with an action plan to reduce failures (unreliability)
and to reduce the cost of unreliability.
When:
For new a
new plant, this can be a design criteria to limit costs of
unreliability for competitive reasons in the
marketplace, i.e., by plan, the hidden costs of failures is
made obvious as a portion of the strategic plan. For an
existing plant, this can be an exercise in defining the cost
of unreliability and building a long term plan to reduce the
cost of failures as a portion of the tactical plan.
Where:
This activity is best performed with high
level involvement of the management team to provide
fundamental understanding of the size of the icebergs about
to rip out the underbelly of the plant and to involve the
organization in a plan to reduce the costs so that profits
are pushed upward because of the improvements. If the cost
of unreliability cannot be reduced, then the costs
become extra weight for the saddle bags in the race for
survival.
Return to top
Critical
Items List-
What:
The
critical items list is a top level summary of problems/cost
used for discussions with management about key reliability
issues. The summary list converts technical details to a
summary of costs and time while placing the issues into a
Pareto distribution explained in terms of money and the
vital few problems to be solved for competitive reasons.
Why:
The
purpose of the critical items list is to focus management’s
attention on items that need to be resolved during the
design phase as a corrective action loop for influencing the
life time costs.
When:
The list
starts with the first design review as issues are disclosed
in design reviews for reliability.
Where:
The critical items list is presented to top
level management as issues to be accepted or resolved before
paper plans become steel and concrete.
Return to top
Data-
What:
Data is the
informational energy which runs the reliability improvement
machine. Data is acquired at great cost. Data needs to be
retained and used to prevent future failure events. Proper
use of data provides an understanding of failure mechanisms
and prevents reoccurrence of bad events which cause safety
or high cost failures to occur. Reliability data requires
definition of a failure. Failures can be catastrophic
failures or slow degradation—you decide by defining the
failures. The units of the measure for the data must be in
units of the degradation—sometimes it is hours, some times
it is miles, and so forth—in short, what ever motivates the
failure. Reliability always ceases with a failure or a
removal from service in some aged condition which then
generates a category of data called a suspension or censored
data. Data is information in the form of facts, figures, or
engineering databases which is obtained from engineering
tests, experiments, or actual operating conditions.
Reliability data is often incomplete as the exact times to
failure are rarely known or recorded with much precision so
that only partial information is available for analysis.
Reliability data comes in two forms: 1) age-to-failure data,
and 2) censored/suspended data such as occurs when unfailed
items are removed from service or when they fail due to a
different failure mode than we are studying—this is useful
information and part of the data set. Some data is better
than no data for resolving reliability issues.
Why:
Data is
the information that, when used in an informed manner, helps
prevent repetition of bad history and allows an enlightened
approach to rationally solving a reliability issue using
facts and figures. Intelligent use of data for reliability
issues provided the objective evidence needed for helping to
solve the root cause of failures.
When:
Databases
of reliability information of past experience is very
helpful for predicting future failure events. The data is
helpful if failure rates, or the reciprocal of failures
rates is described in mean times to failure which reduces
the information to an average failure rate or average time
to failure. The reliability data is particularly valuable
if retained for components as a Weibull data base with shape
factor beta and scale factor eta.
Where:
The data is useful for understanding failure
modes, and for predicting future failures for a population
of equipment during the design stage and for predicting
future failures with subsequent increases in the aging of
equipment. The role of the reliability engineer is to
acquire the failure data and convert the data into useful
information for both current and future use.
Return to top
Decision Trees-
What:
Most
business decision have considerable uncertainty which
implies at least two outcomes if you choose a course of
action. Making decisions in the face of uncertainty
requires the costs for taking action and the probability
along with the cost for not taking action and the
probability of the occurrence. In most cases the
probabilities are not well known (maybe to one significant
digit) and the costs are not well know (maybe to $10000).
The quantitative assessment is called risk assessment. The
issue is to take these not well identified issues and devise
a strategy which can minimize exposure to risk for the
business. The graphical representation of the methodology is
called decision trees to reach the expected values for
decision to take/not-take action.
Why:
Most
business decisions have no exact answers, i.e., no black and
white answers but rather shades of grey. The use of the
tool is to help decide which course of action may be to the
advantage of the business given the best estimates that can
be made.
When:
Decisive
details will only be know into the future and decisions have
to be made today so use of decision trees are tools to help
wisely span from today into the future with the wisest
decisions that can be made from sketchy data.
Where:
If you have absolute date, use it. Must most
decisions must be made with indecisive information which
requires decisions about the odds for a given event, usually
based on estimates—the wiser the estimate the better the
decision, taking into account the probabilities of the
outcomes and the money involved in the decision. Use this
tool when few details are available and you must be the
pioneer to cut through the forest to reach the promised land
of opportunity and profitable ventures.
Return to top
Dependability-
What:
The
International Electrical Congress (IEC)
defines dependability as “Dependability describes the
availability performance and its influencing factors:
reliability performance,
maintainability performance and
maintenance support performance.”
MIL-HDBK-338 defines dependability differently as a
measure of the degree to which an item is operable and
capable of performing its required function at any (random)
time during a specified mission profile, given that the item
is available at mission start. (Item state during a mission
includes the combined effects of the mission-related system
R&M parameters but excludes non-mission time; see
availability.) Dependability is related to reliability
with the intention that dependability would be a more
general concept than the measurable issues of reliability,
maintainability, and maintenance.
Why:
The key
dependability issue is make equipment and processes work as
advertised, which is, without failure. Dependability aims
at facilitating co-operation by all parties concerned
(supplier, organization, and customer by fostering an
understanding of the dependability needs and value to
achieve the overall dependability objectives) so it involves
harmonizing conflicting issues. Dependability has a better
view point from the end user of the equipment or system than
from the designer’s viewpoint or the maintainer’s
viewpoint. From a system effectiveness viewpoint,
reliability and maintainability provide system availability
and dependability.
When:
You cannot
repair yourself to happiness with a failure prone system as
the failure prone system will be viewed lacking
dependability to function as required when you need it.
Thus dependability is viewed over the longer term and not in
convenient snap-shots and dependability also involves life
cycle cost issues.
Where:
Reliability contributes directly to uptime by
avoiding failures whereas maintainability contributes
directly to reducing downtime by faster repairs. Thus
reliability and maintainability jointly provide impact on
dependability of the system. Dependable systems must be
ready to function, in an operable state, to produce the
desired output, upon demand by the end user, at the
specified quantity and quality of output.
Return to top
Design Reviews For Reliability-
What:
Specific
questions to ask the design engineers during a review
specifically for reliability using failure data from
operations and maintenance are: 1) show the calculated
availability for the system based on a
RAM model, 2) show the calculated number of failures
during the specified mission time between turnarounds based
on a reliability and maintainability (RAM) model, 3) show
details of
FEMA studies, 4) show details of
FTA calculations, 5) show the calculated mean times
between downing events, 6) show the calculated the mean time
between cutbacks from full production capability and losses
thus incurred, 7) show the
QFD matrix and details, and 8) show the calculated
cost of unreliability.
Why:
Design
reviews should demonstrate by calculation or through the use
of models and reliability tools that the system is capable
of achieving the design objects rather than making a giant
leap of faith that all will be well and good.
When:
Design
reviews for reliability should be a part of the design
process starting with conceptual designs and ending when the
drawings are revised for the as-built system.
Where:
This is a logical extension of the design process to show me
rather than tell me how the system will function and is
performed as a portion of the up-front design by the numbers
process.
Return to top
Effectiveness-
What:
The
potential or actual probability of a system to perform a
mission for a given level of performance under specified
operating conditions defined as the product of
reliability*availability*maintainability*capability.
Many variants of the effectiveness equation exist, e.g., OEE,
and others.
Why:
The
effectiveness equation defines the ability of a product,
operating under specified conditions, to meet operational
demands when called upon. This is a practical measure of
how well the system is performing—not how well we want it to
perform but a practical measure of how it’s doing. Since
all the elements are measured between 0 to 1, the elements
of the equation quickly draw the eye to where opportunities
exist for making improvements.
When:
The
effectiveness equation is useful for trade-off boxes for
various alternatives when plotted on an X-Y scale for
effectiveness vs net present value (NPV) for improvement
alternative selections. For the elements::
reliability defines the probability of a failure free
interval (or the complement unreliability which describes
the probability of failure),
availability defines the probability of the system
being up and alive to handle the demand (or the complement,
unavailability which describes the probability of the system
being down),
maintainability defines the probability of making
repairs within the allowed repair standard,
capability defines the probability of production
achieving the desired production results [a measure of how
well the product performs compared to the standard] and
frequently it is described as the product of efficiency *
utilization where
efficiency is an output/input relationship such
as (output achieved)/(the standard required) and
utilization is how time is used such as (direct
labor)/(direct labor + labor lost)
[in the old days, if this index
decreased to as low as 80% we went berserk—today,
you can’t get this high because of
wasted time when noses are not to the grindstone!!!].
Where:
It is used to describe new systems and old
systems performance. Consider this example for
effectiveness: If we are comparing a heavy duty truck
versus a sports car for transportation, the truck may be
more effective for heavy loads whereas the sports car may be
more effective for acceleration and high speeds—neither are
defined by the effectiveness equation until the mission is
defined.
Return to top
Environmental Stress Screening (ESS)-
What:
A series
of screens are conducted under environmental stresses to
disclose weak parts and workmanship defects which require
corrections and this requires and understanding of burn-in
testing and ESS of which both techniques identify weak
points and eliminate them by motivating early failures.
Burn-in is usually a long process of operating under load(s)
and at fixed temperature (in short, this is a special case
of ESS) or it can be operated at varying loads and
accelerated temperatures to achieve a shorter burin-in
period, whereas ESS is a scientifically planned and
conducted test which is usually conducted under accelerated
loads to produce the same test/use results in a shorter
period of time by increasing the stress on the components or
assemblies. The objective of these screens is to produce a
failure free product when released into operations. ESS is
not intended as a test to validate compliance to a design,
however it is intended to force latent defects into becoming
defects before the end user finds them in day-to-day usage.
Why:
The
extremes of operating conditions such as high power levels,
high temperatures, high vibration levels, etc. produce
failures not anticipated from testing at nominal
conditions. Generally ESS is directly applicable and
interpreted to be applicable to electrical/electronic
equipment, however the same issues/concepts apply to
mechanical equipment when the stressing conditions are
loads/pressures/temperatures/vibrations/thermal shocks/etc.,
so as for all reliability issues—think broadly!
When:
When
acquiring data, the tests are done upfront of production.
When controlling early failures that would be discovered by
the end user, these test are done as a portion of the
production process to eliminate week units to control
warranty costs and improve customer satisfactions
Where:
Some tests are conducted in the laboratory
for quick results and then the data is used to control
product testing/release for the purpose of limiting costs
and preventing the loss of customers from unsatisfactory
performance in the field.
Return to top
Events/Incidents-
What:
Events/incidents are single events or occurrences that
happen, especially one that is particularly significant,
that results in a failure from an non-aging mechanism for
reliability purposes. Usually the event/incident result in
a serious consequence of the loss of functional life of a
component or system. The death of the device must be
recorded as censored (suspended) data.
Why:
For
reliability purposes, failure of the component, device,
subassembly, or system has been a success up to the point in
life where a failure from a non-aging event too place. This
means the event-age was a success (up to the point it was
killed by an event/incident) and inclusion of the data is
required as censored/suspended data—this is important data.
When:
Include
the suspended/censored data into every analysis. Young
suspensions/censored data have little impact on the results
of an analysis but old suspensions have major effect on the
analysis.
Where:
The data is used for MTBF/MTTF analysis and
particularly for Weibull analysis.
Return to top
Exponential Distribution-
What:
The
probability of survival and of failure of components or
equipment is under the condition of chance failure which
means a constant instantaneous failure rate where the
die-off rate is the same for any surviving (unfailed)
population. An old part is as good as a new part. For any
survivors in this memory-less system that have survived to
time t, a certain percent of the survivors will die in a
specified interval of time such as 2*t. The reliability of
the system is often described by the exponential
distribution because many times a system is made-up of mixed
failure modes which in the aggregate will function like a
constant failure rate system. The reliability of
exponential distributions are described mathematically as
R(t) = e^(-lt) = e^(-t/Q) where t
is the mission time, l is the failure rate, and Q is the
mean time, given that l=1/Q. The exponential distribution
is frequently used as a first approximation to describe
reliability based on a simple failure rate or a simple mean
time to failure—particularly if the system or component has
multiple failure modes.
Why:
The
constant hazard rate, l, is usually a result of combining
many failure rates into a single number.
When:
The
exponential distribution is frequently used for reliability
calculations as a first cut based on it’s simplicity to
generate the first estimate of reliability when more details
failure modes are not described.
Where:
In electronic systems (which can have many
different types of failure modes and the fact that any
electrical/electronic system is an amalgam of many different
components) the simple assumption is that the
electrical/electronic package will have a constant failure
rate system defined by the exponential distribution. When
in doubt about the failure mechanisms, it is common to
assume use of the exponential distribution with it’s
constant failure rate for simplicity.
Return to top
Failure-
What:
Failure is
the loss of function when you needed the function to occur.
Failures for reliability purposes must be precisely defined
so they are recorded correctly. Much life data is
incomplete because failures are mixed-up with
censored/suspended data where aged items may not have failed
or they represent removals from service before failure, or
they have not yet failed for the mode of failure under
study—in short these censored/suspended items represent
successes and are a portion of data set for study.
Why:
We study
failed items for the same reason we do autopsies on
humans—we want the data and we want it categorized correctly
for making important decisions. Failures require: 1) a time
origin which must be unambiguously defined, 2) a scale for
measuring the passage of time/starts/stops/etc. which
motivates failure, and 3) the meaning of failure must be
entirely clear for recording the event.
When:
Failure
data must be recorded as it occurs to prevent loss of
information.
Where:
The CMMS system is frequently where most data
resides but usually in crude fashion. The failure data is
often transferred into the
FRACAS system for converting the symptoms of the failure
into the root causes of failure. The failure data must be
converted into action items for making management decisions
about future failures and the corrective action needed.
Return to top
Failure
Forecast-
What:
Failure
forecasting is a projection of failures into the future
based on assumed or documented failure details. It is also
known as risk analysis of future failures. For a constant
failure mode system this is very straight forward. However
for complicated failure modes where the failure rate
increases with time (wear out failure modes) or where
failure rates decrease with time (infant mortality failure
modes) this becomes a more complicated analysis as described
by the Abernethy Risk which is described in
The New Weibull Handbook and implemented in the software
package
WinSMITH Weibull for predicting future failures. Like
wise, reliability block diagrams are useful for predicting
future failures when the authentic failure details are
supplied to the Monte Carlo models.
Please note manufacturers follow two general strategies for
their equipment:
1) build the equipment to avoid failures even though
this increases the original capital costs, or
2) build equipment and see the original equipment at a
low cost (or even a break-even costs)
expecting to make profits with the sale of
replacement parts.
Thus for end users of the procured equipment, it is
important to know the forecasted failures in the face of
suppler protest that “our equipment never fails”—in that
case ask to see the sale of spare parts for similar
equipment and an estimate of the number of units working to
get a crude estimate of the strategy employed by the
equipment supplier.
A failure is an event which renders equipment as non-useful
for the intended or specified purpose during a designated
time interval. The failure can be sudden, partial, or
one-shot, intermittent, gradual, complete, or catastrophic.
The degree of failure can be degradation or gradual, sudden,
or one-shot, from weakness, from imperfections, from misuse,
or so forth.
A failure mechanism includes a variety of physical processes
which results in failure from chemical, electrical, thermal,
or other insults.
Why:
Future
failures costs money and frequently increase the risk for
safety or environmental problems. For manufacturers, the
forecasted failures predict impending high costs for
warranty expenses which can make/break a company. With good
failure forecast, you can anticipate expected failures now
(after x-usage), future failures when failed units are not
replaced, and future failures when failed units are replaced
either with the same failure modes or with differently
designed components with different failure detais.
When:
This
analysis is wisely performed in during the design of the
equipment, however many surprises arise from different
failure modes build into the assembled product or incurred
by not anticipated usage in operations.
Where:
Generally this analysis is made during the
up-front design effort—with much disbelief the products
could be “this bad”. Follow-up analysis occurs when
unexpected failure modes arise during operation of the
equipment which causes loss of service of the equipment and
high costs for the end users.
Return to top
Failure Rates-
What:
Failure
rates, in the simplest form, are
S(time
in use)/S(number of failures) or the reciprocal of mean
times to/between failure. For more sophisticated failure
data bases such as Weibull data bases the failure rates can
be disclosed without giving away proprietary data such as
the shape factors, beta, which tell the failure mode for the
equipment.
Why:
Simple
failure rates are a precursor of maintenance events and
production interruptions that will occur into the future
which drive up costs and cause chaos.
When:
Failure
rates derive from the history of operation or from well
known data sources such as OREADA, IEEE 500, IEEE 493, EPRI,
and other sources listed in
reading lists for reliability including
Weibull databases.
Where:
The failure rates are used as an awareness
criteria for the average person just as you used automobile
fuel consumption rates for understanding the health of your
automobile as well as anticipating your
weekly/monthly/annual out-of-pocket expenditures for
gasoline or diesel fuel. The failure rates drive the
maintenance interventions, spare parts, and maintenance cost
for the Maintenance Department. Similarly they predict the
interruptions to the process and lead to misses on promised
deliveries and result in negative variances for production
costs. In sort, failure rates are precursors for the misery
expected for the organization.
Return to top
Fault Tree Analysis-
What:
Fault tree
analysis (FTA) is a top down processes of defining the top
level problems and through a deductive approach using
parallel and series combinations of possible malfunctions to
find the root of the problem and correct it before the
failure occurs. The reliability tool can be used as
qualitative or quantitative methods.
Why:
The tool
aids the design process, shows weak links that cause
failures, and in the critical legs of the trees helps to
define maintenance strategies for which pieces of equipment
and processes should be defended with the greatest
maintenance vigor to prevent “Murphy” from shutting down the
process or causing serious safety issues. The technique
provides a graphical aid for the analysis and it allows many
failure modes including common cause failures. Results from
a FTA is usually more pessimistic that other analysis tools
such as
RBDs as you can see from a study of the Space Shuttle
reliability analysis where each system is studied by
multiple reliability tools because of the high cost/profile
of failures.
When:
FTA is
widely used in the design phase of nuclear power plants,
subsea control and distribution systems, and for oversight
studies in layers of protection studies for process safety
and loss control in chemical plants and refineries so as to
prevent accidents and control the costs of risks. The
technique is helpful for identifying critical fault paths,
observing vague failure combinations before they occur in
reality, comparing alternate designs for safety, and setting
a methodology to provide management with a tool to evaluate
the overall hazards in a system and avoid single sources of
critical failures. Finally when thinking top down about
failures and where/how they can occur, the methodology gives
a diagram for setting maintenance strategies for protecting
key pieces of equipment/processes to prevent failures.
Where:
FTA is helpful for defining potential event
sequences and potential incidents, evaluating the incident
consequences of outcomes, and estimating the risks of events
occurring. FTAs work in the design room and on the
operating floor where first hand knowledge has been gained
for preventing failures.
Return to top
FMEA-
What:
Failure
mode and effect analysis (FMEA) is the study of potential
failures that might occur in any part of a system to
determine the probable effect of each failure on all other
parts of the system and on probable operations success.
When criticality analysis is added for sophisticated studies
the method is know as FMEAC. In the automotive world where
FMEA is a required portion of the quality systems, it is
frequently known as PFMEA for potential failure mode
and effect analysis. The basic thrust of the analysis tool
is to prevent failures using a simple and cost effective
analysis that draws on the collective information of the
team to find problems and resolve them before they occur.
Why:
The
analysis is known as a bottom up (inductive) approach
to finding each potential mode of failure and preventing
failures that might occur for every component of a system
and determines the probable effects on system operation of
each failure mode in turn on probable operational success
and the results of which are ranked in order of
seriousness. FMEA can be performed from different
viewpoints such as safety, mission success, availability,
repair costs, failure modes, reliability reputation,
production processes, and follow-on service, and so forth.
When:
The FMEA
is most productive when performed during the design process
to eliminate potential failures. It can also be performed
on existing systems where operations personnel and
maintainers are made team members to add real-life
experiences to educate the team in a problem solving forum
that is constructive to eliminating existing problems.
Where:
The analysis can be conducted in the design
room or on the shop floor and it is an excellent tool for
sharing experiences to make the team aware of details that
are know to one person but seldom shared with the team. It
is also an extremely productive tools for educating young
engineers, young maintainers, and your operators into
details they should be aware that can kill the system.
Return to top
FRACAS Systems-
What:
Failure
reporting can corrective action systems (FRACAS) is an
organized database for aiding in solving reliability
problems using a common sense approach by systematically and
permanently removing failure mechanism. Good historical
data from this system can populate a
Weibull database.
Why:
Use data
to solve problem by attacking root causes to reduce failures
and make reliability grow. Fixing failures requires
data—not opinions—to use the data acquisition system in a
closed loop to record, analyze, correct, and verify
improvements have been achieved. First data reported is
usually a symptom of a failure and with a failure
investigation, the symptom can be converted into a root
cause which requires the system to be editable to correctly
report failes.
When:
The
maintenance repair order system usually generates evidence
of a failure. Failures with significant costs (repair costs
+ collateral damage + lost margin from the failure + other
appropriate business costs) must be investigated and
evaluated to reduce failures and to reduce failure costs.
Little is to be gained by spending big money to investigate
trivial failures.
Where:
This is an engineering tool requiring
clerical effort to input the data and build the Pareto
distributions for identifying significant events requiring
corrective action and thus it also becomes a management tool
for controlling costs.
Return to top
HALT-
What:
Highly
accelerated life test (HALT) is an offspring of older
environmental stress screening (ESS) tests and it is a
testing process for ruggedization of pre-production products
by heavily stressing the product to identify failure modes
quickly and to verify weak links in the system.
Why:
HALT tests
are intended to quickly find failures and accelerate the
improvement program so that when products are delivered to
end users, they will be mature products by elimination of
potential failure modes that would normally generate a
reliability growth program. Usually the HALT programs
reduce time, cost, and delays experienced in new products by
recalls, warranty costs, etc. HALT is similar to
HASS but the stresses are more severe. In the HALT
process, design and process flaws are found, root causes
identified, and corrective actions implemented quickly.
When:
HALT is
used during the development program to get engineers to
acknowledge and correct fatal problems in designs by adding
loads (generally temperature, vibrations, pressures,
physical stresses, etc) by rapidly changing the load
conditions over and above normal operating loads
Where:
HALT is frequently used for electronic
systems but also applicable to mechanical systems where
thermal shocks are used to validate designs for extreme
conditions of loads. The tests are performed in the
laboratory for engineering evaluation.
Return to top
HASS-
What:
Highly
accelerated stress screen (HASS) uses the same stresses as
HALT, but at a lower stress level. Compared to HALT
testing, temperature and voltage extremes may be reduced by
10-15%, vibration levels reduced 50%, etc. depending upon
the design although all the stresses may be above rated
product specifications with the motivation to produce test
results quickly for verifying product compliance.
Why:
HASS
testing is to verify product performance is on
target and has not shifted toward inferior performance in
the manufacturing process. Note that higher stresses often
produce accelerated failures out of proportion to the
increased stress applied.
When:
Products
are periodically screened by HASS to verify no shifts have
occurred in the manufacturing process.
Where:
HASS tests are performed as a quality
assurance test in manufacturing facilities to learn what you
don’t know about each product as it is faster than a simple
burn-in test. If 100% of the finished goods do not receive
HASS, as when only a percentage of the product is screened
by HASS this is called a highly accelerated stress audit (HASA).
Return to top
Life Cycle Cost-
What:
Life cycle
cost (LCC) are all costs associated with the acquisition and
ownership of a system over its full life. The usual figure
of merit is net present value (NPV). Projects are
considered most favorable for large positive NPVs. However
for many cost individual cases, decisions are made for the
least negative NPVs. In all cases, the default position for
accounting is to know the NPV for making no change and this
is usually the last alternative for most people associated
with change.
Why:
The first
cost for capital equipment (acquisition) is between ½ and
1/20 of the total life time cost! The first cost,
acquisition cost, is usually definable by a firm quotation
and sustaining costs must be estimated and put into the
appropriate time slots for discounting to obtain the NPV for
the project life. Typical values used in industry for LCC
are: discount rate = 12%, tax rate = 38%, and project life
is usually between 10 and 20 years.
When:
Life cycle
cost is usually calculated as an up-front decision making
effort either for projects or for cost reduction efforts. I
does not work well for doing the analysis after the project
is underway.
Where:
LCC is the business of investing money to
make changes occur. The NPV values add the voice of
investments to technical decisions to work for the lowest
long term cost of ownership.
Return to top
Life Units-
What:
A measure
of use duration applicable to an item. For example, the
life units may be starts-stops, run hours, hot-cold cycles,
distances traveled, emergency starts or starts, shelf life,
and other measurements which motivate failures.
Why:
Life is
consumed by usage of life units. Some life units occur as a
sum of the different cases, for example on a gas turbine
aircraft engine take-offs consume more life than landings or
enroute conditions which requires a synthetic value for how
life is consumed on a mission. For a land based, heavy
–duty gas turbine used in the generation of electrical power
the number of starts is not equivalent to hours of operation
as other wear mechanisms are involved, however, 1 trip cycle
= 8 normal shutdown cycles and thus decreases the time
between required maintenance actions.
When:
Development of a life consuming profile may be more
important than the literal measurement of an elapsed time to
adequately measure consumption of life that in the end will
result in a failure.
Where:
Life units have different measures and must
be considered to obtain the proper “common denominator” for
calculations.
Return to top
Load-Strength Interactions-
What:
For
reliability successes, loads must always be less than
strengths. When loads are greater than strengths, failures
occur. The issue is determining the probability of
load-strength interference which is a joint probability of
when loads exceed strengths. The loads should include
expected conditions plus the foolishness of people to
violate rules and overload equipment, plus the vagaries of
Mother Nature to impose unexpected static and dynamic loads
from hurricanes, tornadoes, earth quakes, wild fires, and so
forth.
Why:
Neither
loads nor strengths are unmovable point estimates, although
most designers use point values. Failures occur and
reliability terminates when loads exceed strengths.
When:
Loads
usually increase over time (e.g., airplanes like people gain
weight over time from accumulation of dirt and extra
equipment), strength usually decrease over time (small
fatigue cracks appear with many cycles and load bearing
strengths decline).
Where:
Bridges have finite lives because of
load-strength interactions, wings break off of airplanes
from fatigue, etc. A few failures are dramatic but most
failures sneak up from the unknown in a variety of ways to
cause loss of reliability. To prevent loss of the system
requires many physical inspections to learn what you don’t
know!
Return to top
Lognormal-
What:
Lognormal
distributions are continuous life functions that have long
tails to the right (display positive skewness) in time or
usage. A lognormal distribution plotted on semi-log papers
would appear as a normal curve.
Why:
The
lognormal distribution is a common competitor to the
Weibull distribution for life. However it is adequate
for 85-95% of all repair times.
When:
Lognormal
distributions are motivated by multiplicative (or
proportional) events that grow with time like crack growth ,
molecular diffusion, and some wear out problems.
Where:
In the days when plots had to be made by
hand, it was the first widely used transform to convert
plotted data into straight lines. Today it is simply one of
an arsenal of probability tools used to obtain good curve
fits to data with multiplicative type events.
Return to top
Maintainability-
What:
The
measure of the ability of an item to be retained in or
restored to specified condition when maintenance is
performed by personnel having specified skill levels, using
prescribed procedures and resources.
Why:
Maintainability measures the percent of maintenance jobs
completed to a standard time for the repair with repair
times for the task usually plotted on a lognormal
probability plot.
When:
First you
set a standard repair time for the task, second you set a
skills level, third you measure how you’re doing against the
standard.
Where:
Applies to major tasks where many repetitions
are expected and where considerable time is required.
Return to top
Maintenance-
What:
All
actions necessary, both technical and administrative, for
retaining an item in or restoring it to a specified
condition so it can perform a required function. The
actions include servicing, repair, modification, overhaul,
inspection, reclamation, and restored condition
determination.
Why:
Equipment deteriorates because of entropy changes, because
of errors both overt and convert, and because of the use of
incorrect procedures.
When:
Maintenance is generally routine and recurring.
Where:
The effort includes fault location,
diagnosis, repair, test, adjustment, replacement,
administration, and overhauls wherever equipment is located.
Return to top
Maintenance Engineering-
What:
A tactical
job for rapidly repairing equipment to operable conditions
by studying operating and repair manuals. Acquires failure
data and prepares maintenance plans of restoring equipment
to operable condition in a minimum amount of time. Prepares
general diagrams, charts, drawings, and spare parts
requirements for maintenance planners. Makes
recommendations for improving the repair cycle. Provides
manning level forecast for supervisors and estimates the
duration of outages. Determines the cost advantages of
alternatives for developing action plans to comply with
internal/external customer demands for timely repairs of
processes/equipment. The purpose of these activities is to
restore equipment to service in a timely manner.
Why:
Facilitates speedy repairs by providing maintenance
technology above the craftsman level and up to but not
including reliability engineering principles.
When:
Provides
expertise for more complicated maintenance tasks or when
organization and oversight is required and time is of the
essence for fast repairs.
Where:
Provides on-site expertise to aid craftsmen
to solve non-standard repairs without hands-on tool
contact. Maintenance engineers serve as liaison with
reliability engineers.
Return to top
Mean Time-
What:
A density
figure-of-merit metric often referred to as the average or
expected value. In the simplest form it appears as
arithmetic S(time)/S(events) or in complicated situations as
a statistic metric. It applies to mean life (ML),
mean down time (MDT), mean maintenance time (MMT),
mean time between failures (MTBF for repairable
items), mean time to failures (MTTF for replacement
items), mean time between maintenance (MTBM), mean
time between maintenance scheduled (MTBMs), mean
maintenance time unscheduled (MMTu), mean maintenance
time scheduled (MMTs), mean time between overhauls (MTBO),
mean time between unscheduled removals(MTBRu), mean
time to restore (MTR), mean time between downing
events (MTBDE), and so forth. The units will be
time/metric, e.g., hours/failure. The reciprocal of the
metric provides an incident rate, e.g., failures/hour.
Why:
The metric
provides an awareness factor for deciding central tendency
numbers and for the expected number of events which will
occur into the future based on historical situations. The
arithmetic simplicity of mean time is a reason to establish
the metric, and listen to the information derived from it to
gain insight. The arithmetic provides immediate answers to
categorize facts for starting continuous improvement rather
than postponing a metric while searching for delayed
perfection!
When:
The
metrics are used as criteria of performance and variations
from the central tendency numbers are expected however for
the long term the variations are expected to be controlled
to prevent distortion of the measurement.
Where:
The metrics are use from the shop floor to
the management levels as criteria for “How are we doing?”.
Return to top
Mechanical Components Interaction-
What:
Mechanical
components suffer from interactions and degradations of
overloads, strength deterioration, wear, corrosion, process
variations during the fabrication process, effects of
special processes where the procedures must be controlled as
discovery of the end results would result in destruction of
the component, and removal of safety factors by increasing
loads.
Why:
The naïve
expectation is that individually the impact of a single
insult will not destroy reliability of the component.
However, you frequently have multiple insults occurring
which results in failures that are not predicted up front
but which can be perfectly explained after the components
have failed.
When:
The
multiple destructive events are more predominate in complex
devices and highly stressed devices which too often have
small safety factors which cannot cope with the overload
conditions and thus failures occur.
Where:
The foolishness of humans adds further
insults to the interactions of many different failure
mechanism which demands high maintenance interventions and
frequent inspections. Of course the solution to many of
these cases where failures occur is to increase safety
factors by adding extra material (when possible) but this
adds extra weight and extra costs.
Return to top
Monte Carlo Simulation-
What:
Monte
Carlo simulation (modeling) is a method to solve engineering
problems by sampling methods. The method applies to such
things as system reliability and availability modeling by
simulating random processes such as life to failure and
repair times.
Why:
The
technique is used when: 1) many variables are present and
their interrelationships are unclear, 2) the system can’t be
analyzed by direct and formal methods; 3) building
analytical models would be time consuming, complex, and just
too hard, 4) you cannot do direct experiments, 5) when the
input details such as equipment life and repair times are
not discrete and they vary over time according to a
distribution, and 6) you need to do some tweaking of the
system to understand where opportunities lie for improving
uptime, reliability, and costs.
When:
Build
models before you commit systems to bricks and mortar so you
know their performance on paper. Revise the models after
they are in operation to help improve the unknown weaknesses
an improve costs for future cases.
Where:
Monte Carlo models are used for gaining
insight about how things work and data collected from the
model is done at an accelerated rate compared to real life.
Return to top
Normal Distribution-
What:
A
fundamental frequency distribution that produces a
symmetrical bell-shaped diagram based on the Gaussian
distribution to form a normal law of errors.
Why:
The
distribution is easily described with two statistics, the
mean (X-bar, which is a location parameter) and the standard
distribution (sigma, which is a shape parameter carrying
units of the location parameter) as these are parameters of
the population.
When:
The
distribution is widely used for quality issues where errors
are frequently symmetrically distributed and for a few cases
of reliability problems where life data is also symetriclly
distributed. For symmetrical life data, the normal data
makes a good
Weibull plot whereas a Weibull data usually makes a poor
normal plot—thus Weibull plots have almost displaced normal
plots for reliability data.
Where:
The distribution is used where the statistics
simplify descriptions of the distribution so it is easy to
describe and explain.
Return to top
OEE-
What:
Overall
equipment effectiveness (OEE) is a manufacturing index to
reduce complexity of discrete systems for problem solving
and benchmarking. In many ways, it is a subset of
effectiveness. OEE=availability*performance*quality
where availability = (operating time)/(planned
production time), performance = (ideal cycle
time)/(operating time/total pieces), quality = (good
pieces)/(total pieces) and is best suited to discrete
manufacturing. The index is larger than for effectiveness
and allows for acceptance of down time without have a hard
measure for utilization losses in the capability (although
it does have a performance index which takes elements from
both efficiency and utilization) and it accepts planned
downtime as OK in the availability index. The effectiveness
index looks at the system from the perspective of the
investor, where as OEE looks at the system from the
perspective of the operations management which excuses many
losses such as planned outages, etc. which has the
propensity for the indices to be “Enronized” so they look
good when in fact from the investors viewpoint the results
are not good which is a violation of the principle of Esse
Quam Videri (To be rather than to seem).
Why:
It’s a
simple and easy to use index for the big picture summary of
performance in industry and it can be benchmarked against
similar industries.
When:
Use for a
quick assessment and approximation of the effectiveness
equation
Where:
Widely used for a first cut at improving
manufacturing operations in lieu of the more stringent and
complete effectiveness equation.
Return to top
Pareto Distribution-
What:
Vilfredo
Pareto, and Italian economist in the late 1800s, who
described the unequal distribution of wealth in the world.
The concept was improved by Joe Juran for manufacturing
operations when he said it was a methodology for separating
the vital few problems from the trivial many problems. When
the Pareto distribution is listed in order of money lost (or
the risk for money lost) it becomes a work priority for
attacking business problems that have the greatest impact on
the enterprise. Winners in the organization work on the
vital few important items, as they put their reputations at
stake, while the losers in the organization work on the
trivial many problems, which if solved, would have little
impact on the enterprise.
Why:
The Pareto
distribution sets work priorities and assuming a one year
pay back period describes how much money can be spent to
resolve the issues. Most reliability engineers need to be
working on the top 5 or 6 items all the time as data and
solutions are developed slowly and the key items always need
to be on the mind for active consideration. The mentality
is to think like a bank robber—go for where the big money is
located and get it back.
When:
At least
quarterly reviews of the Pareto distribution are important
for accountability of who has solved what problems and to
define what new targets have come over the horizon that
require immediate attention.
Where:
Pareto distributions are used throughout the
organization to keep attention on the vital few issues.
They are favored by management when engineers employ them
based on money. Pareto distributions help set work
priorities and avoid focusing on love affairs with equipment
or process which often occurs to the detriment of the
business. Pareto distributions explain why some work orders
always get maintenance priority while other task are
relegated to the category of when ever we get time to solve
the problem.
Return to top
Poisson Distribution-
What:
Poisson
distributions are discrete distributions and the simplest
statistic process where Poisson events are random in time
which describes a stable average rate of occurrence of
counted events. The Poisson is frequently used as a first
approximation to described failures expected with time. The
calculations are driven by an average value, e.g.,
failures/year, defects/meter2, hurricanes/year,
etc. Answers from the Poisson will come as probabilities
for 1 failure, 2 failures, etc. or the probability for 1
hurricane in a year or 2 hurricanes in a year, etc. The
average value is obtained from a constant*time-interval
which is usually explained as l*t. Frequently charts are
used to obtain solutions to the Poisson equation such as the
Thorndike Chart from Bell Labs or the Abernethy-Weber chart
from
The New Weibull Handbook. The equation is often
described in two formats: 1) probability = (np)re-np/r!
where n = number of trials, r =
number of occurrences, and p=probability of an occurrence,
or 2) probability = ZCe-Z/C!
where Z=expected number (i.e.,
the mean) and C=probability of an event in counting
numbers. Of course for the two different formats np=Z and
r=C. When n is large and p (or 1-p) is small, the Poisson
is an excellent approximation to the binomial distribution.
Why:
Simplicity
is the major reason for use of the Poisson distribution.
When:
Use the
Poisson when an answer is needed quickly and the answer
deals with counting terms.
Where:
When you know the average number of events
the Poisson is easy to use to find the probability of 1, 2,
3,…events occurring.
Return to top
Probability Plots-
What:
Probability plots make sense of the chaos of failure data on
an X-Y plot. Each type of plot is divided differently on
the X and Y axis based on the fundamental mathematics for a
given distribution. The decision which type of graph paper
to use is based on: 1) a simple pragmatic approach (use the
one that gives the best curve fit to the data), and 2) the
physics of failure or the mechanism driving the data for
non-failures. For reliability data, 85% to 95% of the data
will adequately fit a
Weibull distribution. For repair data, 85% to 95% of
the data will adequately fit a
lognormal distribution. Often Weibull plots or
lognormal plots compete to which distribution best fits the
failure data.
Why:
The
acquired data is plotted in the units acquired on the X-axis
of a probability plot and the data is plotted in rank
order. The Y-axis in most cases is determined using
Benards median rank approximation to
provided the probability percentage. The result is
often a straight line on the properly divided X-Y graph
paper. Please note, over the years many
different plotting positions have been tried with
Benard’s plot position being the strongest survivor for
tailed data.
When:
Use when
you have failure data or repair data. They work best when
age-failure plots are made by individual failure modes or
individual repair modes. They also will handle high level
failure data and repair times where the data represent how
the system is behaving.
Where:
Use probability plots to get complicated data
summarized onto one side of one sheet of paper. When the
plots have the cumulative distribution plotted on the
Y-axis, it tells what percent of the population will have a
life (or repair time) less than the corresponding X-value.
Return to top
Quality Function Deployment-
What:
QFD is a
bad translation of a good reliability technique for getting
the voice of the customer into the design process so the
product delivered is the product the customer desires. In
particular it is applicable to soft issues that are
difficult to specify.
Why:
The method
helps pinpoint: 1) what to do, 2) the best ways to
accomplish the objective, 3) the best order for achieving
the design objectives, and 4) the staffing/assets required
to complete the task.
When:
QFD is a
major up-front effort (as is the case with most Japanese
techniques) to learn and understand the customers
requirements and the approach that will satisfy their
objectives.
Where:
The methodology is used as a team approach to
solving problems and satisfying customers beginning with a
listing of customer requirements, converting customer
requirements into engineering characteristics (the house of
quality), converting engineering characteristics into parts
characteristics (the house of parts deployment), converting
parts characteristics in process characteristics (the house
of process planning), and finally converting the process
characteristics into production characteristics (the house
of production planning). As with all Japanese techniques
the up front costs are high and many clever graphical tools
exist for transferring information with the intention of
decreasing costs downstream while satisfying customer’s
needs.
Return to top
Reliability-
What:
Reliability is the probability that a device, system, or
process will perform its prescribed duty without
failure for a given time when operated correctly in
a specified environment.
Why:
Reliability has two broad ranges of meanings: 1)
qualitatively-operating without failure for long periods
of time just as the advertisements for sale suggest, and 2)
quantitatively-where life is predictable long and
measureable in test to assure satisfactory field conditions
are achieved to meet customer requirements. Reliability is
concerned with failure-free operation for periods of time,
whereas quality is concerned with avoiding non-conformances
at a specified time prior to shipment thus reliability
measures a dynamic situation but quality measures a static
situation. As in physics, statics is easier to understand
and calculate than dynamics which involves higher levels of
math and greater mental capabilities for comprehension.
When:
Reliability is expected for new equipment to start, run, and
continue to function for long periods of time without
failure. Reliability is also expected when the equipment is
dormant and called to duty. Reliability is also expected
upon service or restoration and resumption of long life.
Reliability is designed into the system by up-front
activities, and reliability is sustained by careful
operation of the system along with careful nurturing of the
system with sustaining maintenance activities. Reliability
always terminates in a failure and the roots of failure can
be due to design, fabrication, installation, operation,
maintenance (repair and period servicing), and management of
the system—in short there are many ways and means to kill
the system but few ways to keep is operating without
failure.
Where:
The adage says the proof of the pudding is in
the eating; and for reliability, the proof of the system is
in the long failure free interval. Reliability tools are
used from stem to stern to demonstrate high reliability (the
absence of failures for long periods of time) by use of many
tools such as:
reliability acceptance test to demonstrate long life,
reliability analysis to compute the expected results,
reliability and maintainability the mathematical
tasks which predict the expected results from the elements,
reliability apportionment to allocate life issues in
a top-down manner to meet an overall reliability goal,
reliability assessment determines the achieved level
of reliability of an existing system using data gathered
during test or use
reliability assurance implements planned management
and technical measures to provide confidence that a
reliability target is obtained and maintained,
reliability block diagrams to graphically and
mathematically calculate reliability results prior to
building a system,
reliability-centered maintenance is the systematic
approach to identify preventive support and service
according to a set of procedures to reduce and avoid
failures,
reliability confidence limits demonstrate the limits
for reliability within a given confidence limit,
reliability control is the coordination and direction
of system
dependability through design activities and management
planning,
reliability critical item identification whereby
failure significantly affects system safety/cost or
operational success or maintenance/logistics support costs,
reliability data is the basic age-to-failure
data as
life unit information relating to the time-to-failure
when organized by
probability distributions,
reliability degradation which incurs loss of the
failure-free performance due to poor workmanship or bad
parts or improper operation or abuse or inadequate
maintenance,
reliability design practices
are a series of trade-off-tools to meet or beat the design
specification for reliability,
reliability development/growth test are the
evaluations to disclose deficiencies and verify corrective
actions to prevent reoccurrence of the failures to achieve
the design specifications and sustain
reliability growth toward longer times between failure,
reliability estimates are life values used prior to
statistical experimentation with the end products to make
predictions or assessments, or stress analysis evaluations,
reliability function is the graphical representation
of life characteristics plotted against operating time,
reliability growth achievement is the systematic
improvements of a item/systems dependability by removing
failure mechanisms through corrective actions to eliminate
deficiencies and flaws often achieved by means of
test-analyze and fix,
reliability growth models (Crow-AMSAA) measures the
reliability growth by means of log-log plots of
cumulative failures on the Y-axis and cumulative time on the
X-axis to demonstrate with statistics that failures are
coming more slowly and reliability goals have been achieved,
reliability guarantee is the commitment by suppliers
to provide a given meant time between replacements or to
maintenance and overhauls intervals for equipment,
reliability improvement is the identification of
failure modes and effects having a critical impact on the
system failure potential of the design along with the
systematic removal of the failures to produce long life
without failures,
reliability index is the ratio of the mean
reliability level achieved to the acceptable level specified
in the design as a figure of merit,
reliability measurement is failure free endurance
assessment activity for making decisions about reliability
and demonstrating compliance,
reliability mission is the mission time for
demonstrating failure free performance,
reliability prediction is the process of
quantitatively assessing whether a proposed or existing
deign meets a specified life requirement,
reliability prediction functions estimate the life
characteristics for setting goals and evaluating the design
benchmarks and needs,
reliability
prediction limitations describes the shortcomings in
life values by analytical methods
reliability prediction requirements describes life
assumptions, environmental data, and failure rates for the
design,
reliability prediction summary
is a report providing conclusions and recommendations based
upon an reliability assessment analysis
reliability program at the activities to organize and
achieve a system to insure reliability goals are achieved
and deficient areas shored-up,
reliability program plan is the formal written
definition of the specific tasks to fulfill the reliability
requirements<
reliability qualification test (RQT) is an evaluation
conducted under specified conditions using items
representative of the approved product configuration,
reliability quantitative elements are the life
characteristics and factors considered in predicting and
measuring reliability performance,
reliability requirements are the numerical values
representing a specified failure-free life or dependability
performance characteristic,
reliability sequential tests are evaluations of the
number of failures and the time required to reach a decision
based on the accumulated results of the reliability tests,
reliability tasks describe the activities required to
achieve a reliability program,
reliability tests are the formal evaluation to
determine a product’s longevity for the failure-free
interval or stability relative to time/usage,
and finally
reliability with repair is the failure-free
performance achieved by redundancy with permitted online
repairs without interrupting equipment operation.
Return to top
Reliability Audits-
What:
Reliability audits verify your reliability program is
effective and find areas of weakness for corrective action.
They are inquiries by factual examination of elements of the
system with a written an objective criteria for performance
beginning with an assessment of how management is involved
and are they effective in building an productive reliability
program.
Why:
Most
organizations know where they are strong. On an objective
basis, few organizations know where they are weak.
Reliability audits are a fact finding exercises similar to
financial and quality audits to ferret out weaknesses for
corrective action. The questions to be answered are: 1) how
well are you doing what you promised against your
reliability policy, 2) How well is upper management
doing against company objectives for reliability, 3) how
well are reliability plans, systems, and procedures working,
4) How well are plans systems, and procedures being
executed against the policy, 5) how well are the productive
effort for reliability working for achieving the goals, 6)
how well has the reliability system been communicated to
employees and are they committed to understanding and
implementing the improvements, and 7) are financial
objectives being met as a result of ongoing reliability
improvements (which is the main objective of the audit—not
just a rigid procedural/bureaucratic compliance to details).
When:
Detailed
annual audits should occur annually with a follow-up on
the annual audits to occur six month later to insure that
corrective action has been implemented. With out a six
month deadline few tasks will be completed because of
procrastination.
Where:
Audits are needed for 1) reliability system
management, 2) new techniques, technology, developments, and
controls, 3) supplier control (internal and external), 4)
process operation and control, 5) reliability data programs,
6) problem solving techniques, 7) control of reliability
measurements, 8) human resources involvement, 9) customer
satisfaction assessment (internal/external), and 10)
software reliability (excluding Microsoft products used in
the office environment).
Return to top
Reliability Block Diagrams-
What:
Reliability block diagram (RBD) models are graphical
representations of a calculation methodology for reliability
systems.
Why:
The RBD
models allow calculation of system reliability based on
knowing/assuming failure details of the components starting
with the least component and growing the model to the
greatest system to predict performance from the elements.
When:
RBDs are
used in upfront designs as a performance parameter and after
the system is constructed to ferret out poor performing
blocks that limit the system performance.
Where:
Frequently used as a trade-off tool to search
for the lowest long cost of ownership and to help sell
alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by
alternative designs where the results can be calculated
before building the system as the results of the
calculations provide knowledge about availability,
maintenance interventions required for failures, and the
number of spare parts required to sustain operations. For
other definitions see
MIL-HDBK-338, section 4 and 6.
Return to top
Reliability-Centered Maintenance-
What:
Reliability-Centered maintenance (RCM) is a
systematic planning process used to determine the
maintenance requirements for a system. RCM expects the
system has an inherent reliability and maintenance
requirements are imposed upon the baseline of inherent
safety and inherent reliability which can be no better than
the worst than designed into the system.
Why:
RCM does
what is required to make sure the systems continue to do
what the users want done. If the excellent maintenance
programs demonstrate the lack of reliability expected, then
the system must be improved by design changes to physical
assets or the manner in which the assets are used.
When:
RCM
requires a cultural change in both management and the work
forces to “do maintenance by the numbers”. This requires
discipline in the organization to perform the
FMEAs that drive the work process for maintenance and it
also requires defining
functional failures.
Where:
RCM works better in top quartile
manufacturers who have a disciplined work force and are
interested in achieving excellence in 1) safety, 2)
operability, 3) reduced maintenance
downtime by a disciplined approach to the maintenance
activities, 4) high uptimes and 5) a reduction in failures.
Lacking one or more of the five efforts at excellence
generally results in a failed RCM program.
Return to top
Reliability Engineering-
What:
A
strategic job for preparing plans to reduce the failures and
the cost of failures as a preventative measure to reduce the
cost of unreliability. Acquires failure data and
analyzes the data to quantify
the financial impact and prepare long term solutions to
prevent reoccurrences to improve reliability and uptime.
Determines the cost advantages and proposes alternatives for
solving the problem and recommends the alternative with the
lowest long term cost of ownership. The purpose of
these actions is to prevent failures.
Why:
Prevents
future failures by working on medium and long term projects
using technology to solve the problems. As required,
provides technical assistance to maintenance engineers to
aid their efforts for quickly restoring equipment to
service.
When:
Provides
expertise for avoiding failures by means of a technical
solution to reduce the high cost reliability problems on the
Pareto distribution.
Where:
Provides technical support and solutions for
management on longer range problems, and as required,
supplies technical assistance to maintenance engineers for
immediate and difficult restoration projects as a liaison
effort. Supports task improvements to accomplish longer
term objectives (think months and quarters) which will
result in smoother operations, at lower costs, without
failures.
Return to top
Reliability Growth Models-
What:
Reliability growth models are important management concepts
for making reliability visual with simple displays. The
simple log-log plots of cumulative failures on the Y-axis
against cumulative time on the X-axis often make straight
lines where the slope of the trend line is highly
significant for telling if failures are coming faster (b>1)
which is undesirable, slower (b<1) which is desirable, or
without improvement/deterioration (b=1), which usually
drifts toward undesirable results. The reliability growth
models are frequently call Crow-AMSSA plots in honor of
Larry Crow’s proof of why the charts work as described in
MIL-HDBK-189 when he worked with
AMSAA.
Why:
If must
see reliability problems to fix them. The simple log-log
plots make the models visible. The task of the reliability
engineer is to put favorable cusps on the Crow-AMSAA trend
lines to make failures come more slowly and thus decrease
the long term cost of ownership. If you’re doing your
improvement job correctly, you’ll never have many failures
until you have a cusp.
When:
The plots
are useful for development tasks (where they first were
used) or to long term operations. They work for safety
programs, plant improvement programs, environmental
programs, or for cost problems. Use the plots as show me,
don’t tell me, how the projects are proceeding and the key
metric in the form of line slope is easy to understand and
easy to communicate in less than 60 seconds.
Where:
They are used for technical development
issues or for management reviews. A picture is worth a 1000
words for getting management’s attention for focusing on a
problem. Likewise the charts are highly useful for showing
the reductions in failures that have occurred from making a
desirable and permanent fix.
Return to top
Reliability Policies-
What:
Management
communicates with their staffs through important policy
statements. Management policies are general and relate to
procedures and rules which are specific for implementing
policies. Written statements of policy regarding
reliability are decisive documents about avoid system
failures in the same way as safety policies address the need
for absence of human injuries, quality policies address the
need for absence of product discrepancies, environmental
policies address the need for avoiding spills and releases.
Management needs to also say by a policy statement a
reliability policy which may read like this: We will
build an economical and failure free process which will
operate for 5 years between planned outages. This
statement will clearly communicate that failures to the
process (which is the money machine) are to be abhorred and
avoided!
Why:
Process
failures are clearly money issues because when the process
ceases to run, the company has no income, thus process
failures are to be abhorred for killing the money machine.
When:
Implementing a policy before constructions of new facilities
is important to use the policy as design criteria. When
implemented with older facilities the task is more difficult
and old facilities may never be able to comply with the
objectives at a reasonable cost alternative.
Where:
Responsibility for implementing the policy
lies with: 1) the chief operating officer must authorize the
policy and ensure the policy is applied thorough out the
operations under the administrative directive which sets the
guidelines for financial and engineering measures, 2) the
engineering/R&D executives are responsible for ensuring the
policy is implemented by systems engineering, design
engineering, project engineering, pilot plant engineering
and test engineering, 3) the manufacturing executive is
responsible for ensuring that the reliability policy is
carried out by the materials and procurement functions
,industrial engineering functions, manufacturing engineering
functions, operations functions, and maintenance functions,
4) the quality assurance executive is responsible for the
dissemination of the reliability policy, it’s annual review
and auditing for compliance to the spirit of the policy, and
for making recommendations to the chief operating officer
concerning continued relevance, applicability, and
effectiveness, and 5) the human resources executive is
responsible for ensuring that ll new employees are
indoctrinated into the purpose and implementation of the
reliability policy as a part of the operation’s mission,
goals, and priorities.
Return to top
Reliability Testing-
What:
Suppliers
have two strategies for testing: 1) test for success and 2)
test for failures. Reliability testing produces
failures, particularly when the tests are accelerated with
extra loads, and this may be troublesome to have in the
records for future lawsuits. Thus it is often to everyone’s
advantage to perform reliability test under code names to
protect against the broad rules of legal discovery.
Why:
The
reliability tests will determine a product’s longevity and
failure-free performance. This requires data recording and
data integrity. Plans must be set for how the tests are to
be conducted, loads to be handled,
duration of the tests, environmental conditions,
operating modes, failure definitions, and documentation for
recording/analyzing the test data.
When:
Reliability test are usually run prior to release of the
product for sale or after the product has been released and
troublesome failures appear in field applications where no
problems were expected.
Where:
Laboratory test are conducted in many cases
but in other cases the data may simply come from field use.
Note the failures induced require extra components which
must be expected and budgeted along with the extra costs for
data acquisition/analysis.
Return to top
Simultaneous Testing-
What:
For
inexpensive components and inexpensive tests, simultaneous
tests involve many components under test loads/conditions at
the same time for the purpose of quickly acquiring data and
producing test analysis as the failures occur. In
simultaneous testing the suspensions (censored data) become
important details for use in the statistical analysis. Most
simultaneous tests are accelerated to generate the data in a
short period of time although this carries the risk of
introducing unexpected failure modes (but this can also be
useful information for anticipating field failures).
Why:
Conducting
analysis of the early test results, when only a few failures
have occurred, will give precursors as to passing/failing
the longer term tests. If the early test results look
encouraging, the larger test may be allowed to run to
conclusion. However if early test results are
disappointing, the test may be abandoned without using all
of the testing budget so that remedial action can occur
prior to completing the full scale planned test.
When:
This
testing is usually conducted prior to release of products.
However, a similar watch may be setup for warranty repairs
so as to anticipate the cost and extra supplies required to
cope with an unexpected failure which was not forecasted.
Where:
This strategy is appropriate for inexpensive
components in the test laboratory. However, for warranty
problems, the issues are very appropriate for expensive
components or assemblies.
Return to top
Software Reliability-
What:
Software
does not wear out but it does fail and most failures are due
to specification errors and code errors with only a few
errors in copying or use. The only software repair is by
reprogramming and adding safety factors is almost
impossible. Software reliability improves by finding errors
and fixing the errors but estimating the number of errors
which canse failures is extremely difficult as many branches
of software code may lie dormant and unused until special
events occur to make the latent failures obvious. Software
failures are not often time related but are more software
code page dependent. Software reliability is improved by
extensive testing to disclose the failures and then fixing
them to repeat the test all over again to validate the fix
did not generate more failures and to continue the search of
other latent defects.
Why:
More than
50% of the software bugs (failures) occur from
specifications with lesser amounts of failures from system
design and the coding process and this is due to the lack of
visibility in the software process along with problems from
those specifying the requirements with problem roots in
ambiguities, inconsistencies, incomplete statements, and
lack of logical requirements. This requires that both
inputs and outputs for software must be specified in greater
detail than for mechanical, electrical, or system data to
avoid the errors and conflicts.
When: “Clean
room” software procedures are a technique for extracting
details from the customers to insure the programmers and
they are used up-front to reduce errors and wasted code.
The acquiring of the data is tedious and roughly 80% of the
software budget is spent get the details “right” before
programming commences.
Where:
Disciplined software specialist carefully
work the plan up-front to reduce errors and testing time.
Undisciplined, so called “neo-experts” want to see busyness
in code writing up-front and thus their software reliability
is worse from not having a firm foundation from which to
work.
Return to top
Sudden Death Testing-
What:
For
expensive components and expensive tests, sudden death tests
involve a few components that tie-up a test frame as they
are heavily loaded under the same test loads/conditions with
several items being run at the same time. When one of the
items fails the entire test frame is shut down so that you
have 1 failure (this is the sudden death!) and several
suspensions because the unfailed units are survivors as the
test is halted until the test frame is loaded with new
samples for resumption of the life test. Opening the test
frame (instead of tying up the frame until all samples have
failed) is cost effective. If three units can be tested
simultaneously and the test is halted on the first failure,
then perhaps we will literally have only 4 failures and 8
suspensions for preparing the
Weibull analysis. Will the 4 sample + 8 suspension
data set be different than if all 12 samples had been run to
failure?—the answer is yes, they will be different, but
will they be significantly different—the answer is no to the
significant difference. So, as with
simultaneous testing the suspensions (censored data)
become important details for use in the statistical
analysis. Most sudden death tests are accelerated to
generate the data in a short period of time although this
carries the risk of introducing unexpected failure modes
(but this can also be useful information for anticipating
field failures).
Why:
Sudden
death testing is all about the economics and shorter elapsed
time for results.
When:
Sudden
death testing is used for product acceptance tests.
Where:
It is a quick test for many products and the
on-going test for production lots.
Return to top
Total Productive Maintenance-
What:
Total
productive maintenance (TPM) is a corporate-wide effort
involving all employees to fully use equipment to the
maximum limit employing an equipment-oriented management
concept to reduce failures and increase utilization of
equipment and processes in a productive manner. TPM
programs are teamwork programs and require a corporate
culture of teamwork devoid of us vs. them issues. All
employees are expected to accept ownership of the equipment
and processes to do many small things all the time to insure
high levels of availability by eliminating failures in the
early stages with low cost actions. The employees approach
the process equipment as owners rather than
renters.
Why:
Maximizing
equipment uptime with lower costs by all employees working
to reducing the many small incidents which lead to a failure
When:
Major
maintenance tasks are handled by the craftsmen. Most small
tasks are handled by operators in a never ending effort of
cleaning, lubricating, and tightening to find problems early
when they can be solved simply instead of letting the
problem grow to a major issue.
Where:
TPM is a system wide effort of providing care
to the equipment rather than “it’s not my job” and “we’ve
got to fill out the paperwork before “they” can do
anything”. The technique makes good use of the 5 human
senses but technical details must be taught to the work
force to understand good from bad and when action must be
taken along with what must be done—this requires a sharing
environment where the work team works for the common good of
higher performance. If the culture is me, me,
me, TPM will not work.
Return to top
Weibayes Estimates-
What:
If you’ve
got one piece of failure data and nothing else, you’re a
poor person without much hope. I’ve you’ve got one piece of
failure data and a
Weibull database, you’re a
rich person with a map on the back of an envelope and a
compass by your side to get you out of the abysmal swamp of
ignorance and misunderstanding.
Why:
The
Weibayes technique uses your
failure data and past experience to make
Weibull analysis forecast about what you should expect
into the future and in may cases, given a hypothesis of
worst-case/best-case a failure
forecast can be generated.
When:
Use the
technique when you lack specific details but you know
something from your past experience—often the past
experience reduces errors of Weibull analysis. Use Weibayes
analysis to make sense out of emotional non-sense.
Where:
Use the technique to say something and point
noses in the right direction rather than playing the role of
Chicken Little with the sky falling. Some data is better
than no data in most cases and when you can keep your wits
and everyone else is in panic mode, it quiets the problem to
allow reason to prevail.
Return to top
Weibull Analysis-
What:
Weibull
analysis is the tool of choice for most reliability
engineers when they consider what to do with age-to-failure
data. It uses the Weibull distribution which says
mathematically that reliability, R(t) = e-(t/h)^b
where t is time, h is a scale factor know as
the characteristic life (most of the Weibull distributions
have tailed data and lack an easy way to describe central
tendency as the mode≠median≠mean, however, regardless of the
b-values, which is a shape factor, and all of the cumulative
distribution function values pass through the h value at
63.2% which thus entitles it to be know as the single point
characteristic life).
Why:
The
Weibull distribution is so frequently used for reliability
analysis because one set of math (based on the weakest link
in the chain will cause failure) described infant mortality,
chance failures, and wear-out failures.
When:
Use
Weibull analysis when you have age-to-failure data. When
you have age-to-failure data by component, the
analysis is very helpful because the b-values will tell you
the modes of failure which no other distribution will do
this! When you have age-to-failure by system, the b-values
have NO physical significance and the b-, h-values only
explain how the system is functioning—this means you loose
significant information for problem solving.
Where:
When in doubt, use the Weibull distribution
to analyze age-to-failure data. It works with test data.
It works with field data. It works with warranty data. It
works with accelerated testing data. The Weibull
distribution is valid for ~85-95% of all life data, so play
the odds and start with Weibull analysis. The major
competing distribution for Weibull analysis is the
lognormal distribution. For additional information
read
The New Weibull Handbook, 5th edition by Dr.
Robert B. Abernethy and use the
WinSMITH Weibull and
WinSMITH Visual software for analyzing the data (both
software are bundled for a reduce price as
SuperSMITH).
Return to top
Weibull Database-
What:
The
smartest way to maintain a reliability database is in
Weibull format and
Weibull databases are available. Seldom do you see
Weibull databases from vendors because they jealously
protect their data for proprietary reasons—they life/die
financially from the Weibull database information.
Why:
The
Weibull databases simplify the complications of failure data
into two statistical values of great importance:
b tells you HOW things fail, and
h tells you WHEN things fail.
The results are key benchmark data that tell you how you’re
doing.
When:
Gather
your failure data and create your own database. No one is
going to give you their database because they put much sweat
and tears into cleaning up the data so it is useful. The
data needs to be locally generated because it tells you: 1)
the life from the
grade of equipment your purchase, 2) it describes the
grade of operation of the equipment—do you operate it like
16 year-old teen agers or wise old men/women of 65?, 3) it
describes the grade of maintenance you use to renew it’s
life, and 4) it tells you management’s expectations for how
to treat the system.
Where:
The data starts out as a silly exercise by
maintenance to accumulate data with much ridicule from the
unknowledgeable about why are you
spending this effort to build a Weibull database. Then
suddenly when adversity arises, it becomes everyone’s prized
possession. Remember the worlds of Runyard Kipling about
the English soldier: In peace time it’s
Tommy this and Tommy that, and Tommy get out of the
way…..but you let the bullets fly in wartime and it Mr. This
and Mr That and Mr. if you please! Everyone wants the baby
but no one wants the dirty diapers that go with every baby!
If you don’t have a Weibull database, you’re already too
late because your competitor has one started and it using it
for your disadvantage and he’s not doing to tell you why
you’re left in the dirt!
Return to top
Comments:
Refer to the caveats on the
Problem Of The Month Page about the
limitations of the following solution. Maybe you have a
better idea on how to solve the problem. Maybe you find
where I've screwed-up the solution and you can point out my
errors as you check my calculations. E-mail your comments,
criticism, and corrections to: Paul Barringer by
clicking here.
Return to top of page.
You can download a copy of
this page as a PDF file.