Reliability tools exist by the dozens:
what are the tools,
why use the tools,
when should I use the
tools, and where
should I use the tools? Click on the tools below for
answers.
The details about these tools will be brief
as books are written about each item. Think of the
presentations below as hors d’oeuvres (a little snack food
or starters)—not the main course.
The most important reliability tool is a
Pareto distribution based on money—specifically based on
the
cost of unreliability which directs attention to
work on the most important money problem first. No magic
bullet exists for reliability issues—don’t waste your time
looking for a single magic tool—none exist!
Accelerated
Testing-
What:
A test
method of increasing loads to quickly produce age-to-failure
data with only a few data points which are then scaled to
reflect normal loads.
Why:
The
benefit of accelerated testing is to save time and money
while quantifying the relationships between stress and
performance along with identifying design and manufacturing
deficiencies to get useful data quickly and at low cost.
When:
Usually
performed during the development of devices, components, or
systems. Also applies to items that have been in service to
obtain a metric needed to show how the item is performing
under heavy loads. Accelerate testing is a useful method
for solving old, nagging, problems within a production
process.
Where:
Used for correlating test results with real
life conditions.
Return to top
Availability-
What:
A tool for
measuring the % of time an item or system is in a state of
readiness where it is operable and can be committed to use
when call upon. Availability ceases because of a downing
event which causes the item/system to become unavailable to
initiate a mission when called upon. In the simplest view
the metric is availability = uptime/(uptime
+ downtime). For many other definitions see
MIL-HDBK-338, section 5.
Why:
The
measure is important for knowing the commitment of time for
performing the mission and it usually only involves the use
of arithmetic.
When:
Often the
measurement tool is based on past experiences and the
complement of the measurement tool addresses unavailability
to perform the task.
Where:
In design of a system it is a calculated
value and in operation of a system it is a performance index
that is often easy to use and provides and index that is
understandable to the average person. Today there is a
great tendency to “Enronize” availability metrics by using
uptime metrics that presents data in the best light (an
issue of data integrity) to maximize managerial bonuses by
excusing (deducting) downtime from the calculations to put
lipstick on the pig. Use the
KISS principle. Think of availability in terms of the
investor’s typical year of 8760 hours. The no-excuse annual
metric in hours is availability = uptime/8760. Suddenly
you’ll find a metric of great interest to investors that can
be bench marked as a financial issue, and thus motivate the
management team to solve real issues of importance to the
business. Please note, you can
have high availability but many failures and thus low
reliability as
availability ≠ reliability. Likewise, you can have high
availability but little output so team the metric with
effectiveness to get the complete story.
Return to top
Bathtub Curves-
What:
The
concept is derived from the human life experience involving
infant mortality, chance failures, plus a wear out period of
life since data for births and deaths is accumulated by
government agencies. Most equipment lacks the birth/death
recording by government agencies and most non-human systems
can be regenerated to live/die many times before relegation
to the scrap heap.
Why:
Failure
rates are different for both people and equipment at
different phases of operation and the medicine to be applied
to both humans and equipment need to be considered for
effectively treating the roots of the problem.
When:
The
concept is useful during design, operation, and maintenance
of equipment and systems to understand the failure
mechanisms
Where:
It explains the human experiences to the
ordinary person to relate equipment/system failures to those
experienced in real life so as to coordinate the design,
operation and maintenance of equipment. For other
definitions see
MIL-HDBK-338, section 9.
Return to top
Block
Diagram Model (same as Reliability Block Diagram Models)-
What:
Reliability block diagram (RBD) models are graphical
representations of a calculation methodology for reliability
systems.
Why:
The RBD
models allow calculation of system reliability based on
knowing/assuming failure details of the components starting
with the least component and growing the model to the
greatest system to predict performance from the elements.
When:
RBDs are
used in upfront designs as a performance parameter and after
the system is constructed to ferret out poor performing
blocks that limit the system performance.
Where:
Frequently used as a trade-off tool to search
for the lowest long cost of ownership and to help sell
alternative courses of action for moderating the effects of
reliability issues or overcoming the poor performance by
alternative designs where the results can be calculated
before building the system as the results of the
calculations provide knowledge about availability,
maintenance interventions required for failures, and the
number of spare parts required to sustain operations. For
other definitions see
MIL-HDBK-338, section 4 and 6.
Return to top
Capability-
What:
A measure
of how well the product performance meets objectives. In
short how well are the outputs actually accomplished against
a standard? Capability is frequently the product of
efficiency * utilization.
Why:
Capability
is a component of the
effectiveness equation and usually under the control of
production.
When:
Data for
this metric is frequently produced by the Accounting
department each month as a segment of the financial reports
for the purpose of handling variances against the standards.
Where:
Frequently in the effectiveness measure it is
a weak point [as a measure of how well the production
process des the job for which it was purchased] requiring
substantial improvement that cannot be solved by the usual
reliability and maintainability (RAM) tools. However, this
metric may be deficient from the original design [an issue
of design effectiveness] of the system or from the way the
system is operated [an issue of use effectiveness].
Return to top
Configuration Control-
What:
Configuration control is involved with the management of
change by providing traceability of failures back into the
design standard. If the design details are not specified,
the design will not contain the requirements and thus
implementation of the project will be hit or miss for
achieving the desired end results beginning with the
conceptual design and resulting in the operating facility.
Why:
With
active configuration control you know where items are used
and contained, where and why they were installed, where
signal originate, what items are used where and in what
environments, what drawing revisions have occurred and the
product conforms to the drawings and specifications, what
alternate materials/components have been used, and test
reports/certifications are available as original documents
for review.
When:
Configuration control begins after the first design review
to build an unbroken chain of traceability to aid in
avoiding surprises in the field which would destroy the
designed-in criteria for availability, reliability,
maintainability, and cost effectiveness established as a
portion of the original design criteria.
Where:
Frequently these documentation details are
assembled into a dossier with third party witnessing for use
in validating conformance to the design requirements and
provided to the owner of the equipment as witness documents.
Return to top
Contracting For Reliability-
What:
Say what
you want and want what you say to your vendors. Provide
explanations of the objectives in contracts in terms the
vendors will understand.
Why:
If you
can’t spell clearly spell-out the requirements for
availability, reliability, and maintainability the
contractors cannot make these issues features of the
design. Thus it is important to be specific in the features
the design must manifest. Explanations such as: “You know
what I want and what I need, just do it quickly” are self
defeating expressions of vague generalities that lead to
inferior designs and constant arguments. Be specific about
requirements for building
reliability block diagrams, using
quality function deployment, performing
failure mode and effects analysis, conducting
fault tree analysis and finally conducting
design reviews for reliability.
When:
Write the
specifications before procurement begins. Plant to spend
time with your own Purchasing Department to explain the
details and sell the team on the financial advantages for
including reliability requirements into the specifications;
and likewise, spend time selling your vendors on the
requirements and why they are stated.
Where:
These are up front decisions to avoid
replication of previous problems that are built into
previous designs and never corrected.
Return to top
Cost Of
Unreliability-
What:
The cost
of unreliability is a big picture view of system
failure costs, described in annual terms, for a
manufacturing plant as if the key elements were reduced to a
series block diagram for simplicity. It looks at the
production system and reduces the complexity to a simple
series system where failure of a single
item/equipment/system/processing-complex causes the loss of
productive output along with the total cost incurred for the
failure. If the system IS sold out, then the cost of unreliability
must include all appropriate business costs such as lost
gross margin plus repair costs, scrap incurred, etc. If the
system is NOT sold out, and make-up time is available in the
financial year, then lost gross margin for the failure
cannot be counted. The cost of unreliability is a
management concern connected to management’s two favorite
metrics: time and money.
Why:
In private
enterprise, failures must be concerned from a financial view
point and not a gear-head approach of simply counting the
number of failures; and you must speak the language of the
enterprise which describes events by monetary measures over
a period of time. The annual cost for failures is usually
not stated in a clear cut manner nor is failure costs
summarized by system/sub-system to identify the weak links
in a monetary fashion so that appropriate action is taken to
reduce the annual cost of unreliability by building a
clear
Pareto distribution to attack the vital (high cost)
areas with an action plan to reduce failures (unreliability)
and to reduce the cost of unreliability.
When:
For new a
new plant, this can be a design criteria to limit costs of
unreliability for competitive reasons in the
marketplace, i.e., by plan, the hidden costs of failures is
made obvious as a portion of the strategic plan. For an
existing plant, this can be an exercise in defining the cost
of unreliability and building a long term plan to reduce the
cost of failures as a portion of the tactical plan.
Where:
This activity is best performed with high
level involvement of the management team to provide
fundamental understanding of the size of the icebergs about
to rip out the underbelly of the plant and to involve the
organization in a plan to reduce the costs so that profits
are pushed upward because of the improvements. If the cost
of unreliability cannot be reduced, then the costs
become extra weight for the saddle bags in the race for
survival.
Return to top
Critical
Items List-
What:
The
critical items list is a top level summary of problems/cost
used for discussions with management about key reliability
issues. The summary list converts technical details to a
summary of costs and time while placing the issues into a
Pareto distribution explained in terms of money and the
vital few problems to be solved for competitive reasons.
Why:
The
purpose of the critical items list is to focus management’s
attention on items that need to be resolved during the
design phase as a corrective action loop for influencing the
life time costs.
When:
The list
starts with the first design review as issues are disclosed
in design reviews for reliability.
Where:
The critical items list is presented to top
level management as issues to be accepted or resolved before
paper plans become steel and concrete.
Return to top
Data-
What:
Data is the
informational energy which runs the reliability improvement
machine. Data is acquired at great cost. Data needs to be
retained and used to prevent future failure events. Proper
use of data provides an understanding of failure mechanisms
and prevents reoccurrence of bad events which cause safety
or high cost failures to occur. Reliability data requires
definition of a failure. Failures can be catastrophic
failures or slow degradation—you decide by defining the
failures. The units of the measure for the data must be in
units of the degradation—sometimes it is hours, some times
it is miles, and so forth—in short, what ever motivates the
failure. Reliability always ceases with a failure or a
removal from service in some aged condition which then
generates a category of data called a suspension or censored
data. Data is information in the form of facts, figures, or
engineering databases which is obtained from engineering
tests, experiments, or actual operating conditions.
Reliability data is often incomplete as the exact times to
failure are rarely known or recorded with much precision so
that only partial information is available for analysis.
Reliability data comes in two forms: 1) age-to-failure data,
and 2) censored/suspended data such as occurs when unfailed
items are removed from service or when they fail due to a
different failure mode than we are studying—this is useful
information and part of the data set. Some data is better
than no data for resolving reliability issues.
Why:
Data is
the information that, when used in an informed manner, helps
prevent repetition of bad history and allows an enlightened
approach to rationally solving a reliability issue using
facts and figures. Intelligent use of data for reliability
issues provided the objective evidence needed for helping to
solve the root cause of failures.
When:
Databases
of reliability information of past experience is very
helpful for predicting future failure events. The data is
helpful if failure rates, or the reciprocal of failures
rates is described in mean times to failure which reduces
the information to an average failure rate or average time
to failure. The reliability data is particularly valuable
if retained for components as a Weibull data base with shape
factor beta and scale factor eta.
Where:
The data is useful for understanding failure
modes, and for predicting future failures for a population
of equipment during the design stage and for predicting
future failures with subsequent increases in the aging of
equipment. The role of the reliability engineer is to
acquire the failure data and convert the data into useful
information for both current and future use.
Return to top
Decision Trees-
What:
Most
business decision have considerable uncertainty which
implies at least two outcomes if you choose a course of
action. Making decisions in the face of uncertainty
requires the costs for taking action and the probability
along with the cost for not taking action and the
probability of the occurrence. In most cases the
probabilities are not well known (maybe to one significant
digit) and the costs are not well know (maybe to $10000).
The quantitative assessment is called risk assessment. The
issue is to take these not well identified issues and devise
a strategy which can minimize exposure to risk for the
business. The graphical representation of the methodology is
called decision trees to reach the expected values for
decision to take/not-take action.
Why:
Most
business decisions have no exact answers, i.e., no black and
white answers but rather shades of grey. The use of the
tool is to help decide which course of action may be to the
advantage of the business given the best estimates that can
be made.
When:
Decisive
details will only be know into the future and decisions have
to be made today so use of decision trees are tools to help
wisely span from today into the future with the wisest
decisions that can be made from sketchy data.
Where:
If you have absolute date, use it. Must most
decisions must be made with indecisive information which
requires decisions about the odds for a given event, usually
based on estimates—the wiser the estimate the better the
decision, taking into account the probabilities of the
outcomes and the money involved in the decision. Use this
tool when few details are available and you must be the
pioneer to cut through the forest to reach the promised land
of opportunity and profitable ventures.
Return to top
Dependability-
What:
The
International Electrical Congress (IEC)
defines dependability as “Dependability describes the
availability performance and its influencing factors:
reliability performance,
maintainability performance and
maintenance support performance.”
MIL-HDBK-338 defines dependability differently as a
measure of the degree to which an item is operable and
capable of performing its required function at any (random)
time during a specified mission profile, given that the item
is available at mission start. (Item state during a mission
includes the combined effects of the mission-related system
R&M parameters but excludes non-mission time; see
availability.) Dependability is related to reliability
with the intention that dependability would be a more
general concept than the measurable issues of reliability,
maintainability, and maintenance.
Why:
The key
dependability issue is make equipment and processes work as
advertised, which is, without failure. Dependability aims
at facilitating co-operation by all parties concerned
(supplier, organization, and customer by fostering an
understanding of the dependability needs and value to
achieve the overall dependability objectives) so it involves
harmonizing conflicting issues. Dependability has a better
view point from the end user of the equipment or system than
from the designer’s viewpoint or the maintainer’s
viewpoint. From a system effectiveness viewpoint,
reliability and maintainability provide system availability
and dependability.
When:
You cannot
repair yourself to happiness with a failure prone system as
the failure prone system will be viewed lacking
dependability to function as required when you need it.
Thus dependability is viewed over the longer term and not in
convenient snap-shots and dependability also involves life
cycle cost issues.
Where:
Reliability contributes directly to uptime by
avoiding failures whereas maintainability contributes
directly to reducing downtime by faster repairs. Thus
reliability and maintainability jointly provide impact on
dependability of the system. Dependable systems must be
ready to function, in an operable state, to produce the
desired output, upon demand by the end user, at the
specified quantity and quality of output.
Return to top
Design Reviews For Reliability-
What:
Specific
questions to ask the design engineers during a review
specifically for reliability using failure data from
operations and maintenance are: 1) show the calculated
availability for the system based on a
RAM model, 2) show the calculated number of failures
during the specified mission time between turnarounds based
on a reliability and maintainability (RAM) model, 3) show
details of
FEMA studies, 4) show details of
FTA calculations, 5) show the calculated mean times
between downing events, 6) show the calculated the mean time
between cutbacks from full production capability and losses
thus incurred, 7) show the
QFD matrix and details, and 8) show the calculated
cost of unreliability.
Why:
Design
reviews should demonstrate by calculation or through the use
of models and reliability tools that the system is capable
of achieving the design objects rather than making a giant
leap of faith that all will be well and good.
When:
Design
reviews for reliability should be a part of the design
process starting with conceptual designs and ending when the
drawings are revised for the as-built system.
Where:
This is a logical extension of the design process to show me
rather than tell me how the system will function and is
performed as a portion of the up-front design by the numbers
process.
Return to top
Effectiveness-
What:
The
potential or actual probability of a system to perform a
mission for a given level of performance under specified
operating conditions defined as the product of
reliability*availability*maintainability*capability.
Many variants of the effectiveness equation exist, e.g., OEE,
and others.
Why:
The
effectiveness equation defines the ability of a product,
operating under specified conditions, to meet operational
demands when called upon. This is a practical measure of
how well the system is performing—not how well we want it to
perform but a practical measure of how it’s doing. Since
all the elements are measured between 0 to 1, the elements
of the equation quickly draw the eye to where opportunities
exist for making improvements.
When:
The
effectiveness equation is useful for trade-off boxes for
various alternatives when plotted on an X-Y scale for
effectiveness vs net present value (NPV) for improvement
alternative selections. For the elements::
reliability defines the probability of a failure free
interval (or the complement unreliability which describes
the probability of failure),
availability defines the probability of the system
being up and alive to handle the demand (or the complement,
unavailability which describes the probability of the system
being down),
maintainability defines the probability of making
repairs within the allowed repair standard,
capability defines the probability of production
achieving the desired production results [a measure of how
well the product performs compared to the standard] and
frequently it is described as the product of efficiency *
utilization where
efficiency is an output/input relationship such
as (output achieved)/(the standard required) and
utilization is how time is used such as (direct
labor)/(direct labor + labor lost)
[in the old days, if this index
decreased to as low as 80% we went berserk—today,
you can’t get this high because of
wasted time when noses are not to the grindstone!!!].
Where:
It is used to describe new systems and old
systems performance. Consider this example for
effectiveness: If we are comparing a heavy duty truck
versus a sports car for transportation, the truck may be
more effective for heavy loads whereas the sports car may be
more effective for acceleration and high speeds—neither are
defined by the effectiveness equation until the mission is
defined.
Return to top
Environmental Stress Screening (ESS)-
What:
A series
of screens are conducted under environmental stresses to
disclose weak parts and workmanship defects which require
corrections and this requires and understanding of burn-in
testing and ESS of which both techniques identify weak
points and eliminate them by motivating early failures.
Burn-in is usually a long process of operating under load(s)
and at fixed temperature (in short, this is a special case
of ESS) or it can be operated at varying loads and
accelerated temperatures to achieve a shorter burin-in
period, whereas ESS is a scientifically planned and
conducted test which is usually conducted under accelerated
loads to produce the same test/use results in a shorter
period of time by increasing the stress on the components or
assemblies. The objective of these screens is to produce a
failure free product when released into operations. ESS is
not intended as a test to validate compliance to a design,
however it is intended to force latent defects into becoming
defects before the end user finds them in day-to-day usage.
Why:
The
extremes of operating conditions such as high power levels,
high temperatures, high vibration levels, etc. produce
failures not anticipated from testing at nominal
conditions. Generally ESS is directly applicable and
interpreted to be applicable to electrical/electronic
equipment, however the same issues/concepts apply to
mechanical equipment when the stressing conditions are
loads/pressures/temperatures/vibrations/thermal shocks/etc.,
so as for all reliability issues—think broadly!
When:
When
acquiring data, the tests are done upfront of production.
When controlling early failures that would be discovered by
the end user, these test are done as a portion of the
production process to eliminate week units to control
warranty costs and improve customer satisfactions
Where:
Some tests are conducted in the laboratory
for quick results and then the data is used to control
product testing/release for the purpose of limiting costs
and preventing the loss of customers from unsatisfactory
performance in the field.
Return to top
Events/Incidents-
What:
Events/incidents are single events or occurrences that
happen, especially one that is particularly significant,
that results in a failure from an non-aging mechanism for
reliability purposes. Usually the event/incident result in
a serious consequence of the loss of functional life of a
component or system. The death of the device must be
recorded as censored (suspended) data.
Why:
For
reliability purposes, failure of the component, device,
subassembly, or system has been a success up to the point in
life where a failure from a non-aging event too place. This
means the event-age was a success (up to the point it was
killed by an event/incident) and inclusion of the data is
required as censored/suspended data—this is important data.
When:
Include
the suspended/censored data into every analysis. Young
suspensions/censored data have little impact on the results
of an analysis but old suspensions have major effect on the
analysis.
Where:
The data is used for MTBF/MTTF analysis and
particularly for Weibull analysis.
Return to top
Exponential Distribution-
What:
The
probability of survival and of failure of components or
equipment is under the condition of chance failure which
means a constant instantaneous failure rate where the
die-off rate is the same for any surviving (unfailed)
population. An old part is as good as a new part. For any
survivors in this memory-less system that have survived to
time t, a certain percent of the survivors will die in a
specified interval of time such as 2*t. The reliability of
the system is often described by the exponential
distribution because many times a system is made-up of mixed
failure modes which in the aggregate will function like a
constant failure rate system. The reliability of
exponential distributions are described mathematically as
R(t) = e^(-lt) = e^(-t/Q) where t
is the mission time, l is the failure rate, and Q is the
mean time, given that l=1/Q. The exponential distribution
is frequently used as a first approximation to describe
reliability based on a simple failure rate or a simple mean
time to failure—particularly if the system or component has
multiple failure modes.
Why:
The
constant hazard rate, l, is usually a result of combining
many failure rates into a single number.
When:
The
exponential distribution is frequently used for reliability
calculations as a first cut based on it’s simplicity to
generate the first estimate of reliability when more details
failure modes are not described.
Where:
In electronic systems (which can have many
different types of failure modes and the fact that any
electrical/electronic system is an amalgam of many different
components) the simple assumption is that the
electrical/electronic package will have a constant failure
rate system defined by the exponential distribution. When
in doubt about the failure mechanisms, it is common to
assume use of the exponential distribution with it’s
constant failure rate for simplicity.
Return to top
Failure-
What:
Failure is
the loss of function when you needed the function to occur.
Failures for reliability purposes must be precisely defined
so they are recorded correctly. Much life data is
incomplete because failures are mixed-up with
censored/suspended data where aged items may not have failed
or they represent removals from service before failure, or
they have not yet failed for the mode of failure under
study—in short these censored/suspended items represent
successes and are a portion of data set for study.
Why:
We study
failed items for the same reason we do autopsies on
humans—we want the data and we want it categorized correctly
for making important decisions. Failures require: 1) a time
origin which must be unambiguously defined, 2) a scale for
measuring the passage of time/starts/stops/etc. which
motivates failure, and 3) the meaning of failure must be
entirely clear for recording the event.
When:
Failure
data must be recorded as it occurs to prevent loss of
information.
Where:
The CMMS system is frequently where most data
resides but usually in crude fashion. The failure data is
often transferred into the
FRACAS system for converting the symptoms of the failure
into the root causes of failure. The failure data must be
converted into action items for making management decisions
about future failures and the corrective action needed.
Return to top
Failure
Forecast-
What:
Failure
forecasting is a projection of failures into the future
based on assumed or documented failure details. It is also
known as risk analysis of future failures. For a constant
failure mode system this is very straight forward. However
for complicated failure modes where the failure rate
increases with time (wear out failure modes) or where
failure rates decrease with time (infant mortality failure
modes) this becomes a more complicated analysis as described
by the Abernethy Risk which is described in
The New Weibull Handbook and implemented in the software
package
WinSMITH Weibull for predicting future failures. Like
wise, reliability block diagrams are useful for predicting
future failures when the authentic failure details are
supplied to the Monte Carlo models.
Please note manufacturers follow two general strategies for
their equipment:
1) build the equipment to avoid failures even though
this increases the original capital costs, or
2) build equipment and see the original equipment at a
low cost (or even a break-even costs)
expecting to make profits with the sale of
replacement parts.
Thus for end users of the procured equipment, it is
important to know the forecasted failures in the face of
suppler protest that “our equipment never fails”—in that
case ask to see the sale of spare parts for similar
equipment and an estimate of the number of units working to
get a crude estimate of the strategy employed by the
equipment supplier.
A failure is an event which renders equipment as non-useful
for the intended or specified purpose during a designated
time interval. The failure can be sudden, partial, or
one-shot, intermittent, gradual, complete, or catastrophic.
The degree of failure can be degradation or gradual, sudden,
or one-shot, from weakness, from imperfections, from misuse,
or so forth.
A failure mechanism includes a variety of physical processes
which results in failure from chemical, electrical, thermal,
or other insults.
Why:
Future
failures costs money and frequently increase the risk for
safety or environmental problems. For manufacturers, the
forecasted failures predict impending high costs for
warranty expenses which can make/break a company. With good
failure forecast, you can anticipate expected failures now
(after x-usage), future failures when failed units are not
replaced, and future failures when failed units are replaced
either with the same failure modes or with differently
designed components with different failure detais.
When:
This
analysis is wisely performed in during the design of the
equipment, however many surprises arise from different
failure modes build into the assembled product or incurred
by not anticipated usage in operations.
Where:
Generally this analysis is made during the
up-front design effort—with much disbelief the products
could be “this bad”. Follow-up analysis occurs when
unexpected failure modes arise during operation of the
equipment which causes loss of service of the equipment and
high costs for the end users.
Return to top
Failure Rates-
What:
Failure
rates, in the simplest form, are
S(time
in use)/S(number of failures) or the reciprocal of mean
times to/between failure. For more sophisticated failure
data bases such as Weibull data bases the failure rates can
be disclosed without giving away proprietary data such as
the shape factors, beta, which tell the failure mode for the
equipment.
Why:
Simple
failure rates are a precursor of maintenance events and
production interruptions that will occur into the future
which drive up costs and cause chaos.
When:
Failure
rates derive from the history of operation or from well
known data sources such as OREADA, IEEE 500, IEEE 493, EPRI,
and other sources listed in
reading lists for reliability including
Weibull databases.
Where:
The failure rates are used as an awareness
criteria for the average person just as you used automobile
fuel consumption rates for understanding the health of your
automobile as well as anticipating your
weekly/monthly/annual out-of-pocket expenditures for
gasoline or diesel fuel. The failure rates drive the
maintenance interventions, spare parts, and maintenance cost
for the Maintenance Department. Similarly they predict the
interruptions to the process and lead to misses on promised
deliveries and result in negative variances for production
costs. In sort, failure rates are precursors for the misery
expected for the organization.
Return to top
Fault Tree Analysis-
What:
Fault tree
analysis (FTA) is a top down processes of defining the top
level problems and through a deductive approach using
parallel and series combinations of possible malfunctions to
find the root of the problem and correct it before the
failure occurs. The reliability tool can be used as
qualitative or quantitative methods.
Why:
The tool
aids the design process, shows weak links that cause
failures, and in the critical legs of the trees helps to
define maintenance strategies for which pieces of equipment
and processes should be defended with the greatest
maintenance vigor to prevent “Murphy” from shutting down the
process or causing serious safety issues. The technique
provides a graphical aid for the analysis and it allows many
failure modes including common cause failures. Results from
a FTA is usually more pessimistic that other analysis tools
such as
RBDs as you can see from a study of the Space Shuttle
reliability analysis where each system is studied by
multiple reliability tools because of the high cost/profile
of failures.
When:
FTA is
widely used in the design phase of nuclear power plants,
subsea control and distribution systems, and for oversight
studies in layers of protection studies for process safety
and loss control in chemical plants and refineries so as to
prevent accidents and control the costs of risks. The
technique is helpful for identifying critical fault paths,
observing vague failure combinations before they occur in
reality, comparing alternate designs for safety, and setting
a methodology to provide management with a tool to evaluate
the overall hazards in a system and avoid single sources of
critical failures. Finally when thinking top down about
failures and where/how they can occur, the methodology gives
a diagram for setting maintenance strategies for protecting
key pieces of equipment/processes to prevent failures.
Where:
FTA is helpful for defining potential event
sequences and potential incidents, evaluating the incident
consequences of outcomes, and estimating the risks of events
occurring. FTAs work in the design room and on the
operating floor where first hand kn