| Brian
wrote:
Before the conversation
between Ray and Peter deteriorated they were discussing the reliability
of a proposed new system. In the course of that conversation they used
the phrase Mean Time Between Failure
(MTBF) quite a lot. I'm sure they both knew what they were talking
about, but I'm also pretty sure that many of the other members of the
forum did not fully understand the term. I have to confess that it was
not until this year that I really started to understand what it meant
even though I have been hearing and using the phrase for decades. I always
took it to mean what it said -- the
amount of time, on average, which a device would operate without a failure.
Wrong!
For those who care to read on, here's my, "I
learned about MTBF from that!"
story.
We started flying the fleet of airplanes I worry
about 10 years ago.
Recently we started hearing from the pilots and
maintainers of the airplanes that a particular widget was failing an awful
lot. In the process of investigating the problem, our reliability folks
provided the MTBF of the widget based on field history.
I looked at the number they generated and complained,
"Wait. This can't be right. You say the MTBF is X, but the high time
airplane in the fleet only has 60% of X hours and most have much less.
If the MTBF is X, how come these folks are complaining? Except for the
odd widget which has a premature failure, they should not be seeing any
problem."
The reliability guy comes back and says, "Our
number is right. MTBF is equal to total fleet
hours divided by number of failures. Think about it."
Well I'd known that equation for quite some time,
but it never dawned on me that the equation gives something different
from my intuitive view of what MTBF meant. I guess I was in the mode of,
"I've got my mind made up; don't bother me with facts."
In order to "think
about it," consider a hypothetical airline introducing a new aircraft
into service. Let's say they take delivery of one airplane a month and
fly each airplane in the fleet 500 hours a month. Let's further hypothesize
that the airplanes all have a widget that fails after exactly
6000 hours. That
takes the randomness of the failure out of the picture. By my intuitive
definition, the MTBF should be 6000 hours, but let's see what the reliability
guy tells us.
At the end of
the first year the fleet has 500 + 1000 + 1500 + ... + 6000 =
39000 hours with
no failures. Without a failure, my reliability guy can't come up with
a value for MTBF, but as soon as the next year starts, that first airplane
goes over the 6000 hour mark and the first failure comes, and he can start
generating numbers. By the end of the second year there have been 12 failures
-- one on each of the original 12 airplanes. The total hours on the fleet
are 39000 from the first year plus 39000 from the 12 new airplanes plus
500 x 12 x 12 hours on the airplanes we got the first year.
That gives a total of 150,000 hours on the fleet.
150,000 divided by 12 gives a MTBF for the widget of 12,500 hours. This
is close to the situation my real life fleet has; i.e., the statisticians
are saying the MTBF is more than the flying time of my high time airframe,
but I've had a failure a month for the last 12 months.
In the third year the MTBF works out to 9250 hours.
Eventually the number from the reliability guy and my intuitive number
will get pretty close, but ......
So the message here for the line pilot is that
when the engineers say the MTBF is so high you don't have to really worry
about it, worry about it anyway. The more reliable the widget is and the
more recent its introduction to the fleet, the less likely the MTBF derived
from fleet history is to be accurate. If we are talking about the MTBF
on a DC-3 tail wheel, I'd believe it. If we are talking about a new gizmo
to fly the airplane from the ground, I'm sceptical. That's
where redundancy and fallback modes must come into the reliability story.
End of story.
Now a question for the engineers: is there a statistic
which corresponds to my original idea of MTBF? It would be derived by
dividing the operating time of all the FAILED widgets by the number of
failures. If it exists, what is it called? Why is it not used more? In
my hypothetical case it would have given the right number after the first
failure. In a more realistic case it would approach the real MTBF from
below (rather than above) which would still be handy in safety analyses.
Still learning, Mike
|
| BP King
wrote:
<<<In the course of that conversation
they used the phrase Mean Time Between Failure (MTBF) quite a lot. I'm
sure they both knew what they were talking about, but I'm also pretty
sure that many of the other members of the forum did not fully understand
the term. >>>>
Let me throw a little fuel on the flame of discussion with a slightly
different take on safety, reliability, and MTBF:
First: Reliability numbers,
MTBF numbers, and any other numbers you care to throw out are merely indicators
of safety.
They are no different than price-to-earning ratios are in financially
rating a company's worth. Just because the numbers are great doesn't mean
that the its a done deal, can't miss, or that it is safe. They do mean
that, for the type of analysis being done, the result looks "promising".
And just as there are umpteen ways to price a company's value, there are
umpteen ways to evaluate safety with variations occurring in each regulator,
manufacturer and engineer.
In my humble opinion, safety is still an art form.
It involves corporate and individual culture, commitment, commitment,
and more commitment at every level of design, test, and manufacturing.
For all the science around it, if the engineer, assembler, technician,
or operator is upset, lazy, incompetent, or just doesn't care; safety
is quickly lost. We use scientific tools in all of these areas to try
to control it, but are really only achieving scientific measurement of
it's "probability of being there".
And as you have recently seen, even that is hotly disputable.
Second: Aircraft system safety
requirements are not based on, nor do they even reliably (pun intended)
correlate with LRU MTBF numbers. I have attempted to provide a researched
discussion of this below, but it is lengthy and has lots of interesting
(at least to safety engineers) official guidance so let me try to cut
to the bottom lines:
1) MTBF is a LRU measurement. It
reflects any failure. It is
usually defined in procurement contracts as a quality indicator having
to do with monetary or other incentives to lower cost of ownership,
warranty costs, etc. In most
cases (especially military) it can even be caused if a box has to be
removed to gain access to another box.
2) Aircraft safety reliability requirements
are imposed upon a function. That function may cover many boxes or only
a small part of a single box. It would be unusual if it actually occurred
at the physical boundary of a box. Failures
are evaluated to determine their effect on the aircraft and are collected
and separated into "buckets" labeled"No
Effect", "Minor", "Major", "Hazardous/Severe
Major", and "Catastrophic" along with their probability
of failure. A specific limit of probability is set for each of the buckets
(1E-9 typical for "Catastrophic", etc.). The
system gets certified only if the aggregate (summed) probability
of all the failures in each of the buckets is below that bucket's assigned
limit.
3) It is possible and even common
for equipment performing catastrophic tasks to have terrible MTBF's.
The failures were detected, alerted to the crew, caused no harm or undue
excitement, and an operational alternative was available. Availability
(much closer to MTBF) of an auto-land function could be as low as 99%
(1E-2)and probably have a good chance of being certified if it had only
extremely improbable failures that could cause harm or prevent continuation
of the approach to a safe landing and rollout below the alert height.
4) It is possible, but not common, for single string systems to provide
catastrophic functions if it can be shown that it has no failure mode
which can stimulate the catastrophic function and all existing failure
modes meet the aggregate bucket requirements described above.
I posted something of explanation about how the FAA looks at system safety
certification some years ago on Bluecoat that might be helpful, along
with some follow-on comments by Don Armstrong (FAA -now retired- Flight
Test).
.... well it was my intent to give the URL to get the original posting
off the Bluecoat site, but I couldn't get the search engine to work for
me. So much for being a computer type! I will either get the URL or re-post
it.
Finally, US system safety has its roots in the military and NASA, but
was probably most formalized by the Nuclear Regulatory Commission in the
late 1950's. When I started work as a NASA Mission Controller we were
using the NRC document along with some NASA generated material. Soon afterward
NASA went entirely with its own dialect of how to get safe. Its a small
community and somewhat incestuous in that the Nuclear Regulatory Commission
(now DOE) was knocking at NASA's safety door after Three Mile Island.
When I first started working aircraft, there was my old friend the NRC
document! It was then replaced with AC25.1309-1, then 25.1309-1A and (soon?)
the SAE documents as the aviation world creates its own dialect to fit
its environment. Should we be at all surprised if there are significant
differences between medical, commercial, nuclear, and aviation safety?
Not at all. Each has its own unique challenges (threats) and environments.
However, for licensed products such as aircraft and nuclear reactors
there is a clear need to satisfy that regulatory agency's safety system
and not much incentive to also fiddle with the other guy's requirements.
Probably similar to pilot's not getting commercial drivers licenses in
order to taxi aircraft.
Here is a little more detail (maybe more than you want!) in how system
safety gets handled in civil aviation.
Working from the top downward:
The actual requirements are very simply
stated in (USA) 14 CFR 25.1309 which is harmonized with the JAR. These
eight simple sentences, and especially Part (b) and
(d) probably cause more trees to die per word than anything else in the
entire rulebook!
Here they are:
|
Part (a):
The equipment, systems, and installations
whose functioning is required by this subchapter, must be designed
to ensure that they perform their intended functions under any
foreseeable operating condition.
Part (b):
The airplane systems and associated components,
considered separately and in relation to other systems, must
be designed so that:
(1) The occurrence of any failure
which would prevent the continued safe flight and landing of
the airplane is extremely improbable, and (2)
The occurrence of any other failure conditions which would reduce
the capability of the airplane or the ability of the crew to
cope with adverse operating conditions is improbable.
Part (c):
Warning information must be provided
to alert the crew to unsafe system operating conditions, and
to enable them to take appropriate corrective action.
Systems, controls, and associated monitoring
and warning means must be designed to minimize crew errors which
could create additional hazards.
Part (d):
Compliance with the requirements of paragraph
(b) of this section must be shown by analysis, and where necessary,
by appropriate ground, flight, or simulator tests.
The analysis must consider
(1) Possible modes of failure, including
malfunctions and damage from external sources.
(2) The probability of multiple failures
and undetected failures.
(3) The resulting effects on the
airplane and its components, considering the stage of flight
and operating conditions, and
(4) The crew warning cues, corrective
action required, and the capability of detecting faults.
Parts (e) and (f): skipped as
not directly dealing with system safety analysis (they talk
about generating capacity vs loads for various engine-out conditions).
Part (g):
In showing compliance with paragraphs
(a) and (b) of this section with regard to the electrical system
and equipment design and installation, critical environmental
conditions must be considered.
For electrical generation, distribution,
and utilization equipment required by or used in complying with
this chapter, except equipment covered by Technical Standard
Orders containing environmental test procedures, the ability
to provide continuous, safe service under foreseeable environmental
conditions may be shown by environmental tests, design analysis,
or reference to previous comparable service experience on other
aircraft.
**** End of 25.1309 ****
|
Now to show you ops guys just why aviation safety guys can get into a
good yelling match, the FAA's "advisory circular"
AC 25.1309-1A "System Design And Analysis" just takes the (b),
(c), and (d) from above and provides 18 pages of explanation. I hope you
will excuse the length of the following quote from the AC, but it is key
to understanding aircraft safety analysis:
|
*** begin AC 25.1309-1A excerpt ****
5.
THE FAA FAIL-SAFE DESIGN CONCEPT.
The Part 25 airworthiness standards are based on, and incorporate,
the objectives, and principles or techniques, of the fail-safe
design concept, which considers the effects of failures and
combinations of failures in defining a safe design.
a. The following basic objectives pertaining to failures
apply:
(1) In any system or subsystem, the failure of any single
element, component, or connection during any one flight
(brake release through ground deceleration to stop) should
be assumed, regardless of its probability.
Such single failures should not prevent continued safe
flight and landing, or significantly reduce the capability
of the airplane or the ability of the crew to cope with
the resulting failure conditions.
(2) Subsequent failures during the same flight, whether
detected or latent, and combinations thereof, should also
be assumed, unless their joint probability with the first
failure is shown to be extremely improbable.
b. The fail-safe design concept uses the following
design principles or techniques in order to ensure a safe
design.
The use of only one of these principles or techniques is
seldom adequate. A combination of two or more is usually needed
to provide a fail-safe design; that is, to ensure that major
failure conditions are improbable and that catastrophic failure
conditions are extremely improbable.
(1) Designed Integrity and Quality, including Life-Limits,
to ensure intended function and prevent failures.
(2)
Redundancy or Backup Systems to enable continued function
after any single (or other defined number of)
failure(s); for example,
two or more engines, hydraulic systems, flight control systems,
etc.
(3) Isolation of Systems, Components, and Elements so that
the failure of one does not cause the failure of another.
Isolation is also termed independence.
(4) Proven Reliability so that multiple, independent failures
are unlikely to occur during the same flight.
(5) Failure Warning or Indication to provide detection.
(6) Flight-crew Procedures for use after failure detection,
to enable continued safe flight and landing by specifying
crew corrective action.
(7) Checkability: the capability to check a component's
condition.
(8) Designed Failure Effect Limits, including the Capability
to sustain damage, to limit the safety impact or effects
of a failure.
(9) Designed Failure Path to control and direct the effects
of a failure in a way that limits its safety impact.
(10) Margins or Factors of Safety to allow for any undefined
or unforeseeable adverse conditions.
(11) Error Tolerance that considers adverse effects of
foreseeable errors during the airplane's design, test, manufacture,
operation, and maintenance.
*** end AC 25.1309-1A excerpt ****
|
By the way, the AC got everybody straightened out and working together
well enough that the SAE went off and produced not one, but two documents
on the same subject!
One explains the concepts and the other shows examples and guidance in
using many of the popular techniques. They are:
Certification Considerations for Highly-Integrated or Complex Aircraft
Systems (ARP 4754)
Guidelines and Methods For Conducting The Safety Assessment Process On
Civil Airborne Systems and Equipment (ARP
4761)
Now, if I still have any of you with me, there is a pretty good description
of what the FAA thinks they said (sometimes this is "constructively"
disputed in the real world) in the Preamble to the new requirements for
gas tank protection (see Federal Register May 7, 2001 - volume 66 number
88, pages 23085-23131). I have included the pertinent section below.
|
*** Start FAA Commentary FR 66 88 page 23108 ***
As for §25.1309, the commenter
appears to be confusing the objective of the rule (i.e., to
prevent the occurrence of catastrophic failure conditions that
can be anticipated) with
a conditionally acceptable means of demonstrating compliance,
as described in AC
25.1309-1A (i.e., that catastrophic failure conditions must
have an "average probability per flight hour" of less
than 1 x 10-9).
Since this same misconception has presented itself many times
before, the following discussion is intended to clarify the
intent of the term "extremely improbable" and the
role of "average probability" in demonstrating that
a condition is "extremely improbable."
The term "extremely improbable"
(or its predecessor term, "extremely remote") has
been used in 14 CFR part 25 for many years. The objective of
this term has been to describe a condition (usually a failure
condition) that has a probability of occurrence so remote that
it is not anticipated to occur in service on any transport category
airplane. While a rule sets a minimum standard for all
the airplanes to which it applies, compliance determinations
are necessarily limited to individual type designs.
Consequently, all that has been required of applicants is a
sufficiently conservative demonstration that a condition is
not anticipated to occur in service on the type design being
assessed.
The means of demonstrating that the occurrence of an event
is extremely improbable varies widely, depending on the type
of system, component, or situation that must be assessed.
There has been a tendency, as evidenced by the comment, to
confuse the meaning of this term with the particular means used
to demonstrate compliance in those various contexts.
This has led to a misunderstanding that the term has a different
meaning in different sections of part 25.
As a rule, failure conditions arising from a single failure
are not considered extremely improbable; thus, probability assessments
normally involve failure conditions arising from multiple failures.
Both qualitative and quantitative assessments are used in practice,
and both are often necessary to some degree to support a conclusion
that an event is extremely improbable.
Qualitative methods are techniques used to structure a logical
foundation for any credible assessment. While a best-estimate
quantitative analysis is often valuable, there are many situations
where the qualitative aspects of the assessment and engineering
judgment must be relied on to a much greater degree. These situations
include those where:
• There is insufficient reliability information (e.g., unknown
operating time or conditions associated with failure data);
• Dependencies among assessment variables are subtle or
unpredictable (e.g., independence of two circuit failures
on the same microchip, size and shape of impact damage due
to foreign objects);
• The range of an assessment variable is extreme or indeterminate;
and
• Human factors play a significant role (e.g., safe outcome
dependent totally upon the flight-crew immediately, accurately,
and completely identifying and mitigating an obscure failure
condition).
Qualitative compliance guidance usually involves selecting
combinations of failures that, based on experience and engineering
judgment, are considered to be just short of "extremely
improbable", and then demonstrating that they will not
cause a catastrophe. In some cases, [[Page 23109]] examples
of combinations of failures necessary for a qualitative assessment
are directly provided in the rule.
For example, § 25.671 (concerning flight controls) sets forth
several examples of combinations of failures that are intended
to help define the outermost boundary of events that are not
"extremely improbable." Judgment would dictate that
other combinations, equally likely or more likely, would also
be included as not "extremely improbable."
However, combinations less likely than the examples would be
considered so remote that they are not expected to occur and
are, therefore, considered extremely improbable. Another common
qualitative compliance guideline is to assume that any failure
condition anticipated to be present for more than one flight,
occurring in combination with any other single failure, is not
"extremely improbable." This is the guideline, often
used to find compliance with § 25.901(c), that the FAA is adopting
as a standard in § 25.981(a)(3).
Quantitative methods are those numerical
techniques used to predict the frequency or the probability
of the various occurrences within a qualitative analysis.
Quantitative methods are vital for supporting the conclusion
that a complex condition is extremely improbable. When a quantitative
probability analysis is used, one has to accept the fact that
the probability of zero is not attainable for the occurrence
of a condition that is physically possible.
Therefore, a probability level is chosen
that is small enough that, when combined with a conservative
assessment and good engineering judgment, it provides convincing
evidence that the condition would not occur in service.
For conditions that lend themselves
to average probability analysis, a guideline on the order
of 1 in 1 billion is commonly used as the maximum average
probability that an "extremely improbable" condition
can have during a typical flight hour. This 1 in 1 billion
"average probability per flight hour" criterion
was originally derived in an effort to assure the proliferation
of critical systems would not increase the historical accident
rate. This criterion was based on an assumption that there
would be no more than 100 catastrophic failure conditions
per airplane.
This criterion was later adopted as
guidance in AC 25.1309.
The
historical derivation of this criterion should not be misinterpreted
to mean that the rule is only intended to limit the frequency
of catastrophe to that historic
1
x 10-7 level. The FAA conditionally accepts the use of this
guidance only because, when combined with a conservative assessment
and good engineering judgment, it has been an effective indicator
that a condition is not anticipated to occur, at least not
for the reasons identified and assessed in the analysis. Furthermore,
decreasing this criterion to anything greater than
1
x 10-12 would not result in substantially improved designs,
only increased line maintenance. The FAA has concluded that
the resulting increased exposure to maintenance error would
likely counteract any benefits from such a change.
An ARAC working group has validated
these conclusions.
When using "averages,"
care must be taken to assure that the anticipated deviations
around that "average" are not so extreme that the
"peak" values are unacceptably susceptible to inherent
uncertainties. That is to say, the risk on one flight cannot
be extremely high simply because the risk on another flight
is extremely low. An important example of the flaw in relying
solely on consideration of "average" risk
is the "specific risk" that results from operation
with latent (not operationally detectable) failures. It is this
risk that is being addressed by § 25.981(a)(3), as adopted in
this final rule. For example, latent failures have been identified
as the primary or contributing cause of several accidents. In
1991, a thrust reverser deployment occurred during climb from
Bangkok, Thailand, on a Boeing Model 767 due to a latent
failure in the reversing system.
In 1996, a thrust reverser deployment on a Fokker Model F-100
airplane occurred following takeoff from Sao Paulo, Brazil,
due to a latent failure in the system. As noted earlier, the
NTSB determined that the probable cause of the TWA 800 accident
was ignition of fuel vapours in the center wing fuel from an
ignition source:
* * * The source of ignition energy for the explosion could
not be determined with certainty but, of the sources evaluated
by the investigation, the most likely was a short circuit outside
of the center wing tank that allowed excessive voltage to enter
it through electrical wiring associated with the fuel quantity
indication system [FQIS].
A latent failure or condition creating a reduced arc gap in
the FQIS would have to be present to result in an ignition source.
This rule is intended to require designs that prevent operation
of an airplane with a pre-existing condition or failure such
as a reduced arc gap in the FQIS (latent failure) and a subsequent
single failure resulting in a short circuit that causes an electrical
arc inside the fuel tank.
Due to variability and uncertainty in the analytical process,
predicting an average probability of 1 in 1 billion does not
necessarily mean that a condition is extremely improbable; it
is simply evidence that can be used to support the conclusion
that a condition is extremely improbable. Wherever part 25 requires
that a condition be "extremely improbable," the compliance
method, whether qualitative, quantitative, or a combination
of the two, along with engineering judgment, must provide convincing
evidence that the condition will not occur in service.
*** END FAA Commentary FR 66 88 page 23109 ***
|
And you wondered why we don't always agree on what is real safety!
I hope this helps more than it hurts!
Stuart Stuart Law voice 979-567-6370 Winged Systems Corp. fax 979-567-7439
1600 American Way stuart@wingsys.com Caldwell, Texas 77836
www.wingsys.com
|
Charlie Falke wrote:
The ARP4754 is what we're certifying the PW6000 control system to, and any
future engines,until the next wave of fashion. This is viewed here as DO-178B
for hardware. Before the ARP, safety for the hardware was almost purely
analytical, whereas for the software you had to do a lot of tests. Now,
that much more of the analysis of the hardware is validated with tests,
you get occasionally surprising results. :-)
Good Day Charlie.
The history of ARP-4754 (Certification Considerations for Highly-Integrated
or Complex Aircraft Systems) and ARP-4761 (Guidelines and Methods for
Conducting the Safety Assessment Process on Civil Airborne Systems and
Equipment) is interesting. RTCA formed SC-167 in the mid-'80's to deal
with the "problems" of DO-178A, Software Considerations in Airborne
Systems and Equipment Certification. DO-178() takes as its basic premise
that software does not fail. It can perform "anomalously," that
is in a manner which is not in accordance with its requirements, but it
does not have the exposure to the tradition "random" failures
to which hardware falls victim to. So DO-178() defines processes, which
if properly followed, will try to ensure that the software is free from
any "anomalous behavior." DO-178(A) did a very poor job of providing
criteria for measuring how good the process should be (a common supplier
statement was "Well it doesn't say that I have to do that.")
so SC-167 was formed to provide more specific guidance on how to recognize
"good process." In the Systems team of SC-167 two players were
key intellectual resources: Dr.
Nancy Leveson,then at the University of California-Irvine, and Mike DeWalt,
then the FAA Software National Resource Specialist. Dr.
Leveson's position was (and still is) brutally frank - without good requirements
the best that "process" can do is to ensure that bad requirements
produce the expected results even if the results are bad.
(Her book Safeware should be required reading for all Systems Engineers
and Software Project Managers.) This view on requirements as the key to
safety resonated with the systems folks and we went forward trying to
define what good systems activities would be. However it was pointed out
to us that writing system requirements was not in the charter for SC-167,
and that by defining a systems process in a software document we had exceeded
the Terms of Reference. Mike led the charge in stripping the work down
to the absolute minimum, which is what remains in DO-178B today. To sum
up the systems activities, Systems defines requirements and allocates
some of these requirements to Software. Systems also performs a System
Safety Assessment to define the safety objectives for the software. Systems
also reviews requirements generated within the software process (the derived
requirements) and ensures that safety objectives are not violated.
Notice the order. System is responsible for defining a set of requirements
which meet aircraft and system safety objectives. Software, using the
processes of DO-178(B) does all the "right stuff" to ensure
that the software is consistent with the requirements allocated to them.
DO-178B was found acceptable to the FAA, and Advisory Circular 20-115A
was issued calling attention to DO-178B as "a means but not the only
means for showing compliance with" regulatory requirements. Mike
took the systems material from SC-167, went to the Society of Automotive
Engineers, and SAE produced Aerospace Recommended Practices 4565 and ARP-4761.
Both are excellent documents, clearly written, insightful, technically
correct, and all round "good stuff." HOWEVER - the FAA has never
given the "stamp of approval" to these documents in the same
way they did with DO-178B. Nor have the Europeans. I'll refrain from speculating
on why, sufficient to say that some big players in industry took offense
to what would be called for as "required" and so have successfully
blocked "official recognition" by any regulatory agency for
these two excellent documents.
An excellent site for this information is the FAA Wichita ACO at http://www.faa.gov/avr/air/ace/wichita_aco/safety.htm#References
|