RoboLander - Redundancy, Reliability, Fall-back Modes - and The Potential for Failure

Good Job Brian (below) - and very close.

There is a difference between fleet MTBF and LRU* MTBF and the difference is experience. A typical number used in safety analysis for an individual LRU is typically around 1 failure in 1000 - 10,000 operating hours. (The 1000 hours is for a typical mechanical instrument which has a much higher failure rate than electronics.) This is a "let's take into account all the random failures which can pop up that we can't design out." Notice that the figure is in operating hours, not flight hours. Yet another problem. Leave the avionics on all the time, a 1000 hour unit can be expected to fail in about 41 days even if the aircraft never leaves the block! But MTBF from a maintenance perspective is not the same as reliability of a function. For example the failure of all airspeed indication is defined in AC 25.11 to be catastrophic. In Advisory Circular 25.1309-1A, the failure probability for a Catastrophic event is defined to be less than 1/1,000,000,000 per flight hour (i.e. the infamous 10-9 number.) But this is for the function, not for individual components which provide the function. Three "independent" units each with a calculated reliability of 10-3 will meet the requirement for a highly reliable function. (Specifically ALL of the three units must fail before the required function is no longer available.)

In an earlier post I pointed out that Ray was dealing with the probability of function failure, while Dr. Ladkin was talking about individual component failure resulting in functional failure. It is impossible, given reasonable dollar budgets, to field a single LRU which will meet the accepted failure probability of one in a million flight hours. Take three units of lesser (but more cost effective) reliability, throw in a standby instrument (to ensure independence) and the safety analysis is satisfied.

Another example. The auto-land system is typically a triplicated system (at least). If all three "lanes" or channels are not fully operational, autoland will not be enabled. The reason? There is an exposure time which starts when autoland is armed until the plane is on the ground. During this time, there is a finite probability that one "lane" will fail. Having two remaining "lanes" allows the system to degrade to a dual system but still continue to operate (i.e. fail active or fail operable.) It is only when another "lane" fails that the function also fails and control is "dumped" back onto the flight crew (fail passive). [or in the case of RoboLander losing its ground-link (or airside of that link), that failure causes a data-link "unlatch" and flight-control automatically reverts to the onboard pilot]. Failure of the autoland function, once engaged and below decision height, must be shown through calculations to be less than 10-9 during that very short exposure period.

Three "lanes," each with a calculated reliability of 1 failure in more than 1000 critical hours will met the minimum design criteria for safety (where critical is defined to be the time from decision height to roll out - in the case of Cat IIIc.)

The reliability guarantees provided by the equipment suppliers will be fleet average like you have said, but the numbers guaranteed by contract for MTBF are significantly greater than the numbers used during design analysis. We will design a system using 10-3 (or perhaps ^-4) because this is a conservative number and will result in a conservative design. This low a number is unacceptable in the real world, so to make the sale the number will be increased closer to a "service experience' value. (i.e. if we try and "sell" the reliability numbers calculated by Mil-HBK-217(), our product would never be sold because of the "low reliability.")

This is the context in which Ray was discussing reliability and MTBF, and it is this context which the system and equipment designers understand MTBF as well. Regulatory guidance will give functional "reliability" in terms of a minimum probability of failure during some unit of time. From this, using industry standard practices (for example Society of Automotive Engineers Aerospace Recommended Practice 4761) the minimum reliability for individual pieces of a system are allocated to those units. Typically this results in redundancy in the architecture (i.e. three 10^-3 units, each able to independently perform a required function, will meet the functional reliability requirements.) This is also where the MMEL comes from. A system is allows to degrade by a specific amount and still meet the functional reliability if operated with some set of restrictions. Hope that this helps,

*LRU = Line Replaceable Unit (in a modular design you rip out the failed avionics box and insert another) - The FAA System Safety Handbook

Brian wrote:

Before the conversation between Ray and Peter deteriorated they were discussing the reliability of a proposed new system. In the course of that conversation they used the phrase Mean Time Between Failure (MTBF) quite a lot. I'm sure they both knew what they were talking about, but I'm also pretty sure that many of the other members of the forum did not fully understand the term. I have to confess that it was not until this year that I really started to understand what it meant even though I have been hearing and using the phrase for decades. I always took it to mean what it said -- the amount of time, on average, which a device would operate without a failure. Wrong!

For those who care to read on, here's my, "I learned about MTBF from that!"


We started flying the fleet of airplanes I worry about 10 years ago.

Recently we started hearing from the pilots and maintainers of the airplanes that a particular widget was failing an awful lot. In the process of investigating the problem, our reliability folks provided the MTBF of the widget based on field history.

I looked at the number they generated and complained, "Wait. This can't be right. You say the MTBF is X, but the high time airplane in the fleet only has 60% of X hours and most have much less. If the MTBF is X, how come these folks are complaining? Except for the odd widget which has a premature failure, they should not be seeing any problem."

The reliability guy comes back and says, "Our number is right. MTBF is equal to total fleet hours divided by number of failures. Think about it."

Well I'd known that equation for quite some time, but it never dawned on me that the equation gives something different from my intuitive view of what MTBF meant. I guess I was in the mode of, "I've got my mind made up; don't bother me with facts."

In order to "think about it," consider a hypothetical airline introducing a new aircraft into service. Let's say they take delivery of one airplane a month and fly each airplane in the fleet 500 hours a month. Let's further hypothesize that the airplanes all have a widget that fails after exactly 6000 hours. That takes the randomness of the failure out of the picture. By my intuitive definition, the MTBF should be 6000 hours, but let's see what the reliability guy tells us.

At the end of the first year the fleet has 500 + 1000 + 1500 + ... + 6000 = 39000 hours with no failures. Without a failure, my reliability guy can't come up with a value for MTBF, but as soon as the next year starts, that first airplane goes over the 6000 hour mark and the first failure comes, and he can start generating numbers. By the end of the second year there have been 12 failures -- one on each of the original 12 airplanes. The total hours on the fleet are 39000 from the first year plus 39000 from the 12 new airplanes plus 500 x 12 x 12 hours on the airplanes we got the first year.

That gives a total of 150,000 hours on the fleet. 150,000 divided by 12 gives a MTBF for the widget of 12,500 hours. This is close to the situation my real life fleet has; i.e., the statisticians are saying the MTBF is more than the flying time of my high time airframe, but I've had a failure a month for the last 12 months.

In the third year the MTBF works out to 9250 hours. Eventually the number from the reliability guy and my intuitive number will get pretty close, but ......

So the message here for the line pilot is that when the engineers say the MTBF is so high you don't have to really worry about it, worry about it anyway. The more reliable the widget is and the more recent its introduction to the fleet, the less likely the MTBF derived from fleet history is to be accurate. If we are talking about the MTBF on a DC-3 tail wheel, I'd believe it. If we are talking about a new gizmo to fly the airplane from the ground, I'm sceptical. That's where redundancy and fallback modes must come into the reliability story.

End of story.

Now a question for the engineers: is there a statistic which corresponds to my original idea of MTBF? It would be derived by dividing the operating time of all the FAILED widgets by the number of failures. If it exists, what is it called? Why is it not used more? In my hypothetical case it would have given the right number after the first failure. In a more realistic case it would approach the real MTBF from below (rather than above) which would still be handy in safety analyses.

Still learning, Mike

BP King wrote:

<<<In the course of that conversation they used the phrase Mean Time Between Failure (MTBF) quite a lot. I'm sure they both knew what they were talking about, but I'm also pretty sure that many of the other members of the forum did not fully understand the term. >>>>

Let me throw a little fuel on the flame of discussion with a slightly different take on safety, reliability, and MTBF:

First: Reliability numbers, MTBF numbers, and any other numbers you care to throw out are merely indicators of safety.

They are no different than price-to-earning ratios are in financially rating a company's worth. Just because the numbers are great doesn't mean that the its a done deal, can't miss, or that it is safe. They do mean that, for the type of analysis being done, the result looks "promising". And just as there are umpteen ways to price a company's value, there are umpteen ways to evaluate safety with variations occurring in each regulator, manufacturer and engineer.

In my humble opinion, safety is still an art form.

It involves corporate and individual culture, commitment, commitment, and more commitment at every level of design, test, and manufacturing. For all the science around it, if the engineer, assembler, technician, or operator is upset, lazy, incompetent, or just doesn't care; safety is quickly lost. We use scientific tools in all of these areas to try to control it, but are really only achieving scientific measurement of it's "probability of being there". And as you have recently seen, even that is hotly disputable.

Second: Aircraft system safety requirements are not based on, nor do they even reliably (pun intended) correlate with LRU MTBF numbers. I have attempted to provide a researched discussion of this below, but it is lengthy and has lots of interesting (at least to safety engineers) official guidance so let me try to cut to the bottom lines:

1) MTBF is a LRU measurement. It reflects any failure. It is usually defined in procurement contracts as a quality indicator having to do with monetary or other incentives to lower cost of ownership, warranty costs, etc. In most cases (especially military) it can even be caused if a box has to be removed to gain access to another box.

2) Aircraft safety reliability requirements are imposed upon a function. That function may cover many boxes or only a small part of a single box. It would be unusual if it actually occurred at the physical boundary of a box. Failures are evaluated to determine their effect on the aircraft and are collected and separated into "buckets" labeled"No Effect", "Minor", "Major", "Hazardous/Severe Major", and "Catastrophic" along with their probability of failure. A specific limit of probability is set for each of the buckets (1E-9 typical for "Catastrophic", etc.). The system gets certified only if the aggregate (summed) probability of all the failures in each of the buckets is below that bucket's assigned limit.

3) It is possible and even common for equipment performing catastrophic tasks to have terrible MTBF's. The failures were detected, alerted to the crew, caused no harm or undue excitement, and an operational alternative was available. Availability (much closer to MTBF) of an auto-land function could be as low as 99% (1E-2)and probably have a good chance of being certified if it had only extremely improbable failures that could cause harm or prevent continuation of the approach to a safe landing and rollout below the alert height.

4) It is possible, but not common, for single string systems to provide catastrophic functions if it can be shown that it has no failure mode which can stimulate the catastrophic function and all existing failure modes meet the aggregate bucket requirements described above.

I posted something of explanation about how the FAA looks at system safety certification some years ago on Bluecoat that might be helpful, along with some follow-on comments by Don Armstrong (FAA -now retired- Flight Test).

.... well it was my intent to give the URL to get the original posting off the Bluecoat site, but I couldn't get the search engine to work for me. So much for being a computer type! I will either get the URL or re-post it.

Finally, US system safety has its roots in the military and NASA, but was probably most formalized by the Nuclear Regulatory Commission in the late 1950's. When I started work as a NASA Mission Controller we were using the NRC document along with some NASA generated material. Soon afterward NASA went entirely with its own dialect of how to get safe. Its a small community and somewhat incestuous in that the Nuclear Regulatory Commission (now DOE) was knocking at NASA's safety door after Three Mile Island. When I first started working aircraft, there was my old friend the NRC document! It was then replaced with AC25.1309-1, then 25.1309-1A and (soon?) the SAE documents as the aviation world creates its own dialect to fit its environment. Should we be at all surprised if there are significant differences between medical, commercial, nuclear, and aviation safety? Not at all. Each has its own unique challenges (threats) and environments.

However, for licensed products such as aircraft and nuclear reactors there is a clear need to satisfy that regulatory agency's safety system and not much incentive to also fiddle with the other guy's requirements. Probably similar to pilot's not getting commercial drivers licenses in order to taxi aircraft.

Here is a little more detail (maybe more than you want!) in how system safety gets handled in civil aviation.

Working from the top downward:

The actual requirements are very simply stated in (USA) 14 CFR 25.1309 which is harmonized with the JAR. These eight simple sentences, and especially Part (b) and (d) probably cause more trees to die per word than anything else in the entire rulebook!

Here they are:

Part (a):

The equipment, systems, and installations whose functioning is required by this subchapter, must be designed to ensure that they perform their intended functions under any foreseeable operating condition.

Part (b):

The airplane systems and associated components, considered separately and in relation to other systems, must be designed so that:

(1) The occurrence of any failure which would prevent the continued safe flight and landing of the airplane is extremely improbable, and (2) The occurrence of any other failure conditions which would reduce the capability of the airplane or the ability of the crew to cope with adverse operating conditions is improbable.

Part (c):

Warning information must be provided to alert the crew to unsafe system operating conditions, and to enable them to take appropriate corrective action.

Systems, controls, and associated monitoring and warning means must be designed to minimize crew errors which could create additional hazards.

Part (d):

Compliance with the requirements of paragraph (b) of this section must be shown by analysis, and where necessary, by appropriate ground, flight, or simulator tests.

The analysis must consider 

(1) Possible modes of failure, including malfunctions and damage from external sources.

(2) The probability of multiple failures and undetected failures.

(3) The resulting effects on the airplane and its components, considering the stage of flight and operating conditions, and 

(4) The crew warning cues, corrective action required, and the capability of detecting faults.

Parts (e) and (f): skipped as not directly dealing with system safety analysis (they talk about generating capacity vs loads for various engine-out conditions).

Part (g):

In showing compliance with paragraphs (a) and (b) of this section with regard to the electrical system and equipment design and installation, critical environmental conditions must be considered.

For electrical generation, distribution, and utilization equipment required by or used in complying with this chapter, except equipment covered by Technical Standard Orders containing environmental test procedures, the ability to provide continuous, safe service under foreseeable environmental conditions may be shown by environmental tests, design analysis, or reference to previous comparable service experience on other aircraft.

**** End of 25.1309 ****


Now to show you ops guys just why aviation safety guys can get into a good yelling match, the FAA's "advisory circular"

AC 25.1309-1A "System Design And Analysis" just takes the (b), (c), and (d) from above and provides 18 pages of explanation. I hope you will excuse the length of the following quote from the AC, but it is key to understanding aircraft safety analysis:

*** begin AC 25.1309-1A excerpt ****


The Part 25 airworthiness standards are based on, and incorporate, the objectives, and principles or techniques, of the fail-safe design concept, which considers the effects of failures and combinations of failures in defining a safe design.

a. The following basic objectives pertaining to failures apply:

(1) In any system or subsystem, the failure of any single element, component, or connection during any one flight (brake release through ground deceleration to stop) should be assumed, regardless of its probability.

Such single failures should not prevent continued safe flight and landing, or significantly reduce the capability of the airplane or the ability of the crew to cope with the resulting failure conditions.

(2) Subsequent failures during the same flight, whether detected or latent, and combinations thereof, should also be assumed, unless their joint probability with the first failure is shown to be extremely improbable.

b. The fail-safe design concept uses the following design principles or techniques in order to ensure a safe design.

The use of only one of these principles or techniques is seldom adequate. A combination of two or more is usually needed to provide a fail-safe design; that is, to ensure that major failure conditions are improbable and that catastrophic failure conditions are extremely improbable.

(1) Designed Integrity and Quality, including Life-Limits, to ensure intended function and prevent failures.

(2) Redundancy or Backup Systems to enable continued function after any single (or other defined number of) failure(s); for example, two or more engines, hydraulic systems, flight control systems, etc.

(3) Isolation of Systems, Components, and Elements so that the failure of one does not cause the failure of another. Isolation is also termed independence.

(4) Proven Reliability so that multiple, independent failures are unlikely to occur during the same flight.

(5) Failure Warning or Indication to provide detection.

(6) Flight-crew Procedures for use after failure detection, to enable continued safe flight and landing by specifying crew corrective action.

(7) Checkability: the capability to check a component's condition.

(8) Designed Failure Effect Limits, including the Capability to sustain damage, to limit the safety impact or effects of a failure.

(9) Designed Failure Path to control and direct the effects of a failure in a way that limits its safety impact.

(10) Margins or Factors of Safety to allow for any undefined or unforeseeable adverse conditions.

(11) Error Tolerance that considers adverse effects of foreseeable errors during the airplane's design, test, manufacture, operation, and maintenance.

*** end AC 25.1309-1A excerpt ****


By the way, the AC got everybody straightened out and working together well enough that the SAE went off and produced not one, but two documents on the same subject!

One explains the concepts and the other shows examples and guidance in using many of the popular techniques. They are:

Certification Considerations for Highly-Integrated or Complex Aircraft Systems (ARP 4754)

Guidelines and Methods For Conducting The Safety Assessment Process On Civil Airborne Systems and Equipment (ARP 4761)

Now, if I still have any of you with me, there is a pretty good description of what the FAA thinks they said (sometimes this is "constructively" disputed in the real world) in the Preamble to the new requirements for gas tank protection (see Federal Register May 7, 2001 - volume 66 number 88, pages 23085-23131). I have included the pertinent section below.

*** Start FAA Commentary FR 66 88 page 23108 ***

As for 25.1309, the commenter appears to be confusing the objective of the rule (i.e., to prevent the occurrence of catastrophic failure conditions that can be anticipated) with a conditionally acceptable means of demonstrating compliance, as described in AC 25.1309-1A (i.e., that catastrophic failure conditions must have an "average probability per flight hour" of less than 1 x 10-9).

Since this same misconception has presented itself many times before, the following discussion is intended to clarify the intent of the term "extremely improbable" and the role of "average probability" in demonstrating that a condition is "extremely improbable."

The term "extremely improbable" (or its predecessor term, "extremely remote") has been used in 14 CFR part 25 for many years. The objective of this term has been to describe a condition (usually a failure condition) that has a probability of occurrence so remote that it is not anticipated to occur in service on any transport category airplane. While a rule sets a minimum standard for all the airplanes to which it applies, compliance determinations are necessarily limited to individual type designs.

Consequently, all that has been required of applicants is a sufficiently conservative demonstration that a condition is not anticipated to occur in service on the type design being assessed.

The means of demonstrating that the occurrence of an event is extremely improbable varies widely, depending on the type of system, component, or situation that must be assessed.

There has been a tendency, as evidenced by the comment, to confuse the meaning of this term with the particular means used to demonstrate compliance in those various contexts.

This has led to a misunderstanding that the term has a different meaning in different sections of part 25.

As a rule, failure conditions arising from a single failure are not considered extremely improbable; thus, probability assessments normally involve failure conditions arising from multiple failures. Both qualitative and quantitative assessments are used in practice, and both are often necessary to some degree to support a conclusion that an event is extremely improbable.

Qualitative methods are techniques used to structure a logical foundation for any credible assessment. While a best-estimate quantitative analysis is often valuable, there are many situations where the qualitative aspects of the assessment and engineering judgment must be relied on to a much greater degree. These situations include those where:

There is insufficient reliability information (e.g., unknown operating time or conditions associated with failure data); 

Dependencies among assessment variables are subtle or unpredictable (e.g., independence of two circuit failures on the same microchip, size and shape of impact damage due to foreign objects); 

The range of an assessment variable is extreme or indeterminate; and 

Human factors play a significant role (e.g., safe outcome dependent totally upon the flight-crew immediately, accurately, and completely identifying and mitigating an obscure failure condition).

Qualitative compliance guidance usually involves selecting combinations of failures that, based on experience and engineering judgment, are considered to be just short of "extremely improbable", and then demonstrating that they will not cause a catastrophe. In some cases, [[Page 23109]] examples of combinations of failures necessary for a qualitative assessment are directly provided in the rule.

For example, 25.671 (concerning flight controls) sets forth several examples of combinations of failures that are intended to help define the outermost boundary of events that are not "extremely improbable." Judgment would dictate that other combinations, equally likely or more likely, would also be included as not "extremely improbable."

However, combinations less likely than the examples would be considered so remote that they are not expected to occur and are, therefore, considered extremely improbable. Another common qualitative compliance guideline is to assume that any failure condition anticipated to be present for more than one flight, occurring in combination with any other single failure, is not "extremely improbable." This is the guideline, often used to find compliance with 25.901(c), that the FAA is adopting as a standard in 25.981(a)(3).

Quantitative methods are those numerical techniques used to predict the frequency or the probability of the various occurrences within a qualitative analysis. Quantitative methods are vital for supporting the conclusion that a complex condition is extremely improbable. When a quantitative probability analysis is used, one has to accept the fact that the probability of zero is not attainable for the occurrence of a condition that is physically possible.

Therefore, a probability level is chosen that is small enough that, when combined with a conservative assessment and good engineering judgment, it provides convincing evidence that the condition would not occur in service.

For conditions that lend themselves to average probability analysis, a guideline on the order of 1 in 1 billion is commonly used as the maximum average probability that an "extremely improbable" condition can have during a typical flight hour. This 1 in 1 billion "average probability per flight hour" criterion was originally derived in an effort to assure the proliferation of critical systems would not increase the historical accident rate. This criterion was based on an assumption that there would be no more than 100 catastrophic failure conditions per airplane.

This criterion was later adopted as guidance in AC 25.1309.

The historical derivation of this criterion should not be misinterpreted to mean that the rule is only intended to limit the frequency of catastrophe to that historic 1 x 10-7 level. The FAA conditionally accepts the use of this guidance only because, when combined with a conservative assessment and good engineering judgment, it has been an effective indicator that a condition is not anticipated to occur, at least not for the reasons identified and assessed in the analysis. Furthermore, decreasing this criterion to anything greater than 1 x 10-12 would not result in substantially improved designs, only increased line maintenance. The FAA has concluded that the resulting increased exposure to maintenance error would likely counteract any benefits from such a change.

An ARAC working group has validated these conclusions.

When using "averages," care must be taken to assure that the anticipated deviations around that "average" are not so extreme that the "peak" values are unacceptably susceptible to inherent uncertainties. That is to say, the risk on one flight cannot be extremely high simply because the risk on another flight is extremely low. An important example of the flaw in relying solely on consideration of "average" risk is the "specific risk" that results from operation with latent (not operationally detectable) failures. It is this risk that is being addressed by 25.981(a)(3), as adopted in this final rule. For example, latent failures have been identified as the primary or contributing cause of several accidents. In 1991, a thrust reverser deployment occurred during climb from Bangkok, Thailand, on a Boeing Model 767 due to a latent failure in the reversing system.

In 1996, a thrust reverser deployment on a Fokker Model F-100 airplane occurred following takeoff from Sao Paulo, Brazil, due to a latent failure in the system. As noted earlier, the NTSB determined that the probable cause of the TWA 800 accident was ignition of fuel vapours in the center wing fuel from an ignition source:

* * * The source of ignition energy for the explosion could not be determined with certainty but, of the sources evaluated by the investigation, the most likely was a short circuit outside of the center wing tank that allowed excessive voltage to enter it through electrical wiring associated with the fuel quantity indication system [FQIS].

A latent failure or condition creating a reduced arc gap in the FQIS would have to be present to result in an ignition source. This rule is intended to require designs that prevent operation of an airplane with a pre-existing condition or failure such as a reduced arc gap in the FQIS (latent failure) and a subsequent single failure resulting in a short circuit that causes an electrical arc inside the fuel tank.

Due to variability and uncertainty in the analytical process, predicting an average probability of 1 in 1 billion does not necessarily mean that a condition is extremely improbable; it is simply evidence that can be used to support the conclusion that a condition is extremely improbable. Wherever part 25 requires that a condition be "extremely improbable," the compliance method, whether qualitative, quantitative, or a combination of the two, along with engineering judgment, must provide convincing evidence that the condition will not occur in service.

*** END FAA Commentary FR 66 88 page 23109 ***


And you wondered why we don't always agree on what is real safety!

I hope this helps more than it hurts!

Stuart Stuart Law voice 979-567-6370 Winged Systems Corp. fax 979-567-7439

1600 American Way Caldwell, Texas 77836 


"Let me throw a little fuel on the flame of discussion with a slightly different take on safety, reliability, and MTBF:

First: Reliability numbers, MTBF numbers, and any other numbers you care to throw out are merely indicators of safety."

Stuart, That was completely fascinating. The ARP4754 is what we're certifying the PW6000 control system to, and any future engines, until the next wave of fashion. This is viewed here as DO-178B for hardware. Before the ARP, safety for the hardware was almost purely analytical, whereas for the software you had to do a lot of tests. Now, there's much more of the analysis of the hardware being validated with tests....and with occasionally surprising results. :-)

I regard this as an entirely wonderful thing.

Your note also put in clear terms that no matter how many trees are killed, it is easy for unsubstantiated assumptions to work their way into safety analyses. Safety follows from diligence, no matter which method is being used.

My favourite 10E-9 story is the FADEC alternator, which used to be the sole source of power for the FADEC, and which has one rotor, and one housing, and which was supposed to have common mode faults only in that famous interval. When we mounted the very first one on the test bench, the housing was out of tolerance and the samarium cobalt (SmCo) magnet rubbed the inside of the stator. The resulting heat made it not a magnet anymore after 10 minutes, and a lot of strange colours.

The joke was that now we had had the one failure, it would be 10E9 hours before we had another one. They are backed up with aircraft power now. 

Charlie Falke wrote:
The ARP4754 is what we're certifying the PW6000 control system to, and any future engines,until the next wave of fashion. This is viewed here as DO-178B for hardware. Before the ARP, safety for the hardware was almost purely analytical, whereas for the software you had to do a lot of tests. Now, that much more of the analysis of the hardware is validated with tests, you get occasionally surprising results. :-)

Good Day Charlie.

The history of ARP-4754 (Certification Considerations for Highly-Integrated or Complex Aircraft Systems) and ARP-4761 (Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment) is interesting. RTCA formed SC-167 in the mid-'80's to deal with the "problems" of DO-178A, Software Considerations in Airborne Systems and Equipment Certification. DO-178() takes as its basic premise that software does not fail. It can perform "anomalously," that is in a manner which is not in accordance with its requirements, but it does not have the exposure to the tradition "random" failures to which hardware falls victim to. So DO-178() defines processes, which if properly followed, will try to ensure that the software is free from any "anomalous behavior." DO-178(A) did a very poor job of providing criteria for measuring how good the process should be (a common supplier statement was "Well it doesn't say that I have to do that.") so SC-167 was formed to provide more specific guidance on how to recognize "good process." In the Systems team of SC-167 two players were key intellectual resources: Dr.
Nancy Leveson,then at the University of California-Irvine, and Mike DeWalt, then the FAA Software National Resource Specialist. Dr.
Leveson's position was (and still is) brutally frank - without good requirements the best that "process" can do is to ensure that bad requirements produce the expected results even if the results are bad.
(Her book Safeware should be required reading for all Systems Engineers and Software Project Managers.) This view on requirements as the key to safety resonated with the systems folks and we went forward trying to define what good systems activities would be. However it was pointed out to us that writing system requirements was not in the charter for SC-167, and that by defining a systems process in a software document we had exceeded the Terms of Reference. Mike led the charge in stripping the work down to the absolute minimum, which is what remains in DO-178B today. To sum up the systems activities, Systems defines requirements and allocates some of these requirements to Software. Systems also performs a System Safety Assessment to define the safety objectives for the software. Systems also reviews requirements generated within the software process (the derived requirements) and ensures that safety objectives are not violated.
Notice the order. System is responsible for defining a set of requirements which meet aircraft and system safety objectives. Software, using the processes of DO-178(B) does all the "right stuff" to ensure that the software is consistent with the requirements allocated to them. DO-178B was found acceptable to the FAA, and Advisory Circular 20-115A was issued calling attention to DO-178B as "a means but not the only means for showing compliance with" regulatory requirements. Mike took the systems material from SC-167, went to the Society of Automotive Engineers, and SAE produced Aerospace Recommended Practices 4565 and ARP-4761. Both are excellent documents, clearly written, insightful, technically correct, and all round "good stuff." HOWEVER - the FAA has never given the "stamp of approval" to these documents in the same way they did with DO-178B. Nor have the Europeans. I'll refrain from speculating on why, sufficient to say that some big players in industry took offense to what would be called for as "required" and so have successfully blocked "official recognition" by any regulatory agency for these two excellent documents.

An excellent site for this information is the FAA Wichita ACO at

to RoboLandeR main menu