393SYS

advertisement
Aviation reliability:
Programs & calculation
Chapter 19 text
March 16
393SYS
1
Reliability
Definition (in statistical term):
‘the probability of failure free operation of an item in
a specified environment for a specified amount of
time’
Examples:
If eight delays and cancellations are experienced in 200 flights, that means 96%
of flights dispatched on time for the airline.
Effective February 15, 2007, the FAA ruled that US-registered ETOPS-207
operators can fly over most of the world provided that the IFSD rate is 1 in
100,000 engine hours. This limit is more stringent than ETOPS-180 (2 in
100,000 engine hours).
March 16
393SYS
2
Two main approaches of reliability in
the aviation industry


First approach is the overall airline reliability, essential means
the dispatch reliability, that is, how often the airline achieves an
on-time departure of its scheduled flights. The reasons of delay
are categorized as maintenance, procedures, personnel, flight
operations, air traffic control (ATC). etc.
Second approach is to consider reliability as programs
specifically designed to address the problems of maintenancewhether or not they cause delays and provide analysis of and
corrective actions for those items to provide the overall
reliability of equipment. This contributes to the dispatch
reliability as well as the overall operation.
March 16
393SYS
3
Reliability Program (for maintenance)
A set of rules and practices for managing and
controlling a maintenance program. The
main function is to monitor the performance
of the vehicles and their associated
equipment and call attention to any need for
corrective action.
Additional functions:

Monitor the effectiveness of those corrective actions

Provide data to justify adjusting the maintenance interval or
maintenance program procedure as appropriate
March 16
393SYS
4
Maintenance programs have four
types of reliability




Statistical reliability
Historical reliability
Event-oriented reliability
Dispatch reliability
March 16
393SYS
5
Statistical reliability

Based upon collection and analysis of ‘events’
such as failure, removal, and repair rates of
systems or components.
March 16
393SYS
6
Historical reliability

Comparison of current event rates with those of past
experience. Commonly used when new equipment is
introduced and no established statistic is available.
March 16
393SYS
7
Event-oriented reliability
Events like bird strikes, hard landing, in-flight
shutdowns (IFSD), lighting strikes or other accidents
that do not occur on a regular basis and therefore
produce no useable statistical or historical data. In
ETOPS, FAA designated certain events to be tracked
as ‘event-oriented reliability program’. Each
occurrence of the events must be investigated to
determinate the cause to prevent recurrence.
IFSD causes; for example: due to flameout, internal failure,
crew-initiated shutoff, foreign object ingestion, icing, inability to
obtain and/or control desired thrust.
March 16
393SYS
8
Dispatch reliability
Measurement of an airline operation respect
to on-line departure. It receives considerable
attention from regulatory authorities(e.g.
FAA), airlines and passengers. Actually, it is
just a special form of the event-oriented
reliability approach.
March 16
393SYS
9
Danger of misinterpreted
reliability data (1)
A pilot experienced a rudder control problem and called in two hours from arriving
an airport. He writes up the problem in the aircraft logbook and reports it by
radio to the flight operation unit at the airport. Upon arrival, the maintenance
crew check the log and find the write-up and begin troubleshooting. The repair
actions take a little longer then scheduled turnaround time and cause delay.
Since maintenance is at work and rudder is the problem, the delay is charged to
the maintenance department.
If the pilot and the flight operation unit knew the problem and informed the
maintenance two hours before landing, the maintenance people can spent the
time prior to landing to perform troubleshooting analysis and the delay could
have been prevented. So, an alter in airline procedure can avoid the delay.
A good reliability program should avoided same delay in the future by altering
the procedure, not regardless of who or what is to blame.
March 16
393SYS
10
Danger of misinterpreted reliability
data (2)
If there were 12 write-ups of rudder problems during the month and only
one of them caused a delay, there is actually two problems to
investigate.
1.
The delay, which may/or may not be caused by rudder the problems
2.
The 12 rudder write-ups that may ,in fact, be related to an underlying
maintenance problem.
Dispatch delay constitutes one problem and the rudder system malfunction
constitutes another. They may overlap but they are two different
problems. Delay is a event-oriented reliability that must be
investigated on its own; the 12 rudder problems should be addressed
by the statistical (or historical) reliability problem separately.
March 16
393SYS
11
Elements of a Reliability Program
1.
2.
3.
4.
5.
6.
7.
March 16
Data collection
Problem area alerting
Data display
Data analysis
Corrective actions
Follow-up analysis
Monthly report
393SYS
12
Data Collection: allows operator to compare
present performance with the past, typical data
type are:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Flight time and cycle for each aircraft
Cancellations and delays over 15 minutes
Unscheduled component removals
Unscheduled engine removals
In-flight shutdowns of engines
Pilot reports or logbook write-ups
Cabin logbook write-up
Component failures (shop maintenance)
Maintenance check package findings
Critical failures
March 16
393SYS
13
Problem detection: alerting systems
alerting systems for quick identify areas where performance is
significantly different from normal so that possible problems
can be investigated. Standards for event rates are set
according to past performance.
March 16
393SYS
14
Problem detection 2:
setting & adjusting alert levels
alert levels recalculation (yearly) and filtering of
false alarms
March 16
393SYS
15
Quality Control Charts and the Seven
Run Rule


A control chart is a graphic display of data that illustrates
the results of a process over time. It helps prevent
defects and allows you to determine whether a process is
in control or out of control
The seven run rule states that if seven data points in a
row are all below the mean, above the mean, or
increasing or decreasing, then the process needs to be
examined for non-random problems
March 16
393SYS
16
Control Chart of 12” ruler
March 16
393SYS
17
Control Chart



contiu.
The output of a production process will fluctuate. The causes of
fluctuation can just be random or non-random due to
desirable/undesirable process change. Control charts graph and
measure process data against control limits. Control charts can
distinguish the random variation from assignable causes or nonrandom causes.
We cannot adjust random variation out of a process. Process
adjustments for random variation are neither necessary nor desirable.
This is over-adjustment or tempering, and it makes the process worse.
We can and must investigate assignable causes (or non-random
causes). Points outside the control limits are evidence of process
problems. Analyst must investigate every out of control point for an
assignable cause. They must record their findings and any corrective
actions. For example, a tool adjustment, or change in Formal Technical
Review format or worn tooling, may correct the problem.
March 16
393SYS
18
Pattern analyzing of Control Chart
7-Run rule
7-run-rule is used to filter out the random variation in a production
process. shows the ‘trends’ that are caused by the ‘assignable
causes’ or non-random causes that required investigation and
possible corrective action to be taken.
7-run-rule pattern:

seven points above mean value;

seven points below mean value;

seven points or all increasing ; or

seven points all decreasing
the patterns are indicators of non-random problems which can be
symptom of process out of control.
March 16
393SYS
19
To develop a Control Chart to
determine project stability





Plot individual metric values on a chart.
Compute the mean value for the metrics value and plot
the line.
Plot the Upper Control Limit and Lower Control Limit.
Compute a standard deviation as (Upper-control-limit mean)/3. Plot lines one and two standard deviation
above and below Am. If any of the standard deviation
lines is less than 0.0, it need not be plotted unless the
metric being evaluated takes on values that are less than
0.0.
The Std Dev.# is then plotted on the control chart.
March 16
393SYS
20
Other Pattern analyzing of Control
Chart
7 run rule is just another method to filter false alarms. Other
pattern analyzing methods include but not limited to :
 metric value lay outside UCL or LCL
 2 out of 3 successive metrics values lay more than 2
standard deviations away from the mean;
 4 out of 5 successive metrics values lay more than 1
standard deviations away from the mean;
 others…
March 16
393SYS
21
Reliability :
Basic Calculation & Application
March 16
393SYS
22
CHARACTERIZING FAILURE
OCCURRENCES IN TIME
Four general ways:
1.
time of failure
2.
time interval between failures
3.
cumulative failures experienced up to a
given time
4.
failures experienced in a time interval
March 16
393SYS
23
Time-based failure specification
Failure
number
March 16
Failure time
(sec)
Failure
interval
(sec)
1
10
2
19
9
3
32
13
4
43
11
10
393SYS
24
Failure-based failure specification
Time
(sec)
March 16
Cumulative
failures
Failure in
interval
30
2
2
60
5
3
90
7
2
120
8
1
393SYS
25
TABLE 3
Typical probability distribution of failures
Value of random variable
(failures in time period)
Probability
0
0.10
0
1
0.18
0.18
2
0.22
0.44
3
0.16
0.48
4
0.11
0.44
5
0.08
0.40
6
0.05
0.30
7
0.04
0.28
8
0.03
0.24
9
0.02
0.18
10
0.01
0.1
3.04
Mean failures
March 16
Product of value
and probability
393SYS
26
TIME VARIATION


Mean value function - represents the average
cumulative failures associated with each time point.
Failure intensity function - is the rate of change
of the mean value function or the no. of failures per
unit time.
March 16
393SYS
27
Probability distributions at times tA and tB
Value of random variable
(failures in time period)
Probability
Elapsed time ta = 1 hr
Elapsed time tB = 5 hr
0
0.10
0.01
1
0.18
0.02
2
0.22
0.03
3
0.16
0.04
4
0.11
0.05
5
0.08
0.07
6
0.05
0.09
7
0.04
0.12
8
0.03
0.16
9
0.02
0.13
10
0.01
0.10
11
0
0.07
12
0
0.05
13
0
0.03
14
0
0.02
15
0
0.01
Mean failures
March 16
3.04
393SYS
7.77
28
Mean Failure function F(t)
5
failure intensity function f(t)
Time tA = 1
Time tB =5
10
Failure intensity (failures/hr)
Mean Failures
10
Time
Figure showing the Mean value function and failure intensity function
March 16
393SYS
29
f(x) -- probability density function
F(x) -- probability cumulative distribution function
eg : When we toss three coins, there will be eight events
have:
If we set x is the number of coins which head side face up, we
Event HHH HHT
x
3
2
HTH
2
THH
HTT
THT
TTH
TTT
1
1
1
0
2
H = Head side up; T = Tail side up
: X
0
1
2
3
f(x)
1/8
3/8
3/8
1/8
F(x)
1/8
1/2
7/8
1
March 16
393SYS
30
DISCRETE FAILURE FUNCTION
f(t), the failure density function over a time interval [t1, t2] and is defined
as the ratio of the number of failures occurring in the interval to the size
of the original population, divided by the length of the time interval:
f (t ) 
[n( t 1)  n( t 2 )] / N
( t 2  t 1)
Where n(t) is the number of the fault survivors at time t
The f(t) can measure the overall speed at which failures are occurring
March 16
393SYS
31
DISCRETE HAZARD FUNCTION
Z(t), the failure rate, or the Hazard function is the probability that
a failure occurs in some time interval [t1, t2], given that the
system has survived up to time t. It is the ratio of the number of
failures occurring in the interval to the size of the original
population, divided by the length of the time interval:
[n(t1) n(t2)]/ n(t1)
Z(t) 
(t2 t1)
The Z(t) can measure the instantaneous speed of failure
March 16
393SYS
32
FAILURE CURVES
March 16
393SYS
33
PROBABILITY OF SUCCESS
F(t) is the probability of failure
(= Cumulative Distribution)
R(t) is the probability of success
(= Reliability)
 F(t) + R(t) = 1
March 16
393SYS
34
DISCRETE FUNCTION EXAMPLE
Failure data for 10 hypothetical electrical components
Failure Number
Operating time, h
1
8
2
20
3
34
4
46
5
63
6
86
7
111
8
141
9
186
10
266
March 16
393SYS
35
[n ( t 1)  n ( t 2)] / N
f (t ) 

(t 2
Z(t) 
Interval
t1 to t2
Failure density/hr
f(t)
[10-9]/10 / (8-0)
= 1//10/8
= 0.0125
0-8
Hazard rate/hr
Z(t)
[10-9]/10 / (8-0)
= 1//10/8
= 0.0125
F(t)
1/10
[n(t1) n(t2)]/ n(t1)
(t2 t1)
R(t) MTTF
9/10
t 1)
8*10
=80
Overall
MTTF
8/1
*10
= 80
8-20
[9-8]/10/(20-8)
= 1/10/12
= 0.0083
[9-8]/9/(20-8)
= 1/9/12
= 0.093
2/10
8/10
12 *9
=108
20/2
*10
= 100
20-34
[8-7]/10/(34-20)
= 1/10/14
= 0.0071
March 16
[8-7]/8/(34-20)
= 1/8/14
= 0.089
393SYS
3/10
7/10
14*8
34/3 *10
=112
= 113
36
Achieving a reliable system
ref. Ian Summerville, 7e Ch20


Three basic strategies to achieve reliability
Fault Avoidance


Fault Tolerance


Build fault-free systems from the start
Build facilities into the system to let the system continue
when faults cause system failures
Fault Detection
Use software validation techniques to discover faults prior to
the system being put into operation
For most systems, fault avoidance and fault detection suffice to
provide the required level of reliability

March 16
393SYS
37
Implementing Fault Avoidance



Availability of a formal and unambiguous system
specification
Adoption of a quality philosophy by developers.
Developers should be expected to write bug-free
systems
…
March 16
393SYS
38
Implementing Fault Tolerance





Even if somehow we build a fault-free system, we still need
fault-tolerance in critical systems
Fault-free does not mean failure-free
Fault-free means that the system correctly meets its
specifications
Specifications may be incomplete or faulty or unaware of a
requirement of the environment
Can never conclusively prove that a system is fault-free
March 16
393SYS
39
Aspects of Fault Tolerance

Failure Detection


Damage Assessment


System must detect what damage the system failure has caused
Fault Recovery




System must be able to detect that the current state of the system
has caused a failure or will cause a failure
System must change the state of the system to a known “safe” state
Can correct the damaged state (forward error recovery - harder)
Can restore to a previous known “safe” state (backwards error
recovery - easier)
Fault Repair


March 16
Modifying the system so that the failure does not recur
Many software failures are transient and need no repair and normal
processing can resume after fault recovery
393SYS
40
Implementing Fault Tolerance

Hardware - Triple-Modular Redundancy (TMR)




Hardware unit is replicated three (or more) times
Output is compared from three units
If one unit fails, its output is ignored
Space Shuttle is a classic example
Machine 1
Output
Comparator
Machine 2
Machine 3
March 16
393SYS
41
Implementing Fault Tolerance (2)


Using Software
N-Version programming




Have multiple teams build different versions of the software
and then execute them in parallel
Assumes teams are unlikely to make the same mistakes
Not necessarily a valid assumption, if teams all work from the
same specification
…
March 16
393SYS
42
N-Version Programming
Commonly used approach in railway signaling, aircraft systems & reactor
protection system
March 16
393SYS
43
System Configuration for Failure Event Diagram




Divide system into a hierarchy set of components. The reliability of the
components should be known or is easy to estimate or measure.
Each component represented as a switch. If component is functioning,
the switch is viewed as CLOSED and if not functioning, as OPEN.
System success occurs if there is a continuous path through the
configuration.
The components are described as combination of 2 types - AND & OR
configuration with independent failures representation.
Express the reliability relationship between the components with Failure
Even diagram.
March 16
393SYS
44
EVENT DIAGRAM
A
B
D
E
C
Event diagram for AND-OR configuration
March 16
393SYS
45
EVENT EXPRESSION
RS = (A + B + C) * D * E
where RS = Reliability of system
= Probability of System success
March 16
393SYS
46
TRUTH TABLE
[ OR ]
[ AND ]
A
B
A+B
A*B
1
1
1
1
1
0
1
0
0
1
1
0
0
0
0
0
March 16
393SYS
47
.AND. CONFIGURATION
Rs = Reliability of the system.
= R1XR2
where R1 & R2 are the reliability of
components C1 & C2.
R1
R2
Rs
Rs with n components
arranged in logical .AND.
then,
Rs = R1 X R2 X … X Rn
March 16
393SYS
48
.OR. CONFIGURATION
Rs = Reliability of the system.
R1, R2, ..Rn = Reliability of component 1, 2,
..n.
in this case, it is easier to calculate by the
probability of failure F
Fs = (1 - Rs)
F1 = (1 - R1); F2 = (1 - R2)
Fs = F1 * F2
= (1 - R1) * (1 - R2)
Rs = 1 - Fs
= 1 - [(1 - R1) * (1 - R2) * … (1 - Rn)]
for n components arranged in logical .OR.
March 16
393SYS
R1
R
2
Rs
49
Reliability Acronyms





MTBF - Mean Time Between Failures
MTTF - Mean Time To Failure
MTTR - Mean Time To Repair
MTBF = MFFT + MTTR
Many people consider it to be far more useful
than measuring fault rate per LOC
March 16
393SYS
50
Difference btw. Reliability and
Availability

Reliability - means no failures in the interval, say, 0 to t.

Availability - means only that the system is up at time t.
March 16
393SYS
51
AVAILABILITY
A(T)
Time (up)
= -----------------------------------------Time(up) + Time(down)
MTTF
= -------------------------MTTF + MTTR
=
MTTF
-------------MTBF
Where MTTF = Mean Time To Fail
MTTR = Mean Time To Repair
MTBF = MTTF + MTTR
Where MTBF = Mean Time Between Failure
March 16
393SYS
53
Failure rate and Reliability during the random
failure period
Rs = e –t/m
= e -tl
where:
Rs
=
e
=
t
=
m
=
l
=
Probability of failure free operation for period >=t
2.718
specified period of failure free operation
mean time between failure (MTBF)
failure rate = 1/m
March 16
393SYS
54
SYSTEM CONFIGURATION

Suppose we have Qp components with constant failure intensities.
Their reliabilities are measured over a common time period. Assume
that all must function successfully for system success. Then the
system failure intensity l is given by
Qp
l  lk
k1
Where lk are the component failure intensities
March 16
393SYS
55
Confusion surrounds MTBF :
1.5
Exp. failure
Intuitive
1
0.5

3
2.
25
2.
5
2.
75
2
1.
25
1.
5
1.
75
1
0.
25
0.
50
0.
75
0
0
MTBF
If the failure rate is constant, the probability that a product will
operate without failure for a time equal to or greater than its MTBF
is only 37%. This is based on the exponential distribution and it is
contrary to the intuitive feeling that there is a 50-50 chance of
exceeding an MTBF.
March 16
393SYS
56
Computer harddisk with MTBF rating of 1 million hour mean that
the average unit will run for 114 years before it fails?
But harddisk has not been invented for 114 years, how can we know
that it can be used for that long?
MTBF of a drive is obtained by multiplying a large quantity of the drives
with the number of hours running before experiencing a failure in the
batch. For example, when a disk manufacturer batch tested 1500 units
of hard disk and achieved an average of 30 days operation out of the
batch between each individual unit failure, then
MTBF of the disk is 1500 x 30 x 24 hours
= 1 million hours.(for the period of testing time)
March 16
393SYS
57
Download