S2P2 - Lyle School of Engineering

advertisement
EMIS 7300
SYSTEMS ANALYSIS METHODS
Spring 2006
Dr. John Lipp
Copyright © 2002 - 2006 John Lipp
Today’s Session Topics
• Part 2: The Statistics You Thought You Knew.
–
–
–
–
–
–
–
–
What is Statistics?
Populations and Samples.
Mean, Variance, Standard Deviation.
Mode, Range, Quartiles, Percentiles.
Frequency, Relative Frequency.
Dot Diagram and Box Plot.
Mechanistic vs. Empirical Models.
Deterministic vs. Statistical Modeling.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-2
What is Statistics?
• Statistics is the mathematics branch dealing with applied
probability theory.
• Statistics is very prescriptive.
• Statistics’ main emphasis is on decision making.
• Engineering is well populated with decisions to be made
based on random or imperfect data:
– Is this radar measurement just cosmic radiation, or is it a
stealth fighter?
– How many high-pressure hoses in this lot should be
destructively tested to be confident the whole lot is good?
– Is there a correlation between system performance and
missile mass, antennae gain, bad FLIR pixels, etc.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-3
What is Statistics? (cont.)
• Statistics is also concerned with estimation of unknown
quantities (statistical parameters like mean and variance).
• Estimation is the more prevalent statistics problem found in
engineering:
– Curve fitting (Regression)
• Linear,
• Logistic,
– The Kalman filter (a course into and of itself).
– Design of Experiments (another course).
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-4
Populations
• Missiles are built in lots of 10. The following parameters are
measured as percentages of the requirement specifications.
Missile
Weight
Motor
Seeker
Range
Labor
1
99
96
99
96
105
2
101
102
101
99
101
3
102
98
101
95
101
4
101
105
102
103
97
5
103
99
101
95
96
6
101
103
99
100
99
7
102
102
100
98
90
8
98
98
99
100
101
9
100
94
100
94
105
10
100
101
100
100
100
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-5
Populations (cont.)
Dot Diagrams
Weight
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
Motor
}
Seeker
Number of dots represents the
frequency of the data value.
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
90
92
EMIS 7300 Spring 2006
94
96
98
100
102
104
106
108
110
Range
Labor
Copyright  2002 - 2006 Dr. John Lipp
S2P2-6
Population Mean, Variance, and Standard Deviation
• The size of a population is denoted N. The number of unique
data values will be denoted M.
• The population mean is a measure of a population’s central
tendency. It is commonly denoted by the Greek letter  and is
computed from data via
M
Ni M
1 N
1 M
   xi or    xi N i  xi
 xi f i
N i1
N i1
N i1
i 1
• The population variance is a measure of a population’s
variability about the population mean. It is commonly
denoted by the Greek letter  2 and is computed from data via
M
1 N
2
2
2
    xi    or     xi   2 f i
N i1
i 1
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-7
Population Mean, Variance, and Standard Deviation (cont.)
• The population standard deviation, , is the square root of the
population variance. It is also a measure of variability about
the mean.
– Unlike the population variance, the population standard
deviation has the same units as the population mean and
the raw data.
– In engineering  2 is usually proportional to power, while
 is proportional to magnitude (voltage, current, force,
velocity, etc.).
–   1 contains 68.3% of “normal” data
–   2 contains 95.5% of “normal” data
–   3 contains 99.7% of “normal” data
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-8
Population Mean, Variance, and Standard Deviation (cont.)
Weight
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
90
92
EMIS 7300 Spring 2006
94
96
98
100
102
104
106
108
110
Motor
Seeker
Range
Labor
Copyright  2002 - 2006 Dr. John Lipp
S2P2-9
Population Range, Median, Quartiles, and Percentiles
• The population range is another measure of variability. It is
the difference between the largest and smallest data values.
• The population mode is the most frequently occurring value in
the samples, that is, the most probable value. Ties are
allowed.
• The population range and mode are rarely used. If the size of
the population is infinite they can be undefined.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-10
Population Range, Median, Quartiles, and Percentiles
• The population median is another measure of central
tendency. It is computed via sorting the data (from lowest to
highest) and dividing this ordered data into two equal halves
at the data mid-point.
– If N is odd, the median is the “left over” data point after
dividing at the mid-point into equal halves.
– If N is even, the median is the average of the two data
points on either side of the mid-point.
– Regardless of N’s value, the same number of data points
are above and below the median’s value (ties with the
median are allocated above/below as necessary).
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-11
Population Range, Median, Quartiles, and Percentiles (cont.)
• The division points which divide ordered data into four
“equal” portions are the population quartiles:
– The first or lower quartile is denoted q1.
– The second quartile is the median, q2.
– The third or upper quartile is denoted q3.
• The difference q3 – q1 is called the interquartile range and is
yet another measure of variability. For “normal” data the
interquartile range should be about 4/3.
• The division points which divide ordered data into 100
“equal” portions are the population percentiles.
– Percentiles are most commonly denoted as i.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-12
Population Range, Median, Quartiles, and Percentiles (cont.)
• What constitutes “equal” portions is defined by the following
algorithm
– Sort (order) the x data in increasing order, call the result y
y1 , y2 , y3 ,, yN   sort  x1 , x2 , x3 , xN 
– Determine the quartile / percentile point z by computing
Q( N  1)
K ( N  1)
z
(quartiles )
z
( percentiles)
4
100
– Q  {1..3} for the first quartile, second quartile (median),
or third quartile, respectively.
– K  {1..100} for the K-th percentile.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-13
Population Range, Median, Quartiles, and Percentiles (cont.)
– Linear interpolation is used to compute the quartiles and
percentiles from the two closest data points to z via
f ( z )   z   z  y  z    z   z  y  z 
– z  = closest integer less than or equal to z (floor).
– z  = closest integer greater than or equal to z (ceiling).
• For the missile lot, N = 10. Thus the value of z for q1 is 2.75
and for q3 it is 8.25. The quartile equations are then
3 y3  y2
q1  2.75  2 y3  (3  2.75) y2 
4
3 y8  y9
q3  8.25  8 y9  9  8.25 y8 
4
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-14
Population Range, Median, Quartiles, and Percentiles (cont.)
• The population statistics (means, variances, quartiles, etc.) for
the missile lot are computed in the table below.
Parameter
Weight
Motor
Seeker
Range
Labor

100.7
99.8
100.2
98.0
99.5
2
2.01
10.36
0.96
7.60
17.65

1.42
3.22
0.98
2.76
4.20
range
5
11
3
9
15
q3 – q1
2.25
4.75
2.00
5.00
5.25
q1
99.75
97.50
99.00
95.00
96.75
q2
101.0
100.0
100.0
98.5
100.5
q3
102.00
102.25
101.00
100.00
102.00
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-15
Population Range, Median, Quartiles, and Percentiles (cont.)
• Below is a common method of illustrating the statistical
quantiles called a box plot.
– The box is drawn from q1 to q3 (the interquartile range)
and has the median, q2, marked in the middle.
– A line and mark known as a whisker extends from the
box’s q1 end to the smallest data point within 1.5
interquartile ranges from q1.
– Likewise, a whisker is drawn from the box’s q3 end to the
largest data point within 1.5 interquartile ranges from q3.
Outlier
Labor
90
EMIS 7300 Spring 2006
92
94
96
98
100
102
Copyright  2002 - 2006 Dr. John Lipp
104
106
108
110
S2P2-16
Population Range, Median, Quartiles, and Percentiles (cont.)
Weight
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
Motor
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-17
Population Range, Median, Quartiles, and Percentiles (cont.)
Seeker
90
92
94
96
98
100
102
104
106
108
110
90
92
94
96
98
100
102
104
106
108
110
Range
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-18
Population Range, Median, Quartiles, and Percentiles (cont.)
• Code on the left is for
MATLAB.
% Population analysis for Missile Lot
x = [ 99
96
105
99
96
101 102
101
99
101
102
98
101
95
101
101 105
102
103
92
103
99
101
95
96
101 103
99
100
99
102 102
100
98
90
100
101
98
98
99
100
94
100
94
105
100 101
100
100
100
y = sort(x);
x_bar = mean(x)
sigma = std(x,1)
var = sigma.^2
• Notice that MATLAB has
built-in
functions
to
compute most of the
statistical values.
];
• std(x,1) divides by N to give
the population standard
deviation, while std(x)
divides by N-1 to give the
sample standard deviation.
q2 = median(x)
rng = range(x)
q1 = (3*y(3,:) + y(2,:)) / 4
q3 = (3*y(8,:) + y(9,:)) / 4
q31 = q3 - q1
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-19
In Class Assignment – Deal or Dud?
• The customer is unhappy with his lot of missiles and
canceling the contract.
– They claim the missiles don’t meet the contract
requirements!!!
• Divide up into two teams of statisticians,
– One representing the plantiffs (the customer), and
– The other the defendant (Missile King).
• Prepare to argue WITH STATISTICS the case for your side!
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-20
Samples
• An entire population may not or cannot be measured to
determine the population’s statistics:
– The measurement process is destructive.
– The measurement costs are excessive.
– The population is evolving, i.e, the statistics fluctuate.
– The population is theoretical (N = ).
• Instead of measuring the population, a sub-set or sample of
the population can be measured and the parameters estimated
by statistical inference.
• The number of items in the sub-sample is typically denoted as
n in statistics. (Note n < N.) Denote the number of unique
data values as m.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-21
Sample Mean, Variance, and Standard Deviation
• The sample mean is an estimate of the population mean. It is
commonly denoted by x and is computed from data via
m
1 n
x   xi or x   xi f i
n i1
i 1
• The sample variance is an estimate of the population
variance. It is commonly denoted by s2 and is computed from
data via
2
n
n
n


1
1
1
2


2
2
s 
 xi  x  

 xi    xi  
n  1 i1
n  1  i1
n  i1  
n m
2
or s 
 xi  x  f i

n  1 i1
2
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-22
Sample Range, Median, Quartiles, and Percentiles (cont.)
• The sample standard deviation, s, is likewise the square root
of the sample variance and is an estimate of the population
standard deviation.
• The sample range is the difference between the largest and
smallest data sample values.
• Similarly, the sample median, sample quartiles, and sample
percentiles are found by using the sorted data samples and
replacing n for N in the population equations.
• The sample mode is the most frequently occurring value in the
data samples. Ties are allowed.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-23
Sample Statistics
• Consider the missile lot. If only 3 of the 10 missiles are
sampled, what is the sample mean?
• Let i, j, and k denote the missiles’ selected for the sample.
Then the sample mean is
1
x  xi  x j  xk 
3
• Since {i, j, k} are selected at random, so are the {xi, xj, xk}
data values. That implies the sample mean is
– Itself is a random variable,
– Has a population, and
– Has its own mean, variance, quartiles, etc.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-24
Sample Statistics (cont.)
• The first step in determining the sample means’ statistics is to
determine the population
– Select i first. Since N = 10, i can take on one of 10 values.
– Select j next. Since i  j, j can take on one of 9 values.
– Finally select k. Since k  i  j, k is one of 8 values.
– The total population size is 10  9  8 = 720.
• The process applicable to determining the population in this
case is known as selection without replacement. The general
formula for the size is known as the number of permutations
n!
n Pr 
(n  r )!
where n is the population size and r is the sample size.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-25
Sample Statistics (cont.)
• However, the order in which {xi, xj, xk} are added does not
change the value of the sample mean.
• Regardless of the values of i, j, and k , there are six orders in
which they can be randomly draw
– {i, j, k} {j, i, k} {i, k, j} {j, k, i} {k, i, j} {k, j, i}
– Thus, the population size can be reduced to 720 / 6 = 120.
• Generally, the number of orders in which r things can be
arranged is r! (= rPr).
• When the order of draw is not important, the formula for the
number of combinations is
n!
n Cr 
(n  r )! r!
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-26
Sample Statistics (cont.)
• Consider the n = 3 sample mean for the weight of the missile
lot. Note that
–
–
–
–
Largest possible value = (103 + 102 + 102) / 3 = 102 1/3.
Smallest value possible = (98 + 98 + 100) / 3 = 99.
Range is 3 1/3, with discrete steps every 1/3.
That is, the number of unique values is 11 !?!
• What is different about the proposed populations? For
– N=720: each permutation of i, j, and k is equally probable.
– N=120: each combination of i, j, and k is equally probable.
– N=11: each sample mean is unequally probable!
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-27
Sample Statistics (cont.)
Sampling
Distribution
(of the Mean)
mean weight
(n = 3)
98
EMIS 7300 Spring 2006
99
100
Copyright  2002 - 2006 Dr. John Lipp
101
102
S2P2-28
Sample Statistics (cont.)
98
EMIS 7300 Spring 2006
99
100
Copyright  2002 - 2006 Dr. John Lipp
101
102
S2P2-29
Sample Statistics (cont.)
% Population of N = 120
• Code on the left and next page is
for MATLAB.
%
i = zeros(120,1);
j = zeros(120,1);
• This code assumes that you have
already run the code on page
1-24.
k = zeros(120,1);
ijkdx = zeros(120,3);
N = 0;
for idx = 1:10,
• The first code section formulates
the population for the n = 3 mean.
for jdx = (idx+1):10,
for kdx = (jdx+1):10,
N = N + 1;
i(N) = idx;
• The
second
code
computes the statistics.
j(N) = jdx;
k(N) = kdx;
ijkdx(N,:) = [idx jdx kdx];
end
end
end
% Check that i,j,k are all different
%
section
• The third code section computes
the data for dot diagrams /
histograms and plots them.
sum(i==j),
sum(i==k),
sum(j==k),
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-30
Sample Statistics (cont.)
% Sample mean population statistical analysis
% Compute / plot dot diagrams (but use the
%
% built-in histogram function) for all columns
%
x3 = (x(i,:) + x(j,:) + x(k,:)) / 3;
for loop = 1:min(size(x3)),
x_bar = mean(x3)
rng = (3*min(x3(:,loop))):(3*max(x3(:,loop)));
sigma = std(x3,1)
var = sigma.^2
figure(loop);
q2 = median(x3)
hist(3*x3(:,loop),rng);
y3 = sort(x3);
xx = get(gca,’xtick’);
z1 = 1 * (N+1) / 4;
set(gca,’xticklabel’,xx/3);
z3 = 3 * (N+1) / 4;
end
q1 = (z1 – floor(z1)) * y3(ceil(z1),:) + …
(ceil(z1) – z1) * y3(floor(z1),:)
q3 = (z3 – floor(z3)) * y3(ceil(z3),:) + …
(ceil(z3) – z3) * y3(floor(z3),:)
q31 = q3 – q1
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-31
Sample Statistics (cont.)
Sample Mean (n = 3) of
Weight
Motor
Seeker
Range
Labor

100.7
99.8
100.2
98.0
99.5
2
0.52
2.69
0.25
1.97
4.58

0.72
1.64
0.50
1.40
2.14
q3 – q1
1.0000
2.3333
0.6667
2.0000
3.0000
q1
100.3333
98.6667
100.0000
97.0000
98.0000
q2
100.6667
99.6667
100.3333
98.0000
99.6667
q3
101.3333
101.0000
100.6667
99.0000
101.0000
Parameter
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-32
Statistical Modeling
• Engineering analysis often begins with a mechanistic model
of a physical system using scientific first principles, for
example, F = ma, V = IR, etc.
• The analysis results of such a design are deterministic, exact
and reproducible.

x1
x2
Transfer
Function
y = f(x1, x1, …, xn)
y
xn
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-33
Statistical Modeling (cont.)
• Sources of experimental variability are many:
– Imperfect hardware and measurement devices.
– Assumptions are approximate (frictionless surfaces really
aren’t, missiles flex during maneuvers, etc.).
• A mechanistic model can be augmented with random errors to
represent this lack of knowledge:

x1
x2
y = f(x1, x1, …, xn,
e1, e1, …, em)
y
xn
e1 e2  em
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-34
Statistical Modeling (cont.)
• The primary parameters in a statistical model of a system are
– The mean of the response, Y-bar.
– The standard deviation of the response, S.
– The random distribution of the response errors.
• A deterministic model only considers the the mean of the
response;
– The response’s standard deviation is effectively 0.
– The response’s random distribution is an indeterminate
concept.
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-35
Statistical Modeling (cont.)
• When a mechanistic model is unavailable, an empirical model
based on experimental evidence can be constructed by
considering the system to be a “black box.”
Response Variation
Response Mean

x1
x2
y  1 x1
xn
Input
Factors
EMIS 7300 Spring 2006
x2
0 
 
 1
 xn   2   1 x1
 
 
  n 

Random
Errors / Noise
Copyright  2002 - 2006 Dr. John Lipp
x2
 0 
 
 1
 xn  2 

 
 n 
y
Output
Response(s)
S2P2-36
Modeling (cont.)
• Neither a mechanistic or empirical model is appropriate in
some cases! Some phenomenon are purely random, possibly
even irreducibly random.
Regardless of the problem statistics boils down to modeling!
EMIS 7300 Spring 2006
Copyright  2002 - 2006 Dr. John Lipp
S2P2-37
Statistical Model for Sample Mean and Variance
• Simplest Transfer Function: Y-bar and S are constants.
S=1
Y-bar = 5
7.5
7
6.5
6
y
5.5
5
4.5
4
3.5
3
2.5
0
10
EMIS 7300 Spring 2006
20
30
40
50
x
60
Copyright  2002 - 2006 Dr. John Lipp
70
80
90
100
S2P2-38
Download