Probability Density Functions

advertisement
Probability Plots
Jake Blanchard
Spring 2010
Uncertainty Analysis for Engineers
1
Introduction
Probability plots allow us to assess the
degree to which a set of data fits a
particular distribution
 The idea is to scale the x-axis of a CDF
such that the result would be a straight
line if the data conforms to the assumed
distribution

Uncertainty Analysis for Engineers
2
An Example
Suppose we have a set of data that we
suspect is normal.
 First, we form an empirical cdf

◦ [f,xx]=ecdf(x)

Then scale cdf so that each unit on the
axis corresponds to 1 standard deviation
◦ z=norminv(ff);

Then we plot the data (sorted) against
this new axis
◦ figure, plot(y,z,'+')
Uncertainty Analysis for Engineers
3
The Script
n=100;
x=normrnd(10,3,n,1);
y=sort(x);
[f, xx]=ecdf(x)
for i=1:n
ff(i)=(f(i)+f(i+1))/2;
end
z=norminv(ff);
figure, plot(y,z,'+')
Uncertainty Analysis for Engineers
4
3
2
1
0
-1
-2
-3
0
2
4
6
8
10
12
14
16
18
Uncertainty Analysis for Engineers
5
Matlab has an alternative
probplot('norm',x)
 Options are

◦
◦
◦
◦
◦
◦
exponential
extreme value
lognormal
normal
rayleigh
weibull
Uncertainty Analysis for Engineers
6
Probability plot for Normal distribution
0.995
0.99
0.95
0.9
Probability
0.75
0.5
0.25
0.1
0.05
0.01
0.005
0
2
4
6
8
10
12
14
16
18
Data
Vertical axis here is cdf, not
number of standard deviations
Uncertainty Analysis for Engineers
7
Look at some lognormal data
Probability plot for Normal distribution
0.995
0.99
0.95
0.9
Probability plot for Normal distribution
0.995
0.99
Normal
probability plot
0.5
0.25
0.1
0.95
0.9
0.75
Probability
0.05
0.01
0.005
0.5
0.25
0
2
4
6
8
10
Probability plotData
for Lognormal distribution
12
0.1
6
x 10
0.05
0.995
0.99
0.01
0.005
2
0.95
0.9
4
6
8
10
Data
12
14
16
18
0.75
Probability
Probability
0.75
0.5
0.25
Lognormal
probability plot
0.1
0.05
Normal
probability plot
of log of data
0.01
0.005
0
10
2
10
4
10
Data
6
10
8
10
Uncertainty Analysis for Engineers
8
Now some exponential data
Probability plot for Lognormal distribution
0.995
0.99
0.95
0.9
0.5
0.25
0.1
Probability plot for Exponential distribution
0.05
0.995
0.01
0.005
0.99
-3
10
-2
10
-1
0
10
10
1
2
10
10
Data
Lognormal
probability plot
Probability
Probability
0.75
0.95
Exponential
probability plot
0.9
0.75
0.5
0.25
0.1
0
10
20
30
40
Data
50
60
Uncertainty Analysis for Engineers
70
80
9
Facts
On normal probability plots, the intercept
is the mean
 On exponential paper, the slope is 1/
 Results at the extremes are expected to
deviate from the straight line more than
those in the middle
 On the other hand, for some data,
multiple distributions will fit in the center,
but not in the tails

Uncertainty Analysis for Engineers
10
Results

We are dealing with samples, so our
conclusions tend to be one of
◦ The model appears to be adequate
◦ The model is questionable
◦ The model is not adequate
Uncertainty Analysis for Engineers
11
Comparison
Take some wind data (maximum
measured wind velocity over a given
period)
 20 data points taken over 20 years
 Compare all 6 Matlab probability plots
 Compare looking at CDFs
 Compare other error measures

Uncertainty Analysis for Engineers
12
0.25
0.1
0.25
0.95
0.9
0.1
0.05
0.05
70
80
Data
90
100
60
Probability plot for Weibull distribution
0.999
0.99
0.95
0.9
0.75
0.25
0.1
0.05
1.84
1.93
10
10
Data
70
80
Data
90
0.5
0.25
1.84
100
1.93
10
10
Data
Probability plot for Exponential distribution
Probability plot for Rayleigh distribution
0.95
0.95
Probability
0.5
0.75
0.1
0.05
0.9
Probability
60
Probability
0.5
Probability plot for Lognormal distribution
Probability
Probability
Probability
Probability plot for Extreme value distribution
Probability plot for Normal distribution
0.999
0.99
0.95
0.95
0.9
0.9
0.75
0.75
0.5
0.9
0.75
0.5
0.25
0.1
0.005
60
70
80
Data
90
100
0.75
0.5
0.25
0.1
0.05
0.01
0.001
60
70
80
Data
90
Uncertainty Analysis for Engineers
100
13
Empirical CDF
1
0.9
0.8
0.7
F(x)
0.6
0.5
EV
Norm
Exp
LogN
Weibull
Rayleigh
0.4
0.3
0.2
0.1
0
68
70
72
74
76
78
x
80
82
84
Uncertainty Analysis for Engineers
86
14
Goodness of Fit Statistics

For discrete and continuous sampled data
distributions
◦
◦
◦
◦
◦
Chi-square statistic
Kolmogorov-Smirnoff (K-S) statistic
Anderson-Darling (A-D) statistic
Root Mean Square Error (RMS).
Value is limited if there are fewer than about
30 data points.
◦ The lower the value, the closer the
distribution appears to fit the data. But they
do not provide a measure that the data
actually come from the distribution.
Chi-square statistic

This goodness-of-fit statistic measures
◦ The oldest, most commonly used
◦ Data are grouped into frequency cells and compared
to the expected number of observations based on the
proposed distribution.
◦ Definition
2
N






O
i

E
i
2  
i 1
E i 
 Where O(i) is the observed frequency of the ith histogram bar
and
 E(i) is the expected frequency from the fitted distribution of x
values falling within the x range of the ith histogram bar.
◦ It can be overly sensitive to large errors
Chi-Squared Tests

First we divide values into groups;
suggestion is
k  n/5


2


k

4
0
.
75
n

1




0.2
n  200
n  200
For example, if we have n=500 data points,
then this gives us about 45 groups (I would
use 50 for convenience)
Example (cont.)
Sort data and divide into 50 cells
 Find upper and lower bound of values in
each cell
 Calculate expected number of data points
in each cell by subracting cdf of lower
bound from cdf of upper bound

Example (cont.)

Compare this value to the value of the chisquared distribution for k-np-1 degrees of
freedom and a desired confidence level,
where np is the number of parameters in
the model (eg, 2 for a normal distribution)
Example (cont.)
In Matlab, we can get this chi-squared
distribution from chi2inv(p,v), where p
is the confidence level (0<p<1) and v is
the number of DOF
 If our calculated value for chi-squared is
less than chi-square distribution, then the
fit is OK

Kolmogorov-Smirnov Test
Compare measured cumulative frequency
with CDF of assumed theoretical
distribution
 Compare the maximum discrepancy
between these two with a critical value of
a test statistic and reject fit if former
exceeds latter
 Good when we don’t have many data
points

Kolmogorov-Smirnoff Statistic
D  maxF x   F x 
n
◦
◦
◦
◦

n
Where Dn is the K-S distance,
n is the total number of data points,
F(x) is the distribution function of the fitted distribution,
and Fn(x)=i/n and i is the cumulative rank of the data point.
K-S is better than χ2 because data are assessed at all
points—
◦ avoids problem of number of bars (bins).

But value determined by the one largest discrepancy
◦ So it takes no account of lack of fit across entire distribution
More on K-S statistic

The position of Dn along the x-axis is more
likely to occur away from the low probability
tails.
◦ This insensitivity to lack of fit at the extremes is
corrected for in the Anderson-Darling statistic.

Some statistical literature is critical about
distribution fitting software that use this
statistic as a goodness-of-fit test.
◦ Because the statistic assumes the fitted distribution is
fully specified so that the critical region of the curve
can be checked.
Process
Sort n data points
 Make a step-wise cdf

◦ (cdfplot in Matlab)
0
i
S n ( x)  
n
1
x  x1
xi  x  xi 1
x  xn
Fit data to a model to obtain model cdf
 Find maximum difference between these
two cdf’s over each of the steps in the first
cdf
 Look up comparison data in tables

KS Critical Values
n
=0.2
=0.1
=0.05
=0.01
10
0.32
0.37
0.41
0.49
20
0.23
0.26
0.29
0.36
30
0.19
0.22
0.24
0.29
40
0.17
0.19
0.21
0.25
50
0.15
0.17
0.19
0.23
>50
1.07/sqrt(n)
1.22/sqrt(n)
1.36/sqrt(n)
1.63/sqrt(n)
What is the Significance Level?
Our hypothesis is that the fit is a good fit.
 If difference in cdf’s exceeds that of the
test statistic, we reject the hypothesis
 There are two possibilities:

◦ Fit really is bad, or
◦ We are rejecting a good fit

Significance level is probability that we are
rejecting a good fit
Fit Tests in Matlab
chi2gof(x) – normal distribution only
 kstest(x,CDF)

w1=data(:,4); n=numel(w1);
Matlab Code
hist(w1,25); ncells=128; npoints=51;
nused=ncells*npoints;
datasort=sort(w1(1:nused));
a=4.52997; b=2.72931; %from dfittol
chisqr=0;
upperbound=min(datasort);
for i=1:ncells
indx=(i-1)*(npoints)+1;
thisset=datasort(indx:indx+npoints-1);
lowerbound=upperbound;
upperbound=max(thisset);
ee=n*(gamcdf(upperbound,a,b)gamcdf(lowerbound,a,b));
chisqr=chisqr+(npoints-ee)^2/ee;
end
chisqr
param=2; v=ncells-param-1;
chpdf=chi2inv(0.95,v)
KS Test
a=13.9761; %from dfittool
b=2.34506; %from dfittol
cdfvals=wblcdf(xvals,a,b);
kstest(datasort,[xvals' cdfvals'])
Anderson-Darling Statistic
This is a more sophisticated and complex
version of the K-S,
 It is more powerful because

◦ The f(x) weights the observed distances by the
probability that the value will be generated at that x
value.
 This helps focus the difference measure more equitably.
◦ The vertical distances are integrated over ALL values
of x rather than just looking at the maximum.
 This makes maximum use of the observed data
Root Mean Square Error
RMS error is available as a test statistic in
BESTFIT for expert data that is sampled
using percentiles.
 Measures the area between the
distribution fit and the data.

◦ The smaller the better. Does not provide fine
distinction.
Back to Example - Tests
All pass KS test except exponential and
rayleigh at 5% significance level
 Same holds at 2% level

Uncertainty Analysis for Engineers
32
Script
vel=[78.2 75.8 81.8 85.2 75.9 78.2 72.3
69.3 76.1 74.8 ...
78.4 76.4 72.9 76.0 79.3 77.4 77.1 80.8
70.6 73.5];
sortv=sort(vel);
subplot(2,3,1),probplot('ev',vel)
subplot(2,3,2),probplot('norm',vel)
subplot(2,3,3),probplot('lognorm',vel)
subplot(2,3,4),probplot('weibull',vel)
subplot(2,3,5),probplot('exponential',vel)
subplot(2,3,6),probplot('rayleigh',vel)
x=68:0.1:86;
alpha=0.02;
zev=evfit(vel);
sev=evcdf(x,zev(1),zev(2));
kev=kstest(sortv,[x' sev'],alpha)
[an bn]=normfit(vel);
sn=normcdf(x,an,bn);
kn=kstest(sortv,[x' sn'],alpha)
Rayleigh and Exponential fail test (kx=kr=1)
zx=expfit(vel);
sx=expcdf(x,zx);
kx=kstest(sortv,[x' sx'],alpha)
zl=lognfit(vel);
sl=logncdf(x,zl(1),zl(2));
kl=kstest(sortv,[x' sl'],alpha)
zw=wblfit(vel);
sw=wblcdf(x,zw(1),zw(2));
kw=kstest(sortv,[x' sw'],alpha)
zr=raylfit(vel);
sr=raylcdf(x,zr);
kr=kstest(sortv,[x' sr'],alpha)
figure,plot(x,sev,'o',x,sn,'r',x,sx,'b',x,sl,'+',x,s
w,'m',x,sr,'c',...
'LineWidth',2)
legend('EV','Norm','Exp','LogN','Weibull','Ra
yleigh')
hold on
cdfplot(vel)
Uncertainty Analysis for Engineers
33
A Second Example
A new controller was installed on 96 diesel
locomotives
 The mileage at failure for each was recorded
 37 failed at less than 135,000 miles
 All we know about the others is that each
lasted beyond 135k miles
 Thus, we have to “censor” the data
 We assume we have 96 data points, but only
plot 37
 This is important when we compute CDF
 Goal is 80,000 mile warranty

Uncertainty Analysis for Engineers
34
Censoring in Matlab
Tell Matlab which data points are
censored and how many of each there are
 We’ll use a lognormal plot in the example
because failure mechanism indicates this
is appropriate

Uncertainty Analysis for Engineers
35
Uncensored “Normal” probability
plot
Probability plot for Normal distribution
0.99
0.95
0.9
Probability
0.75
0.5
0.25
0.1
0.05
0.01
20
40
60
80
Data
100
120
140
Uncertainty Analysis for Engineers
36
Un-Censored “Lognormal” Plot
Probability plot for Lognormal distribution
0.99
0.95
0.9
Probability
0.75
0.5
0.25
0.1
0.05
0.01
1
10
2
10
Data
3
10
Uncertainty Analysis for Engineers
37
Censored “Lognormal” Plot
Probability plot for Lognormal distribution
0.5
Probability
0.25
0.1
0.05
0.01
0.005
1
10
2
10
Data
3
10
Uncertainty Analysis for Engineers
38
Conclusions
The lognormal distribution looks like a
good fit
 Probability of failure is approximately 15%

Uncertainty Analysis for Engineers
39
Script
d=[22.2 37.5 46.0 48.5 51.5 ...
53 54.5 57.5 66.5 68 ...
69.5 76.5 77 78.5 80 ...
81.5 82 83 84 91.5 93.5 ...
102.5 107 108.5 112.5 113.7 ...
116 117 118.5 119 120 ...
122.5 123 127.5 131.0 132.5 134]
data=[d'; 135];
bools=zeros(size(data));
bools(end)=1;
freq=bools*59+1;
probplot('logn',data)
figure, probplot('logn',data,bools,freq)
Uncertainty Analysis for Engineers
40
Download