Summary Statistics

advertisement
Summary Statistics
Jake Blanchard
Spring 2008
Uncertainty Analysis for Engineers
1
Summarizing and Interpreting Data
It is useful to have some metrics for
summarizing statistical data (both input and
output)
 3 key characteristics are

◦ central tendency (mean, median, mode)
◦ Dispersion (variance)
◦ Shape (skewness, kurtosis)
Uncertainty Analysis for Engineers
2
Central Tendency

Mean
n
E( x)   xi pi
i 1


E( x)   x f ( x)dx

Median=point such that exactly half of the
probability is associated with lower values
and half with greater values
z
 f ( x)dx  0.5


Mode=most likely value (maximum of pdf)
Uncertainty Analysis for Engineers
3
For 1 Dice
m ean
1 1 1 1 1 1
E ( x)   xi p( xi )  1   2   3   4   5   6 
6 6 6 6 6 6
xi 1
E ( x)  3.5
m edian
6
x  3.5
mode  3.5
Uncertainty Analysis for Engineers
4
Radioactive Decay

For our example, the mean, median, and
mode are given by
m ean



0
E (t )   tf (t )dt   te t dt 
1

m edian
z
 t

e
 dt  0.5
0
z

ln(2)

The mode is x=0
Uncertainty Analysis for Engineers
5
Other Characteristics

We can calculate the expected value of
any function of our random variable as

  h( x) f ( x)dx

Ehx   
 h x  p  x 
i
i

 i
Uncertainty Analysis for Engineers
6
Some Results
E (c )  c
E (cx)  cE( x)
n  n
E  x j    E x j 
 j 1  j 1
n
 n
E  b j x j    b j E x j 
 j 1
 j 1


Uncertainty Analysis for Engineers
7
Moments of Distributions

We can define many of these parameters in
terms of moments of the distribution
   x f ( x)dx

1


k


x


f ( x)dx

1

k
 k  Ex   1   
 x   k p( x )
i
1
i

 i
Mean is first moment.
 Variance is second moment
 Third and fourth moments are related to
skewness and kurtosis

Uncertainty Analysis for Engineers
8
Spread (Variance)

Variance is a measure of spread or dispersion
  2 Ex 1  
2
2


 x  
2
1
f ( x)dx

For discrete data sets, the biased variance is:
n
1
2
2
s   x  x 
n i 1

and the unbiased variance is
1 n
2


s 
x

x

n  1 i 1
2

The standard deviation is the square root of
the variance
Uncertainty Analysis for Engineers
9
Skewness

skewness is a measure of asymmetry
 3 Ex 1  
3

 x  
3
1
f ( x)dx


For discrete data sets, the biased skewness
is related to:
n
1
3
m3   x  x 
n i 1

The skewness is often defined as
3
1  3

Uncertainty Analysis for Engineers
10
Skewness
Uncertainty Analysis for Engineers
11
Kurtosis

kurtosis is a measure of peakedness
 4 Ex 1  
4

 x  
4
1
f ( x)dx


For discrete data sets, the biased kurtosis is
related to:
n
1
4
m4   x  x 
n i 1

The kurtosis is often defined as
4
2  4 3

Uncertainty Analysis for Engineers
12
Kurtosis

Pdf of Pearson type VII distribution with
kurtosis of infinity (red), 2 (blue), and 0 (black)
Uncertainty Analysis for Engineers
13
Using Matlab
Sample data is length of time a person was
able to hold their breath (40 attempts)
 Try a scatter plot

load RobPracticeHolds;
y = ones(size(breathholds));
h1 = figure('Position',[100 100 400 100],'Color','w');
scatter(breathholds,y);
Uncertainty Analysis for Engineers
14
Adding Information
disp(['The mean is ',num2str(mean(breathholds)),' seconds (green line).']);
disp(['The median is ',num2str(median(breathholds)),' seconds (red line).']);
hold all;
line([mean(breathholds) mean(breathholds)],[0.5 1.5],'color','g');
line([median(breathholds) median(breathholds)],[0.5 1.5],'color','r');
Uncertainty Analysis for Engineers
15
Box Plot
title('Scatter with Min, 25%iqr, Median, Mean, 75%iqr, & Max lines');
xlabel('');
h3 = figure('Position',[100 100 400 100],'Color','w');
boxplot(breathholds,'orientation','horizontal','widths',.5);
set(gca,'XLim',[40 140]);
title('A Boxplot of the same data'); xlabel(''); set(gca,'Yticklabel',[]);
ylabel('');
Uncertainty Analysis for Engineers
16
Box Plot
Min
Box
represents
inter-quartile
range (half of
data)
Median
Max
Outlier
Uncertainty Analysis for Engineers
17
Empirical cdf
h3 = figure('Position',[100 100 600 400],'Color','w');
cdfplot(breathholds);
Uncertainty Analysis for Engineers
18
Multivariate Data Sets

When there are multiple input variables,
we need some additional ways to
characterize the data
 
   h( x, y ) f ( x, y )dxdy continuous
E h( x, y )  
 h( xi , y j ) p xi , y j  discrete
 i j
Cov( x, y )  E ( xy)  E ( x) E ( y )

If x and y are independent, then
Cov(x,y)=0
Uncertainty Analysis for Engineers
19
Correlation Coefficients
Two random variables may be related
 Define correlation coefficient of input (x) and
output (y) as

 x
m
 x, y 
k 1
 x
m
k 1
k
 x  yk  y 
 x
2
k
 y
m
k 1
 y
2
k
Cov ( x, y )

 ( x)  ( y )
=1 implies linear dependence, positive slope
 =0 no dependence
 =-1 implies linear dependence, negative
slope

Uncertainty Analysis for Engineers
20
Example
=0.98
=1
=-0.98
=-0.38
Uncertainty Analysis for Engineers
21
Example
x=rand(25,1)-0.5;
y=x;
corrcoef(x,y)
subplot(2,2,1), plot(x,y,'o')
y2=x+0.2*rand(25,1);
corrcoef(x,y2)
subplot(2,2,2), plot(x,y2,'o')
y3=-x+0.2*rand(25,1);
corrcoef(x,y3)
subplot(2,2,3), plot(x,y3,'o')
y4=rand(25,1)-0.5;
corrcoef(x,y4)
subplot(2,2,4), plot(x,y4,'o')
Uncertainty Analysis for Engineers
22
Download