Summary Statistics Jake Blanchard Spring 2008 Uncertainty Analysis for Engineers 1 Summarizing and Interpreting Data It is useful to have some metrics for summarizing statistical data (both input and output) 3 key characteristics are ◦ central tendency (mean, median, mode) ◦ Dispersion (variance) ◦ Shape (skewness, kurtosis) Uncertainty Analysis for Engineers 2 Central Tendency Mean n E( x) xi pi i 1 E( x) x f ( x)dx Median=point such that exactly half of the probability is associated with lower values and half with greater values z f ( x)dx 0.5 Mode=most likely value (maximum of pdf) Uncertainty Analysis for Engineers 3 For 1 Dice m ean 1 1 1 1 1 1 E ( x) xi p( xi ) 1 2 3 4 5 6 6 6 6 6 6 6 xi 1 E ( x) 3.5 m edian 6 x 3.5 mode 3.5 Uncertainty Analysis for Engineers 4 Radioactive Decay For our example, the mean, median, and mode are given by m ean 0 E (t ) tf (t )dt te t dt 1 m edian z t e dt 0.5 0 z ln(2) The mode is x=0 Uncertainty Analysis for Engineers 5 Other Characteristics We can calculate the expected value of any function of our random variable as h( x) f ( x)dx Ehx h x p x i i i Uncertainty Analysis for Engineers 6 Some Results E (c ) c E (cx) cE( x) n n E x j E x j j 1 j 1 n n E b j x j b j E x j j 1 j 1 Uncertainty Analysis for Engineers 7 Moments of Distributions We can define many of these parameters in terms of moments of the distribution x f ( x)dx 1 k x f ( x)dx 1 k k Ex 1 x k p( x ) i 1 i i Mean is first moment. Variance is second moment Third and fourth moments are related to skewness and kurtosis Uncertainty Analysis for Engineers 8 Spread (Variance) Variance is a measure of spread or dispersion 2 Ex 1 2 2 x 2 1 f ( x)dx For discrete data sets, the biased variance is: n 1 2 2 s x x n i 1 and the unbiased variance is 1 n 2 s x x n 1 i 1 2 The standard deviation is the square root of the variance Uncertainty Analysis for Engineers 9 Skewness skewness is a measure of asymmetry 3 Ex 1 3 x 3 1 f ( x)dx For discrete data sets, the biased skewness is related to: n 1 3 m3 x x n i 1 The skewness is often defined as 3 1 3 Uncertainty Analysis for Engineers 10 Skewness Uncertainty Analysis for Engineers 11 Kurtosis kurtosis is a measure of peakedness 4 Ex 1 4 x 4 1 f ( x)dx For discrete data sets, the biased kurtosis is related to: n 1 4 m4 x x n i 1 The kurtosis is often defined as 4 2 4 3 Uncertainty Analysis for Engineers 12 Kurtosis Pdf of Pearson type VII distribution with kurtosis of infinity (red), 2 (blue), and 0 (black) Uncertainty Analysis for Engineers 13 Using Matlab Sample data is length of time a person was able to hold their breath (40 attempts) Try a scatter plot load RobPracticeHolds; y = ones(size(breathholds)); h1 = figure('Position',[100 100 400 100],'Color','w'); scatter(breathholds,y); Uncertainty Analysis for Engineers 14 Adding Information disp(['The mean is ',num2str(mean(breathholds)),' seconds (green line).']); disp(['The median is ',num2str(median(breathholds)),' seconds (red line).']); hold all; line([mean(breathholds) mean(breathholds)],[0.5 1.5],'color','g'); line([median(breathholds) median(breathholds)],[0.5 1.5],'color','r'); Uncertainty Analysis for Engineers 15 Box Plot title('Scatter with Min, 25%iqr, Median, Mean, 75%iqr, & Max lines'); xlabel(''); h3 = figure('Position',[100 100 400 100],'Color','w'); boxplot(breathholds,'orientation','horizontal','widths',.5); set(gca,'XLim',[40 140]); title('A Boxplot of the same data'); xlabel(''); set(gca,'Yticklabel',[]); ylabel(''); Uncertainty Analysis for Engineers 16 Box Plot Min Box represents inter-quartile range (half of data) Median Max Outlier Uncertainty Analysis for Engineers 17 Empirical cdf h3 = figure('Position',[100 100 600 400],'Color','w'); cdfplot(breathholds); Uncertainty Analysis for Engineers 18 Multivariate Data Sets When there are multiple input variables, we need some additional ways to characterize the data h( x, y ) f ( x, y )dxdy continuous E h( x, y ) h( xi , y j ) p xi , y j discrete i j Cov( x, y ) E ( xy) E ( x) E ( y ) If x and y are independent, then Cov(x,y)=0 Uncertainty Analysis for Engineers 19 Correlation Coefficients Two random variables may be related Define correlation coefficient of input (x) and output (y) as x m x, y k 1 x m k 1 k x yk y x 2 k y m k 1 y 2 k Cov ( x, y ) ( x) ( y ) =1 implies linear dependence, positive slope =0 no dependence =-1 implies linear dependence, negative slope Uncertainty Analysis for Engineers 20 Example =0.98 =1 =-0.98 =-0.38 Uncertainty Analysis for Engineers 21 Example x=rand(25,1)-0.5; y=x; corrcoef(x,y) subplot(2,2,1), plot(x,y,'o') y2=x+0.2*rand(25,1); corrcoef(x,y2) subplot(2,2,2), plot(x,y2,'o') y3=-x+0.2*rand(25,1); corrcoef(x,y3) subplot(2,2,3), plot(x,y3,'o') y4=rand(25,1)-0.5; corrcoef(x,y4) subplot(2,2,4), plot(x,y4,'o') Uncertainty Analysis for Engineers 22