Nonparametrics.Zip (a compressed version of nonparametrics) Tom Hettmansperger Department of Statistics, Penn State University References: 1. 2. 3. 4. 5. 6. Higgins (2004) Intro to Modern Nonpar Stat Hollander and Wolfe (1999) Nonpar Stat Methods Arnold Notes Johnson, Morrell, and Schick (1992) Two-Sample Nonparametric Estimation and Confidence Intervals Under Truncation, Biometrics, 48, 1043-1056. Beers, Flynn, and Gebhardt (1990) Measures of Location and Scale for Velocities in Cluster of GalaxiesA Robust Approach. Astron J, 100, 32-46. Website: http://www.stat.wmich.edu/slab/RGLM/ Robustness and a little philosophy Robustness: Insensitivity to assumptions a. Structural Robustness of Statistical Procedures b. Statistical Robustness of Statistical Procedures Structural Robustness (Developed in 1970s) Influence: How does a statistical procedure respond to a single outlying observation as it moves farther from the center of the data. Want: Methods with bounded influence. Breakdown Point: What proportion of the data must be contaminated in order to move the procedure beyond any bound. Want: Methods with positive breakdown point. Beers et. al. is an excellent reference. Statistical Robustness (Developed in 1960s) Hypothesis Tests: • Level Robustness in which the significance level is not sensitive to model assumptions. • Power Robustness in which the statistical power of a test to detect important alternative hypotheses is not sensitive to model assumptions. Estimators: • Variance Robustness in which the variance (precision) of an estimator is not sensitive to model assumptions. Not sensitive to model assumptions means that the property remains good throughout a neighborhood of the assumed model Examples 1. The sample mean is not structurally robust and is not variance robust. 2. The sample median is structurally robust and is variance robust. 3. The t-test is level robust (asymptotically) but is not structurally robust nor power robust. 4. The sign test is structurally and statistically robust. Caveat: it is not very powerful at the normal model. 5. Trimmed means are structurally and variance robust. 6. The sample variance is not robust, neither structurally nor variance. 7. The interquartile range is structurally robust. Recall that the sample mean minimizes 2 ( x ) i Replace the quadratic by r(x) which does not increase like a quadratic. Then minimize: r(x ) i The result is an M-Estimator which is structurally robust and variance robust. See Beers et. al. The nonparametric tests described here are often called distribution free because their significance levels do not depend on the underlying model assumption. Hence, they are level robust. They are also power robust and structurally robust. The estimators that are associated with the tests are structurally and variance robust. Opinion: The importance of nonparametric methods resides in their robustness not the distribution free property (level robustness) of nonparametric tests. Single Sample Methods • Robust Data Summaries • Graphical Displays • Inference: Confidence Intervals and Hypothesis Tests Location, Spread, Shape CI-Boxplots (notched boxplots) Histograms, dotplots, kernel density estimates. Absolute Magnitude Planetary Nebulae Milky Way Abs Mag (n = 81) -5.140 -6.700 -6.970 -8.000 -8.079 -8.359 … -7.190 -8.478 -7.273 -8.558 -7.365 -7.509 -7.633 -7.741 -8.662 -8.730 -8.759 -8.825 Dotplot of Abs Mag -14.4 -13.2 -12.0 -10.8 -9.6 A bs Mag -8.4 -7.2 -6.0 Summary for Abs Mag A nderson-Darling Normality Test -14 -12 -10 -8 A -Squared P -V alue 0.30 0.567 M ean StDev V ariance Skew ness Kurtosis N -10.324 1.804 3.253 0.305015 -0.048362 81 M inimum 1st Q uartile M edian 3rd Q uartile M aximum -6 -14.205 -11.564 -10.557 -9.144 -5.140 85% C onfidence Interv al for M ean -10.615 -10.032 85% C onfidence Interv al for M edian -10.699 85% C onfidence Interv al for StDev 8 5 % Confidence Inter vals 1.622 Mean Median -10.6 -10.4 -10.2 -10.208 -10.0 2.039 Probability Plot of Abs Mag Normal - 95% CI 99.9 Mean StDev N AD P-Value 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 -17.5 -15.0 -12.5 -10.0 Abs Mag -7.5 -5.0 -10.32 1.804 81 0.303 0.567 But don’t be too quick to “accept” normality: Probability Plot of Abs Mag 3-Parameter Weibull - 95% CI Percent 99.9 99 Shape Scale Thresh N AD P-Value 90 80 70 60 50 40 30 20 10 5 3 2 1 0.1 1 10 Abs Mag - Threshold 2.680 5.027 -14.79 81 0.224 >0.500 Histogram of Abs Mag 3-Parameter Weibull Shape Scale Thresh N 20 Frequency 15 10 5 0 -14 -12 -10 Abs Mag -8 -6 2.680 5.027 -14.79 81 Weibull Distributi on : c( x t )c 1 xt c f ( x) exp{ ( ) for x t and 0 otherwise c b b t threshold b scale c shape Null Hyp: Pop distribution, F(x) is normal The Kolmogorov-Smirnov Statistic D max | Fn ( x) F ( x) | The Anderson-Darling Statistic AD n ( Fn ( x) F ( x)) [ F ( x)(1 F ( x))] dF ( x) 2 1 Boxplot of Abs Mag (with 95% CI) -5 Outlier -6 -7 Whisker Abs Mag -8 -9 -10 -11 -12 -13 -14 3rd Quartile 95% Confidence Interval for the Median (in red) Median 1st Quartile Anatomy of a 95% CI-Boxplot • Box formed by quartiles and median • IQR (interquartile range) Q3 – Q1 • Whiskers extend from the end of the box to the farthest point within 1.5xIQR. For a normal benchmark distribution, IQR=1.348Stdev and 1.5xIQR=2Stdev. Outliers beyond the whiskers are more than 2.7 stdevs from the median. For a normal distribution this should happen about .7% of the time. Pseudo Stdev = .75xIQR Estimation of scale or variation. Recall the sample standard deviation is not robust. From the boxplot we can find the interquartile range. Define a pseudo standard deviation by .75IQR. This is a consistent estimate of the population standard deviation in a normal population. It is robust. The MAD is even more robust. Define the MAD by Med|x – med(x)|. Further define another psueudo standard deviation by 1.5MAD. Again this is calibrated to a normal population. Suppose we enter 5.14 rather than -5.14 in the data set. The table shows the effect on the non robust stdev and the robust IQR and MAD. No Outlier Outlier Breakdown Stdev 1.80 2.43 1 obs .75IQR 1.82 1.82 25% 1.5MAD 1.78 1.78 50% The confidence interval and hypothesis test A population is located at d 0 if the population median is d 0 . Sample X 1 ,..., X n from the population. Say X 1 ,..., X n is located at d if X 1 d ,..., X n d is located at 0. S (d ) S ( X 1 d ,..., X n d ) a statistic useful for location analysis if Ed 0 ( S (d 0 )) 0 when pop is located at d 0 Sign Statistic : S (d ) sgn( X i d ) # X i d # X i d S (d ) S (d ) 2S (d ) n Estimate d 0 from data, note : Ed 0 S (d 0 ) 0 ˆ ˆ ˆ Find d S (d ) 0 [or S (d ) n / 2] Solution : dˆ median( X ) i HYPOTHESIS TEST of H 0 : d d 0 vs. H A : d d 0 Rule : reject H 0 if | S (d 0 ) | | 2S (d 0 ) n | c where Pd 0 (| 2S (d 0 ) n | c) . nc nc S (d 0 ) k or S (d 0 ) nk 2 2 Under H 0 : d d 0 , 1 S (d 0 ) distributed Binomial (n, ) 2 Distributi on Free CONFIDENCE INTERVAL d is population location Pd (k S (d ) n k ) 1 Find smallest d (# X i d ) n k d X ( k ) : (# X i X ( k ) ) n k d min X ( k 1) : (# X i X ( k 1) ) n k 1 Likewise d max X ( n k ) Then [ X ( k 1) , X ( n k ) ] is (1 )100% Conf . Int . Distributi on Free SUMMARY: X 1 ,..., X n a sample from a populationlocated at d 0 . SIGN STATISTIC : S ( d ) S ( d ) S ( d ) # X i d # X i d ESTIMATE: dˆ S ( dˆ ) 0 dˆ median( X i ) TEST of H 0 : d d 0 vs . H A : d d 0 1 1 H 0 : P ( X d 0 ) vs . H A : P ( X d 0 ) 2 2 reject H 0 if S ( d 0 ) k or n k where Pd 0 ( S ( d 0 ) k ) / 2 and S ( d 0 ) binomial( n ,1 / 2 ) CONFIDENCE INTERVAL: if Pd ( S ( d ) k ) / 2 then [ X ( k 1) , X ( n k ) ] has confidence coefficient (1 )100% Boxplot of Abs Mag (with 95% CI) -5 -6 -7 Abs Mag -8 -9 -10 -11 -12 -13 -14 Q1 Median SE Med -11.5 -10.7 .18 Q3 IQR -9.14 2.42 Additional Remarks: The median is a robust measure of location. It is not affected by outliers. It is efficient when the population has heavier tails than a normal population. The sign test is also robust and insensitive to outliers. It is efficient when the tails are heavier than those of a normal population. Similarly for the confidence interval. In addition, the test and the confidence interval are distribution free and do not depend on the shape of the underlying population to determine critical values or confidence coefficients. They are only 64% efficient relative to the mean and t-test when the population is normal. If the population is symmetric then the Wilcoxon Signed Rank statistic can be used, and it is robust against outliers and 95% efficient relative to the t-test. Two-Sample Methods Two-Sample Comparisons 85% CI-Boxplots Mann-Whitney-Wilcoxon Rank Sum Statistic •Estimate of difference in locations •Test of difference in locations •Confidence Interval for difference in locations Levene’s Rank Statistic for differences in scale or variance. 85% CI-Boxplots 20 15 10 5 0 -5 -10 -15 MW M-31 Boxplot of App Mag, M-31 19 18 17 App Mag 16 15 14 13 12 11 10 Dotplot of App Mag, M-31 11 12 13 14 15 App Mag 16 17 18 Summary for App Mag, M-31 A nderson-Darling Normality Test 10.5 12.0 13.5 15.0 16.5 18.0 A -Squared P -V alue < 1.79 0.005 M ean StDev V ariance Skew ness Kurtosis N 14.458 1.195 1.427 -0.396822 0.366104 360 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 10.749 13.849 14.540 15.338 18.052 85% C onfidence Interv al for M ean 14.367 14.549 85% C onfidence Interv al for M edian 14.453 14.610 85% C onfidence Interv al for StDev 8 5 % Confidence Inter vals 1.134 Mean Median 14.40 14.45 14.50 14.55 14.60 1.263 Summary for App Mag (low outliers removed) A nderson-Darling Normality Test 12 13 14 15 16 17 18 A -Squared P -V alue 1.01 0.012 M ean StDev V ariance Skew ness Kurtosis N 14.522 1.115 1.243 -0.172496 0.057368 353 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 11.685 13.887 14.550 15.356 18.052 85% C onfidence Interv al for M ean 14.436 14.607 85% C onfidence Interv al for M edian 14.483 14.639 85% C onfidence Interv al for StDev 8 5 % Confidence Inter vals 1.058 Mean Median 14.45 14.50 14.55 14.60 14.65 1.179 Probability Plot of App Mag Normal - 95% CI 99.9 Mean StDev N AD P-Value 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 10 11 12 13 14 15 App Mag 16 17 18 19 14.46 1.195 360 1.794 <0.005 Why 85% Confidence Intervals? We have the following test of H 0 : d d1 d 2 0 vs. H A : d d1 d 2 0 Rule: reject the null hyp if the 85% confidence intervals do not overlap. The significance level is close to 5% provided the ratio of sample sizes is less than 3. Mann-Whitney-Wilcoxon Statistic: The sign statistic on the pairwise differences. X 1 ,..., X m and Y1 ,..., Yn with X from pop F and Y from pop G with d dY d X . U (d ) sgn( Y d X ) U ( d ) U (d ) i j (# Yi X j d ) (# Yi X j d ) Unlike the sign test (64% efficiency for normal population, the MWW test has 95.5% efficiency for a normal population. And it is robust against outliers in either sample. SUMMARY: MW W STATISTIC:U ( d ) U ( d ) U ( d ) #Y j X i d #Y j X i d ESTIMATE: dˆ U ( dˆ ) 0 dˆ mediani , j (Y j X i ) TEST of H 0 : d 0 vs . H A : d 0 1 1 H 0 : P (Y X ) vs . H A : P (Y X ) 2 2 reject H 0 if U ( 0 ) k or n k where Pd0 (U ( 0 ) k ) / 2 and U ( d 0 ) a tabled distribution. CONFIDENCE INTERVAL:if Pd (U ( d ) k ) / 2 then [ D( k 1) , D( mnk ) ] has confidence coefficient (1 )100% where D(1) ... D( mn ) aretheordered pairwisedifferences . Mann-Whitney Test and CI: App Mag, Abs Mag N Median App Mag (M-31) 360 14.540 Abs Mag (MW) 81 -10.557 Point estimate for d is 24.900 95.0 Percent CI for d is (24.530,25.256) W = 94140.0 Test of d=0 vs d not equal 0 is significant at 0.0000 What is W? U # Y j X i W U n n( n 1) Rj 2 j 1 R1 ,..., Rn are ranks of Y1 ,..., Yn in combined data RY R X ( 1 1 nm )U n m 2 Hence MWW can be written as the difference in average ranks rather than Y X in t test. What about spread or scale differences between the two populations? Below we shift the MW observations to the right by 24.9 to line up with M-31. Dotplot of MW and M-31 MW M-31 11.2 12.6 14.0 15.4 16.8 18.2 19.6 Each symbol represents up to 2 observations. Variable MW M-31 StDev 1.804 1.195 IQR 2.420 1.489 PseudoStdev 1.815 1.117 Levene’s Rank Test Compute |Y – Med(Y)| and |X – Med(X)|, called absolute deviations. Apply MWW to the absolute deviations. (Rank the absolute deviations) The test rejects equal spreads in the two populations when difference in average ranks of the absolute deviations is too large. Idea: After we have centered the data, then if the null hypothesis of no difference in spreads is true, all permutations of the combined data are roughly equally likely. (Permutation Principle) So randomly select a large set of the permutations say B permutations. Assign the first n to the Y sample and the remaining m to the X sample and compute MMW on the absolute deviations. The approximate p-value is #MMW > original MMW divided by B. Difference of rank mean abso devs 51.9793 Histogram of levenerk Normal Mean StDev N 120 0.1644 16.22 1000 Frequency 100 80 60 40 20 0 -45 -30 -15 0 levenerk 15 30 45 52 So we easily reject the null hypothesis of no difference in spreads and conclude that the two populations have significantly different spreads. One Sample Methods Two Sample Methods k-Sample Methods Variable Messier 31 Mean 22.685 StDev 0.969 Median .75IQR 23.028 1.069 Skew -0.67 Kurtosis -0.67 Messier 81 24.298 0.274 24.371 0.336 -0.49 -0.68 NGC 3379 NGC 4494 NGC 4382 26.139 26.654 26.905 0.267 0.225 0.201 26.230 26.659 26.974 0.317 0.252 0.208 -0.64 -0.36 -1.06 -0.48 -0.55 1.08 All one-sample and two-sample methods can be applied one at a time or two at a time. Plots, summaries, inferences. We begin k-sample methods by asking if the location differences between the NGC nebulae are statistically significant. We will briefly discuss issues of truncation. 85% CI-Boxplot Planetray Nebula Luminosities 28 27 26 25 24 23 22 21 20 M-31 M-81 NGC-3379 NGC-4494 NGC-4382 Extending MWW to several samples Given N total sample size with ranks of combined data with R1 , R2 , and R3 construct : KW 12 nn nn nn { 1 2 ( R1 R2 ) 2 1 3 ( R1 R3 ) 2 2 3 ( R2 R3 ) 2 } N ( N 1) N N N 12 N 1 2 N 1 2 N 1 2 {n1 ( R1 ) n2 ( R2 ) n3 ( R3 )} N ( N 1) 2 2 2 Generally use a chisquare (k 1 2) Degrees of Freedom as approximate sampling distributi on for KW Kruskal-Wallis Test on NGC sub 1 2 3 Overall N 45 101 59 205 KW = 116.70 Median 26.23 26.66 26.97 DF = 2 Ave Rank 29.6 104.5 156.4 103.0 Z -9.39 0.36 8.19 P = 0.000 This test can be followed by multiple comparisons. For example, if we assign a family error rate of .09, then we would conduct 3 MWW tests, each at a level of .03. (Bonferroni) 85% CI-Boxplot 27.25 27.00 26.75 26.50 26.25 26.00 25.75 25.50 NGC3379 NGC4494 NGC4382 What to do about truncation. 1. See a statistician 2. Read the Johnson, Morrell, and Schick reference. and then see a statistician. Here is the problem: Suppose we want to estimate the difference in locations between two populations: F(x) and G(y) = F(y – d). But (with right truncation at a) the observations come from F ( x) for x a and 1 for x a F (a) F(y d) Ga ( y ) for y a and 1 for y a F (a d ) Fa ( x) Suppose d > 0 and so we want to shift the X-sample to the right toward the truncation point. As we shift the Xs, some will pass the truncation point and will be eliminated from the data set. This changes the sample sizes and requires adjustment when computing the corresponding MWW to see if it is equal to its expectation. See the reference for details. Comparison of NGC4382 and NGC 4494 Data multiplied by 100 and 2600 subtracted. Truncation point taken as 120. Point estimate for d is 25.30 W = 6595.5 m = 101 and n = 59 Computation of shift estimate with truncation d m n W E(W) 25.3 88 59 5.10 4750.5 4366.0 28.3 84 59 3.60 4533.5 4248.0 30.3 83 59 2.10 4372.0 4218.5 32.3 81 59 0.80 4224.5 4159.5 33.3 81 59 -0.20 4144.5 4159.5 33.1 81 59 -0.00 4161.5 4159.5 d̂ Robust regression fitting and correlation (association) Dataset (http://astrostatistics.psu.edu/datasets/HIP_star.html) We have extracted a sample of 50 from the subset of 2719 Hipparcos stars Vmag = Visual band magnitude. This is an inverted logarithmic measure of brightness Plx = Parallactic angle (mas = milliarcsseconds). 1000/Plx gives the distance in parsecs (pc) B-V = Color of star (mag) The HR diagram logL vs. B-V where (roughly) the log-luminosity in units of solar luminosity is constructed logL=(15 - Vmag - 5logPlx)/2.5. All logs are base10. Row LogL BV 1 0.69233 0.593 2 1.75525 0.935 3 -0.30744 0.830 4 -0.17328 0.685 5 0.57038 0.529 6 -1.04471 1.297 7 0.51396 0.510 8 0.52149 0.607 9 -1.06306 1.288 10 0.41990 0.677 11 -0.76152 0.950 12 -1.10608 1.260 13 0.42593 0.651 14 -0.44066 0.909 15 -0.90039 1.569 16 -0.74118 1.065 17 -0.66820 1.049 18 -0.26810 0.884 19 0.56722 0.480 20 -0.93809 0.490 21 -0.38095 1.160 22 -0.19267 0.810 23 0.54619 0.498 24 0.20161 0.614 25 0.37348 0.538 26 -0.38556 0.879 27 -0.22978 0.723 28 0.57671 0.455 29 -1.00092 1.110 30 -0.00215 0.637 31 -0.95768 1.616 32 0.10378 0.606 33 -1.43872 1.365 34 1.23674 0.395 35 0.10866 0.630 36 -1.60621 * 37 0.06468 0.599 38 -0.18214 0.709 39 0.37988 0.561 40 1.23793 0.257 41 -0.16896 0.864 42 -0.59331 0.955 43 1.78028 1.010 44 -0.63099 1.100 45 0.61900 0.664 46 -0.28520 0.706 47 -0.71404 0.898 48 0.35061 0.616 49 0.55002 0.466 50 0.37922 0.548 Fitted Line Plot Resistent Line in Black Resistant Line in Black Least Squares line in Blue Least Squares Line in Blue 2.00 1.50 1.00 LogL 0.50 0.00 LogL = 1.253 - 1.605 BV -0.50 -1.00 logL = 1.513 - 2.067BV -1.38 -1.50 0.0 0.2 0.4 0.6 0.8 BV 1.0 1.2 1.4 1.6 The resistant line is robust and not affected by the outliers. It follows the bulk of the data much better than the non robust least squares regression line. There are various ways to find a robust or resistant line. The most typical is to use the ideas of M-estimation and minimize: r ( x a bc ) i i where the r(x) does not increase as fast as a quadratic. The strength of the relationship between variables is generally measured by correlation. Next we look briefly at non robust and robust measures of correlation or association. Pearson product moment correlation is not robust. Spearman’s rank correlation coefficient is simply the Pearson coefficient with the data replaced by their ranks. Spearman’s coefficient measures association or the tendency of the two measurements to increase or decrease together. Pearson’s measures the degree to which the measurements cluster along a straight line. For the example: Pearson r = -.673 Spearman rs= -.743 Significance tests: Pearson r: refer z to a standard normal distribution, where z r n2 (1 r 2 ) 6.28 Spearman rs: refer z to a standard normal distribution, where z n 1rs 5.201 Kendall’s tau coefficient is defined as where P is the number of concordant pairs out of n(n-1)/2 total pairs. For example (1, 3) and (2, 7) are concordant since 2>1 and 7>3. Note that Kendall’s tau estimates the probability of concordance minus the probability of discordance in the bivariate population. For the example: Kendalls Tau = -0.63095 Significance Test: refer z to a standard normal distribution where z 3 n(n 1)tau 2(2n 5) 6.47 What more can we do? 1. Multiple regression 2. Analysis of designed experiments (AOV) 3. Analysis of covariance 4. Multivariate analysis These analyses can be carried out using the website: http://www.stat.wmich.edu/slab/RGLM/ Professor Lundquist, in a seminar on compulsive thinkers, illustrates his brain stapling technique. The End