Nonparametric Methods Featuring the Bootstrap Jon Atwood November 12, 2013 Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Experimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...) Collaboration Walk-In Consulting Monday—Friday* 1-3PM, GLC From our website request a meeting for personalized statistical advice. Coauthor not required, but encouraged. *Mon—Thurs during the summer Great advice right now: Meet with LISA before collecting your data Short Courses Designed to help graduate students apply statistics in their research All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu Short Course Goals By the end of this course, you will know… • • • • The fundamental differences between nonparametric and parametric In what general situations nonparametric methods would be advantageous How to implement nonparametric alternatives to ttests, ANOVAs and simple linear regression What nonparametric bootstrapping is and when you can use it effectively What we will do today… Give a brief overview of nonparametric statistical methods Take a look at real world data sets! (and some nonreal world data sets) Implement the following methods in R… Wilcoxon ranked sum and signed rank (alternatives to t- tests Kruskal-Wallis (alternative to one-way ANOVA) Spearman correlation (alternative to Pearson correlation) Nonparametric Bootstrapping ?Bonus Topic? What does nonparametric mean? Well, first of all, what does parametric mean? Parametric is a word that can have several meanings In statistics, parametric usually means some specific probability distribution is assumed Normal (regression, ANOVA, etc.) Exponential (survival) May also involve assumptions about parameters (variance) And now a closer look… Regression Analysis We want to see the relationship between a bunch of predictor variables (x1, x2,…,xp) and a response variable. For example, suppose we wanted to see the relationship between weight and blood pressure A usual regression model might look something like this… Simple Linear Regression Plot Error term εi~ N(0, σ2) The error terms are assumed to come from a normal distribution with mean 0, variance σ2 Usual methods for testing the significance of our θ estimates are based directly on this assumption of normality If the assumption of normality false, these methods are invalid So onto nonparametric… A statistical method is nonparametric if it does not require most of these assumptions Data are not assumed to follow some underlying distribution In some cases, it means that variables do not take predetermined forms Nonparametric regression, for example Assumptions nonparametric methods do make Randomness Independence In multi-sample tests, distributions are of the same shape Nonparametric Methods Advantages Disadvantages Free from distributional assumptions of data Easy to interpret Usually computationally straightforward Loss of power when data does follow usual assumptions Reduces the data’s information Larger sample sizes needed due to less efficiency With that said…Rank Tests Rank tests are a simple group of nonparametric tests Instead of using the actual numerical value of something, we use its rank, or relative position on the number line to the other observations As an example… Here’s some data Basically, these are just numbers I picked out of the sky Any ideas on what we should call it? Wigs in a wig shop? Y Value Ascending Value Rank 32 1 1 64 2 (2+3+4)/3=3 64 3 (2+3+4)/3=3 64 311 2000 4 5 6 (2+3+4)/3=3 5 6 Y Value Ascending Value Rank 32 1 1 45 2 2 54 3 3 64 4 4 311 5 5 2000 6 6 What about ties? Their ascending values are added together and divided by the total number of ties Wilcoxon rank tests These provide alternatives to the standard t-tests, which test mean differences between two groups T-tests assume that the data is normally distributed Can be highly sensitive to outliers (we’ll see that in an example soon), which may reduce power (ability to detect significant differences) Wilcoxon tests alleviate these problems by using ranks, not actual values First, the Wilcoxon Rank-Sum Test Alternative to the independent sample t-test (recall that the independent sample t-test compares two independent samples, testing if the means are equal) The t-test assumes normality of the data, and equal variances (though adjustments can be made for unequal variances) The Wilcoxon rank-sum test assumes that the two samples follow continuous distributions, as well as the usual randomness and independence Let’s try it out with some data! The source of this data is…me, Jon Atwood Data randomly generated from R First group, 15 observations distributed normally, mean=30000, variance=2500. Extra observation added Second group, 16 observations distributed normally, mean=40000, variance=2500 Group 1 32897.08 29383.8 30539.39 27448.5 33712.17 28166.07 26613.3 30105.7 25786.81 30761.04 27427.12 Group 2 40273 35610.5 42547.4 41214.4 40183 36460.8 45040.8 29060.02 39248.6 25593.89 33901.7 29815.82 32453.97 1000000 40985.1 42412 38247.2 36384.2 41849.5 38222.8 39739.4 So what is this test…testing? Remember, for the t-test, we are testing H0: μ1=μ2 vs. Ha: μ1≠μ2 Where μ1 is the mean of group 1 and μ2 is the mean of group 2 In the Wilcoxon rank-sum test, we are testing H0: F(G1)=F(G2) vs. Ha: F(G1)=F(G2-α) Where F(G1) is the distribution of group 1 and F(G2) is the distribution of group 2, and α is the “location shift” What this really means is that F(G1) and F(G2) are basically the same, except one is to the right of the other, shifted by α α In our problem… Group 1 Value 25593.89 25786.81 26613.3 27427.12 27448.5 28166.07 29060.02 29383.8 29815.82 30105.7 30539.39 30761.04 32453.97 32897.08 33712.17 1000000 Rank sum Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 152 Group 2 Value 33902 35611 36384 36461 38223 38247 39249 39739 40183 40273 40985 41214 41850 42412 42547 45041 Rank 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 376 So, group 2 has a rank sum of 224 higher than group 1 The question is, what is the probability of observing a combination of rank sums in the two groups where the difference is greater than 224? R will compute p-values using the normal approximation of sample sizes greater than 50 So let’s do some R-Coding! We will view graphs and results in the R program we run (this applies to all examples) Moving to a paired situation… Suppose we have two samples, but they are paired in some way Suppose pairs of people are couples, or plants are paired off into different plots, dogs are paired by breeds, or whatever Then, the rank-sum test will not be the optimal test. Instead, we use the signed rank test Signed Rank Test The null hypothesis is that the medians of the two distributions are equal Procedure 1. 2. 3. 4. 5. Calculate all differences between two groups Take the absolute value of those differences Convert those to ranks (like before) Attach a + or -, depending on if the difference is positive or negative Add these signed ranks together to get the test statistic W Example This data comes from Laureysens et al. (2004), who, in August and November, took 13 poplar tree clones and measured aluminum levels Below is a table with the raw data, ranks, and differences Abs Value Clone Aug Nov Rank S-Rank Diff Balsam Spire 8.1 11.2 3.1 6 6 W=59 Beaupre 10 16.3 6.3 9 9 Hazendans Hoogvorst Raspalje Unal Columbia River Fritzi Pauley Trichobel Gaver Gibecq Primo Wolterson 16.5 13.6 9.5 8.3 18.3 13.3 7.9 8.1 8.9 12.6 13.4 15.3 15.6 10.5 15.5 12.7 11.1 19.9 20.4 14.2 12.7 36.8 1.2 2 1 7.2 5.6 2.2 12 12.3 5.3 0.1 23.4 3 4 2 10 8 5 11 12 7 1 13 -3 4 2 10 -8 -5 11 12 7 1 13 How many more combinations could create a more extreme w? R will give an exact p-value for sample sizes less than 50 Otherwise, a normal approximation is used R Time! More than 2 groups Suppose we are comparing more than two groups Normally (no pun intended), we would use an analysis of variance, or ANOVA But, of course, we need to assume normality for this as well Kruskal-Wallis In this situation, we use the Kruskal-Wallis test Again, we are converting the actual numeric values to ranks, regardless of groups Ultimately, we will compute the test statistic An example We’ll use the built-in R data air quality In this data, Chambers et al. (1973) compared air quality measurements in New York in 1973 from May to September The question is, does the air quality differ from month to month? Break out the R! Onto Part 2… In this part of the course, we will do the following things 1. 2. Look at the Spearman correlation (alternative to the Pearson correlation) between an x variable and y variable Examine nonparametric bootstrapping, and how it can help us when our data does not approximate normality Spearman correlation Suppose you want to discover the association between infant mortality and GDP by country Here’s a 2003 scatterplot of the situation Pearson correlation This data comes from www.indexmundi.com In this example, the Pearson correlation is about .63 Still significant, but perhaps underestimates the monotone nature of the relationship between GDP and infant mortality rate In addition, the Pearson correlation assumes linearity, which is clearly not present here We can use the Spearman correlation instead This is a correlation coefficient based on ranks, which are computed in the y variable and x variable independently, with sample size n To calculate the coefficient, we do the following… 1. 2. 3. Take each xi and each yi, convert them into ranks (ranks of x and y are independent of each other Subtract rxi from ryi to get di, the difference in ranks The formula is In this case, the hypotheses are H0: Rs=0 vs. Ha: Rs≠0 Basically, we are attempting to see whether or not the two variables are independent, or if there is evidence of an association Let’s return to the GDP data We will now plot the data in R, and see how to get the Spearman correlation R tests this by using an exact p-value for small sample sizes, and an approximate t-distribution for larger ones The test statistic in that case would follow a tdistribution with n-2 degrees of freedom Turn to R now!!! Nonparametric Bootstrapping Suppose you are fitting a multiple linear regression model Yi=β0+β1x1i+β2x2i+…+βkxki+εi Recall that εi~ N(0, σ2) But what if we have reason to suppose this assumption is not met? Then, the regular way of testing the significance of our coefficients is invalid So what do we do? Depending on the situations, there are several options The one we will talk about today is called nonparametric bootstrapping Bootstrapping is a resampling method, where we take the data and randomly sample from it to draw inferences There are several types of bootstrapping. We will focus on the simpler, nonparametric type How do we do nonparametric bootstrapping? We assign each of the n observations a 1/n probability of being selected We then take a random sample with replacement, usually of size n We compute estimated coefficients based on this sample We repeated the process many times, say 10,000, or 100,000, or 100,000,000,000,000… For example, in regression Suppose we want to test whether or not β1 = 0. With our newly generated sample of (10,000, or whatever) β1 coefficients For a 95 percent confidence interval, we would look at the 250th observation (or 250th lowest), and the 9,750th observation (or the 250th highest) If this interval does not contain 0, we conclude that there is evidence that β1 is not equal to 0 Possible issue? Theoretically, this methods comes out of the idea that the distribution of a population is approximated by the distribution of its sample This assumption becomes less and less valid the smaller your sample size Example This data was taken from The Practice of Econometrics by ER Berndt (1991) We will be regressing wage on education and experience We will look for whether graphs tell us the residuals are approximately normal or not R-Code Now! In Summary We now understand that certain parametric methods, like t-tests and regression, depend on assumptions that may or may not be met We know that nonparametric methods are methods that do not make distributional assumptions, and therefore are applicable when data do not meet these assumptions We can implement these methods in R, incase our data does not meet these assumptions Bonus Topic! Indicator Variables in Regression Used when we have categorical variables as predictors in regression As a start, let’s re-examine the wage data Here, we will drop education and just look at experience and sex The full model in a case like this would look like Wi=β0+β1Ei+β2Si+β3S*Ei+εi Where… W=wage E=experience S=1 if sex=“male”, 0 otherwise ε is the error term Separate regressions for different sexes For women, the reference group, Wia=β0a+β1aEi+εia For men, the non-reference group, Wib=β0b+β1bEi+εib With that in mind… Let’s return to the full model Wi=β0+β1Ei+β2Si+β3S*Ei+εi Suppose we want to see how women do in this model Well, we can set S=0 We are left with Wi=β0+β1Ei+εi But, since this is with women, this is equivalent to the women only model, Wia=β0a+β1aEi+εia Thus… β0= β0a β1Ei= β1aEi Now, look at “male” We set S=1 Wi=β0+β1Ei+β2*1+β31*Ei+εi = β0+β1Ei+β2+β3Ei+εi =β0a+β1aEi+β2+β3Ei+εi =(β0a+β2)+(β1a+β3)Ei =β0b+β1bEi+εib So… β2=β0b- β0a β3=β1b- β1a Example in R! Thank you! References Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley. Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth. Laureysens, I., R. Blust, L. De Temmerman, C. Lemmens and R. Ceulemans. 2004. Clonal variation in heavy metal accumulation and biomass production in a poplar coppice culture. I. Seasonal variation in leaf, wood and bark concentrations. Environ. Pollution 131: 485-494.