Nonparametric Statistical Methods Definition When the data is generated from process (model) that is known except for finite number of unknown parameters the model is called a parametric model. Otherwise, the model is called a nonparametric model Statistical techniques that assume a nonparametric model are called non-parametric. For example If you assume that your data has come from a normal distribution with mean m and standard deviation s (both unknown) then the data is generated from process (model) that is known except for two of parameters.(m and s ) The model is called a parametric model. Models that do not assume normality (or some other distribution with a finite no. of paramters) are non-parametric We will consider two nonparametric tests 1. The sign test 2. Wilcoxon’s signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population z x m0 s n t x m0 s n Nonparametric Statistical Methods Single sample nonparametric tests for central location 1. The sign test 2. Wilcoxon’s signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population z x m0 s n t x m0 s n Both the z-test and the t-test assumes the data is coming from a normal population If the data is not coming from a normal population, properties of the z-test and the ttest that require this assumption will no longer be true. The probability of a type I error may be different than the desired value (0.05 or 0.01) Single sample non parametric tests If the data is not coming from a normal population we should then use one of the two nonparametric tests 1. The sign test 2. Wilcoxon’s signed tank test These tests do not assume the data is coming from a normal population The sign test A nonparametric test for the central location of a distribution We want to test: H0: median = m0 against HA: median ≠ m0 (or against a one-sided alternative) The Sign test: 1. The test statistic: S = the number of observations that exceed m0 Comment: If H0: median = m0 is true we would expect 50% of the observations to be above m0, and 50% of the observations to be below m0, 50% 0 0 50% median = m0 If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size. m0 > median p < 0.50 p median m0 If H 0 is not true then S will still have a binomial distribution. However p will not be equal to 0.50. m0 < median p > 0.50 p m0 median p = the probability that an observation is greater than m0. Summarizing: If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size. n = 10 x 0 1 2 3 4 5 6 7 8 9 10 0.3 p(x) 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 The critical and acceptance region: n = 10 x 0 1 2 3 4 5 6 7 8 9 10 p(x) 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 0 1 2 3 4 5 6 7 8 9 Choose the critical region so that a is close to 0.05 or 0.01. e. g. If critical region is {0,1,9,10} then a = .0010 + .0098 + .0098 +.0010 = .0216 10 e. g. If critical region is {0,1,2,8,9,10} then a = .0010 + .0098 +.0439+.0439+ .0098 +.0010 = .1094 n = 10 x 0 1 2 3 4 5 6 7 8 9 10 p(x) 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 0 1 2 3 4 5 6 7 8 9 10 Example Suppose that we are interested in determining if a new drug is effective in reducing cholesterol. Hence we administer the drug to n = 10 patients with high cholesterol and measure the reduction. The data Case 1 2 3 4 5 6 7 8 9 10 Initial 240 237 264 233 236 234 264 241 261 256 Cholesterol Final Reduction 228 12 222 15 262 2 224 9 240 -4 237 -3 264 0 219 22 252 9 254 2 Suppose we want to test H0: the drug is not effective median reduction ≤ 0 against HA: the drug is effective median reduction > 0 The Sign test S = the no. of positive obs The Sign test The test statistic S = the no. of positive obs = 8 We will use the p-value approach p-value = P[S ≥ 8] = 0.0439 + 0.0098 + 0.0010 = 0.0547 Since p-value > 0.05 we cannot reject H0 Summarizing: To carry out Sign Test We 1. Compute S = The # of observations greater than m0 2. Let sobserved = the observed value of S. 3. Compute the p-value = P[S ≤ sobserved] (2 P[S ≤ sobserved] for a two-tailed test). Use the table for the binomial dist’n (p = ½ , n = sample size) 4. Conclude HA (Reject H0) if p-value is less than 0.05 (or 0.01). Sign Test for Large Samples If n is large we can use the Normal approximation to the Binomial. Namely S has a Binomial distribution with p = ½ and n = sample size. Hence for large n, S has approximately a Normal distribution with mean and n m S np 2 standard deviation n 1 1 s S npq n 2 2 2 Hence for large n,use as the test statistic (in place of S) n z S mS sS S n 2 2 Choose the critical region for z from the Standard Normal distribution. i.e. Reject H0 if z < -za/2 or z > za/2 two tailed ( a one tailed test can also be set up. Nonparametric Confidence Intervals Assume that the data, x1, x2, x3, … xn is a sample from an unknown distribution. Now arrange the data x1, x2, x3, … xn in increasing order x(1) < x(2) < x(3) < … < x(n) Hence x(1) = the smallest observation x(2) = the 2nd smallest observation x(n) = the largest observation Consider the kth smallest observation and the kth largest observation in the data x1, x2, x3, … xn x(k) and x(n – k + 1) Hence P[x(k) < median < x(n – k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] If at least k observations lie below the median than x(k) < median If at least k observations lie above the median than median < x(n – k + 1) Thus P[x(k) < median < x(n – k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] = P[The number of observations below the median is at least k and at most n-k] = P[k ≤ S ≤ n-k] where S = the number of observations below the median S has a binomial distribution with n = the sample size and p =1/2. Hence P[x(k) < median < x(n – k + 1) ] = P[k ≤ S ≤ n-k] = p(k) + p(k + 1) + … + p(n-k) = P where p(i)’s are binomial probabilities with n = the sample size and p =1/2. This means that x(k) to x(n – k + 1) is a P100% confidence interval for the median Summarizing x(k) to x(n – k + 1) is a P100% confidence interval for the median where P = p(k) + p(k + 1) + … + p(n-k) and p(i)’s are binomial probabilities with n = the sample size and p =1/2. Example: n = 10 and k =2 Binomial probabilities x 0 1 2 3 4 5 6 7 8 9 10 p(x) 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 0 1 2 3 4 5 6 7 P = p(2) + p(3) + p(4) + p(5) + p(6) + p(7) + p(8)= .9784 Hence x(2) to x(9) is a 97.84% confidence interval for the median 8 9 10 Example Suppose that we are interested in determining if a new drug is effective in reducing cholesterol. Hence we administer the drug to n = 10 patients with high cholesterol and measure the reduction. The data Case 1 2 3 4 5 6 7 8 9 10 Initial 240 237 264 233 236 234 264 241 261 256 Cholesterol Final Reduction 228 12 222 15 262 2 224 9 240 -4 237 -3 264 0 219 22 252 9 254 2 The data arranged in order k x (k ) 1 2 3 4 5 6 7 8 9 10 -4 -3 0 2 2 9 9 12 15 22 x(2) = -3 to x(9) =15 is a 97.84% confidence interval for the median Example In the previous example to repeat the study with n = 20 patients with high cholesterol. The data Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Initial 231 241 270 264 256 240 269 242 269 230 234 266 253 267 248 238 266 266 244 249 Cholesterol Final Reduction 212 19 235 6 268 2 255 9 252 4 234 6 256 13 243 -1 264 5 221 9 220 14 267 -1 252 1 242 25 248 0 239 -1 268 -2 240 26 247 -3 250 -1 The binomial distribution with n = 20, p = 0.5 x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 p (x ) 0.0000 0.0000 0.0002 0.0011 0.0046 0.0148 0.0370 0.0739 0.1201 0.1602 0.1762 0.1602 0.1201 0.0739 0.0370 0.0148 0.0046 0.0011 0.0002 0.0000 0.0000 Note: p(6) + p(7) + p(8) + p(9) + p(10) + p(11) + p(12) + p(13) + p(14) = 0.037 + 0.0739 + 0.1201 + 0.1602 + 0.1762 + 0.1602 + 0.1201 + 0.0739 + 0.037 = 0.9586 Hence x(6) to x(15) is a 95.86% confidence interval for the median reduction in cholesterol The data arranged in order k x (k ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 -3 -2 -1 -1 -1 -1 0 1 2 4 5 6 6 9 9 13 14 19 25 26 x(6) = -1 to x(15) = 9 is a 95.86% confidence interval for the median For large values of n one can use the normal approximation to the Binomial to find the value of k so that x(k) to x(n – k + 1) is a 95% confidence interval for the median. i.e. we want to find k so that n n k S 2k n 2 2 PS k P P Z 0.025 n n n 2 2 but PZ 1.960 0.025 2k n n 1.96 n hence 1.960 or k 2 n n 1.96 n Using k 2 n 20 40 100 200 k 5.6 13.8 40.2 86.1 Next we will consider: 1. The Wilcoxon signed rank test The Wilcoxon signed rank test is an alternative to the Sign test, a test for the central location of a single population The sign test A nonparametric test for the central location of a distribution We want to test: H0: median = m0 against HA: median ≠ m0 (or against a one-sided alternative) The Sign test: 1. The test statistic: S = the number of observations that exceed m0 Comment: If H0: median = m0 is true then • The distribution of S is binomial - n = sample size, p = 0.50 To carry out the The Sign test: 1. Compute the test statistic: S = the number of observations that exceed m0 = sobserved 2. Compute the p-value of test statistic, sobserved : p-value = P [S ≥ sobserved ] ( = 2 P [S ≥ sobserved ] for 2-tailed test) where S is binomial, n = sample size, p = 0.50 3. Reject H0 if p-value low (< 0.05) Non-parametric confidence intervals for the median of a population x(k) to x(n – k + 1) is a (1 – a)100% = P100% confidence interval for the median where x(k) = kth smallest xi and x(n – k + 1) = kth largest xi P = p(k) + p(k + 1) + … + p(n-k) and p(i)’s are binomial probabilities with n = the sample size and p =1/2. The Wilcoxon Signed Rank Test An Alternative to the sign test Situation • A sample of size n , (x1 , x2 , … , xn) from an unknown distribution and we want to test H0 : the centre of the distribution, m = m0 , against H A : m ≠ m0 , • For the sign test we would count S, the number of positive values of (x1 – m0 , x2 – m0 , … , xn – m0). • We would reject H0 if S was not close to n/2 • For Wicoxon’s signed-Rank test we would assign ranks to the absolute values of (x1 – m0 , x2 – m0 , … , xn – m0). • A rank of 1 to the value of xi – m0 which is smallest in absolute value. • A rank of n to the value of xi – m0 which is largest in absolute value. W+ = the sum of the ranks associated with positive values of xi – m0. W- = the sum of the ranks associated with negative values of xi – m0. Note: W+ + W- = 1 + 2+ 3+ …n = n(n + 1)/2 If H0 is true then W+ W- n(n + 1)/4 If H0 is not true then either 1. W+ will be small (W- large) or 2. W+ will be large (W- small) m0 > True median True median W- large m0 W+ small m0 < True median m0 W- small True median W+ large Note: It is possible to work out the sampling distribution of W+ ( and W-) when H0 is true. Note: We use the fact that if H0 is true that there is an equal probability (1/2) that the sign attached to any rank is plus (+) or minus (-). Example: n = 4. ranks 1 2 3 4 W+ W- Prob - - - - 0 10 1/16 + - - - 1 9 1/16 - + - - 2 8 1/16 - - + - 3 7 1/16 - - - + 4 6 1/16 + + - - 3 7 1/16 + - + - 4 6 1/16 + - - + 5 5 1/16 - + + - 5 5 1/16 - + - + 6 4 1/16 - - + +- 7 3 1/16 + + + - 6 4 1/16 + + - + 7 3 1/16 + - + + 8 2 1/16 - + + + 9 1 1/16 + + + + 10 0 1/16 The distribution of W+ and W- : n = 4. W+ 0 1 2 3 4 5 6 7 8 9 10 Prob 1/16 W0 Prob 1/16 1/16 1/16 2/16 1 2 3 1/16 1/16 2/16 2/16 2/16 4 5 6 2/16 2/16 2/16 7 8 9 216 1/16 1/16 10 1/16 2/16 216 1/16 1/16 1/16 If T = W+ or W- : n = 4. T 0 P[T = t] 1/16 P[T ≤ t] 0.0625 1 2 3 1/16 1/16 2/16 0.1250 0.1875 0.3125 4 5 2/16 2/16 0.4375 0.5625 6 7 8 2/16 2/16 1/16 0.6875 0.8125 0.8750 9 10 1/16 1/16 0.9375 1.000 These are the values found in the table A.6 in the textbook table A.6 Page A- 15 Distribution of the test statistic for Wilcoxon signed-rank test Sample Size T 2 3 4 5 6 1 0.5000 0.2500 0.1250 0.0625 0.0313 2 0.3750 0.1875 0.0938 0.0469 3 0.6250 0.3125 0.1563 0.0782 4 0.4375 0.2188 0.1094 5 0.5625 0.3125 0.1563 6 0.4063 0.2188 7 0.5000 0.2813 8 0.3438 9 0.4219 10 0.5000 table A.6 Page A- 15 Only goes up to n = sample size = 12 For sample sizes, n > 12 we can use the fact that T (W+ or W-) has approximately a normal distribution with nn 1 mean mT 4 nn 12n 1 standard deviation s T 24 T mT t mT t mT and PT t P P Z s s s T T T PT t 1 2 3 4 5 12 0.0005 0.0008 0.0013 0.0018 0.0025 0.0014 0.0019 0.0024 0.0030 0.0038 6 7 8 9 10 0.0035 0.0047 0.0062 0.0081 0.0105 0.0048 0.0060 0.0075 0.0093 0.0115 11 12 13 14 15 0.0135 0.0171 0.0213 0.0262 0.0320 0.0140 0.0171 0.0207 0.0249 0.0299 Exact Values t mT P Z s T Normal Approximation Example • In this example we are interested in the quantity FVC (Forced Vital Capacity) in patients with cystic fibrosis • FVC (Forced Vital Capacity) = the volume of air that a person can expel from the lungs in a 6 sec period. • This will be reduced with time for cystic fibrosis patients The research question: Will this reduction be less when a new experimental drug is administered? The Experimental Design • The design will be a matched pair design • Pairs of patients are matched (Using initial FVC readings) • One member of the pair is given the new drug the other member is given a placebo • We measure the reduction in FVC for each member and compute the difference – xi = Reduction in FVC (placebo) – Reduction in FVC (drug) These values will be generally positive if the drug is effective in minimizing the deterioration in Forced Vital Capacity (FVC). W+ will be large (W- will be small) Table: Reduction in forced vital capacity (FVC) for a matched pair sample of patients with cystic fibrosis Reduction in FVC Subject Placebo Drug xi Difference 1 224 213 11 1 2 8 95 -15 2 3 75 33 42 3 3 4 541 440 101 4 4 5 74 -32 106 5 5 6 85 -28 113 6 6 7 293 445 -152 7 8 -23 -178 155 8 8 9 525 367 158 9 9 10 -38 140 -179 10 11 508 323 185 11 11 12 255 10 245 12 12 13 525 65 460 13 13 14 1023 343 680 14 14 Rank Signed Rank 1 -2 -7 -10 W+ = 86 W- = 19 We have to judge if W+ = 86 is large (or W- = 19 is small) mW nn 1 1415 52.5 4 4 nn 12n 1 141529 sW 15.92953 24 24 W 52.5 19 52.5 P W 19 P 15.92953 15.92953 PZ 2.10 0.0179 p - value Since the p-value is small (< 0.05) we conclude the drug is effective in reducing the deterioration of FVC Summarizing: To carry out Wilcoxon’s signed rank test We 1. Compute T = W+ or W- (usually it would be the smaller of the two) 2. Let tobserved = the observed value of T. 3. Compute the p-value = P[T ≤ tobserved] (2 P[T ≤ tobserved] for a two-tailed test). i. ii. 4. For n ≤ 12 use the table. For n > 12 use the Normal approximation. Conclude HA (Reject H0) if p-value is less than 0.05 (or 0.01). Alternative tests for this example 1. The t – test x m0 x test statistic t s s n n 2. The sign test test statistic S # of positive xi Comments The t – test 1. i. ii. This test requires the assumption of normality. If the data is not normally distributed the test is invalid • iii. 2. The probability of a type I error may not be equal to its desired value (0.05 or 0.01) If the data is normally distributed, the t-test commits type II errors with a smaller probability than any other test (In particular Wilcoxon’s signed rank test or the sign test) The sign test i. ii. This test does not require the assumption of normality (true also for Wilcoxon’s signed rank test). This test ignores the magnitude of the observations completely. Wilcoxon’s test takes the magnitude into account by ranking them Nonparametric Statistical Methods Single sample nonparametric tests for central location 1. The sign test 2. Wilcoxon’s signed rank test These are tests for the central location of a population. They are alternatives to the z-test and the t-test for the mean of a normal population z x m0 s n t x m0 s n The Sign test Summarizing: To carry out Sign Test We 1. Compute S = The # of observations greater than m0 2. Let sobserved = the observed value of S. 3. Compute the p-value = P[S ≤ sobserved] (2 P[S ≤ sobserved] for a two-tailed test). Use the table for the binomial dist’n (p = ½ , n = sample size) 4. Conclude HA (Reject H0) if p-value is less than 0.05 (or 0.01). m0 = True median True median S ≈ n/2 m0 m0 > True median True median m0 S small (close to 0) m0 < True median m0 True median S large (close to n) Wilcoxon’s signed-Rank test • For Wilcoxon’s signed-Rank test we would assign ranks to the absolute values of (x1 – m0 , x2 – m0 , … , xn – m0). • A rank of 1 to the value of xi – m0 which is smallest in absolute value. • A rank of n to the value of xi – m0 which is largest in absolute value. W+ = the sum of the ranks associated with positive values of xi – m0. W- = the sum of the ranks associated with negative values of xi – m0. Note: W+ + W- = 1 + 2+ 3+ …n = n(n + 1)/2 If H0 is true then W+ W- n(n + 1)/4 If H0 is not true then either 1. W+ will be small (W- large) or 2. W+ will be large (W- small) m0 = True median True median m0 W+ ≈ W- ≈ n(n + 1)/4 m0 > True median True median W- large m0 W+ small m0 > True median True median W- large m0 W+ small m0 < True median m0 W- small True median W+ large Summarizing: To carry out Wilcoxon’s signed rank test We 1. Compute T = W+ or W- (usually it would be the smaller of the two) 2. Let tobserved = the observed value of T. 3. Compute the p-value = P[T ≤ tobserved] (2 P[T ≤ tobserved] for a two-tailed test). i. ii. 4. For n ≤ 12 use the table. For n > 12 use the Normal approximation. Conclude HA (Reject H0) if p-value is less than 0.05 (or 0.01). Two-sample – Non-parametic tests Mann-Whitney Test A non-parametric two sample test for comparison of central location The Mann-Whitney Test • This is a non parametric alternative to the two sample t test (or z test) for independent samples. • These tests (t and z) assume the data is normal • The Mann- Whitney test does not make this assumption. • Sample of n from population 1 x1, x2, x3, … , xn • Sample of m from population 2 y1, y2, y3, … , ym The Mann-Whitney test statistic U1 counts the number of times an observation in sample 1 precedes an observation in sample 2. An Equivalent statistic U2 that counts the number of times an observation in sample 2 precedes an observation in sample 1 can also be computed Example n = m = 4 measurements of bacteria counts per unit volume were made for two type of cultures. The n = 4 measurements for culture 1 were 27, 31, 26, 25 The m = 4 measurements for culture 2 were 32, 29, 35, 28 To compute the Mann-Whitney test statistics U1 and U2, arrange the observations from the two samples combined in increasing order (retaining sample membership). 25(1), 26(1), 27(1) , 28(2) , 29(2), 31(1), 32(2) , 35(2) For each observation in sample 2 let ui demote the number of observations in sample 1 that precede that value. u1 = 3, u2 = 3, u3 = 4, u4 = 4, Then U1 = u1 + u2 + u3 + u4 = 3 + 3 + 4 + 4 =14 To compute U2, repeat the process for the second sample 25(1), 26(1), 27(1) , 28(2) , 29(2), 31(1), 32(2) , 35(2) For each observation in sample 1 let vi demote the number of observations in sample 2 that precede that value. v1 = 0, v2 = 0, v3 = 0, v4 = 2, Then U2 = v1 + v2 + v3 + v4 = 0 + 0 + 0 + 2 =2 Note: U1 + U2 = mn = 16. This is true in general For each pair (xi,yj) either xi < yj or xi > yj (Assume no ties) In one case U1 will be increased by 1 while in the other case U2 will be increased by 1. There are mn such pairs. An Alternative way of o computing the Mann-Whitney test statistic U Arrange the observations from the two samples combined in increasing order (retaining sample membership) and assign ranks to the observations. 25(1) 26(1) 27(1) 28(2) 29(2) 31(1) 32(2) 35(2) 1 2 3 4 5 6 7 8 Let W1 = the sum of the ranks for sample 1. = 1 + 2 + 3 + 6 = 12 Let W2 = the sum of the ranks for sample 2. = 4 + 5 + 7 + 8 = 24 It can be shown that n n 1 U1 nm U 2 nm and Note: W W 1 2 3 1 2 U1 U 2 2nm n n 1 2 m m 1 2 n m m m 1 W2 n m n m 1 2 n m n m 1 2 2 2 4nm n n 1 m m 1 n m n m 1 2 4nm n n m m n 2 2nm m 2 n m 2 W1 2 2 nm • The distribution function of U (U1 or U2) has been tabled for various values of n and m (<n) when the two observations are coming from the same distribution. • These tables can be used to set up critical regions for the Mann-Whitney U test. Example A researcher was interested in comparing “brightness” of paper prepared by two different processes A measure of brightness in paper was made for n = m = 9 samples drawn randomly from each of the two processes. The data is presented below: Process A Process B 6.1 9.1 9.2 8.2 8.7 8.6 8.9 6.9 7.6 7.5 7.1 7.9 9.5 8.3 8.3 7.8 9.0 8.9 Process A xi rank 6.1 1 9.2 17 8.7 12 8.9 13.5 7.6 5 7.1 3 9.5 18 8.3 9.5 9.0 15 Process B yj rank 9.1 16 8.2 8 8.6 11 6.9 2 7.5 4 7.9 7 8.3 9.5 7.8 6 8.9 13.5 Ranks averaged because observations are tied WA = 94 WB = 77 It can be shown that U A nm n n 1 2 WA 9 9 9 10 2 94 32 and U B nm m m 1 2 WA 9 9 9 10 2 From table for n = m = 9 P Ui 18 .025 Hence we will reject H0 if either UA or UB ≤ 18. Thus H0 is accepted 77 49 The Mann-Whitney test for large samples For large samples (n > 10 and m >10) the statistics U1 and U2 have approximately a Normal distribution with mean and standard nm deviation mU i 2 sU i nm n m 1 12 Thus we can convert Ui to a standard normal statistic nm Ui U i mUi 2 z s Ui nm n m 1 12 And reject H0 if z < -za/2 or z > za/2 (for a two tailed test) The Kruskal Wallis Test • Comparing the central location for k populations • An nonparametric alternative to the one-way ANOVA F-test Situation: Data is collected from k populations. The sample size from population i is ni. The data from population i is: xi1 , xi 2 , , xini i 1, 2, .k The computation of The Kruskal-Wallis statistic We group the N = n1 + n2 + … + nk observation from k populations together and rank these observations from 1 to N. Let rij be the rank associated with with the observation xij. Handling of “tied” observations If a group of observations are equal the ranks that would have been assigned to those observations are averaged The computation of The Kruskal-Wallis statistic Let ri ri1 ri 2 rini the average rank for the i th sample ni r the average rank for all the observations. 1 2 3 N N N 1 2 Note: If the k populations do not differ in central location the r1 , r2 , , rk should be approximately equal and close to r . The Kruskal-Wallis statistic k 12 2 K ni ri r N N 1 i 1 3 N 1 U 12 N N 1 i 1 ni N 1 k ni where U i rij ri1 j 1 2 i 2 rini = the sum of the ranks for the ith sample The Kruskal-Wallis test Reject H0: the k populations have same central location if K a2 with d . f . k 1 Example In this example we are measuring an enzyme level in three groups of patients who have received “open heart” surgery. The three groups of patients differ in age: 1. Age 30 – 45 2. Age 46 – 60 3. Age 61+ The data Age Group 30 - 45 46 - 60 218.3 166.1 140.8 124.1 268.4 84.3 201 124.2 70.2 70.7 61+ 197.9 120.6 81.1 181.6 118.9 Computation of the Kruskal-Wallis statistic The raw data Age Group 30 - 45 46 - 60 218.3 166.1 140.8 124.1 268.4 84.3 201 124.2 70.2 70.7 The data ranked 61+ 197.9 120.6 81.1 181.6 118.9 ri Ui Age Group 30 - 45 46 - 60 14 10 9 7 15 4 13 8 1 2 12.75 5.33 51 32 U i2 512 322 37 2 1094.717 4 6 5 i 1 ni 3 61+ 12 6 3 11 5 7.40 37 The Kruskal-Wallis statistic 3 N 1 U 12 K N N 1 i 1 ni N 1 k 2 i 2 3 16 12 7.698 1094.717 15 14 14 2 2 K since 0.05 5.99 for df k 1 2 H0 is rejected. There are significant differences in the central enzyme levels between the three age groups