Math 3080 § 1. Treibergs Hydroxylation Example: One Sample Wilcoxon Signed-Rank Test Name: Example April 12, 2014 Ths program is to do a Wilcoxon Signed-Rank test for location on a single variable. The data set comes from Navidi, Statistics for Engineers and Scientists, 2nd ed., MacGraw Hill, New York, problem 6.9.3 which quotes the article by S. Nandi, et. al, “Reaction Modeling and Optimization Using Neural Networks and Genetic Algorithms: Case Study Involving TS-1Catalyzed Hydroxylation of Benzene” in Industrial and Engineering Chemistry Research, 2002. The study presents benzene conversions (in mole percent) for 24 benzenehydroxylation reactions. The question is whether the mean conversion rate differs from µ0 = 55 mole percent. The Wilcoxon Signed-Rank test assumes that the data comes from a continuous distribution that is left-right symmetric about µ which is both the mean and median of the population. The null and alternative hypotheses in this problem are H0 : µ = µ0 ; Ha : µ 6= µ0 . The absolute deviations from µ0 are sorted from lowest to highest and |X(i) − µ0 | and assigned ranks from 1 to n, the number of observations. The statistic V is the sum of ranks corresponding to the positive Xi − µ0 . By the symmetry assumption, the sign of Xi − µ0 is positive or negative with equal probability under the null hypothesis, so that sum will be near its expected value n(n + 1) . If this number is high or low compared to µV we reject the null µV = E(V ) = 4 hypothesis. In this problem, V = 71 compared to the expected 150. The p-value, computed by c from the exact distribution of V (assuming no ties) ended up being .02293, thus we reject R the null hypothesis: thus there is significant evidence that µ is not µ0 . When n is large, (n > 20) then the statistic is distributed approximately normally and we may n(n + 1)(2n + 1) use normal distribution to compute the p-value. Since σV2 = , the standardized 24 variable is approximately normal V − n(n+1) 4 Z=q n(n+1)(2n+1) 24 c will compute this number if Hence the p-value is 2Φ(−|Z|) if Z is the observed value. R the exact calculation is turned off and the continuity correction is not used. If the continuity correction is included (in case of no ties as in this data) then the corrected formula (assuming V < µV ) is used V + 0.5 − n(n+1) 4 Z= q n(n+1)(2n+1) 24 c returns the normal approxiIn case of ties, the exact calculation cannot be performed, and R mation to the p-value. This data turned out not to have any ties. We redid an analysis on the same data xi except that, arbitrarily, x10 and x11 were replaced by x12 and x15 was replaced by x16 . Then the |Xi −µ0 | have ties and so the tied observations are all given their average rank. In this case, the variance in the approximation must be corrected. The correction in the normalization is V + .5 − Z=q n(n+1) 4 n(n+1)(2n+1) 24 − (1) T 48 where T = X (τi − 1)τi (τi + 1) 1 where τi is the frequency of the ith value of |Xi − µ0 | or of the rank. Thus if the ith datum is not tied, τi = 1 and contributes nothing to T . If Xi is tied then τi is the number of occurences of this value of |Xi − µ0 |. In the example above, there are two sets of ties: |x10 − µ0 | = |x11 − µ0 | = |x12 − µ0 | and |x15 − µ0 | = |x16 − µ0 | and no others. Thus one τi = 3 and another τj = 2. The nonzero terms of T give T = (3 − 1)3(3 + 1) + (2 − 1)2(2 + 1) = 30. Note that τi is added once for any tied value, not three times for the three i’s that are tied. The points removed changed the data enough so that the text does not reject H0 . c is not able to compute the exact p-value. It uses both Note that when there are ties, R the continuity correction and the variance correction to compute an approzimate p-value using formula (1). Data Set Used in this Analysis : # Math 3080-1 Hydroxylation Data April 12, 2014 # Treibergs # # From Navidi, "Statistics for Engineers and Scientists," 2nd ed., MacGraw # Hill, New York, problem 6.9.3 who quotes the article by S. Nandi, et. al, # "Reaction Modeling and Optimization Using Neural Networks and Genetic # Algorithms: Case Study Involving TS-1-Catalyzed Hydroxylation of # Benzene" in Industrial and Engineering Chemistry Research, 2002. # The study presents benzene conversions (in mole percent) for 24 # benzenehydroxylation reactions. Conversion 52.3 41.1 28.8 67.8 78.6 72.3 9.1 19 30.3 41 63 80.8 26.8 37.3 38.1 33.6 14.3 30.1 33.4 36.2 34.6 40 81.2 59.4 2 R Session: R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i386-apple-darwin9.8.0/i386 (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ’license()’ or ’licence()’ for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type ’contributors()’ for more information and ’citation()’ on how to cite R or R packages in publications. Type ’demo()’ for some demos, ’help()’ for on-line help, or ’help.start()’ for an HTML browser interface to help. Type ’q()’ to quit R. [R.app GUI 1.41 (5874) i386-apple-darwin9.8.0] [History restored from /Users/andrejstreibergs/.Rapp.history] > tt=read.table("M3082DataHydroxylation.txt",header=T) > tt Conversion 1 52.3 2 41.1 3 28.8 4 67.8 5 78.6 6 72.3 7 9.1 8 19.0 9 30.3 10 41.0 11 63.0 12 80.8 13 26.8 14 37.3 15 38.1 16 33.6 17 14.3 18 30.1 19 33.4 20 36.2 21 34.6 22 40.0 23 81.2 24 59.4 > 3 > > > > > > attach(tt) ################## HISTOGRAM TO CHECK ASSUMPTIONS ############## hist(Conversion) ################## HISTOGRAM LOOKS SYMMETRIC ################### ######## RUN CANNED WILCOXON SIGN-RANK TEST ################## > wilcox.test(Conversion,mu=55,alternative="two.sided") Wilcoxon signed rank test data: Conversion V = 71, p-value = 0.02293 alternative hypothesis: true location is not equal to 55 > ############# DO SIGNED-RANK TEST BY HAND #################### > mu=55 > n=length(Conversion); n [1] 24 > ############# RANK FUNCTION ASSIGNS RANKS "IN PLACE" ######### > ra=rank(abs(Conversion-mu));ra [1] 1 5 19 4 15 9 24 22 16 6 3 18 21 10 8 13 23 17 14 11 12 > ############# ADD UP RANKS OF POSITIVE X-mu0’S > SP=sum(ra[Conversion>mu]);SP [1] 71 ############### > ############# UNCORRECTED NORMALIZED STATISTIC ############### > z=(SP-n*(n+1)/4)/sqrt(n*(n+1)*(2*n+1)/24); z [1] -2.257143 > ############## P-VALUE ######################################## > 2*pnorm(z) [1] 0.02399915 > ### SAME AS CANNED WITH NO EXACT P-VALUE, NO CONT. CORR. ####### > wilcox.test(Conversion,mu=55,exact = FALSE, correct = FALSE) Wilcoxon signed rank test data: Conversion V = 71, p-value = 0.024 alternative hypothesis: true location is not equal to 55 4 7 20 2 > ########## WITH CONTINUITY CORRECTION ######################### > zc=(SP+.5-n*(n+1)/4)/sqrt(n*(n+1)*(2*n+1)/24); zc [1] -2.242857 > 2*pnorm(zc) [1] 0.02490603 > wilcox.test(Conversion,mu=55,exact = FALSE, correct = TRUE) Wilcoxon signed rank test with continuity correction data: Conversion V = 71, p-value = 0.02491 alternative hypothesis: true location is not equal to 55 > ########## DATA WITH TIES. CONSTRUCT TIED DATA > x=Conversion; x[10]=x[11]=x[12]; x[15]=x[16] > x [1] 52.3 41.1 28.8 67.8 78.6 72.3 9.1 19.0 30.3 [16] 33.6 14.3 30.1 33.4 36.2 34.6 40.0 81.2 59.4 > ra2 = rank(abs(x-mu)) > ra2 [1] 1.0 4.0 19.0 3.0 13.0 6.0 24.0 22.0 14.0 [16] 10.5 23.0 15.0 12.0 8.0 9.0 5.0 20.0 2.0 > sort(ra2) [1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 [16] 17.0 17.0 17.0 19.0 20.0 21.0 22.0 23.0 24.0 SET. ########## 80.8 80.8 80.8 26.8 37.3 33.6 17.0 17.0 17.0 21.0 10.5 10.5 12.0 13.0 14.0 15.0 > ############## TABULATE FREQUENCIES OF |X - mu0| ################ > t2=table(ra2); t2 ra2 1 2 3 4 5 6 7 8 9 10.5 12 13 14 15 1 1 1 1 1 1 1 1 1 2 1 1 1 1 19 20 21 22 23 24 1 1 1 1 1 1 > ### NOTE THAT |X - mu0| = 80.8 AND |X - mu0| = 33.6 ARE TIED ### > #### CORRECTION FOR VARIANCE ##################################### > fixtie=function(t){(t-1)*t*(t+1)} > T=sum(fixtie(t2)); T [1] 30 > ############ CANNED TEST ON X ################################### > wilcox.test(x,mu=55,exact = FALSE, correct = FALSE) Wilcoxon signed rank test data: x V = 95, p-value = 0.116 alternative hypothesis: true location is not equal to 55 5 7.0 10.5 17 3 > ########### TEST X BY HAND ##################################### > V2 = sum(ra2[x>mu]); V2 [1] 95 > z2=(V2-n*(n+1)/4)/sqrt(n*(n+1)*(2*n+1)/24); z2 [1] -1.571429 > 2*pnorm(z2) [1] 0.1160831 > ########### CANNED TEST WITH CONTINUITY CORRECTION ############## > wilcox.test(x,mu=55,exact = FALSE, correct = TRUE) Wilcoxon signed rank test with continuity correction data: x V = 95, p-value = 0.1193 alternative hypothesis: true location is not equal to 55 > #### TEST BY HAND WITH CONTINUITY AND VARIANCE CORRECTION ####### > z3 = (V2+.5-n*(n+1)/4)/sqrt(n*(n+1)*(2*n+1)/24-T/48); z3 [1] -1.55754 > pvalue3 = 2*pnorm(z3); pvalue3 [1] 0.1193423 > ##### IN TEST DATA WITH TIES, EXACT P-VALUE FAILS ############## > wilcox.test(x,mu=55,exact = TRUE) Wilcoxon signed rank test with continuity correction data: x V = 95, p-value = 0.1193 alternative hypothesis: true location is not equal to 55 Warning message: In wilcox.test.default(x, mu = 55, exact = TRUE) : cannot compute exact p-value with ties > 6 4 2 0 Frequency 6 8 Histogram of Conversion 0 20 40 Conversion 7 60 80