The Wisdom Tree Richard M. Scriven Statistics 104 Midterm II Introduction Most commonly known as Giant Sequoias, the Sequoiadendron giganteum are the largest trees on the planet. These giant redwoods primarily grow in groves in the western foothills of the Sierra Nevada Mountains of California and have been known to grow as tall as 95 meters (311 feet) and as much as 17 meters (56 feet) in diameter. This paper will analyze sample data collected from two distinct Giant Sequoia groves and we will seek inference on the two populations of trees within the two groves. No assumptions have been made about the distributions of the heights of the trees, and non-parametric methods will be implemented for the purpose of inference. Material & Methods Since there are a very limited number of these giant trees left in existence, the data set is quite small. The collected heights in meters of several trees from two Giant Sequoia groves in the western Sierra Nevada Mountains is as follows: The non-parametric methods used in this analysis include a permutation test, a Wilcoxon rank-sum test, a Mann-Whitney confidence interval for the shift parameter Δ (with Hodges-Lehmann estimate), Van der Waerden scores, tests for variability, a Kolmogorov-Smirnov test, and a largesample approximation for hypothesis testing. We will use R-Studio statistical analysis software throughout the analysis. Grove 1: 62 89 77 92 74 Grove 2: 63 48 82 91 49 75 69 Source: National Park Service Analysis Permutation Test In a two-sample permutation test, we wish to test the hypotheses H0: F1(x) = F2(x) that the distributions are the same for the two groves. The alternative hypothesis states that the observations in grove 1 tend to be larger than the observations in grove 2. Testing these hypotheses with the perm.test() function in R-Studio, we achieve the following results: vs. Ha: F1(x) < F2(x) where F1(x) and F2(x) are the cdf’s of the two populations. The null hypothesis states 1 2-sample Permutation Test This result suggests that we overwhelmingly reject the null hypothesis. There is very strong evidence, based on this permutation test, that the observations in grove 1 tend to be larger than the observations in grove 2. data: x and y T = 394, p-value = 0.8838 alternative hypothesis: true mu is less than 0 Wilcoxon Rank-Sum Test alternative hypothesis: true location shift is not equal to 0 Next we perform a Wilcoxon Rank-Sum test on the data. This test combines the two groups into one and then ranks the observations. The test is then based on the rank-sum, W, of grove 1 (or grove 2). Our results from R-Studio are as follows: The W-statistic of 24 indicates that the original data has a rank-sum of 24. Since our p-value is 0.3434, 272 out of the possible 792 permutations have a rank-sum greater than or equal to 24. That is, Wilcoxon rank sum test data: x and y W = 24, p-value = 0.3434 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 272 = 0.3434 792 Mann-Whitney Confidence Interval for the Shift Parameter Δ The Mann-Whitney confidence interval will tell us the 95% confidence interval for the shift parameter Δ, the differences in the means of the respective distributions. Using the function HL.diff() in the pairwiseCI package in R, we find that the 95% confidence interval is (-8, 29). The negative value tells us that group 1 (x) has less observations than group 2 (y). The HodgesLehmann estimate is 11. conf.int [1] -8 29 estimate difference in location 11 Scoring Systems where Φ−1 denotes the inverse of the cdf of the standard normal distribution and N is the total number of observations. Since there are no ties in the Giant Sequoia data, our scores are the standard Van der Waerden scores. The scores can be found at the end of this analysis. Next we would like to take a look at the Van der Waerden scores for our data. This alternative to the normal scoring system uses the equation i 𝑉(𝑖) = Φ−1 ( ) 𝑁+1 2 Tests for Variability We have interest to know if there is a difference in the variances of the two groups. To test the difference in deviances, we make the following hypotheses find that there is strong evidence to support the alternative hypothesis, that σ1 < σ2. Next we perform a Siegel-Tukey test to determine the group with larger variability. Fortunately, someone has already written the code for the Siegel-Tukey test and posted it online. Due to space constraints, the results of this test are after the Van der Waerden scores at the end of this analysis. The results indicate that group 1 has larger variability. H0: σ1 = σ2 vs. Ha: σ1 < σ2 After writing a short R code, we find that the lower-tail p-value for this test is 0.0284. It follows that we reject the null hypothesis to Kolmogorov-Smirnov Test The K-S statistic takes the maximum difference in the sample cdf’s and calculates a p-value based on that statistic. We use this test if the difference between the two groups is not known, and it might cause observations in one treatment to be larger than observations in the other. It also might affect the variability of the observations, or the shapes of the distributions. From the Giant Sequoia data, we obtain a K-S statistic of 0.374 and a p-value of 0.7374. Two-sample Kolmogorov-Smirnov test data: x and y D = 0.3714, p-value = 0.7374 alternative hypothesis: twosided Large-Sample Approximations Our exact p-values for the large-sample approximations are based on a Wilcoxon rank-sum test. After some simple calculations in R, we find that our largesample exact p-values for groups 1 and 2, respectively, are > p1 [1] 0.1560902 > p2 [1] 0.1382092 Conclusion In conclusion, we have found that we can perform a variety of tests on the two-sample non-parametric data, and can get much insight on the population of Giant Sequoia trees in California. We found that there is strong evidence that the observations in group 1 tend to be larger than those in group 2, approximately 34% of the data has a ranksum greater than or equal to that of the original data, and the 95% confidence 3 interval for the shift parameter is (-8, 29). We also found the Van der Waerden scores for our data, and found that group 1 has a larger variability than group 2. This is much more information than we had to begin with, and we now have further knowledge about the populations of Giant Sequoia trees in California. Van der Waerden Scores: Study: Van der Waerden (Normal Scores) test's Value : 11 Pvalue: 0.4432633 Degrees of freedom: c(x, y), 48 49 62 63 69 74 75 77 82 89 91 92 11 means of the normal score c.x..y. std.err r -1.42554404 NA 1 -1.01942762 NA 1 -0.73555756 NA 1 -0.50152740 NA 1 -0.29237490 NA 1 -0.09539637 NA 1 0.09539637 NA 1 0.29237490 NA 1 0.50152740 NA 1 0.73555756 NA 1 1.01942762 NA 1 1.42554404 NA 1 t-Student: NaN Alpha : 0.05 LSD : NaN Means with the same letter are not significantly different 4 Results of the Siegel-Tukey test: Median of group 1 = 77 Median of group 2 = 69 Testing median differences... Wilcoxon rank sum test data: data$x[data$y == 0] and data$x[data$y == 1] W = 24, p-value = 0.3434 alternative hypothesis: true location shift is not equal to 0 Performing Siegel-Tukey rank transformation... 1 2 3 4 5 6 7 8 9 10 11 12 sort.x sort.id unique.ranks 48 1 1 49 1 4 62 0 5 63 1 8 69 1 9 74 0 12 75 1 11 77 0 10 82 1 7 89 0 6 91 1 3 92 0 2 Performing Siegel-Tukey test... Mean rank of group 0: 7 Mean rank of group 1: 6.142857 Wilcoxon rank sum test with continuity correction data: ranks0 and ranks1 W = 20, p-value = 0.7453 alternative hypothesis: true location shift is not equal to 0 5