Statistics 104 - Richard M. Scriven

advertisement
The Wisdom Tree
Richard M. Scriven
Statistics 104
Midterm II
Introduction
Most commonly known as Giant Sequoias,
the Sequoiadendron giganteum are the
largest trees on the planet. These giant
redwoods primarily grow in groves in the
western foothills of the Sierra Nevada
Mountains of California and have been
known to grow as tall as 95 meters (311
feet) and as much as 17 meters (56 feet) in
diameter. This paper will analyze sample
data collected from two distinct Giant
Sequoia groves and we will seek inference
on the two populations of trees within the
two groves. No assumptions have been
made about the distributions of the heights
of the trees, and non-parametric methods
will be implemented for the purpose of
inference.
Material & Methods
Since there are a very limited number of
these giant trees left in existence, the data
set is quite small. The collected heights in
meters of several trees from two Giant
Sequoia groves in the western Sierra Nevada
Mountains is as follows:
The non-parametric methods used in this
analysis include a permutation test, a
Wilcoxon rank-sum test, a Mann-Whitney
confidence interval for the shift parameter Δ
(with Hodges-Lehmann estimate), Van der
Waerden scores, tests for variability, a
Kolmogorov-Smirnov test, and a largesample approximation for hypothesis
testing. We will use R-Studio statistical
analysis software throughout the analysis.
Grove 1: 62 89 77 92 74
Grove 2: 63 48 82 91 49 75 69
Source: National Park Service
Analysis
Permutation Test
In a two-sample permutation test, we wish to
test the hypotheses
H0: F1(x) = F2(x)
that the distributions are the same for the
two groves. The alternative hypothesis
states that the observations in grove 1 tend
to be larger than the observations in grove
2. Testing these hypotheses with the
perm.test() function in R-Studio, we achieve
the following results:
vs. Ha: F1(x) < F2(x)
where F1(x) and F2(x) are the cdf’s of the
two populations. The null hypothesis states
1
2-sample Permutation Test
This result suggests that we overwhelmingly
reject the null hypothesis. There is very
strong evidence, based on this permutation
test, that the observations in grove 1 tend to
be larger than the observations in grove 2.
data: x and y
T = 394, p-value = 0.8838
alternative hypothesis: true mu is
less than 0
Wilcoxon Rank-Sum Test
alternative hypothesis: true
location shift is not equal to 0
Next we perform a Wilcoxon Rank-Sum test
on the data. This test combines the two
groups into one and then ranks the
observations. The test is then based on the
rank-sum, W, of grove 1 (or grove 2). Our
results from R-Studio are as follows:
The W-statistic of 24 indicates that the
original data has a rank-sum of 24. Since
our p-value is 0.3434, 272 out of the
possible 792 permutations have a rank-sum
greater than or equal to 24. That is,
Wilcoxon rank sum test
data: x and y
W = 24, p-value = 0.3434
𝑝 − 𝑣𝑎𝑙𝑢𝑒 =
272
= 0.3434
792
Mann-Whitney Confidence Interval for the Shift Parameter Δ
The Mann-Whitney confidence interval will
tell us the 95% confidence interval for the
shift parameter Δ, the differences in the
means of the respective distributions. Using
the function HL.diff() in the pairwiseCI
package in R, we find that the 95%
confidence interval is (-8, 29). The negative
value tells us that group 1 (x) has less
observations than group 2 (y). The HodgesLehmann estimate is 11.
conf.int
[1] -8 29
estimate
difference in location
11
Scoring Systems
where Φ−1 denotes the inverse of the cdf of
the standard normal distribution and N is the
total number of observations. Since there
are no ties in the Giant Sequoia data, our
scores are the standard Van der Waerden
scores. The scores can be found at the end
of this analysis.
Next we would like to take a look at the Van
der Waerden scores for our data. This
alternative to the normal scoring system uses
the equation
i
𝑉(𝑖) = Φ−1 (
)
𝑁+1
2
Tests for Variability
We have interest to know if there is a
difference in the variances of the two
groups. To test the difference in deviances,
we make the following hypotheses
find that there is strong evidence to support
the alternative hypothesis, that σ1 < σ2.
Next we perform a Siegel-Tukey test to
determine the group with larger variability.
Fortunately, someone has already written the
code for the Siegel-Tukey test and posted it
online. Due to space constraints, the results
of this test are after the Van der Waerden
scores at the end of this analysis. The
results indicate that group 1 has larger
variability.
H0: σ1 = σ2 vs. Ha: σ1 < σ2
After writing a short R code, we find that the
lower-tail p-value for this test is 0.0284. It
follows that we reject the null hypothesis to
Kolmogorov-Smirnov Test
The K-S statistic takes the maximum
difference in the sample cdf’s and calculates
a p-value based on that statistic. We use this
test if the difference between the two groups
is not known, and it might cause
observations in one treatment to be larger
than observations in the other. It also might
affect the variability of the observations, or
the shapes of the distributions. From the
Giant Sequoia data, we obtain a K-S statistic
of 0.374 and a p-value of 0.7374.
Two-sample Kolmogorov-Smirnov
test
data: x and y
D = 0.3714, p-value = 0.7374
alternative hypothesis: twosided
Large-Sample Approximations
Our exact p-values for the large-sample
approximations are based on a Wilcoxon
rank-sum test. After some simple
calculations in R, we find that our largesample exact p-values for groups 1 and 2,
respectively, are
> p1
[1] 0.1560902
> p2
[1] 0.1382092
Conclusion
In conclusion, we have found that we can
perform a variety of tests on the two-sample
non-parametric data, and can get much
insight on the population of Giant Sequoia
trees in California. We found that there is
strong evidence that the observations in
group 1 tend to be larger than those in group
2, approximately 34% of the data has a ranksum greater than or equal to that of the
original data, and the 95% confidence
3
interval for the shift parameter is (-8, 29).
We also found the Van der Waerden scores
for our data, and found that group 1 has a
larger variability than group 2. This is much
more information than we had to begin with,
and we now have further knowledge about
the populations of Giant Sequoia trees in
California.
Van der Waerden Scores:
Study:
Van der Waerden (Normal Scores) test's
Value : 11
Pvalue: 0.4432633
Degrees of freedom:
c(x, y),
48
49
62
63
69
74
75
77
82
89
91
92
11
means of the normal score
c.x..y. std.err r
-1.42554404
NA 1
-1.01942762
NA 1
-0.73555756
NA 1
-0.50152740
NA 1
-0.29237490
NA 1
-0.09539637
NA 1
0.09539637
NA 1
0.29237490
NA 1
0.50152740
NA 1
0.73555756
NA 1
1.01942762
NA 1
1.42554404
NA 1
t-Student: NaN
Alpha
: 0.05
LSD
: NaN
Means with the same letter are not significantly different
4
Results of the Siegel-Tukey test:
Median of group 1 = 77
Median of group 2 = 69
Testing median differences...
Wilcoxon rank sum test
data: data$x[data$y == 0] and data$x[data$y == 1]
W = 24, p-value = 0.3434
alternative hypothesis: true location shift is not equal to 0
Performing Siegel-Tukey rank transformation...
1
2
3
4
5
6
7
8
9
10
11
12
sort.x sort.id unique.ranks
48
1
1
49
1
4
62
0
5
63
1
8
69
1
9
74
0
12
75
1
11
77
0
10
82
1
7
89
0
6
91
1
3
92
0
2
Performing Siegel-Tukey test...
Mean rank of group 0: 7
Mean rank of group 1: 6.142857
Wilcoxon rank sum test with continuity correction
data: ranks0 and ranks1
W = 20, p-value = 0.7453
alternative hypothesis: true location shift is not equal to 0
5
Download