Project 1 - gozips.uakron.edu

advertisement
Christopher Knapp
University of Akron, Fall 2011
Statistical Data Management
Project #1
Problem Statement
Problem #1:
Please research all the non-parametric techniques discussed in lecture. Please report the source,
formula, characteristics, and an example of usage for each.
Problem #2:
Please use the hospital data that was included in assignment #5. Please make sure you use the data
set after you have fixed all the issues. Recall this was a set of observations on 27 people from two
separate hospitals.
In SAS:
Ho: µLead,M = µLead,F
Ho: µNeutro = 22.1
Ho: µLead = µHaemo
In SPSS:
Ho: µLympho, Hosp A, F = µLympho, Hosp B, F
Ho: µLead = 16.2
Ho: µLympho = µNeutro
For each hypothesis test perform the appropriate t-test and the appropriate nonparametric test.
Discuss which test should be used and justify your choice as well as possible. Discuss
everything possible with regards to hypothesis testing (set-up, assumptions, …). When the
parametric approach is appropriate, complete the corresponding confidence interval technique.
Please type up your report and include all necessary output (SAS and SPSS). Be sure to use the
appropriate labels and formats. Email your SAS code.
For the students enrolled in 480, only complete the hypothesis that are bolded.
Contents
Review of Non-parametric Techniques
Page 1
Single Sample: Sign Test
Description
Example
Page 2
Single Sample: Wilcoxon Signed-rank Test
Description
Example
Page 3
Two Independent Samples: Wilcoxon Rank Sum Test
Description
Example
Page 5
Matched Pairs Tests
Description
Example
Hypothesis Tests in SAS
Page 6
Page 8
Page 9
H0: µLead,M = µLead,F
H0: µNeutro = 22.1
H0: µLead = µHaemo
Hypothesis Tests in SPSS
Page 11
Page 13
Page 14
H0: µLympho,Hosp A,F = µLympho,Hosp B,F
H0: µLead = 16.2
H0: µLympho = µNeutro
References and Appendices
Page 16
Appendix A
Appendix B
References
SAS Code
SPSS Code
Section 1
Review of Non-parametric Techniques
Project 1
Section 1: Review of Non-parametric Techniques
P a g e |1
Single Sample: Sign Test
Goals and Assumptions:
This test is used to check if a given value is equal to the population median. That is, H0: η = η0.
If symmetry of the population can be assumed (weaker assumption than normality) then the
hypothesis H0: µ= η0 is equivalent to H0: η = η0 (symmetry implies η = µ). Its parametric
counterpart is the t-test, and can be used if the population is normally distributed or if the sample
size is larger than 30. The sign test (less powerful than t-test) should be used if these conditions
are not met.
The alternative hypotheses: H0: η ≠ η0, H0: η > η0, or H0: η < η0.
Description:
The decision of the sign test is based on the binomial distribution. From a population with
median η0, there is a 50% chance of randomly observing a value larger than η0. If n random
observations are made (sample size of n observations) then the number of values larger than η0
(random variable X) follows a binomial distribution (parameters n and 50%). Represent the
actual number of values larger than η0 by x, and define k as follows (dependent on Ha):
 If Ha: η > η0 then k is the smallest integer between 0 and n such that P[X ≥ k] < α. The
null hypothesis is rejected if x is larger than or equal to k.
 If Ha: η < η0 then k is the largest integer between 0 and n such that P[X ≤ k] < α. The
null hypothesis is rejected if x is smaller than or equal to k.
 If Ha: η ≠ η0 then k2 is the smallest integer between 0 and n such that P[X ≥ k2] < α/2, and
k2 is the largest integer between 0 and n such that P[X ≤ k2] < α/2. The null hypothesis is
rejected if x is smaller than or equal k1 or larger than or equal to k2.
Note: when an observation has the value η0, this data is discarded (even if continuous data is
rounded – that is, even if the probability of selecting data with the value η0 is nonzero). When a
value is discarded, both x and n are not affected by its value.
Example:
(Example from Mathematical Statistics, page 475). The following are measurements of the
breaking strength of a certain kind of 2-inch cotton ribbon in pounds:
163, 165, 160, 189, 161, 171, 158, 151, 169, 162, 163, 139, 172, 165, 148, 166, 172, 163, 187, 173
Project 1
Section 1: Review of Non-parametric Techniques
P a g e |2
The goal of this example is to test the null hypothesis H0: η = 160
against the alternative Ha: η > 160 with a level of significance of
α = .05. The actual mean and median are 164.85 and 164, which
are lower than η0, so this alternative hypothesis makes sense.
1 of the 20 values are equal to η0, so n=19. Furthermore, there are x=15 values larger than η0.
The value k=14 satisfies the definition for b(19,.5). Because k is exceeded by x, the null
hypothesis is rejected, and with a significance level of α = .05 we conclude “The median
breaking strength exceeds 160 pounds”.
Single Sample: Wilcoxon Signed-rank Test
Goals and Assumptions:
Notice that the sign test only considers the signs of the differences between the observations and
η0. Ignoring the magnitudes of the differences is wasteful; therefore, the Wilcoxon Signed-rank
Test was created to consider these deviations. The null hypothesis is the same (H0: η = η0, where
η represents the population median). This test does require a symmetric population, so this is
equivalent to H0: µ= η0.
The alternative hypotheses: H0: µ ≠ η0, H0: µ > η0, or H0: µ < η0.
Description:
The algorithm is simple. Consider the sequence S*=(xi - η0) for each of the xi observations of the
sample. Discard any values in the sequence that equal 0, so that there are n remaining values.
Rearrange S* in ascending order by the absolute value of the data points. Call this sequence
S=(an). Assign the rank (an  n) to each of the n terms. For each maximal subsequence (aj) of
terms with equal absolute values, reassign each aj to the average of the ranks within the
subsequence. Let T+ be the sum of the ranks assigned to positive values of an. For small values
of n, the sampling distribution of T+ is based on a special table (which is easily derived with a
computer considering all possible cases of positive and negative values); however, when n is
larger than 14, T+ is approximately normal with mean n(n+1)/4 and variance n(n+1)(2n+1)/24.
Note that the derivation of the distribution for T+ comes from the formula T+ = 1I1 + ∙∙∙ + nIn,
where each Ii is the indicator for ai being positive. Therefore T+ is a linear function of Bernoulli
variables.
Project 1
Section 1: Review of Non-parametric Techniques
P a g e |3
Example:
Consider the following dataset and test the hypothesis H0: η = 0 and α = .05 with Ha: η > 0:
9.1, 7.3, 13.1, 0, -2.2, 4.0, 4.9, -3.6, 12.2, -1.3, 12.7, 8.3, -2.5, 1.0, 8.1, 0.1, -1.5, 0
Because n is larger than 14, normality can be assumed for T+. The excel spreadsheet to the right
describes the decision to reject the null
hypothesis and to conclude, for α =
.05, that η > 0. Also notice that no two
positive observations had the same
values.
Therefore ranks remained
“simple”. If, for example, the first
observation was 1, then the rank for
the first two observations would both
be 1.5.
Two Independent Samples: Wilcoxon Rank Sum Test
Goals and Assumptions:
This test (also called the Mann-Whitney Test) is used to check if two independent samples come
from identically distributed populations. The null hypothesis is H0: two samples have the same
distribution (hence µ1=µ2). Its parametric counterpart is the t-test for 2 independent samples
(Pooled for equal variances and Satterthewaite for unqueal variances), which is sensitive to
departures form normality; unlike the t-test, population normality and equal sample sizes are not
required.
The alternative hypothesis: Ha: One population distribution tends to be higher than the other.
Note: this test cannot be generalized to more than two samples – more complicated tests (like
ANOVA) need to be applied in these situations.
Project 1
Section 1: Review of Non-parametric Techniques
P a g e |4
Method:
The idea is to first combine data from sample A and data from sample B into one set C=(ci).
Order C in ascending order and assign the rank cii to each value. For each maximal
subsequence (cj) of terms with equal values, reassign each cj to the average of the ranks within
the subsequence. Let W1 be the sum of the ranks for values coming from sample A and let n1
represent the number of observations in sample A. Define U1 = W1 – n1(n1+1)/2.
U1 will take on values from 0 to n1n2 (where n1 and n2 represent the number of observations in
samples A and B, respectively), with a sampling distribution that is symmetric about n1n2/2.
When n1 and n2 are both bigger than 8, it is reasonable to assume that the distribution of U1 is
approximately normal with E[U1]= n1n2/2 and VAR[U1]= n1n2(n1 + n2 + 1)/12. For smaller
values of ni, U1 uses special charts.
Example:
Consider the following two datasets and test the hypothesis H0: Population distributions are the
same (µ1=µ2) for α = .05 with Ha: µ1<µ2:
Sample 1: 14.9, 11.3, 13.2, 16.6, 17.0, 14.1, 15.4, 13.0, 16.9
Sample 2: 15.2, 19.8, 14.7, 18.3, 16.2, 21.2, 18.9, 12.2, 15.3, 19.4
The excel worksheet below demonstrates the steps. The conclusion is that the null hypothesis is
rejected and the mean of the second sample is larger than the mean of the first for α = .05.
Project 1
Section 1: Review of Non-parametric Techniques
P a g e |5
Matched Pairs Tests
Description:
When a sample involves two dependent variables, V1 and V2, (for example scores of a pretest
and of a posttest for several students) the matched pairs tests can be applied. The null hypothesis
is H0: µ1=µ2 (where µi represents the population mean of variable i). For the distribution V1-V2,
normality is not required, but symmetry is.
The method involves applying tests from the previous section: for each observation compute
V1 - V2 and apply either of the single sample tests to this distribution, where the null hypothesis
is H0: µ[V1 - V2] = 0. Notice that the alternative hypothesis Ha: µ[V1 - V2] > 0 is equivalent to
Ha: µ1>µ2; Ha: µ[V1 - V2] < 0 is equivalent to Ha: µ1<µ2; Ha: µ[V1 - V2] ≠ 0 is equivalent to H0:
µ1 ≠ µ2 .
Example:
Consider the following data of pretest scores (out of 20), posttest scores (out of 20), and the
difference between the two. We wish to test the hypothesis, at α=.05, that H0: µpre=µpost, with
alternative hypothesis Ha: µpre > µpost. This alternative hypothesis makes sense because the
average test grade for the pretest is almost 4 points higher than the posttest.
Notice that the values of the differences are identical to the data in the Wilcoxon Signed-rank
Test example. We can apply the same test from the previous example to this example and get
the same result – reject the null hypothesis and conclude the alternative. That is, with a
significance level of .05, the pretest average is higher than the posttest average.
Section 2
Hypothesis Tests in SAS
Project 1
Section 2: Hypothesis Tests in SAS
Page |6
H0: µLead,M = µLead,F
Hypothesis:
At a significance level of α=.05:
H0: µLead,M = µLead,F
Ha: µLead,M ≠ µLead,F
Parametric Test:
If the assumptions for the parametric approach are met, the high p-value for the F test implies it
is reasonable to assume the variances are equal,
so the Pooled test is the appropriate one. Notice
that the p-value is .7507, so the null hypothesis
cannot be rejected. That is, there is evidence for
µLead,M = µLead,F.
Non-parametric Test:
If the assumptions for the parametric
approach are not met, the Wilcoxon
Rank Sum Test (under some
assumptions) can be applied here, since
we have two independent samples –
lead information for males and lead
information for females. The SAS
output is displayed to the right. The pvalue for the two-sided test is .8845, so
we cannot reject the null hypothesis,
and the evidence for µLead,M = µLead,F is
significant.
Project 1
Section 2: Hypothesis Tests in SAS
Page |7
Choosing the Appropriate Test:
The tests for normality for the males all result in
small p-values, so there is substantial evidence
that normality does not hold. With a mean of
20.7, median of 19, and standard deviation of 5.5,
the Pearson test passes because 1.7 is less than
2/3 standard deviations, but this is not a very
good indicator of normality. You can see in the
picture to the right the most extreme values exist
on the right side of the graph. The most extreme
value is 2.8 standard deviations away from the
mean, which is high considering the small sample
size. Furthermore, the QQPlot suggests a skewed
right distribution because it is displayed as
concave up rather than linear. Lastly, skewness
and kurtosis are much larger than 0, which is
inconsistent with a normal distribution. There is
enough evidence to conclude that this distribution
did not come from a normal population, so a
nonparametric approach should be applied.
To apply the Wilcoxon Ranked Sum Test, the assumption of
equal shape must be met. From the side-by-side histogram to the
right, it appears that this is a reasonable assumption. You can
see that both shapes are skewed to the right, however with a
small sample it is hard to determine how accurate this
assumption truly is. Because small deviations from this
assumption do not greatly affect the validity of the test, the
nonparametric approach will be utilized.
Conclusion:
As described above, the nonparametric test is appropriate, and the p-value for the two-sided test
is .8845, so we cannot reject the null hypothesis, and there is evidence for µLead,M = µLead,F.
Project 1
Section 2: Hypothesis Tests in SAS
Page |8
H0: µNeutro = 22.1
Hypothesis:
At a significance level of α=.05:
H0: µNeutro = 22.1
and
H0: µNeutro ≠ 22.1
Parametric Test:
If normality holds for the population, then the student’s t test should be used. From the SAS
output below, the p-value is .0756 for the two-tailed test. Therefore, at a significance level of
α=.05, the null hypothesis cannot be rejected. That is, the evidence supports µNeutro = 22.1.
Non-parametric Test:
Both the sign and the signed rank test produce the same conclusion. For a significance level of
.05, there is not enough evidence to
reject the null hypothesis (p-value of
.23 and .111 for the sign and signed
rank test, respectively). That is, the
evidence supports µNeutro = 22.1.
Choosing the Appropriate Test:
The tests for normality all result in p-values larger than .05. Also, the difference between mean
and median is .12, which is
smaller than the Pearson
restriction. Furthermore, from
the histogram, the data has no
extreme values and appears to
following the empirical rule.
Lastly, the normality plot
appears fairly linear. Therefore
it is likely that this data came
from a population that is close
to being normally distributed.
Therefore the t-test is the
appropriate test.
Project 1
Section 2: Hypothesis Tests in SAS
Page |9
Conclusion:
Because the t-test is appropriate, we use its conclusion. Therefore, at a significance level of
α=.05, the null hypothesis cannot be rejected. That is, the evidence supports µNeutro = 22.1.
H0: µLead = µHaemo
Hypothesis:
At a significance level of α=.05:
H0: µLead = µHaemo
and
Ha: µLead ≠ µHaemo
Parametric Test:
Notice that this is a matched pairs test, so if normality holds for the population’s difference leadhaemo, then the student’s t test should be used. From the SAS output to the right, the p-value is
.0001 for the two-tailed test. Therefore, at a high
confidence level, the null hypothesis can be
rejected. That is, the evidence supports µLead ≠
µHaemo.
Non-parametric Test:
Both the sign and the signed rank test produce the same conclusion. For a significance level
smaller than .0001, there is evidence to reject the null hypothesis. That is, the evidence supports
µLead ≠ µHaemo.
Project 1
Section 2: Hypothesis Tests in SAS
P a g e | 10
Choosing the Appropriate Test:
The tests for normality result in pvalues between .035 and .083.
Therefore, there is some evidence
that the population is normally
distributed; however, the conclusion
varies with a significance level of
α=.05. Also, the difference between
mean and median is .57, which is
smaller than the Pearson restriction,
which supports normality. From the
histogram, the data contains values
that are more extreme than expected
under normal conditions and appears
to be slightly skewed to the right.
Lastly, the normality plot looks close
to linear, but has a slight curve,
which suggests some skewness. The
skewness statistic is 1.03842 with a standard error of .448, so it is more than two standard errors
away from 0. This indicates a right skewed distribution. The kurtosis is 1.74367 with a standard
error of .872, so it is more than two standard errors away from 0. This indicates that the tails are
longer than what is expected from a normal distribution.
These tests are inconclusive, because some support normality while others reject normality.
Notice that the single outlier may be the cause for this deviation from normality, and without
more information on the dataset we don’t know if this is a meaningful value or not. This outlier
certainly has an effect on the kurtosis and skewness values. Because slight deviations from
normality are acceptable, and because the sample size of 27 is close to 30, the appropriate test is
the parametric one.
Conclusion:
There is enough evidence (at α=.05 and p-value=.0001) to reject the null hypothesis; that is, the
evidence supports µLead ≠ µHaemo.
Section 3
Hypothesis Tests in SPSS
Project 1
Section 3: Hypothesis Tests in SPSS
P a g e | 11
H0: µLympho,Hosp A,F = µLympho,Hosp B,F
Hypothesis:
At a significance level of α=.05:
H0: µLympho,Hops A,F = µLympho,Hops B,F
and
Ha: µLympho,Hops A,F ≠ µLympho,Hops B,F
Parametric Test:
I first built a new dataset with just females, then a two sample t-test for hospital A and hospital B
to analyze Lympho. The output is displayed below. As you can see, it does not matter if equal
variance is assumed or not, the confidence interval at α=.05 includes 0, so there is not enough
evidence to reject H0. That is, the evidence supports µLympho,Hops A,F = µLympho,Hops B,F.
Non-parametric Test:
The table to the right displays the result of two
nonparametric tests. The first is the MannWhitney U Test, which was described in section
one. The second was the default test for SPSS
given the dataset. The Mann-Whitney U Test
supports the null hypothesis with a large p-value
(.902).
Project 1
Section 3: Hypothesis Tests in SPSS
P a g e | 12
Choosing the Appropriate Test:
Hospital A is fairly normal, as you
can tell from the four displays on
the right.
The skewness and
kurtosis values are fairly small,
which is consistent with normality.
Furthermore, the QQplot is close
to linear and the boxplot appears to
be closer to symmetric than
skewed. SPSS offers two tests for
normality – Kolmogorov-Smirnov
and Shapiro-Wilk – both tests
support the null hypothesis.
Therefore we can assume hospital
B has a normally distributed
population for Lympho and
Female.
Hospital B is also normal, as you can tell
from the four displays on the left. The
skewness and kurtosis values are fairly
small, which is consistent with normality.
Furthermore, the QQplot is close to linear
and the boxplot appears to be closer to
symmetric than skewed. SPSS offers two
tests for normality – KolmogorovSmirnov and Shapiro-Wilk – both tests
support the null hypothesis. Therefore we
can assume hospital A has a normally
distributed population for Lympho and
Female.
Therefore, the appropriate test is the t-test.
Conclusion:
The Mann-Whitney U Test supports the null hypothesis with a large p-value (.902), so there is
evidence supporting the null hypothesis that µLympho,Hops A,F = µLympho,Hops B,F.
Project 1
Section 3: Hypothesis Tests in SPSS
P a g e | 13
H0: µLead = 16.2
Hypothesis:
At a significance level of α=.05:
H0: µLead = 16.2
Parametric Test:
From the one sample t-test
on the right, the value 16.2
does not fall within the 95%
confidence
interval.
Therefore
the
null
hypothesis is rejected.
Non-parametric Test:
Applying the one-sample
Wilcoxon Signed Rank
Test, the table to the right
displays the decision to
reject the null hypothesis at
a significance level of
α=.05.
If symmetry cannot be assumed, then the basic sign test can
be applied by first computing the variable “lead-16.2” and
observing that 24 of the 27 values were positive. Notice
that P[X ≥ 24] = .0246 for the binomial variable
X~b(27,50%), so the two sided test at α=.05 results in a
rejection of H0. The differences in SPSS are sorted and
displayed to the right:
and
Ha: µLead ≠ 16.2
Project 1
P a g e | 14
Section 3: Hypothesis Tests in SPSS
Choosing the Appropriate Test:
All of the data points toward a rejection of normality. Skewness and kurtosis are both more than
3 standard errors away from 0, the QQplot is concave down, the boxplot looks skewed with an
outlier, and the two tests for normality reject H0: Normally distributed. Therefore the
nonparametric test should be used. More specifically, the sign test should be applied because the
distribution is skewed.
Conclusion:
The conclusion for the Non-ranked Sign Test is to reject the null hypothesis at a significance
level of α=.05. That is, evidence shows µLead ≠16.2.
H0: µLympho = µNeutro
Hypothesis:
At a significance level of α=.05:
H0: µLympho = µNeutro
and
Ha: µLympho ≠ µNeutro
Parametric Test:
The 95% confidence interval,
displayed to the right, does not
include 0, so the null hypothesis is
rejected. That is, µLympho ≠ µNeutro.
It should be noted that the p-value (.049) was only slightly smaller than .05.
Project 1
Section 3: Hypothesis Tests in SPSS
Non-parametric Test:
The conclusion of the Wilcoxon Signed
Rank Test is that the null hypothesis
should be retained. That is, the means
are the same at a significance level of
α=.05.
Choosing the Appropriate Test:
The statistics on the right support normality of the
difference between the two variables.
The
Kolmogorov-Smirnov test rejects normality,
however all other evidence points to normality,
including the Shapiro-Wilk test. Furthermore, the
QQPlot is fairly linear and the boxplot contains no
outliers. The skewness and Kurtosis are also both
within one standard error measure from 0. If
deviation from normality exists it is likely a small.
A small deviation from normality, combined with a
sample size close to 30, indicates the parametric test
is the appropriate one.
Conclusion:
At α=.05, the null hypothesis is rejected. That is, µLympho ≠ µNeutro.
P a g e | 15
Section 4
References and Appendices
Project 1
Section 4: References and Appendices
P a g e | 16
References
Freund and Walpole. Mathematical Statistics, 3rd. Prentice-Hall,1980.
Fridline, Mark. Class Lecture. Statistical Data Management. University of Akron, Akron, OH.
Fall 2011.
Hollander, Myles, and Douglas A. Wolfe. Nonparametric Statistical Methods. 2nd. WileyInterscience, 1999. <http://books.google.com/books/feeds/volumes?q=0471190454>.
Project 1
Section 4: References and Appendices
Appendix A |1
SAS Code
/* IMPORT DATA INTO NEW LIBRARY */
libname mylib '\\uanet.edu\ZIPSpace\C\crk32\Classes\F11 Statistical Data
Management\Project 1\mylib\';
proc format library=mylib;
value $gender
'M'='Male'
'F'='Female';
run;
option fmtsearch=(mylib);
data mylib.dataset;
infile '\\uanet.edu\ZIPSpace\C\crk32\Classes\F11 Statistical Data
Management\Project 1\dataset.txt' dsd delimiter='09'x;
input hospital$ gender$ haemo pcv wbc lympho neutro lead;
hospital = upcase(hospital);
gender = upcase(gender);
difference = lead - haemo;
format
gender $gender.;
label
hospital = 'Hospital Data was Collected From'
gender = 'Gender of Person'
haemo = 'Iron in Blood (Hemoglobin)'
pcv = 'Packed Cell Volume'
wbc = 'White Blood Cell Count'
lympho = 'Number of Lymphocytes'
neutro = 'Neutrophil'
lead = 'Serum Lead Concentration';
run;
/* problem1 - GET DESCRIPTIVE STATS FOR LEAD BY GENDER */
proc means data=mylib.dataset mean;
var lead;
class gender;
run;
/* problem1 - NON PARAMETRIC TEST */
proc npar1way data=mylib.dataset wilcoxon;
class gender;
var lead;
run;
/* problem2 - NON PARAMETRIC TEST */
proc univariate data=mylib.dataset loccount mu0 = 22.1 normal;
var neutro;
histogram neutro / midpoints =10 to 45 by 5 normal;
qqplot neutro;
run;
/* problem3 - NON PARAMETRIC TEST */
proc univariate data=mylib.dataset loccount mu0 = 0 normal;
var difference;
histogram difference / midpoints =-15 to 25 by 2.5 normal;
qqplot difference;
run;
Project 1
Section 4: References and Appendices
SPSS Code
T-TEST GROUPS=Hospital('A' 'B')
/MISSING=ANALYSIS
/VARIABLES=Lympho
/CRITERIA=CI(.95).
*Nonparametric Tests: Independent Samples.
NPTESTS
/INDEPENDENT TEST (Lympho) GROUP (Hospital) MANN_WHITNEY
MEDIAN(TESTVALUE=SAMPLE COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95.
DESCRIPTIVES VARIABLES=Neutro
/STATISTICS=KURTOSIS SKEWNESS.
EXAMINE VARIABLES=Lympho
/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
COMPUTE difference=Lympho-Neutro.
EXECUTE.
Appendix B |1
Download