Nonparametric tests

advertisement
Nonparametric tests
If data do not come from a normal population (and if the sample is
not large), we cannot use a t-test. One useful approach to creating
test statistics is through the use of rank statistics.
Sign test for one population
The sign test assumes the data come from from a continuous
distribution with model
Yi = η + i , i = 1, . . . , n.
• η is the population median and i follow an unknown, continuous
distribution.
• Want to test H0 : η = η0 where η0 is known versus one of the
three common alternatives: H0 : η < η0 , H0 : η 6= η0 , or
H0 : η > η0 .
1
• Test statistic is B =
larger than η0 .
∗
Pn
i=1
I{yi > η0 }, the number of y1 , . . . , yn
• Under H0 , B ∗ ∼ bin(n, 0.5).
• Reject H0 if B ∗ is “unusual” relative to this binomial
distribution.
Example: Eye relief data.
Question: How would you form a “large sample” test statistic from
B ∗ ? You would not need to do that here, but this is common with
more complex test statistics with non-standard distributions.
2
Wilcoxin signed rank test
• Again, test H0 : η = η0 . However, this method assumes a
symmetric pdf around the median η.
• Test statistic built from ranks of
{|y1 − η0 |, |y2 − η0 |, . . . , |yn − η0 |}, denoted R1 , . . . , Rn .
• The signed rank for observation i is


 R y >η 
i
i
0
Ri+ =
.
 0

yi ≤ η0
• The “signed rank” statistic is W
+
=
Pn
+
R
i=1 i .
• If W + is large, this is evidence that η > η0 .
• If W + is small, this is evidence that η < η0 .
3
• Both the sign test and the signed-rank test can be used with
paired data (e.g. we could test whether the median difference is
zero).
• When to use what? Use t-test when data are approximately
normal, or in large sample sizes. Use sign test when data are
highly skewed, multimodal, etc. Use signed rank test when data
are approximately symmetric but non-normal (e.g. heavy or
light-tailed, multimodal yet symmetric, etc.)
Example: Weather station data
Note: The sign test and signed-rank test are more flexible than the
t-test because they require less strict assumptions, but the t-test has
more power when the data are approximately normal.
4
Mann-Whitney-Wilcoxin nonparametric test (pp. 795–796)
The Mann-Whitney test assumes
iid
iid
Y11 , . . . Y1n1 ∼ F1 independent Y21 , . . . , Y2n2 ∼ F2 ,
where F1 is the cdf of data from the first group and F2 is the cdf of
data from the second group. The null hypothesis is H0 : F1 = F2 , i.e.
that the distributions of data in the two groups are identical.
The alternative is H1 : F1 6= F2 . One-sided tests can also be
performed.
Although the test statistic is built assuming F1 = F2 , the alternative
is often taken to be that the population medians are unequal. This is
a fine way to report the results of the test. Additionally, a CI will
give a plausible range for ∆ in the shift model F2 (x) = F1 (x − ∆). ∆
can be the difference in medians or means.
5
The Mann-Whitney test is intuitive. The data are
y11 , y12 , . . . , y1n1 and y21 , y22 , . . . , y2n2 .
For each observation j in the first group count the number of
observations in the second group cj that are smaller; ties result in
adding 0.5 to the count.
Assuming H0 is true, on average half the observations in group 2
would be above Y1j and half would be below if they come from the
same distribution. That is E(cj ) = 0.5n2 .
Pn1
cj and has mean
The sum of these guys is U = j=1
E(U ) = 0.5n1 n2 . The variance is a bit harder to derive, but is
Var(U ) = n1 n2 (n1 + n2 + 1)/12.
6
Something akin to the CLT tells us
U − E(U )
•
Z0 = p
∼ N (0, 1),
n1 n2 (n1 + n2 + 1)/12
when H0 is true. Seeing a U far away from what we expect under the
null gives evidence that H0 is false; U is then standardized as usual
(subtract off then mean we expect under the null and standardize by
an estimate of the standard deviation of U ).
A p-value can be computed as usual as well as a CI.
Example: Dental measurements.
Note: This test essentially boils down to replacing the observed
values with their ranks and carrying out a simple pooled t-test!
7
Regression models
Model: a mathematical approximation of the relationship among real
quantities (equation & assumptions about terms).
• We have seen several models for a single outcome variable from
either one or two populations.
• Now we’ll consider models that relate an outcome to one or more
continuous predictors.
8
Section 1.1: relationships between variables
• A functional relationship between two variables is
deterministic, e.g. Y = cos(2.1x) + 4.7. Although often an
approximation to reality (e.g. the solution to a differential
equation under simplifying assumptions), the relation itself is
“perfect.” (e.g. page 3)
• A statistical or stochastic relationship introduces some “error”
in seeing Y , typically a functional relationship plus noise.
(e.g. Figures 1.1, 1.2, and 1.3; pp. 4–5).
Statistical relationship: not a perfect line or curve, but a general
tendency plus slop.
9
Example:
x
y
1
1
2
1
3
2
4
2
5
4
Let’s obtain the plot in R: x=c(1,2,3,4,5); y=c(1,1,2,2,4)
Must decide what is the proper functional form for this relationship,
i.e. trend : Linear? Curved? Piecewise continuous?
10
Section 1.3: Simple linear regression model For a sample of
pairs {(xi , Yi )}ni=1 , let
Yi = β0 + β1 + i , i = 1, . . . , n,
where
• Y1 , . . . , Yn are realizations of the response variable,
• x1 , . . . , xn are the associated predictor variables,
• β0 is the intercept of the regression line,
• β1 is the slope of the regression line, and
• 1 , . . . , n are unobserved, uncorrelated random errors.
This model assumes that x and Y are linearly related, i.e. the mean
of Y changes linearly with x.
11
Assumptions about the random errors
We assume that E(i ) = 0 and var(i ) = σ 2 , and corr(i , j ) = 0 for
i 6= j: mean zero, constant variance, uncorrelated.
• β0 + β1 xi is the deterministic part of the model. It is fixed but
unknown.
• i represents the random part of the model.
The goal of statistics is often to separate signal from noise; which is
which here?
12
Note that
E(Yi ) = E(β0 + β1 xi + i ) = β0 + β1 xi + E(i ) = β0 + β1 xi ,
and similarly
var(Yi ) = var(β0 + β1 xi + i ) = var(i ) = σ 2 .
Also, corr(Yi , Yj ) = 0 for i 6= j.
(We will draw a picture...)
13
Example: Pages 10–11 (see picture). We have
E(Y ) = 9.5 + 2.1x.
When x = 45, our expected y-value is 104, but we will actually
observe a value somewhere around 104.
What does 9.5 represent here? Is it sensible/interpretable?
How is 2.1 interpreted here?
In general, β1 represents how the mean response changes when x is
increased one unit.
14
Matrix form: Note the simple linear regression model can be
written in matrix terms as
1
xn
β0
β1
+
x2
.
.
.
1
.
.
.
1
2
.
.
.
n
x1
Yn
=
Y2
.
.
.
1
Y1
,
or equivalently
Y = Xβ + ,
where
xn
This will be useful later on.
15
β1
, =
1
,β =
β0
x2
.
.
.
1
.
.
.
1
2
.
.
.
n
x1
Yn
,X =
Y=
Y2
.
.
.
1
Y1
.
Section 1.6: Estimation of (β0 , β1 )
β0 and β1 are unknown parameters to be estimated from the data:
(x1 , Y1 ), (x2 , Y2 ), . . . , (xn , Yn ).
They completely determine the unknown mean at each value of x:
E(Y ) = β0 + β1 x.
Since we expect the various Yi to be both above and below β0 + β1 xi
roughly the same amount (E(i ) = 0), a good-fitting line b0 + b1 x will
go through the “heart” of the data points in a scatterplot.
The method of least-squares formalizes this idea by minimizing the
sum of the squared deviations of the observed yi from the line
b0 + b1 xi .
16
Least squares method for estimating (β0 , β1 ). The sum of squared
deviations about the line is
Q(β0 , β1 ) =
n
X
2
[Yi − (β0 + β1 xi )] .
i=1
Least squares minimizes Q(β0 , β1 ) over all (β0 , β1 ).
Calculus shows that the least squares estimators are
Pn
(x − x̄)(Yi − Ȳ )
i=1
Pn i
b1 =
2
i=1 (xi − x̄)
b0
= Ȳ − b1 x̄
17
Proof :
∂Q
∂β1
=
n
X
2(Yi − β0 − β1 xi )(−xi )
i=1
= −2
" n
X
xi Yi − β0
i=1
∂Q
∂β0
=
n
X
n
X
xi − β1
i=1
n
X
i=1
2(Yi − β0 − β1 xi )(−1)
i=1
= −2
" n
X
Yi − nβ0 − β1
i=1
n
X
i=1
18
#
xi .
#
x2i .
Setting these equal to zero, and dropping indexes on the summations,
we have


P
P
P

xi Yi = b0 xi + b1 x2i 
⇐ “normal” equations
 P Yi = nb0 + b1 P xi

P
Multiply the first by n and multiply the second by
xi and subtract
yielding
X
X 2 X
X X
n
xi Yi −
xi
Yi = b1 n
x2i −
xi
.
Solving for b1 we get
P
P
P P
xi Yi − nȲ x̄
n xi Yi − xi Yi
P
.
=
b1 =
P 2
P 2
2
2
xi − nx̄
n xi − ( xi )
19
The second normal equation immediately gives
b0 = Ȳ − b1 x̄.
Our solution for b1 is correct but not as aesthetically pleasing as the
purported solution
Pn
(xi − x̄)(Yi − Ȳ )
i=1
P
.
b1 =
n
2
(x
−
x̄)
i=1 i
Show
X
X
(xi − x̄)(Yi − Ȳ ) =
xi Yi − nȲ x̄
X
X
2
(xi − x̄) =
x2i − nx̄2
20
The line Ŷ = b0 + b1 x is called the least squares estimated regression
line. Why are the least squares estimates (b0 , b1 ) “good?”
1. They are unbiased: E(b0 ) = β0 and E(b1 ) = β1 .
2. Among all linear unbiased estimators, they have the smallest
variance. They are best linear unbiased estimators, BLUEs.
We will show the first property in the next class. The second
property is formally called the “Gauss-Markov” theorem and is
proved in linear models. (Look it up on Wikipedia if interested.)
21
Download