An Introduction to Statistical Inference Mike Wasikowski June 12, 2008

advertisement
An Introduction to
Statistical Inference
Mike Wasikowski
June 12, 2008
Statistics






Up till now, we have looked at probability
Analyzing data in which chance played some part of its
development
Two main branches
•
•
Estimation of parameters
Testing hypotheses about parameters
To use statistical analysis, must ensure we have a
random sample of the population
Methods described are classical methods, "probability of
data D given hypothesis H"
Bayesian methods are also sometimes used, "probability
of hypothesis H given data D"
Contents






What is an estimator?
Unbiased estimators
Biased estimators
Parametric hypothesis tests
Nonparametric hypothesis tests
Multiple tests/experiments
Classical Estimation Methods





Probability distributions PX(x;θ) and density
functions fX(x;θ) have parameters
Can use the observed value x of X to
estimate θ
To estimate the parameters, must use
multiple iid observations, x1, x2, ..., xn
Estimator of parameters θ is a function of the
rv's X1, X2, ..., Xn, written as either θ(X1, X2,
..., Xn) or θ
Value of θ is the estimate of θ
Desirable Properties of
Estimators



Unbiased: E(θ) = θ
Small variance: observed value of θ
should be close to θ
Normal distribution, either exactly or
approximately: allows us to use the
properties of the normal distribution to
provide properties of θ
Estimating μ





Use X to estimate μ
Know the mean value of X is μ, so X is
unbiased
Know the variance of X is σ2/n, so it is small
when n is large
Central limit theorem tells us that the
distribution of X will be approximately normal
with a large number of observations
Our estimated value of μ is x
Confidence Intervals



From section 1.10.2, for large n, P(X2σ/sqrt(n) < μ < X+2σ/sqrt(n)) ~ 0.95
Probability of random interval (X2σ/sqrt(n), X+2σ/sqrt(n)) containing μ is
approximately 95%
Observed value of interval given all xi is
(x-2σ/sqrt(n),x+2σ/sqrt(n))
Estimating σ2




Can develop unbiased estimator of σ2 by
σ2 = (Σ(Xi-X)2)/(n-1)
Our estimated value of σ2 is s2 = (Σ(xix)2)/(n-1)
One potential problem: unless n is very
large, this variance will also typically be
large
Variance of X = σ2/n = (Σ(Xi-X)2)/(n(n-1))
Estimated Confidence Intervals



We then have the 95% confidence
interval for μ as (X-2S/sqrt(n),
X+2S/sqrt(n))
Observed interval from data is (x2s/sqrt(n), x+2s/sqrt(n))
Again, warning: unless n is very large,
this interval will be large and may not be
useful
Binomial and Multinomial
Probability Estimates






Consider RV Y(p,n), where p is a parameter and n is the
index
Know the mean value of Y/n is p and variance of Y/n is
p(1-p)/n
By above, p = Y/n is an unbiased estimator of p
Typical estimate of variance of p is p(1-p)/n = y(n-y)/n3,
where y = number of successes
Above estimate is biased, unbiased estimate is y(n-y)/(n3n) similarly to σ2 estimate
Estimate of {pi} is similarly calculated by converting
multinomial problem into a series of binomial problems
Biased Estimators







Not all estimators are unbiased
Biased estimator θ is one where E(θ) differs
from θ
Bias = E(θ)- θ
Assess accuracy of θ by MSE rather than
variance
MSE(θ) = E((θ - θ)^2) = Var(θ)+Bias(θ)2
When E(θ) = θ+O(n-1), call the estimator
asymptotically unbiased
MSE and variance would differ by O(n-2)
Why use biased estimators?


Some parameters cannot be estimated
in an unbiased manner
Biased estimators are better than
unbiased estimators if MSE < variance
Hypothesis Testing


Test a null hypothesis (H0) versus an alternate
hypothesis (H1 or Ha)
Five steps:
1)
2)
3)
4)
Declare null hypothesis and alternate hypothesis
Select significance level α
Determine the test statistic to be used
Determine what observed values of test statistic
would lead to rejection of H0
5) Use data to determine whether observed value of
test statistic meets or exceeds significance point from
step 4
Declaring Hypotheses




Must declare null hypothesis and alternate
hypothesis before seeing any data to avoid
bias
Hypotheses can be simple (specifies all values
of unknown parameters) or complex (does not
specify all values of unknown parameters)
Natural alternate hypothesis is complex
Alternate hypotheses can be either one-sided
(θ> θ0 or θ < θ0) or two-sided (θ != θ0)
Selecting Significance Level

Two types of errors can be made from a
hypothesis test
•
•



Type I: reject H0 when it is true
Type II: fail to reject H0 when it is false
Unless we have limitless observations, cannot
make the probability of making either error
arbitrarily small
Typical method is to focus on type I errors and
fix α to be arbitrarily low
Common values of α are 1% and 5%
Choosing Test Statistic


There is much theory available for
choosing good test statistics
Chapter 9 (Alex) discusses finding the
optimal test statistic that, for a given type
I error rate, will minimize the rate of
making type II errors for a number of
observations
Finding Significance Points




Find the value of the significance points K for
the test statistic
General ex: α = 0.05, P(type I error) = P(X >=
K | H0) = 0.05
If the RV is discrete, it may be impossible to
find an exact value of K such that the rate of
type I errors is exactly α
In practice, we err conservatively and round up
the value of K
Finding Conclusions



Compare the result of the test statistic
derived from observations to significance
points K
Two conclusions can be drawn from a
hypothesis test: fail to reject null, or
reject null in favor of alternate
A hypothesis test never tells you if a
hypothesis is true or false
P-values





An equivalent method skips calculating
significance point K
Instead, calculate the achieved
significance level (p-value) of the test
statistic
Then compare p-value to α
If p-value <= α, reject H0
If p-value > α, fail to reject H0
Power of Hypothesis Tests





Recall step 3 involves choosing an optimal test statistic
If both hypotheses are simple, choice of α implicitly
determines β, rate of type II error
Power of hypothesis test = 1- β, rate of avoiding type II
errors
If we have a complex alternate hypothesis, probability of
rejecting H0 depends on actual value of parameters in
test, so there is no unique value of beta
Chapter 9 discusses how to find the power of tests with
alternate hypotheses
Z-test





Classic example: what is the mean of
data drawn from a normal distribution?
H0: μ = μ0, H1: μ > μ0
Use X as our optimal test statistic
RV Z = (X - μ0)sqrt(n)/σ has distribution
N(0,1) when H0 is true
For α = 0.05, get Z ≥ 1.645 for
significance level
One-sample t-test






Must estimate the sample variance with s2
Now use one-sample t-test, t = (x-μ0)sqrt(n)/s
If we know that X_1, X_2, ..., X_n are
NID(mu,sigma^2), H0 distribution is well known
T = (X-μ0)sqrt(n)/S
Called the t-distribution with n-1 degrees of
freedom
T is asymptotically equal to Z, differs greatly for
small n
Two-sample t-test






What if we need to compare between two different RV's?
Ex: repeated experiment comparing two methods
H0: μ1 = μ2, H1 = μ1 != μ2
Consider X11, X12, ..., X1m ~ NID(μ1, σ2) and X21, X22, ...,
X2n ~ NID(μ2,σ2) to be RV's from which our observations
are drawn
Use two-sample t-test
Large positive or negative values cause rejection of H0
Two-sample t-test

T-distribution RV

Observed value of RV
Paired t-test







Suppose values of X1i and X2i are logically paired by
some manner
Can instead perform a paired t-test, use Di = X1i-X2i for
our test
H0: μD = 0, H1: μD != 0
Then use T = Dsqrt(n)/SD as our test statistic
This method can eliminate sources of variance
Beginnings of source for ANOVA, where we break
variation into different components
Also foundations for F-test, test of ratio between two
variances
Chi-square test





Consider a multinomial distribution
H0: pi = specific value for each i=1..k, H1: at
least one of pi != predefined value
Use X2 as our test statistic, X2 = Σ(Yi-npi)2/(npi)
Larger observed values of X2 will lead to
rejection of H0
When H0 is true and n is large, X2 ~ chi-square
distribution with k-1 degrees of freedom
Association tests




Compare elements of a population by placing
into one of a number of categories for two
properties
Fisher's exact test compares two different
binary properties of a population
H0: two properties are independent of one
another, H1: two properties are dependent in
some manner
Can also use chi-square test on tables of
arbitrary number of rows and columns
Hypothesis Testing with
Maximum as Test Statistic








Bioinformatics has several areas where maximum of
many RV's is a useful test statistic
BLAST, local alignment of sequences, only care about the
most likely alignment
Let X1, X2, ..., Xn ~ N(μi,1)
H0: μi = 0 for all i, H1: one μi > 0 with the rest μi = 0
Optimal test statistic: Xmax
Reject H0 if P(Xmax > xmax | H_0) < α
Use equation 1-F(xmax)n to find P-value
Some options still exist if we cannot calculate the cdf, one
possibility is total variation distance
Nonparametric Tests



Two-sample t-test is a distribution-dependent
test, relies on RV's having the normal
distribution
If we use the t-test when at least one of the
underlying RV's is not normal, using the
calculated p-value will result in an invalid
testing procedure
Nonparametric, or distribution-free, tests avoid
problems with using tests specific to a
distribution
Permutation Tests






Avoids assumption of normal distribution
Have RV's X11, X12, ..., X1m iid and X21, X22, ...,
X2n iid, with possibly differing distributions
Assume that X1i independent of X2j for all (i,j)
H0: X1i's distributed identically as X2j's, H1:
distributions differ
Q = nCr(m+n,m) possible placements of X11,
X12, ..., X1m, X21, X22, ..., X2n into two groups,
permutations
H0 says each Q has same probability of arising
Permutation Tests





Calculate test statistic for each permutation
Reject H0 if observed value of statistic is among the most
100α% extreme values of the test statistic
Choice of test statistic depends on what we think may be
different about the two distributions
t-tests could be used if we feel they have different means,
F-test if different variances
Problems with these tests: granularity with too few
samples, computational complexity with too many
Mann-Whitney Test







Frequently used alternative to two-sample t-test
Observed values x11, x12,...,x1m and x21, x22, ..., x2n are
listed in increasing order
Associate all observations with their rank in this list
Sum of all ranks is (m+n)(m+n+1)/2
H0: X1i's, X2j's are identically distributed, H1: at least one
parameter of the distributions differ
For large sample sizes, use central limit theorem to test
null hypothesis using z-score
For small sample sizes, can calculate exact p-value as a
permutation test
Wilcoxon Signed-rank Test






Test for value of the median of a generic continuous RV; if
distribution is symmetric, also tests for mean
H0: med = M0, H1: med != M0
Calculate absolute differences |xi- M0|, order from
smallest to largest, give ranks to each value
Observed test statistic = sum of ranks of positive
differences
Use central limit theorem to compare groups with large
number of samples
Can also calculate exact p-value as permutation for small
sample sizes
Multiple Associated Tests





If we test many associated hypotheses where each H0 is
true, chance will lead to one or more being rejected
Family-wide p-value can be used to avoid this result
If we want a family-wide significance level of 0.05, each
test should have α = 0.05/g, the number of different tests
we are performing
This correction applies even if the tests are not
independent of one another, recall indicator variable
discussion
Obvious problem: if we perform multiple different tests,
this procedure will result in a very low required p-value to
reject H0 for each individual test
Multiple Experiments





In science, it is common to repeat tests to
verify results
What if the p-values of each test are close to α
but not less?
Use a combined p-value to show significance
of each p-value in conjunction with others
V = -2log(P1P2...Pk) gives a quantity with a chisquare distribution of 2k degrees of freedom
Can result in seeing significant results when no
individual null hypothesis was rejected
Questions?
Download