An Introduction to Statistical Inference Mike Wasikowski June 12, 2008 Statistics Up till now, we have looked at probability Analyzing data in which chance played some part of its development Two main branches • • Estimation of parameters Testing hypotheses about parameters To use statistical analysis, must ensure we have a random sample of the population Methods described are classical methods, "probability of data D given hypothesis H" Bayesian methods are also sometimes used, "probability of hypothesis H given data D" Contents What is an estimator? Unbiased estimators Biased estimators Parametric hypothesis tests Nonparametric hypothesis tests Multiple tests/experiments Classical Estimation Methods Probability distributions PX(x;θ) and density functions fX(x;θ) have parameters Can use the observed value x of X to estimate θ To estimate the parameters, must use multiple iid observations, x1, x2, ..., xn Estimator of parameters θ is a function of the rv's X1, X2, ..., Xn, written as either θ(X1, X2, ..., Xn) or θ Value of θ is the estimate of θ Desirable Properties of Estimators Unbiased: E(θ) = θ Small variance: observed value of θ should be close to θ Normal distribution, either exactly or approximately: allows us to use the properties of the normal distribution to provide properties of θ Estimating μ Use X to estimate μ Know the mean value of X is μ, so X is unbiased Know the variance of X is σ2/n, so it is small when n is large Central limit theorem tells us that the distribution of X will be approximately normal with a large number of observations Our estimated value of μ is x Confidence Intervals From section 1.10.2, for large n, P(X2σ/sqrt(n) < μ < X+2σ/sqrt(n)) ~ 0.95 Probability of random interval (X2σ/sqrt(n), X+2σ/sqrt(n)) containing μ is approximately 95% Observed value of interval given all xi is (x-2σ/sqrt(n),x+2σ/sqrt(n)) Estimating σ2 Can develop unbiased estimator of σ2 by σ2 = (Σ(Xi-X)2)/(n-1) Our estimated value of σ2 is s2 = (Σ(xix)2)/(n-1) One potential problem: unless n is very large, this variance will also typically be large Variance of X = σ2/n = (Σ(Xi-X)2)/(n(n-1)) Estimated Confidence Intervals We then have the 95% confidence interval for μ as (X-2S/sqrt(n), X+2S/sqrt(n)) Observed interval from data is (x2s/sqrt(n), x+2s/sqrt(n)) Again, warning: unless n is very large, this interval will be large and may not be useful Binomial and Multinomial Probability Estimates Consider RV Y(p,n), where p is a parameter and n is the index Know the mean value of Y/n is p and variance of Y/n is p(1-p)/n By above, p = Y/n is an unbiased estimator of p Typical estimate of variance of p is p(1-p)/n = y(n-y)/n3, where y = number of successes Above estimate is biased, unbiased estimate is y(n-y)/(n3n) similarly to σ2 estimate Estimate of {pi} is similarly calculated by converting multinomial problem into a series of binomial problems Biased Estimators Not all estimators are unbiased Biased estimator θ is one where E(θ) differs from θ Bias = E(θ)- θ Assess accuracy of θ by MSE rather than variance MSE(θ) = E((θ - θ)^2) = Var(θ)+Bias(θ)2 When E(θ) = θ+O(n-1), call the estimator asymptotically unbiased MSE and variance would differ by O(n-2) Why use biased estimators? Some parameters cannot be estimated in an unbiased manner Biased estimators are better than unbiased estimators if MSE < variance Hypothesis Testing Test a null hypothesis (H0) versus an alternate hypothesis (H1 or Ha) Five steps: 1) 2) 3) 4) Declare null hypothesis and alternate hypothesis Select significance level α Determine the test statistic to be used Determine what observed values of test statistic would lead to rejection of H0 5) Use data to determine whether observed value of test statistic meets or exceeds significance point from step 4 Declaring Hypotheses Must declare null hypothesis and alternate hypothesis before seeing any data to avoid bias Hypotheses can be simple (specifies all values of unknown parameters) or complex (does not specify all values of unknown parameters) Natural alternate hypothesis is complex Alternate hypotheses can be either one-sided (θ> θ0 or θ < θ0) or two-sided (θ != θ0) Selecting Significance Level Two types of errors can be made from a hypothesis test • • Type I: reject H0 when it is true Type II: fail to reject H0 when it is false Unless we have limitless observations, cannot make the probability of making either error arbitrarily small Typical method is to focus on type I errors and fix α to be arbitrarily low Common values of α are 1% and 5% Choosing Test Statistic There is much theory available for choosing good test statistics Chapter 9 (Alex) discusses finding the optimal test statistic that, for a given type I error rate, will minimize the rate of making type II errors for a number of observations Finding Significance Points Find the value of the significance points K for the test statistic General ex: α = 0.05, P(type I error) = P(X >= K | H0) = 0.05 If the RV is discrete, it may be impossible to find an exact value of K such that the rate of type I errors is exactly α In practice, we err conservatively and round up the value of K Finding Conclusions Compare the result of the test statistic derived from observations to significance points K Two conclusions can be drawn from a hypothesis test: fail to reject null, or reject null in favor of alternate A hypothesis test never tells you if a hypothesis is true or false P-values An equivalent method skips calculating significance point K Instead, calculate the achieved significance level (p-value) of the test statistic Then compare p-value to α If p-value <= α, reject H0 If p-value > α, fail to reject H0 Power of Hypothesis Tests Recall step 3 involves choosing an optimal test statistic If both hypotheses are simple, choice of α implicitly determines β, rate of type II error Power of hypothesis test = 1- β, rate of avoiding type II errors If we have a complex alternate hypothesis, probability of rejecting H0 depends on actual value of parameters in test, so there is no unique value of beta Chapter 9 discusses how to find the power of tests with alternate hypotheses Z-test Classic example: what is the mean of data drawn from a normal distribution? H0: μ = μ0, H1: μ > μ0 Use X as our optimal test statistic RV Z = (X - μ0)sqrt(n)/σ has distribution N(0,1) when H0 is true For α = 0.05, get Z ≥ 1.645 for significance level One-sample t-test Must estimate the sample variance with s2 Now use one-sample t-test, t = (x-μ0)sqrt(n)/s If we know that X_1, X_2, ..., X_n are NID(mu,sigma^2), H0 distribution is well known T = (X-μ0)sqrt(n)/S Called the t-distribution with n-1 degrees of freedom T is asymptotically equal to Z, differs greatly for small n Two-sample t-test What if we need to compare between two different RV's? Ex: repeated experiment comparing two methods H0: μ1 = μ2, H1 = μ1 != μ2 Consider X11, X12, ..., X1m ~ NID(μ1, σ2) and X21, X22, ..., X2n ~ NID(μ2,σ2) to be RV's from which our observations are drawn Use two-sample t-test Large positive or negative values cause rejection of H0 Two-sample t-test T-distribution RV Observed value of RV Paired t-test Suppose values of X1i and X2i are logically paired by some manner Can instead perform a paired t-test, use Di = X1i-X2i for our test H0: μD = 0, H1: μD != 0 Then use T = Dsqrt(n)/SD as our test statistic This method can eliminate sources of variance Beginnings of source for ANOVA, where we break variation into different components Also foundations for F-test, test of ratio between two variances Chi-square test Consider a multinomial distribution H0: pi = specific value for each i=1..k, H1: at least one of pi != predefined value Use X2 as our test statistic, X2 = Σ(Yi-npi)2/(npi) Larger observed values of X2 will lead to rejection of H0 When H0 is true and n is large, X2 ~ chi-square distribution with k-1 degrees of freedom Association tests Compare elements of a population by placing into one of a number of categories for two properties Fisher's exact test compares two different binary properties of a population H0: two properties are independent of one another, H1: two properties are dependent in some manner Can also use chi-square test on tables of arbitrary number of rows and columns Hypothesis Testing with Maximum as Test Statistic Bioinformatics has several areas where maximum of many RV's is a useful test statistic BLAST, local alignment of sequences, only care about the most likely alignment Let X1, X2, ..., Xn ~ N(μi,1) H0: μi = 0 for all i, H1: one μi > 0 with the rest μi = 0 Optimal test statistic: Xmax Reject H0 if P(Xmax > xmax | H_0) < α Use equation 1-F(xmax)n to find P-value Some options still exist if we cannot calculate the cdf, one possibility is total variation distance Nonparametric Tests Two-sample t-test is a distribution-dependent test, relies on RV's having the normal distribution If we use the t-test when at least one of the underlying RV's is not normal, using the calculated p-value will result in an invalid testing procedure Nonparametric, or distribution-free, tests avoid problems with using tests specific to a distribution Permutation Tests Avoids assumption of normal distribution Have RV's X11, X12, ..., X1m iid and X21, X22, ..., X2n iid, with possibly differing distributions Assume that X1i independent of X2j for all (i,j) H0: X1i's distributed identically as X2j's, H1: distributions differ Q = nCr(m+n,m) possible placements of X11, X12, ..., X1m, X21, X22, ..., X2n into two groups, permutations H0 says each Q has same probability of arising Permutation Tests Calculate test statistic for each permutation Reject H0 if observed value of statistic is among the most 100α% extreme values of the test statistic Choice of test statistic depends on what we think may be different about the two distributions t-tests could be used if we feel they have different means, F-test if different variances Problems with these tests: granularity with too few samples, computational complexity with too many Mann-Whitney Test Frequently used alternative to two-sample t-test Observed values x11, x12,...,x1m and x21, x22, ..., x2n are listed in increasing order Associate all observations with their rank in this list Sum of all ranks is (m+n)(m+n+1)/2 H0: X1i's, X2j's are identically distributed, H1: at least one parameter of the distributions differ For large sample sizes, use central limit theorem to test null hypothesis using z-score For small sample sizes, can calculate exact p-value as a permutation test Wilcoxon Signed-rank Test Test for value of the median of a generic continuous RV; if distribution is symmetric, also tests for mean H0: med = M0, H1: med != M0 Calculate absolute differences |xi- M0|, order from smallest to largest, give ranks to each value Observed test statistic = sum of ranks of positive differences Use central limit theorem to compare groups with large number of samples Can also calculate exact p-value as permutation for small sample sizes Multiple Associated Tests If we test many associated hypotheses where each H0 is true, chance will lead to one or more being rejected Family-wide p-value can be used to avoid this result If we want a family-wide significance level of 0.05, each test should have α = 0.05/g, the number of different tests we are performing This correction applies even if the tests are not independent of one another, recall indicator variable discussion Obvious problem: if we perform multiple different tests, this procedure will result in a very low required p-value to reject H0 for each individual test Multiple Experiments In science, it is common to repeat tests to verify results What if the p-values of each test are close to α but not less? Use a combined p-value to show significance of each p-value in conjunction with others V = -2log(P1P2...Pk) gives a quantity with a chisquare distribution of 2k degrees of freedom Can result in seeing significant results when no individual null hypothesis was rejected Questions?