732A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011 April 13, 2015 1 Course details Course web: www.ida.liu.se/~732A36 Course responsible, tutor and examiner: Anders Nordgaard Course period: Nov 2011-Jan 2012 Examination: Written exam in January 2012, Compulsory assignments Course literature: “Garthwaite PH, Jolliffe IT and Jones B (2002). Statistical Inference. 2nd ed. Oxford University Press, Oxford. ISBN 0-19-857226-3” Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 2 Course contents Statistical inference in general Point estimation (unbiasedness, consistency, efficiency, sufficiency, completeness) Information and likelihood concepts Maximum-likelihood and Method-of-moment estimation Classical hypothesis testing (Power functions, the Neyman-Pearson lemma , Maximum Likelihood Ratio Tests, Wald’s test) Confidence intervals … Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 3 Course contents, cont. Statistical decision theory (Loss functions, Risk concepts, Prior distributions, Sequential tests) Bayesian inference (Estimation, Hypothesis testing, Credible intervals, Predictive distributions) Non-parametric inference Computer intensive methods for estimation Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 4 Details about teaching and examination Teaching is (as usual) sparse: A mixture between lectures and problem seminars Lectures: Overview and some details of each chapter covered. No full-cover of the contents! Problem seminars: Discussions about solutions to recommended exercises. Students should be prepared to provide solutions on the board! Towards the end of the course a couple of larger compulsory assignments (that need solutions to be worked out with the help of a computer) will be distributed. The course is finished by a written exam Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 5 Prerequisities Good understanding of calculus an algebra Good understanding of the concepts of expectations (including variance calculations) Familiarity with families of probability distributions (Normal, Exponential, Binomial, Poisson, Gamma (Chisquare), Beta, …) Skills in computer programming (e.g. with R , SAS, Matlab,) Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 6 Statistical inference in general Population Model Sample Conclusions about the population is drawn from the sample with assistance from a specified model Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 7 The two paradigms: Neyman-Pearson (frequentistic) and Bayesian Population Model Sample Neyman-Pearson: • Model specifies the probability distribution for data obtained in a sample including a number of unknown population parameters • Bayesian: •Model specifies the probability distribution for data obtained in a sample and a probability distribution (prior) for each of the unknown population parameters of that distribution • Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 8 How is inference made? Point estimation: Find the “best” approximations of an unknown population parameter Interval estimation: Find a range of values that with high certainty covers the unknown population parameter Can be extended to regions if the parameter is multidimensional Hypothesis testing: Give statements about the population (values of parameters, probability distributions, issues of independence,…) along with a quantitative measure of “certainty” Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 9 Tools for making inference Criteria for a point estimate to be “good” “Algorithmic” methods to find point estimates (Maximum Likelihood, Least Squares, Method-of-Moments) Classical methods of constructing hypothesis test (NeymanPearson lemma, Maximum Likelihood Ratio Test,…) Classical methods to construct confidence intervals (regions) Decision theory (make use of loss and risk functions, utility and cost) to find point estimates and hypothesis tests Using prior distributions to construct tests , credible intervals and predictive distributions (Bayesian inference) Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 10 Tools for making inference… Using theory of randomization to form non-parametric tests (tests not depending on any probability distribution behind data) Computer intensive methods (bootstrap and cross-validation techniques) Advanced models from data that make use of auxiliary information (explanatory variables): Generalized linear models, Generalized additive models, Spatio-temporal models, … Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 11 The univariate population-sample model The population to be investigated is such that the values that comes out in a sample x1, x2 , …are governed by a probability distribution The probability distribution is represented by a probability density (or mass) function f(x ) Alternatively, the sample values can be seen as the outcomes of independent random variables X1, X2, … all with probability density (or mass) function f(x ) Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 12 Point estimation (frequentistic paradigm) We have a sample x = (x1 , … , xn ) from a population The population contains an unknown parameter The functional forms of the distributional functions may be known or unknown, but they depend on the unknown . Denote generally by f(x ; ) the probability density or mass function of the distribution A point estimate of is a function of the sample values ˆ ˆx1,, xn ˆ x such that its values should be close to the unknown . Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 13 “Standard” point estimates The sample mean mean x is a point estimate of the population 1 n x xi ˆ x1 ,, xn n i 1 The sample variance s2 is a point estimate of the population variance 2 1 n 2 2 ˆ x1 ,, xn s x x i n 1 i 1 2 The sample proportion p of a specific event (a specific value or range of values) is a point estimate of the corresponding population proportion p # xi : event is satisfied n Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 14 ˆ x1 ,, xn Assessing a point estimate A point estimate has a sampling distribution Replace the sample observations x1 , … , xn with their corresponding random variables X1 , … , Xn in the functional expression: ˆ ˆ X1,, X n The point estimate is a random variable that is observed in the sample (point estimator) As a random variable the point estimator must have a probability distribution than can be deduced from f (x ; ) The point estimator /estimate is assessed by investigating the its sampling distribution, in particular the mean and the variance. Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 15 Unbiasedness A point estimator is unbiased for if the mean of its sampling distribution is equal to E ˆ E ˆ X1,, X n The bias of a point estimate for is bias ˆ E ˆ Thus, a point estimate with bias = 0 is unbiased, otherwise it is biased Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 16 Examples (within the univariate population-sample model) The sample mean is always unbiased for estimating the population mean 1 n 1 n E X E X i E X i n 1 n 1 Is the sample mean an unbiased estimate of the population median? Why do we divide by n–1 in the sample variance (and not by n )? n n 2 E X i X E X i2 nE X 2 1 1 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 17 Consistency A point estimator is (weakly) consistent if Pr ˆ 0 as n for any 0 Thus, the point estimator should converge in probability to Theorem: A point estimator is consistent if biasˆ 0 and Var ˆ 0 as n Proof: Use Chebyshev’s inequality in terms of 2 ˆ ˆ MSE E Var ˆ bias ˆ Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 18 2 Examples The sample mean is a consistent estimator of the population mean. What probability law can be applied? What do we require for the sample variance to be a consistent estimator of the population variance? 2 n 2 1 2 Var s Var X i nX n 1 1 2 1 n 1 2 n 2 2 n 2 2 2 Var X i n Var X 2nCov X i , X ... 1 1 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 19 Efficiency Assume we have two unbiased estimators of , i.e. ˆ1 ,ˆ2 : E ˆ1 E ˆ2 If Var ˆ 1 Var ˆ 2 with strict inequality for at least one value of then ˆ 1 is said to be more efficient than ˆ 2 The efficiency of an unbiased estimator is defined as ˆ j eff mini Var ˆi 1 j ˆ Var Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 20 Example X Xn 1 n Let ˆ X X i ; n 2 and ˆ 2 1 n 1 2 EX1 EX 2 E ˆ 1 ; E ˆ 2 2 2 Both estimatorsare unbiased 1 1 2 1 n 2 Var ˆ Var X i 2 n n n n 1 1 2 2 2 2 2 Var ˆ Var X 1 Var X 2 4 4 2 n since n 2 1 2 ˆ 1 is more efficient than ˆ 2 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 21 Likelihood function For a sample x the likelihood function for is defined as n L ; x f xi ; i 1 the log-likelihood function is n l ; x lnL ; x ln f xi ; i 1 measure how likely (or expected) the sample is Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 22 Fisher information The (Fisher) Information about contained in a sample x is defined as 2 2 I E l ; X E l ; X 1 ,, X n Theorem: Under some regularity conditions (interchangeability of integration and differentiation) 2 I E l ; X In particular the range of X cannot depend on (such as in a population where X U(0, ) ) Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 23 Why is it measure of information for L ; x and l ; x is related to the probability of the obtained sample. How do this probability change with ? T his is measured by l l is close to 0 T he probability is not so affected by slightly changes of If T hesample do not contain much information about l If is largely positive or largely negative T he probability changes a lot if is slightly changed. 2 l measures the amount of information about in the particular sample l 2 E measures generally the amount of information about in a sample from the current distribution. Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 24 Example X Exp( ) n n L ; x f xi ; 1 e 1 1 l ; x lnL ; x n ln l n 1 2 xi 1 1 n e 1 n xi 1 n xi 1 n x ; X i fulfills the regularity conditions 1 n 2l n 2 n 2 n 2 3 xi I 2 3 E X i 2 1 1 n 2 2 n 2 n 3 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 25 Cramér-Rao inequality Under the same regularity conditions as for the previous theorem the following holds for any unbiased estimator 1 Var ˆ I The lower bound is attained if and only if l I ˆ Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 26 Proof: E ˆ E ˆ X 1 , , X n ˆ y1 , , yn f y1 ; f yn ; dy1 dyn y1 f y1 ,, y n ; yn ˆ y1 , , yn L ; y1 , , yn dy1 dyn ˆ y L ; y d y y1 yn y But as ˆ is unbiased E ˆ E ˆ ˆ y L ; y d y y L ; y d y ˆ y condit ions y Regularit y d ln g x g ' x dx g x L ; y l ; y L ; y l ; y L ; y L ; y E ˆ l ; y l ; X ˆ y L ; y d y E ˆ X y Now as Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 27 E ˆ 1 On t he ot her hand: l ; X E ˆ X 1 T he t heoretcal i correlat ion bet ween two variables U and V CovU , V U , V sat isfies U , V 1 Var U Var V CovU , V Var U Var V CovU , V Var U Var V 2 CovU , V E U V E U E V l ; X Let U ˆ and V l ; y l ; X E ˆ ; E f y1 ; f yn ; dy1 dyn y l ; y y L ; y dy L ; y dy y Regularit y L ; y d y y condit ions 1 0 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 28 l ; X l ; X l ; X Covˆ, E ˆ E ˆ E 1 0 1 l ; X 1 1 Var ˆ Var Var ˆ 2 l ; X 2 l ; X 2 E E l ; X 2 l ; X 2 2 ˆ Var ˆ E 0 Var E l ; X 2 ˆ Var E 1 I1 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 29 Example X Exp( ) Var X E X 2 E X x 2 1 e x dx 2 2 0 x2 e 2 xe x 0 x dx 2 0 2 x 1 e x dx 2 0 0 2 2 2 Var X 2 n 1 I X as an est imat or of at t ains t he Cramér- Rao lower bound Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 30 Sufficiency A function T of the sample values of a sample x, i.e. T = T(x)=T(x1 , … , xn ) is a statistic that is sufficient for the parameter if the conditional distribution of the sample random variables does not depend on , i.e. f X1 ,, X n T ( X1 ,, X n )t y1 ,, yn t cannotbe writtenas a functionof What does it mean in practice? If T is sufficient for then no more information about than what is contained in T can be obtained from the sample. It is enough to work with T when deriving point estimates of Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 31 Example Assume x x1 , x2 is a sample from Exp Let T x1 , x2 x1 x2 and assume T t is observed x2 t x1. f X 1 , X 2 ,T y1 , y2 , t f X 1 , X 2 y1 , t y1 1e 1 y1 1e 1 t y1 2e 1 t fT t ? Derive by different at i ing FT t P rT t P rT t P r X 1 X 2 t P r X 2 t X 1 1 1 y1 e 1 1 y 2 e dy1dy2 y 2 t y1 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 32 t y1 t 1 y1 y1 0 y 2 0 t 1 1 1 y1 e e e 1 y 2 e 1 y 2 t dy2 dy1 1 1 y1 e y1 0 t y1 y2 0 y1 0 t 1 t dy1 1 1 y1 e e 1 1t e dy e 1 y1 1 1 e 1 y1 e 1 e t t 1e t 1 0 1 e t t 1e fT t 1e 1 t 1 1t e f X 1 , X 2 T y1 , y2 T t 2 1t t e 2 1t e t 2 e 1 t 33 dy2 dy1 dy 1 t y1 0 1 t 2 1t e 1 not depending on t Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 t e 1 t y1 1 1t y1 0 1 1 1 y 2 y2 0 y1 0 1 1 y1 t y1 The factorization theorem: T is sufficient for if and only if the likelihood function can be written L ; x K1 T x ; K 2 x i.e. can be factorized using two non-negative functions such that the first depends on x only through the statistics T and also on and the second does not depend on Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 34 Example, cont X Exp( ) n Let T x xi 1 n n L ; x f xi ; 1 e 1 1 1 1 xi 1 n e 1 n xi n e 1 1 xi is sufficient for 1 K 2 x n K1 n 1 xi ; Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 35 n xi 1 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden April 13, 2015 36