Data, Models, Parameters, and Statistics In this lecture, we will see more datasets and give a brief introduction to some typical models and setups. In statistics, our starting point is a collection of data X 1 , X 2 ,, X n . Each X i could be a number, a vector, or even a matrix. Our goal is to draw useful information from the data. Examples: 1. Old faithful data. data(faithful) faithful eruptions: numeric Eruption time in minutes. Waiting: numeric Waiting time to next eruption (in minutes). 2. ChickWeight data data(ChickWeight) ChickWeight Weight: a numeric vector giving the body weight of the chick (gm). Time: a numeric vector giving the number of days since birth when the measurement was made. Chick: an ordered factor with levels 18 < ... < 48 giving a unique identifier for the chick. Diet: a factor with levels 1,...,4 indicating which experimental diet the chick received. 3. Longley's Economic Regression Data data(longley) longley This is a macroeconomic data set which provides a well-known example for a highly collinear regression. GNP.deflator: GNP implicit price deflator (1954=100) GNP: Gross National Product. Unemployed: number of unemployed. Armed.Forces: number of people in the armed forces. Population: ‘noninstitutionalized’ population >= 14 years of age. Year: the year (time). Employed: number of people employed. lm(Employed ~ GNP,data=longley) 4. Air passenger data. data(AirPassengers) AirPassengers Assumptions: Once we have a dataset, we need proper assumptions to do statistical inferences (Estimation, Testing, Prediction, Confidence Interval, etc). 1. 2. 3. 4. 5. The samples are independent. The samples are identically distributed. Relationship among the coordinates of each sample (linear, for example). The samples follow a particular distribution (normal, exponential, uniform, etc.). …….. We should be careful when apply those assumptions on the dataset. Parameters: If we assume the samples follow some particular distribution, there will be parameters for the distribution, generally unknown. Example : Michaelson-Morley Speed of Light Data. data(morley) morley attach(morley) hist(Speed) qqnorm(Speed) The samples of Speed are approximately normal, so assume Speed follow a N ( , 2 ) distribution is reasonable. But the parameters and 2 are unknown. We need to estimate them in some cases. Basic Models and Goals 1. Estimation. Observe i.i.d. samples X 1 , X 2 ,, X n . They follow some distribution with parameter . Our goal is to estimate , or more generally, a function of , q( ) . 2. Confidence Interval. We do not need a actual estimate of the parameter. But we want to find a interval (Cl , Cu ) such that it will cover the true parameter with high probability (for example, 95%). 3. Hypothesis Testing. We want to get a yes or no answer to some questions. Foe example 0 or 0 , 1 2 or 1 2 . For example:In ChickWeight data, we want to compare the weight of Chicken with different diet. 4. Prediction. Predict the value of next observation. For example, the air passenger data. 5. Linear Regression Model. We observe paired data. ( x1 , y1 ), , ( xn , y n ) . We assume x1 , x2 ,, xn are nonrandom and y1 , y 2 ,, y n are realization of the random variables Yi xi i where i are independent random variables with expectation 0 and variance 2 . , and 2 are unknown parameters. y x is called the regression line. We want to estimate it. data(trees) attach(trees) plot(Volume, Girth) Measurement of Performance Once we got an answer to a statistics problem, we need to know how good it is. We need to measure the performance of our decision. Unbiased estimation. Mean squared error. Efficiency. …… Unbiased Estimator In this lecture, we will study the estimation problem. Our goal here is to use the dataset to estimate a quantity of interest. We will focus on the case where the quantity of interest is a certain function of the parameter of the distribution of samples. Examples: 1. data(morley) We want to estimate the speed of light, under normal assumption. 2. Exponential distribution. (life time of a machine) X<-rexp(100,rate=2) Let us pretend that we do not know the true parameter (which is 2), and estimate it based on the samples. An estimate is a value that only depends on the dataset x1 , x2 ,, xn , i.e., the estimate t is a function of the data set t h( x1 , x2 ,, xn ) . One can often think of several estimates for the parameter of interest. In example 1, we could use sample mean or sample median. In example 2, we could use the reciprocal of the sample mean or log 2 . sample median Then we need to answer the following questions: When is one estimate better than another? Does there exist a best estimate? Since the dataset x1 , x2 ,, xn is a realization of random variables X 1 , X 2 ,, X n . So the estimate t is a realization of random variable T h( X 1 , X 2 ,, X n ) . T is called an estimator. Example: y<-rep(0,50); z<-rep(0,50); for (i in 1:50) { X<-rexp(100,rate=2); y[i]<-1/mean(X); z[i]<-log(2)/median(X); } For each set of samples, we have an estimate. So the estimator T 100 is X 1 X 100 a random variable. We need to investigate the behavior of the estimators. hist(y); mean(y); var(y); hist(z); mean(z); var(z); The mean squared error of the estimator is defines as MSE (T ) E [(T ) 2 ] mean((y-2)^2) mean((z-2)^2) Now we know that an estimator is a random variable. The probability distribution of T h( X 1 , X 2 ,, X n ) is also called the sampling distribution of T . Definition: An estimator T is called an unbiased estimator for parameter , if E (T ) for all . Generally, the difference E (T ) is called the bias of T . Let us consider the normal mean problem. Suppose X 1 , X 2 ,, X n follow N ( ,1) distribution and we want to estimate . Since is the expectation of the distribution, an intuitive estimator will be T X1 X n X n . This is an n unbiased estimator. Unbiased estimator for expectation and variance Suppose X 1 , X 2 ,, X n are i.i.d. random variables with mean and variance 2 . Now we have the following unbiased estimators for both of them. Xn X1 X n is an unbiased estimator of and n S n2 1 n ( X i X n ) 2 is an unbiased estimator of 2 . n 1 i 1 Remark: Unbiaed estimators do not necessarily exist. Unbiasedness does not always carry over. T is an unbiased estimator of does not mean g (T ) is an unbiased estimator of g ( ) , unless g is a linear function. Method of Moments From the previous normal example, we can see that if the parameter of interest is the expectation or variance of the distribution, we can use the sample expectation or sample variance to estimate it. This estimator is reasonable. Suppose we have i.i.d. samples X 1 , X 2 ,, X n follow some distribution with unknown parameter . Now we want to estimate this parameter . We first calculate the expectation of the distribution E ( X ) . Usually, this is a function of (think about normal or exponential distribution). Suppose E ( X ) g ( ) , then under suitable conditions, can be written as g 1 ( E ( X )) . Since we can always use sample mean to estimate the expectation, we have an intuitive estimator for , ˆ ˆ( X 1 , X 2 ,, X n ) g 1 ( X n ) In general, we can calculate the expectation of a function of X . Suppose E (m( X )) h( ) for some function m(x ) . In the previous discussion, m( x) x . Actually, m(x) could be any function, for example x 2 , x 3 , e x …as long as its expectation is easy to compute. Then an estimator of would be ˆ ˆ( X 1 , X 2 ,, X n ) g 1 ( m( X 1 ) m( X 2 ) m( X n ) ) n This method is called method of moments. From the Law of Large Number, we know that these estimators are not bad.