Statistics 550 Notes 3 Reading: Section 1.3 I. Background for Problem 1.1.9 in Homework 1. p 2 The model Yi zij j i , i ~ N (0, ) is the multiple j 1 p linear regression model. We have E (Yi | zij ) zij j . The j 1 coefficients j can be interpreted as the change in the mean of Y that is associated with a one unit change in z j when z1 , , z j 1 , z j 1 , , z p are held fixed. Example: The 1966 Coleman Report on “Equality of Educational Opportunity” sought to explain how student achievement in schools was associated with the resources of the school and the socioeconomic background of the student, e.g., Y = verbal achievement score in school (6th graders) z1 staff salaries per pupil z2 % of students in 6th grade of school whose father has a white collar occoupation z3 SES (socioeconomic status) z4 teachers’ average verbal scores z5 mothers’ average education 1 The variables z1 , z2 , z3 , z4 , z5 would be collinear if one variable was a linear function of the other variables. There was concern that this was approximately true because socioeconomic status was highly correlated with the resources of the school (staff salaries per pupil) prior to the desegregation of schools. II. Bayesian Inference for the Normal Distribution 2 2 Suppose X 1 , , X n iid N ( , ) , known, and our prior 2 on is N ( , b ) . The posterior distribution is proportional to f ( x | ) ( ) n 1 1 1 1 2 2 exp ( x ) exp ( ) i 2 2 2 2 b 2b i=1 2 1 n 1 exp 2 i 1 X i2 2nX n 2 2 2 2 2 2b 2 nX 1 n exp 2 2 2 2 2 b 2b 2 2 1 nX / 2 / b 2 n 1 exp 2 2 2 2 2 n / 1 / b b Thus, the posterior distribution is 2 N 2 b2 , 1 n 1 n 1 2 2 2 2 b b nX III. Bickel and Doksum’s perspective on Bayesian models. (a) Bayesian models are useful as a way to generate statistical procedures which incorporate prior information when appropriate. However, statistical procedures should be evaluated in a frequentist (repeated sampling) way: For example, for the iid Bernoulli trials example, if we use the posterior mean with a uniform prior distribution to 1 i 1 xi n estimate p , i.e., pˆ n 1 , then we should look at how this estimate would perform in many repetitions of the experiment if the true parameter is p for various values of p . More to come on this frequentist perspective in Chapter 1.3. (b) We can view the parameter as random and view the Bayesian model as providing a joint probability distribution on the parameter and the data. Consider the model P P = { P , } for the probability distribution of the data X . 3 The subjective Bayesian perspective is that there is a true unknown and our goal is to describe our beliefs about after seeing the data X . This requires specifying a prior distribution ( ) for our beliefs about ; the posterior distribution describes our beliefs about after seeing the data X . Bickel and Doksum’s viewpoint is to see the Bayesian model P P = { P , } where we put a prior distribution ( ) on as providing a joint distribution for the parameter and the data ( , X ) , e.g., if the data X 1 , , X n are iid Bernoulli trials with probability p of success and the prior distribution for p is Beta(r,s), then the joint probability distribution for ( p, X1 , , X n ) is generated by: 1. We first generate p from a Beta(r,s) distribution. 2. Conditional on p , we generate X 1 , , X n iid Bernoulli trials with probability p of success. IV. Motivating Example for Decision Theoretic Framework (Section 1.3) A cofferdam protecting a construction site was designed to withstand flows of up to 1870 cubic feet per second (cfs). An engineer wishes to estimate the probability that the dam will be overtopped during the upcoming year. Over the previous 25-year periods, the annual maximum flood levels of the dam has exceeded 1870 cfs 5 times. The engineer 4 models the data on whether the flood level has exceeded 1870 cfs as independent Bernoulli trials with the same probability p that the flood level will exceed 1870 cfs in each year. Some possible estimates of p based on iid Bernoulli trials X1 , , X n : (1) pˆ n i 1 Xi n 1 i 1 X i n (2) pˆ on p . n2 , the posterior mean for a uniform prior 2 i 1 X i n (3) pˆ , the posterior mean for a Beta(2,2) n4 prior on p (called the Wilson estimate, recommended by Moore and McCabe, Introduction to the Practice of Statistics). How should we decide which of these estimates to use? The answer depends in part on how errors in the estimation of p affect us. Example 1 of decision problem: The firm wants the engineer to provide her best “guess” of p , the probability of an overflow, i.e., to estimate p by p̂ . The firm wants the probability of an overflow to be at most 0.05. Based on the estimate p̂ of p , the engineer’s firm plans to spend an 5 additional f (max(0, pˆ 0.05)) dollars to shore up the dam where f (0) 0 and f is an increasing function; after spending f (max(0, pˆ 0.05)) dollars. By spending this money, the firm will make the probability of an overflow be max(0, p max(0, pˆ 0.05)) . The cost of an overflow to the firm is $C. The expected cost to the firm of using an estimate of p̂ (for a true initial probability of overflow of p ) is f (max(0, pˆ 0.05)) C *(max(0, p max(0, pˆ 0.05))) . We want to choose an estimate which provides low expected cost. Example 2 of decision problem: Another decision problem besides estimating p might be that the firm wants to decide whether p 0.15 or p 0.15 ; if p 0.15 , the firm would like to build additional support for the additional dam. This is an example of a testing problem of deciding whether a parameter lives in one of two subsets that form a partition of the sample space. The cost to the firm of making the wrong decision about whether p 0.15 or p 0.15 depends on what type of error was made (deciding that p 0.15 when in fact p 0.15 or deciding that p 0.15 when in fact p 0.15 ). The decision theoretic framework involves: (1) clarifying the objectives of the study; (2) pointing to what the different possible actions are (3) providing assessments of risk, accuracy, and reliability of statistical procedures 6 (4) providing guidance in the choice of procedures for analyzing outcomes of experiments. History: Abraham Wald (1950, Statistical Decision Functions) developed the foundations of the decision theoretic framework for statistics. V. Components of the Decision Theory Framework (Section 1.3.1) As in Section 1.1, we observe data X from a distribution P , where we do not know the true P but only know that P P = { P , } (the statistical model). The true parameter vector is sometimes called the “state of nature.” Action space: The action space A is the set of possible actions, decisions or claims that we can contemplate making after observing the data X . For Example 1 of decision problem, the action space is the possible estimates of p , A [0,1] . For Example 2 of decision problem, the action space is {decide that p 0.15 , decide that p 0.15 }. Loss function: The loss function l ( a is the loss incurred by taking the action a when the true parameter vector is . 7 The loss function is assumed to be nonnegative. We want the loss to be small. The loss function can be thought of as the negative of a utility function in economics. Ideally, we choose the loss function based on the economics of the decision problem as in Example 1 of decision problem. However, more commonly, the loss function is chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. Commonly used loss functions for point estimates of a real valued parameter q( ) : Denote our estimate of q( ) by a . The most commonly used loss function is 2 quadratic (squared error) loss: l ( a q( )- a) . Other choices that are less computationally convenient but perhaps more realistically penalize large errors less are: (1) absolute value loss, l ( a q( ) - a | ; (2 ) Huber’s loss functions, 2 if |q( ) - a | k (q( ) - a) l ( a 2 2k | q( ) - a | -k if |q( ) - a |> k for some constant k (3) zero-one loss function if |q( )- a | k 0 l ( a if |q( )- a |> k 1 for some constant k 8 Decision procedures: A decision procedure or decision rule specifies how we use the data to choose an action a . A decision procedure is a function ( x ) from the sample space of the experiment to the action space. For Example 1 of decision problem, decision procedures i 1 X i n include ( X ) n 1 i 1 X i n and ( X ) n 1 . Risk function: The loss of a decision procedure will vary over repetitions of the experiment because the data from the experiment X is random. The risk function R (θ , ) is the expected loss from using the decision procedure when the true parameter vector is : R(θ, ) E [l ( , ( X ))] Example: For quadratic loss in point estimation of q( ) , the risk function is the mean squared error: R(θ , ) E [l ( , ( X ))] E [(q( ( X )) 2 ] This mean square error can be decomposed as bias squared plus variance. Proposition 3.1: E [(q( ( X )) 2 ] (q( E [ ( X )]) 2 E {( ( X ) E [ ( X )]) 2} 9 Proof: We have E [(q( ( X )) 2 ] E [({q( E [ ( X )]} {E [ ( X )] ( X ) ] (q( E [ ( X )]) 2 E [{E [ ( X )] ( X ) ] {Bias[ ( X )]}2 Variance[ ( X )] 10 Example: Suppose that an iid sample X1,...,Xn is drawn from the uniform distribution on [0, ] where is an unknown parameter and the distribution of Xi is 1 0<x< f X ( x; ) 0 elsewhere Several point estimators: 1. W1 max i X i n 1 W 2. 2 n max i X i . Note: Unlike W1, W2 is unbiased because n 1 n 1 n E (W2 ) E (W1 ) 0 . n n n 1 3. W3=2 X . Note: W3 is unbiased, E [ X ] 0 x2 x dx 2 1 E [W3 ] 2 E [ X ] 2 2 0 2 Comparison of three estimators for uniform example using mean squared error criterion 1. W1 max i X i The sampling distribution for W1 is 11 nw1n 1 ( w1 ) n 0 fW1 0 w1 elsewhere and 0 0 E [W1 ] w1 fW1 ( w1 )dw1 w1 nw1n 1 n nw1n 1 dw1 (n 1) n 0 n 1 n 1 n 1 2 To calculate Var (W1 ) , we calculate E (W1 ) and use the 2 2 formula Var ( X ) E ( X ) [ E ( X )] . Bias (W1 ) E [W1 ] E (W ) w f dw1 w 2 1 0 2 1 w1 nw1n 2 = (n 2) n 0 0 2 1 nw1n 1 n dw1 n 2 n2 2 n 2 n n Var (W1 ) 2 2 n2 (n 1) (n 2)(n 1) Thus, MSE (W1 ) {Bias (W1 )}2 Var (W1 ) 2 n 2 2 2 . 2 2 (n 1) (n 2)(n 1) (n 1)(n 2) n 1 W max i X i 2. 2 n 12 n n 1 n 1 W W1 . Note 2 n n 1 n 1 n E ( W ) E ( W ) , 1 Thus, 2 n n n 1 Bias (W2 ) 0 and n 1 n 1 Var (W2 ) Var ( W1 ) Var (W1 ) n n 2 n 1 n 1 2 2 2 n(n 2) n (n 2)(n 1) Because W2 is unbiased, 1 MSE (W2 ) Var (W2 ) 2 n(n 2) 2 3. W3 2 X To find the mean square error, we use the fact that if X 1 , , X n iid with mean and variance 2 , then X Xn X 1 and variance 2 has mean n We have E ( X ) x2 x dx 2 1 0 E ( X ) 2 0 0 x3 2 x dx 3 2 1 2 2 0 Var ( X ) 3 2 12 2 2 13 3 2 Thus, E ( X ) 2 , Var ( X ) 12n and 2 E (W3 ) 2 E ( X ) and Var (W3 ) 4Var ( X ) 3n . 2 W3 is unbiased and has mean square error 3n . The mean square errors of the three estimators are the following: MSE 2 2 (n 1)(n 2) 1 2 n(n 2) 1 2 3n W1 W2 W3 For n=1, the three estimators have the same MSE. 1 2 1 For n>1, n(n 2) (n 1)(n 2) 3n So W2 is best, W1 is second best and W3 is the worst. 14