Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 157 Least Squares Fitting and Inference for Linear Models Kerby Shedden Department of Statistics, University of Michigan September 30, 2015 2 / 157 Goals and terminology The goal is to learn about an unknown function f that relates variables Y ∈ R and X ∈ Rp through a relationship Y ≈ f (X ). The variables Y and X have distinct roles: I Independent variables (predictors, regressors, covariates, exogeneous variables): X = (X1 , . . . , Xp )0 I Dependent variable (response, outcome, endogeneous variables): Y Note that the terms “independent” and “dependent” do not imply statistical independence or linear algebraic independence. They refer to the setting of an experiment where the value of X can be manipulated, and we observe the consequent changes in Y . 3 / 157 Goals and terminology The analysis is empirical (based on a sample of data): Yi ∈ R Xi = (Xi1 , . . . , Xip )0 ∈ Rp i = 1, . . . , n. where n is the sample size. The data (Yi , Xi ) tell us what is known about the i th “analysis unit”, “case”, or “observation”. 4 / 157 Why do we want to do this? I Prediction: Estimate f (X ) to predict the typical value of Y at a given X point (not necessarily one of the Xi points in the data). I Model inference: Inductive learning about the relationship between X and Y , such as understanding which predictors or combinations of predictors are associated with particular changes in Y . “Model inference” often has the goal of better understanding the physical (biological, social, etc.) mechanism underlying the relationship between X and Y . We will see that inferences about mechanisms can be misleading, particularly when based on observational data. 5 / 157 Examples: I An empirical model for the weather conditions 48 hours from now could be based on current and historical weather conditions. Such a model could have a lot of practical value, but it would not necessarily provide a lot of insight into the atmospheric processes that underly changes in the weather. I A study of the relationship between childhood lead exposure and subsequent behavioral problems would primarily be of interest for inference, rather than prediction. Such a model could be used to assess whether there is any risk due to lead exposure, and to estimate the overall effects of lead exposure in a large population. The effect of lead exposure on an individual child is probably too small in relation to numerous other risk factors for such a model to be of predictive value at the individual level. 6 / 157 Statistical interpretations of the regression function The most common way of putting “curve fitting” into a statistical framework is to define f as the conditional expectation f (x) ≡ E [Y |X = x]. Less commonly, the regression function is defined as a conditional quantile, such as the median f (x) ≡ median(Y |X = x) or even some other quantile f (x) ≡ Qp (Y |X = x). This is called quantile regression. 7 / 157 The conditional expectation function The conditional expectation E [Y |X ] can be viewed in two ways: 1. As a deterministic function of X , essentially what we would get if we sampled a large number of X , Y pairs from their joint distribution, and took the average of the Y values that are coupled with a specific X value. If there are densities we can write: Z E [Y |X = x] = y · f (y |X = x)dy . 2. As a scalar random variable. A realization is obtained by sampling X from its marginal distribution, then plugging this value into the deterministic function described in 1 above. In regression analysis, 1 is much more commonly used than 2. 8 / 157 Least Squares Fitting In a linear model, the independent variables X are postulated to be related to the dependent variable Y via a linear relationship Yi ≈ p X βj Xij = β 0 Xi . j=1 This is called a “linear model” because it is a linear function of β, not because it is a linear function of X . To estimate f , we need to estimate the βj . One approach to doing this is to minimize the following function of β: L(β) = X i (Yi − X βj Xij )2 = j X (Yi − β 0 Xi )2 i This is called least squares estimation. 9 / 157 Simple linear regression A special case of the linear model is simple linear regression, when there is p = 1 covariate and an intercept (a covariate whose value is always 1). L(α, β) = X (Yi − α − βXi )2 i We can differentiate with respect to α and β: ∂L/∂α ∂L/∂β P = −2 Pi (Yi − α − βXi ) = −2 i (Yi − α − βXi )Xi P = −2 P Ri = −2 i Ri Xi Ri = Yi − α − βXi is the “working residual” (requires working values for α and β). 10 / 157 Simple linear regression Setting ∂L/∂α = ∂L/∂β = 0 and solving for α and β yields α̂ = Ȳ − β̂ X̄ P Yi Xi /n − Ȳ X̄ β̂ = P 2 Xi /n − X̄ 2 P P where Ȳ = Yi /n and X̄ = Xi /n are the sample mean values (averages). The “hat” notation (ˆ) distinguishes the least-squares optimal values of α and β from an arbitrary pair of parameter values. 11 / 157 Two important identities We will call α̂ and β̂ the “least squares estimates” of the model parameters α and β. At the least squares solution, ∂L/∂α ∂L/∂β P = 2 P Ri = −2 i Ri Xi = = 0 0 we have the following basic properties for the least squares estimates: I The residuals Ri = Yi − α − βXi sum to zero I The residuals are orthogonal to the independent variable X . Recall that two vectors v , w ∈ Rd are orthogonal if P j vj wj = 0. 12 / 157 Two important identities Centered and uncentered sums of squares can be related as follows: X Xi2 /n − X̄ 2 = X (Xi − X̄ )2 /n. i Centered and uncentered cross-products can be related as follows: X Yi Xi /n − Ȳ X̄ = X Yi (Xi − X̄ )/n = X (Yi − Ȳ )(Xi − X̄ )/n. i 13 / 157 Simple linear regression Note that X (Xi − X̄ )2 /n and i X (Yi − Ȳ )(Xi − X̄ )/n. i are essentially the sample variance of X1 , . . . , Xn , and the sample covariance of the (Xi , Yi ) pairs. Since β̂ is their ratio, we can replace n in the denominator with n − 1 so that β̂ = cd ov(Y , X ) var(X c ) where cd ov and var c are the usual unbiased estimates of variance and covariance. 14 / 157 Convex functions A map Q : Rd → R is convex if for any v , w ∈ Rd , Q(λv + (1 − λ)w ) ≤ λQ(v ) + (1 − λ)Q(w ), for 0 ≤ λ ≤ 1. If the inequality is strict for 0 < λ < 1 and all v 6= w , then Q is strictly convex. λ=1 Q(v) λQ(v)+(1−λ)Q(w) λ=0 Q(w) Q(λv+(1−λ)w) v w 15 / 157 Convex functions A key property of strictly convex functions is that they have at most one global minimizer. That is, there exists at most one v ∈ Rd such that Q(v ) ≤ Q(w ) for all w ∈ Rd . The proof is simple. Suppose there exists v 6= w such that Q(v ) = Q(w ) = inf u∈Rd Q(u). If Q is strictly convex and λ = 1/2, then Q(v /2 + w /2) < (Q(v ) + Q(w ))/2 = inf u∈Rd Q(u), Thus z = (v + w )/2 has the property that Q(z) < inf u∈Rd Q(u), a contradiction. 16 / 157 Convexity of quadratic functions A general quadratic function can be written Q(v ) = v 0 Av + b 0 v + c where A is a d × d matrix, b and v are vectors in Rd , and c is a scalar. Note that v 0 Av = X i,j vi vj Aij b0 v = X bj vj . j 17 / 157 Convexity of quadratic functions If b ∈ col(A), we can complete the square to eliminate the linear term, giving us Q(v ) = (v − f )0 A(v − f ) + s. where f is any vector satisfying Af = −b/2, and s = c − f 0 Af . If A is invertible, we can take f = −A−1 b/2. Since the property of being convex is invariant to translations in both the domain and range, without loss of generality we can assume f = 0 and s = 0 for purposes of analyzing the convexity of Q. 18 / 157 Convexity of quadratic functions Two key definitions: I A square matrix A is positive definite if v 0 Av > 0 for all vectors v 6= 0. I A square matrix is positive semidefinite if v 0 Av ≥ 0 for all v . We will now show that the quadratic function Q(v ) = v 0 Av is strictly convex if and only if A is positive definite. Note that without loss of generality, A is symmetric, since otherwise à ≡ (A + A0 )/2 gives the same quadratic form as A. 19 / 157 Convexity of quadratic functions Q(λv + (1 − λ)w ) = (λv + (1 − λ)w )0 A(λv + (1 − λ)w ) = λ2 v 0 Av + (1 − λ)2 w 0 Aw + 2λ(1 − λ)v 0 Aw . λQ(v ) + (1 − λ)Q(w ) − Q(λv + (1 − λ)w ) = λ(1 − λ)(v 0 Av + w 0 Aw − 2w 0 Av ) = λ(1 − λ)(v − w )0 A(v − w ) ≥ 0. If 0 < λ < 1, this is a strict inequality for all v 6= w if and only if A is positive definite. 20 / 157 Convexity of quadratic functions The gradient, or Jacobian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is (∂f /∂x1 , . . . , ∂f /∂xd ), which is viewed as a row vector by convention. 21 / 157 Convexity of quadratic functions If A is symmetric, the Jacobian of Q(v ) = v 0 Av is 2v 0 A0 . To see this, write v 0 Av = X vi2 Aii + i X vi vj Aij i6=j and differentiate with respect to v` to get the `th component of the Jacobian: 2v` A`` + X j6=` vj A`j + X vi Ai` = 2(Av )` . i6=` 22 / 157 Convexity of quadratic functions The Hessian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is the matrix Hij = ∂ 2 f /∂xj ∂xi . If A is symmetric, the Hessian of Q is 2A. To see this, note that the `th component of 2v 0 A0 is the inner product of v with the `th row (or column) of A. The derivative of this inner product with respect to a second index `0 is 2A``0 . It follows that a quadratic function is strictly convex iff its Hessian is positive definite (more generally, any continuous, twice differentiable function is convex on Rd if and only if its Hessian matrix is everywhere positive definite). 23 / 157 Uniqueness of the simple linear regression least squares fit The least squares solution for simple linear regression, α̂, β̂, is unique as long as var[X c ] (the sample variance of the covariate) is positive. To see this, note that the Hessian (second derivative matrix) of L(α, β) is H= where X· = P i ∂ 2 L/∂α2 ∂ 2 L/∂α∂β ∂ 2 L/∂α∂β ∂ 2 L/∂β 2 = 2n 2X· 2X· P 2 Xi2 Xi . 24 / 157 Uniqueness of the simple linear regression least squares fit If var(X c ) > 0 then this is a positive definite matrix since all the principal submatrices have positive determinants: |2n| > 0 2n 2X· 2X· P = 2 Xi2 = 4n X Xi2 − 4(X· )2 4n(n − 1)var(X c ) > 0. 25 / 157 The fitted line and the data center The fitted line Y = α̂ + β̂X passes through the center of mass of the data (X̄ , Ȳ ). This can be seen by substituting X = X̄ into the equation of the fitted line, which yields Ȳ . 26 / 157 Regression slopes and the Pearson correlation The sample Pearson correlation coefficient between X and Y is ρ̂ = cd ov(Y , X ) c c ) SD(Y )SD(X The relationship between β̂ and ρ̂ is β̂ = ρ̂ · c ) SD(Y . c ) SD(X Thus the fitted slope has the same sign as the Pearson correlation coefficient between Y and X . 27 / 157 Reversing X and Y If X and Y are reversed in a simple linear regression, the slope is β̂∗ = c ) SD(X cd ov(Y , X ) = cor(Y c ,X) . c ) var(Y c ) SD(Y If the data fall exactly on a line, then cor(Y , X ) = 1, so β̂∗ = 1/β̂, which is consistent with algebraically rearranging Y = α + βX to X = −α/β + Y /β. But if the data do not fit a line exactly, this property does not hold. 28 / 157 Norms The Euclidean norm on vectors (the most commonly used norm) is: kV k = sX Vi2 = √ V 0V . i Here are some useful identities, for vectors V , W ∈ Rp : kV + W k2 = kV k2 + kW k2 + 2V 0 W kV − W k2 = kV k2 + kW k2 − 2V 0 W 29 / 157 Fitting multiple linear regression models For multiple regression (p > 1), the covariate data define the design matrix: 1 1 X= 1 x11 x21 ··· ··· x12 x22 ··· ··· ··· ··· ··· xn1 xn2 ··· x1p x2p xnp Note that in some situations the first column of 10 s (the intercept) will not be included. 30 / 157 Fitting multiple linear regression models The linear model coefficients are written as a vector β = (β0 , β1 , . . . , βp )0 where β0 is the intercept and βk is the slope corresponding to the k th covariate. For a given working covariate vector β, the vector of fitted values is given by the matrix-vector product Ŷ = Xβ, which is an n-dimensional vector. The vector of residuals Y − Xβ is also an n-dimensional vector. 31 / 157 Fitting multiple linear regression models The goal of least-squares estimation is to minimize the sum of squared differences between the fitted and observed values. L(β) = X (Yi − Ŷi )2 = kY − Xβk2 . i Estimating β by minimizing L(β) is called ordinary least squares (OLS). 32 / 157 The multivariate chain rule Suppose g (·) is a map from Rm to Rn and f (·) is a scalar-valued function on Rn . If h = f ◦ g , i.e. h(z) = f (g (z)). Let fj (x) = ∂f (x)/∂xj , let ∇f (x) = (f1 (x), · · · , fn (x))0 denote the gradient of f , and let J denote the Jacobian of g Jij (z) = ∂gi (z)/∂zj . 33 / 157 The multivariate chain rule Then ∂h(z)/∂zj = X fi (g (z))∂gi (z)/∂zj i = [J(z)0 ∇f (g (z))]j Thus we can write the gradient of h as a matrix-vector product between the (transposed) Jacobian of g and the gradient of f : ∇h = J 0 ∇f where J is evaluated at z and ∇f is evaluated at g (z). 34 / 157 The least squares gradient function For the least squares problem, the gradient of L(β) with respect to β is ∂L/∂β = −2X0 (Y − Xβ). This can be seen by differentiating L(β) = X (Yi − β0 − i p X Xij βj )2 j=1 element-wise, or by differentiating kY − Xβk2 using the Pmultivariate chain rule, letting g (β) = Y − Xβ and f (x) = j xj2 . 35 / 157 Normal equations Setting ∂L/∂β = 0 yields the “normal equations:” X0 Xβ = X0 Y Thus calculating the least squares estimate of β reduces to solving a system of p + 1 linear equations. Algebraically we can write β = (X0 X)−1 X0 Y , which is often useful for deriving analytical results. However this expression should not be used to numerically calculate the coefficients. 36 / 157 Solving the normal equations The most standard numerical approach is to calculate the QR decomposition of X = QR where Q is a n × p + 1 orthogonal matrix (i.e. Q0 Q = I) and R is a p + 1 × p + 1 upper triangular matrix. The QR decomposition can be calculated rapidly, and highly precisely. Once it is obtained, the normal equations become Rβ = Q0 Y , which is an easily solved p + 1 × p + 1 triangular system. 37 / 157 Matrix products In multiple regression we encounter the matrix product X0 X. Let’s review some ways to think about matrix products. If A ∈ Rn×m and B ∈ Rn×p are matrices, we can form the product A0 B ∈ Rm×p . Suppose we partition A and B by rows: − a1 − a2 ··· A= ··· − an − − − − b1 − b2 ··· B= ··· − bn − − − 38 / 157 Matrix products Then A0 B = a10 b1 + a20 b2 + · · · + an0 bn , where each aj0 bj term is an outer product aj1 bj1 aj2 bj1 0 aj bj = ··· aj1 bj2 aj2 bj2 ··· ··· ··· ∈ Rm×p . 39 / 157 Matrix products Now if we partition A and B by columns | A = a1 | | a2 | | a3 | | B = b1 | | b2 | | b3 , | then A0 B is a matrix of inner products a10 b1 a20 b1 A0 B = ··· a10 b2 a20 b2 ··· ··· ··· . ··· 40 / 157 Matrix products Thus we can view the product X0 X involving the design matrix in two different ways. If we partition X by rows (the cases) − x1 − x2 ··· X= ··· − xn − − − then X0 X = n X xi0 xi i=1 is the sum of the outer product matrices of the cases. 41 / 157 Matrix products If we partition X by the columns (the variables) | X = x1 | | x2 | | x3 , | then X0 X is a matrix whose entries are the pairwise inner products of the variables ([X0 X]ij = xi0 xj ). 42 / 157 Mathematical properties of the multiple regression fit The multiple least square solution is unique as long as the columns of X are linearly independent. Here is the proof: 1. The Hessian of L(β) is 2X0 X. 2. For v 6= 0, v 0 (X0 X)v = (Xv )0 Xv = kXv k2 > 0, since the columns of X are linearly independent. Therefore the Hessian of L is positive definite. 3. Since L(β) is quadratic with a positive definite Hessian matrix, it is convex and hence has a unique global minimizer. 43 / 157 Projections Suppose S is a subspace of Rd , and V is a vector in Rd . The projection operator PS maps V to the vector in S that is closest to V : PS (V ) = argminη∈S kV − ηk2 . 44 / 157 Projections Property 1: (V − PS (V ))0 s = 0 for all s ∈ S. To see this, let s ∈ S. Without loss of generality ksk = 1 and (V − PS (V ))0 s ≤ 0. Let λ ≥ 0, and write kV − PS V + λsk2 = kV − PS V k2 + λ2 + 2λ(V − PS (V ))0 s. If (V − PS (V ))0 s 6= 0, then for sufficiently small λ > 0, λ2 + 2λ(V − PS (V ))0 s < 0. This means that PS (V ) − λs is closer to V than PS (V ), contradicting the definition of PS (V ). 45 / 157 Projections Property 2: Given a subspace S of Rd , any vector V ∈ Rd can be written uniquely in the form V = VS + VS ⊥ , where VS ∈ S and s 0 VS ⊥ = 0 for all s ∈ S. To prove uniqueness, suppose V = VS + VS ⊥ = ṼS + ṼS ⊥ . Then 0 = kVS − ṼS + VS ⊥ − ṼS ⊥ k2 = kVS − ṼS k2 + kVS ⊥ − ṼS ⊥ k2 + 2(VS − ṼS )0 (VS ⊥ − ṼS ⊥ ) = kVS − ṼS k2 + kVS ⊥ − ṼS ⊥ k2 . which is only possible if VS = ṼS and VS ⊥ − ṼS ⊥ . Existence follows from Property 1, with VS = PV (S) and VS ⊥ = V − PV (S). 46 / 157 Projections Property 3: The projection PS (V ) is unique. This follows from property 2 – if there exists a V for which V1 6= V2 ∈ S both minimize the distance from V to S, then V = V1 + U1 and V = V2 + U2 (Uj = V − Vj ) would be distinct decompositions of V as a sum of a vector in S and a vector in S ⊥ , contradicting property 2. 47 / 157 Projections Property 4: PS (PS (V )) = PS (V ). The proof of this is simple, since PS (V ) ∈ S, and any element of S has zero distance to itself. A matrix or linear map with this property is called idempotent. 48 / 157 Projections Property 5: PS is a linear operator. Let A, B be vectors with θA = PS (A) and θB = PS (B). Then we can write A + B = θA + θB + (A + B − θA − θB ), where θA + θB ∈ S, and s 0 (A + B − θA − θB ) = s 0 (A − θA ) + s 0 (B − θB ) = 0 for all s ∈ S. By Property 2 above, this representation is unique, so θA + θB = PS (A + B). 49 / 157 Projections Property 6: Suppose PS is the projection operator onto a subspace S. Then I − PS , where I is the identity matrix, is the projection operator onto the subspace S ⊥ ≡ {u ∈ Rd |u 0 s = 0 for all s ∈ S}. To prove this, write V = (I − PS )V + PS V , and note that ((I − PS )V )0 s = 0 for all s ∈ S, so (I − PS )V ∈ S ⊥ , and u 0 PS V = 0 for all u ∈ S ⊥ . By property 2 this decomposition is unique, and therefore I − PS is the projection operator onto S ⊥ . 50 / 157 Projections Property 7: Since PS (V ) is linear, it can be represented in the form PS (V ) = PS · V for a suitable square matrix PS . Suppose S is spanned by the columns of a non-singular matrix X. Then PS = X(X0 X)−1 X0 . To prove this, let Q = X(X0 X)−1 X0 , so for an arbitrary vector V , V = QV + (V − QV ) = X(X0 X)−1 X0 V + (I − X(X0 X)−1 X0 )V and note that the first summand is in S. 51 / 157 Projections Property 7 (continued): To show that the second summand is in S ⊥ , take s ∈ S and write s = Xb. Then s 0 (I − X(X0 X)−1 X0 )V = b 0 X0 (I − X(X0 X)−1 X0 )V = 0. Therefore this is the unique decomposition from Property 2, above, so PS must be the projection. 52 / 157 Projections Property 7 (continued) An alternate approach to property 7 is constructive. Let θ = Xγ, and suppose we wish to minimize the distance between θ and V . Using calculus, we differentiate with respect to γ and solve for the stationary point: ∂kV − Xγk2 /∂γ = ∂(V 0 V − 2V 0 Xγ + γ 0 X0 Xγ)/∂γ = −2X 0 V + 2X0 Xγ = 0. The solution is γ = (X0 X)−1 X0 V so θ = X(X0 X)−1 X0 V . 53 / 157 Projections Property 8: The composition of projections onto S and S ⊥ (in either order) is identically zero. PS ◦ PS ⊥ = PS ⊥ ◦ PS ≡ 0. This can be shown by direct calculation using the representations of PS and PS ⊥ given above. 54 / 157 Least squares and projections The least squares problem of minimizing kY − Xβk2 is equivalent to minimizing kY − ηk2 over η ∈ col(X). Therefore the minimizing value η̂ is the projection of Y onto col(X). If the columns of X are linearly independent, there is a unique vector β̂ such that Xβ̂ = η̂. These are the least squares coefficient estimates. 55 / 157 Properties of the least squares fit • The fitted regression surface passes through the mean point (X̄, Ȳ ). To see this, note that the fitted surface at X̄ (the vector of column-wise means) is X̄0 β̂ = (10 X/n)β̂ = 10 X(X0 X)−1 X0 Y /n, where 1 is a column vector of 1’s. Since 1 ∈ col(X), it follows that X(X0 X)−1 X0 1 = 1, which gives the result, since Ȳ = 10 Y /n. 56 / 157 Properties of the least squares fit • The multiple regression residuals sum to zero. The residuals are R ≡ Y − Ŷ = Y − PS Y = (I − PS )Y = PS ⊥ Y , where S = col(X ). The sum of residuals can be written 10 (I − X(X0 X)−1 X0 )Y , where 1 is a vector of 1’s. If X includes an intercept, PS 1 = 1, so 10 I = 10 X(X0 X)−1 X0 = 10 , so 10 (I − X(X0 X)−1 X0 ) = 0. 57 / 157 Orthogonal matrices An orthogonal matrix X satisfies X0 X = I . This is equivalent to stating that the columns of X are mutually orthonormal. If X is square and orthogonal, then X0 = X−1 and also XX0 = I . If X is orthogonal then the projection onto col(X) simplifies to X(X0 X)−1 X0 = XX0 If X is orthogonal and the first column of X is constant, it follows that the remaining columns of X are centered and have sample variance 1/(n − 1). 58 / 157 Orthogonal matrices • If X is orthogonal, the slopes obtained by using multiple regression of Y on X = (X1 , . . . , Xp ) are the same as the slopes obtained by carrying out p simple linear regressions of Y on each covariate separately. To see this, note that the multiple regression slope estimate for the i th covariate is β̂m,i = X0:i Y where X:i is column i of X. Since X0 X = I it follows that each covariate has zero sample mean, and sample variance equal to 1/(n − 1). Thus the simple linear regression slope for covariate i is c :i , Y )/var(X c :i ) = X0:i Y = β̂m,i . β̂i = cov(X 59 / 157 Comparing multiple regression and simple regression slopes • The signs of the multiple regression slopes need not agree with the signs of the corresponding simple regression slopes. For example, suppose there are two covariates, both with mean zero and variance 1, and for simplicity assume that Y has mean zero and variance 1. Let r12 be the correlation between the two covariates, and let r1y and r2y be the correlations between each covariate and the response. It follows that n/(n − 1) 0 0 1 X0 X/(n − 1) = 0 r12 0 r12 1 0 X0 Y /(n − 1) = r1y . r2y 60 / 157 Comparing multiple regression and simple regression slopes So we can write β̂ = (X0 X/(n − 1))−1 (X0 Y /(n − 1)) = 1 r1y 2 1 − r12 r2y 0 − r12 r2y . − r12 r1y Thus, for example, if r1y , r2y , r12 ≥ 0, then if r12 > r1y /r2y , then β̂1 has opposite signs in single and multiple regression. Note that if r1y > r2y it is impossible for r12 > r1y /r2y . Thus, the effect direction for the covariate that is more strongly marginally correlated with Y cannot be reversed. This is an example of “Simpson’s paradox”. 61 / 157 Comparing multiple regression and simple regression slopes Numerical example: if r1y = 0.1, r2y = 0.6, r12 = 0.6, then X1 is (marginally) positively associated with Y , but for fixed values of X2 , the association between Y and X1 is negative. 62 / 157 Formulation of regression models in terms of probability distributions Up to this point, we have primarily expressed regression models in terms of the mean structure, e.g. E [Y |X = x] = β 0 x. Regression models are commonly discussed in terms of moment structures (E [Y |X = x], var[Y |X = x]) or quantile structures (Qp [Y |X = x]), rather than as fully-specified probability distributions. 63 / 157 Formulation of regression models in terms of probability distributions If we want to specify the model more completely, we can think in terms of a random “error term” that describes how the observed value y deviates from the ideal value f (x), where (x, y ) is generated according to the regression model. A very general regression model is Y = f (X , ). where is a random variable with expected value zero. If we specify the distribution of , then we have fully specified the distribution of Y |X . 64 / 157 Formulation of regression models in terms of probability distributions A more restrictive “additive error” model is: Y = f (X ) + . Under this model, E [Y |X ] = E [f (X ) + |X ] = E [f (X )|X ] + E [|X ] = f (X ) + E [|X ]. Without loss of generality, E [|X ] = 0, so E [Y |X ] = f (X ). 65 / 157 Regression model formulations and parameterizations A parametric regression model is: Y = f (X ; θ) + , where θ is a finite dimensional parameter vector. Examples: 1. The linear response surface model f (X ; θ) = θ0 X 2. The quadratic response surface model f (X ; θ) = θ1 + θ2 X + θ3 X 2 3. The Gompertz curve f (X ; θ) = θ1 exp(θ2 exp(θ3 X )) θ2 , θ3 ≤ 0. Models 1 and 2 are both “linear models” because they are linear in θ. The Gompertz curve is a non-linear model because it is not linear in θ. 66 / 157 Basic inference for simple linear regression This section deals with statistical properties of least square fits that can be derived under minimal conditions. Specifically, we will assume derive properties of β̂ that hold for the generating model y = x 0 β + , where: i E [|x] = 0 ii var[|x] = σ 2 exists and is constant across cases iii the random variables are uncorrelated across cases (given x). We will not assume here that follows a particular distribution, e.g. a Gaussian distribution. 67 / 157 The relationship between E [|X ] and cov(, X ) If we treat X as a random variable, the condition that E [|X ] = 0 for all X implies that cor(X , ) = 0. This follows from the double expectation theorem: cov(X , ) = E [X ] − E [X ] · E [] = E [X ] = EX [E [X |X ]] = EX [XE [|X ]] = EX [X · 0] = 0. 68 / 157 The relationship between E [|X ] and cov(, X ) The converse is not true. If cor(X , ) = 0 and E [] = 0 it may not be the case that E [|X ] = 0. For example, if X ∈ {−1, 0, 1} and ∈ {−1, 1}, with joint distribution -1 0 1 -1 1/12 4/12 1/12 1 3/12 0 3/12 then E = 0 and cor(X , ) = 0, but E [|X ] is not identically zero. When and X are jointly Gaussian, cor(, X ) = 0 implies that and X are independent, which in turn implies that E [|X ] = E = 0. 69 / 157 Data sampling To be able to interpret the results of a regression analysis, we need to know how the data were sampled. In particular, it important to consider whether X and/or Y and/or the pair X , Y should be considered as random draws from a population. Here are two important situations: I Designed experiment: We are studying the effect of temperature on reaction yield in a chemical synthesis. The temperature X is controlled and set by the experimenter. In this case, Y is randomly sampled conditionally on X , but X is not randomly sampled, so it doesn’t make sense to consider X to be a random variable. I Observational study: We are interested in the relationship between cholesterol level X and blood pressure Y . We sample people at random from a well defined population (e.g. all registered nurses in the state of Michigan) and measure their blood pressure and cholesterol levels. In this case, X and Y are sampled from their joint distribution, and both can be viewed a random variables. 70 / 157 Data sampling There is another design that we may encounter: I Case/control study: Again suppose we are interested in the relationship between cholesterol level X , and blood pressure Y . But now suppose we have an exhaustive list of blood pressure measurements for all nurses in the state of Michigan. We wish to select a subset of 500 nurses to contact for acquiring cholesterol measures, and it is decided that studying the 250 nurses with greatest blood pressure together with the 250 nurses with lowest blood pressure will be most informative. In this case X is randomly sampled conditionally on Y , but Y is not randomly sampled. Summary: Regression models are formulated in terms of the conditional distribution of Y given X . The statistical properties of β̂ are easiest to calculate and interpret as being conditional on X . The way that the data are sampled also affects our interpretation of the results. 71 / 157 Basic inference for simple linear regression via OLS Above we showed that the slope and intercept estimates are P Yi (Xi − X̄ ) cd ov(Y , X ) = Pi β̂ = , 2 var[X c ] i (Xi − X̄ ) α̂ = Ȳ − β̂ X̄ . Note that we are using the useful fact that X i (Yi − Ȳ )(Xi − X̄ ) = X i Yi (Xi − X̄ ) = X (Yi − Ȳ )Xi . i 72 / 157 Sampling means of parameter estimates First we will calculate the sampling means of α̂ and β̂. A useful identity is that β̂ = X (α + βXi + i )(Xi − X̄ )/ i = β+ X (Xi − X̄ )2 i X i (Xi − X̄ )/ i X (Xi − X̄ )2 . i From this identity it is clear that E [β̂|X ] = β. Thus β̂ is unbiased (a parameter estimate is unbiased if its sampling mean is the same as the population value of the parameter). The intercept is also unbiased: E [α̂|X ] = E (Ȳ − β̂ X̄ |X ) = α + β X̄ + E [¯ |X ] − E [β̂ X̄ |X ] = α 73 / 157 Sampling variances of parameter estimates Next we will calculate the sampling variances of α̂ and β̂. These values capture the variability of the parameter estimates over replicated studies or experiments from the same population. First we will need the following result: cov(β̂, ¯|X ) = X = 0, cov(i , ¯|X )(Xi − X̄ )/ i X (Xi − X̄ )2 i since cov(i , ¯|X ) = σ 2 /n does not depend on i. 74 / 157 Sampling variances of parameter estimates To derive the sampling variances, start with the identity: β̂ = β + X i (Xi − X̄ )/ X (Xi − X̄ )2 . i i The sampling variances are var[β̂|X ] = σ2 / X (Xi − X̄ )2 i = σ 2 / ((n − 1)var[X c ]) . var[α̂|X ] = var[Ȳ − β̂ X̄ |X ] = var[α + β X̄ + ¯ − β̂ X̄ |X ] = var[¯ |X ] + X̄ 2 var[β̂|X ] − 2X̄ cov[¯ , β̂|X ] = σ 2 /n + X̄ 2 σ 2 /((n − 1)var[X c ]). 75 / 157 Sampling covariance of parameter estimates The sampling covariance between the slope and intercept is cov(α̂, β̂|X ) = cov(Ȳ , β̂|X ) − X̄ var[β̂|X ] = −σ 2 X̄ /((n − 1)var[X c ]). When X̄ > 0, it’s easy to see what the expression for cov(α̂, β̂|X ) is telling us: Underestimate slope, overestimate intercept Y − Overestimate slope, underestimate intercept 0.5 0.0 0.5 1.0 X 1.5 2.0 2.5 76 / 157 Basic inference for simple linear regression via OLS Some observations: I All variances scale with sample size like 1/n. I β̂ does not depend on X̄ . I var[α̂] is minimized if X̄ = 0. I α̂ and β̂ are uncorrelated if X̄ = 0. 77 / 157 Some properties of residuals Start with the following useful expression: Ri ≡ Yi − α̂ − β̂Xi = Yi − Ȳ − β̂(Xi − X̄ ). Since Yi = α + βXi + i and therefore Ȳ = α + β X̄ + ¯, we can subtract to get Yi − Ȳ = β(Xi − X̄ ) + i − ¯ it follows that Ri = (β − β̂)(Xi − X̄ ) + i − ¯. Since E β̂ = β and E [i − ¯] = 0, itP follows that ERi = 0. Note that this is a distinct fact from the identity i Ri = 0. 78 / 157 Some properties of residuals It is important to distinguish the residual Ri from the “observation errors” i . The identity Ri = (β − β̂)(Xi − X̄ ) + i − ¯ shows us that the centered observation error i − ¯ is one part of the residual. The other term (β − β̂)(Xi − X̄ ) reflects the fact that the residuals are also influenced by how well we recover the true slope β through our estimate β̂. To illustrate, consider two possibilities: I We overestimate the slope (β − β̂ < 0). The residuals to the right of the right of the mean (i.e. when X > X̄ ) are shifted down (toward −∞), and the residuals to the left of the mean are shifted up. I We underestimate the slope (β − β̂ > 0). The residuals to the right of the right of the mean are shifted up, and the residuals to the left of the mean are shifted down. 79 / 157 Some properties of residuals We can use the identity Ri = (β − β̂)(Xi − X̄ ) + i − ¯ to derive the variance of each residual. We have: var[(β − β̂)(Xi − X̄ )] = σ 2 (Xi − X̄ )2 / X (Xj − X̄ )2 j var[i − ¯|X ] = var[i |X ] + var[¯ |X ) − 2cov(i , ¯|X ) = σ 2 + σ 2 /n − 2σ 2 /n = σ 2 (n − 1)/n. cov (β − β̂)(Xi − X̄ ), i − ¯|X = −(Xi − X̄ )cov(β̂, i |X ) X = −σ 2 (Xi − X̄ )2 / (Xj − X̄ )2 , j 80 / 157 Some properties of residuals Thus the variance of the i th residual is var[Ri |X ] = var (β − β̂)(Xi − X̄ ) + i − ¯|X = var (β − β̂)(Xi − X̄ )|X + var[i − ¯|X ] + 2cov (β − β̂)(Xi − X̄ ), i − ¯|X X = (n − 1)σ 2 /n − σ 2 (Xi − X̄ )2 / (Xj − X̄ )2 . j Note the following identity, which ensures that this expression is non-negative: (Xi − X̄ )2 / X (Xj − X̄ )2 ≤ (n − 1)/n. j 81 / 157 Some properties of residuals I The residuals are not iid – they are correlated with each other, and they have different variances. I var[Ri ] < var[i ] – the residuals are less variable than the errors. 82 / 157 Some properties of residuals 2 Since E [R Pi ] = 20 it follows that var[Ri ] = E [Ri ]. Therefore the expected value of i Ri is (n − 1)σ 2 − σ 2 = (n − 2)σ 2 since the (Xi − X̄ )2 / P j (Xj − X̄ )2 sum to one. 83 / 157 Estimating the error variance σ 2 Since E X Ri2 = (n − 2)σ 2 i it follows that X Ri2 /(n − 2) i is an unbiased estimate of σ 2 . That is ! E X Ri2 /(n − 2) = σ2 . i 84 / 157 Basic inference for multiple linear regression via OLS We have the following useful identity for the multiple linear regression least squares fit: β̂ = (X0 X)−1 X0 Y = (X0 X)−1 X0 (Xβ + ) = β + (X0 X)−1 X0 . Letting η = (X0 X)−1 X0 we see that E [η|X] = 0 and var[η|X] = var[β̂|X] = (X0 X)−1 X0 var[|X]X(X0 X)−1 = σ 2 (X0 X)−1 • var[|X] = σ 2 I since (i) the i are uncorrelated and (ii) the i have constant variance given X . 85 / 157 Variance of β̂ in multiple regression OLS Let ui be the i th row of the design matrix X. Then X0 X = X ui0 ui i where ui0 ui is an outer-product (a p + 1 × p + 1 matrix). If we have a limiting behavior n−1 X ui0 ui → Q, i for a fixed p + 1 × p + 1 matrix Q, then X0 X ≈ nQ, so cov(β̂|X) ≈ σ 2 Q −1 /n. Thus we see the usual influence of sample size on the standard errors of the regression coefficients. 86 / 157 Level sets of quadratic forms Suppose 1 0.5 2 3 C= 0.5 1 so that C −1 = 2 −1 −1 2 Left side: level curves of g (x) = x 0 Cx Right side: level curves of h(x) = x 0 C −1 x 3 3 2 1 1.0 0 0 0.5 1 1 4.0 2 33 0 4.0 3. 1.0 0.5 3 2.0.0 1 5.0 5.0 2.0 2 2 2 1 0 1 2 3 33 2 1 0 1 2 3 87 / 157 Eigen-decompositions and quadratic forms The dominant eigenvector of C maximizes the “Rayleigh quotient” g (x) = x 0 Cx/x 0 x, thus it points in the direction of greatest change of g . If C= 1 0.5 0.5 1 , the dominant eigenvector points in the (1, 1) direction. The dominant eigenvector of C −1 points in the (−1, 1) direction. P If the eigendecomposition of C is j λj vj vj0 then the eigendecomposition P 0 of C −1 is j λ−1 j vj vj – thus directions in which the level curves of C are most spread out are the directions in which the level curves of C −1 are most compressed. 88 / 157 Variance of β̂ in multiple regression OLS If p = 2 and 1 X 0 X /n = 0 0 0 1 r 0 r , 1 then 1 cov(β̂) = σ 2 n−1 0 0 0 0 1/(1 − r 2 ) −r /(1 − r 2 ) . −r /(1 − r 2 ) 1/(1 − r 2 ) So if p = 2 and X1 and X2 are positively colinear (meaning that (X1 − X̄1 )0 (X2 − X̄2 ) > 0), the corresponding slope estimates β̂1 and β̂2 are negatively correlated. 89 / 157 The residual sum of squares The residual sum of squares (RSS) is the squared norm of the residual vector: RSS = kY − Ŷ k2 = kY − PY k2 = k(I − P)Y k2 = Y 0 (I − P)(I − P)Y = Y 0 (I − P)Y , where P is the projection matrix onto col(X). The last equivalence follows from the fact that I − P is a projection and hence is idempotent. 90 / 157 The expected value of the RSS The expression RSS = Y 0 (I − P)Y is a quadratic form in Y , and we can write Y 0 (I − P)Y = tr (Y 0 (I − P)Y ) = tr ((I − P)YY 0 ) , where the second equality uses the circulant property of the trace. For three factors, the circulent property states that tr(ABC ) = tr(CAB) = tr(BCA). 91 / 157 The expected value of the RSS By linearity we have E tr[(I − P)YY 0 ] = tr[(I − P) · EYY 0 ], and EYY 0 = E Xββ 0 X0 + EX β0 + E β 0 X0 + E 0 = Xββ 0 X0 + E 0 = Xββ 0 X0 + σ 2 I . Since PX = X and hence (I − P)X = 0, (I − P)E [YY 0 ] = σ 2 (I − P). Therefore the expected value of the RSS is E [RSS] = σ 2 tr(I − P). 92 / 157 Four more properties of projection matrices Property 9: A projection matrix P is symmetric. One way to show this is to let V1 , . . . , Vq be an orthonormal basis for S, where P is the projection onto S. Then complete the Vj with Vq+1 , . . . , Vd to get a basis. By direct calculation, (P − q X Vj Vj0 )Vk = 0 j=1 for all k, hence P = Pq j=1 Vj Vj0 which is symmetric. 93 / 157 Four more properties of projection matrices Property 10: A projection matrix is positive semidefinite. Let V be an arbitrary vector and write V = V1 + V2 , where V1 ∈ S and V2 ∈ S ⊥ . Then (V1 + V2 )0 P(V1 + V2 ) = V10 V1 ≥ 0. 94 / 157 Four more properties of projection matrices Property 11: The eigenvalues of a projection matrix P must be zero or one. Suppose λ, v is an eigenvalue/eigenvector pair: Pv = λv . If P is the projection onto a subspace S, this implies that λv is the closest element of S to v . But if λv ∈ S then v ∈ S, and is strictly closer to v than λv , unless λ = 1 or v = 0. Therefore only 0 and 1 can be eigenvalues of P. 95 / 157 Four more properties of projection matrices Property 12: The trace of a projection matrix is its rank. The rank of a matrix is the number of nonzero eigenvalues. The trace of a matrix is the sum of all eigenvalues. Since the nonzero eigenvalues of a projection matrix are all 1, the rank and the trace must be identical. 96 / 157 The expected value of the RSS We know that E [RSS] = σ 2 tr(I − P). Since I − P is the projection onto col(X)⊥ , I − P has rank n − rank(X) = n − p − 1. Thus E [RSS] = σ 2 (n − p − 1), so RSS/(n − p − 1) is an unbiased estimate of σ 2 . 97 / 157 Covariance matrix of residuals Since E β̂ = β, it follows that E Ŷ = Xβ = EY . Therefore we can derive the following simple expression for the covariance matrix of the residuals. cov(Y − Ŷ ) = E (Y − Ŷ )(Y − Ŷ )0 = (I − P)EYY 0 (I − P) = (I − P)(Xββ 0 X0 + σ 2 I )(I − P) = σ 2 (I − P) 98 / 157 Distribution of the RSS The RSS can be written RSS = = tr[(I − P)YY 0 ] tr[(I − P)0 ] Therefore, the distribution of the RSS does not depend on β. It also depends on X only through col(X). 99 / 157 Distribution of the RSS If the distribution of is invariant under orthogonal transforms, i.e. d = Q when Q is a square orthogonal matrix, then we can make the stronger statement that the distribution of the RSS only depends on X through its rank. To see this, construct a square orthogonal matrix Q so that Q 0 (I − P)Q is the projection onto a fixed subspace S of dimension n − p − 1 (so Q 0 maps col(I − P) to S). Then tr[(I − P)0 ] = d tr[(I − P)Q(Q)0 ] = tr[Q 0 (I − P)Q0 ] Note that since Q is square we have QQ 0 = Q 0 Q = I . 100 / 157 Optimality For a given design matrix X, there are many linear estimators that are unbiased for β. That is, there are many matrices M ∈ Rp+1×n such that E [MY |X] = β for all β. The Gauss-Markov theorem states that among these, the least squares estimate is “best,” in the sense that its covariance matrix is “smallest.” Here we are using the definition that a matrix A is “smaller” than a matrix B if B −A is positive definite. 101 / 157 Optimality (BLUE) Letting β ∗ = MY be any linear unbiased estimator of β, when cov(β̂|X) ≤ cov(β ∗ |X), this implies that for any fixed vector θ, var[θ0 β̂|X] ≤ var[θ0 β ∗ |X]. The Gauss-Markov theorem implies that the least squares estimate β̂ is the “BLUE” (best linear unbiased estimator) for the least squares model. 102 / 157 Optimality (proof of GMT) The idea of the proof is to show that for any linear unbiased estimate β ∗ of β, β ∗ − β̂ and β̂ are uncorrelated. It follows that cov(β ∗ |X) = cov(β ∗ − β̂ + β̂|X) = cov(β ∗ − β̂|X) + cov(β̂|X) ≥ cov(β̂|X). To prove the theorem note that E [β ∗ |X] = M · E [Y |X] = MXβ = β for all β, and let B = (X0 X)−1 X0 , so that E [β̂|X] = BXβ = β for all β, so (M − B)X ≡ 0. 103 / 157 Optimality (proof of GMT) Therefore cov(β ∗ − β̂, β̂|X) = E [(M − B)Y (BY − β)0 |X] = E [(M − B)YY 0 B 0 |X] − E [(M − B)Y β 0 |X] = (M − B)(Xββ 0 X0 + σ 2 I )B 0 − (M − B)Xββ 0 = σ 2 (M − B)B 0 = 0. Note that we have an explicit expression for the gap between cov(β̂) and cov(β ∗ ): cov(β ∗ − β̂|X) = σ 2 (M − B)(M − B)0 . 104 / 157 Multivariate density families If C is a covariance matrix, many families of multivariate densities have the form c · φ((x − µ)0 C −1 (x − µ)), where c > 0 and φ : R+ → R+ is a function, typically decreasing with a mode at the origin (e.g. for the multivariate normal density, φ(u) = exp(−u/2) and for the multivariate t-distribution with d degrees of freedom, φ(u) = (1 + u/d)−(d+1)/2 ) . Left side: level sets of g (x) = x 0 Cx Right side: level sets of a density c · φ((x − µ)0 C −1 (x − µ)) 3 3 2 0 0.5 1 2 2 1 0 1 2 3 3 1 1 0 2 3 2 3 2 1 0.0 5 0.5 0.0 01 1 2 1 0 0.0 5.0 2.0 4.0 3.0 1.0 1 33 2 0.10.15 0 33 3 2 1 0.1 0.05 1 4.0 2 33 01 1 1.0 0 0.1 5 1 0.0 1 5 3 .0 2.0.0 0.0 2 2 2 1 0 1 2 3 33 2 1 0 1 105 / 157 Regression inference with Gaussian errors The random vector X = (X1 , . . . , Xp )0 has a p-dimensional standard multivariate normal distribution if its components are independent and follow standard normal marginal distributions. The density of X is the product of p standard normal densities: p(X ) = (2π)−p/2 exp(− 1 1X 2 Xj ) = (2π)−p/2 exp(− X 0 X ). 2 2 j 106 / 157 Regression inference with Gaussian errors If we transform Z = µ + AX , we get a random variable satisfying EZ cov(Z ) = µ = Acov(X )A0 = AA0 ≡ Σ. 107 / 157 Regression inference with Gaussian errors The density of Z can be obtained using the change of variables formula: 1 p(Z ) = (2π)−p/2 |A−1 | exp − (Z − µ)0 A−T A−1 (Z − µ) 2 1 = (2π)−p/2 |Σ|−1/2 exp − (Z − µ)0 Σ−1 (Z − µ) 2 This distribution is denoted N(µ, Z ). The log-density is 1 1 − log |Σ| − (Z − µ)0 Σ−1 (Z − µ), 2 2 with the constant term dropped. 108 / 157 Regression inference with Gaussian errors The joint log-density for an iid sample of size n is 1X n n n (Xi − µ)0 Σ−1 (Xi − µ) = − log |Σ| − tr Sxx Σ−1 − log |Σ| − 2 2 2 2 i where Sxx = X (Xi − µ)(Xi − µ)0 /n. 109 / 157 The Cholesky decomposition If Σ is a non-singular covariance matrix, there is a lower triangular matrix A with positive diagonal elements such that AA0 = Σ This matrix can be denoted Σ1/2 , and is called the Cholesky square root. 110 / 157 Properties of the multivariate normal distribution: • A linear function of a multivariate normal random vector is also multivariate normal. Specifically, if X ∼ N(µ, Σ) is p-variate normal, and θ is a q × p matrix with q ≤ p, then Y ∼ θX has a N(θµ, θΣθ0 ) distribution. To prove this fact, write X = µ + AZ where AA0 = Σ is the Cholesky decomposition. 111 / 157 Properties of the multivariate normal distribution: Next, extend θ to a square invertible matrix θ θ∗ θ̃ = . where θ∗ ∈ Rp−q×p . The matrix θ∗ can be chosen such that θΣθ∗0 = 0, by the Gram-Schmidt procedure. Let Ỹ = θ̃X = Y Y∗ = θµ + θAZ θ∗ µ + θ∗ AZ . 112 / 157 Properties of the multivariate normal distribution: Therefore cov(Ỹ ) = θΣθ0 0 0 θ∗ Σθ∗0 , and cov(Ỹ )−1 = (θΣθ0 )−1 0 0 (θ∗ Σθ∗0 )−1 , Using the change of variables formula, and the structure of the multivariate normal density, it follows that p(Ỹ ) = p(Y )p(Y ∗ ). This implies that Y and Y ∗ are independent, and by inspecting the form of their densities, both are seen to be multivariate normal. 113 / 157 Properties of the multivariate normal distribution: • A consequence of the above argument is that in general, uncorrelated components of a multivariate normal vector are independent. • If X is a standard multivariate normal vector, and Q is a square orthogonal matrix, then QX is also standard multivariate normal. This follows directly from the fact that QQ 0 = I . 114 / 157 The χ2 distribution If z is a standard normal random variable, the density of z 2 can be calculated directly as √ p(z) = z −1/2 exp(−z/2)/ 2π. This is the χ21 distribution. The χ2p distribution is defined to be the distribution of p X zj2 j=1 where z1 , . . . , zp are iid standard normal random variables. 115 / 157 The χ2 distribution By direct calculation, if F ∼ χ21 , EF = 1 var[F ] = 2. Therefore the mean of the χ2p distribution is p and the variance is 2p. 116 / 157 The χ2 distribution The χ2p density is p(x) = 1 x p/2−1 exp(−x/2). Γ(p/2)2p/2 To prove this by induction, assume that the χ2p−1 density is 1 x (p−1)/2−1 exp(−x/2). Γ((p − 1)/2)2(p−1)/2 The χ2p density is the density of the sum of a χ2p−1 random variable and a χ21 random variable, which can be written 117 / 157 The χ2 distribution 1 Γ((p − 1)/2)Γ(1/2)2p/2 x Z s (p−1)/2−1 exp(−s/2)(x − s)−1/2 exp(−(x − s)/2)ds Z x 1 exp(−x/2) s (p−1)/2−1 (x − s)−1/2 ds Γ((p − 1)/2)Γ(1/2)2p/2 0 Z 1 1 p/2−1 exp(−x/2)x u p/2−3/2 (1 − u)−1/2 du. Γ((p − 1)/2)Γ(1/2)2p/2 0 = 0 = where u = s/x. The density for χ2p can now be obtained by applying the identities Z Γ(1/2) = u α−1 (1 − u)β−1 du = √ π 1 Γ(α)Γ(β)/Γ(α + β). 0 118 / 157 The χ2 distribution and the RSS Let P be a projection matrix and Z be a iid vector of standard normal values. For any square orthogonal matrix Q, Z 0 PZ = (QZ )0 QPQ 0 (QZ ). Since QZ is equal in distribution to Z , Z 0 PZ is equal in distribution to Z 0 QPQ 0 Z . If the rank of P is k, we can choose Q so that QPQ 0 is the projection onto the first k canonical basis vectors, i.e. 0 QPQ = Ik 0 0 0 , where Ik is the k × k identity matrix. 119 / 157 The χ2 distribution and the RSS This gives us d Z 0 PZ = Z 0 QPQ 0 Z = k X Zj2 j=1 which follows a χ2k distribution. It follows that n−p−1 2 σ̂ σ2 = Y 0 (I − P)Y /σ 2 = (/σ)0 (I − P)(/σ) ∼ χ2n−p−1 . 120 / 157 The χ2 distribution and the RSS Thus when the errors are Gaussian, we have var[σ̂ 2 ] = 2σ 4 . n−p−1 121 / 157 The t distribution Suppose Z is standard normal, V ∼ χ2p , and V is independent of Z . Then T = √ √ pZ / V has a “t distribution with p degrees of freedom.” Note that by the law of large numbers, V /p converges almost surely to 1. Therefore T converges in distribution to a standard normal distribution. To derive the t density apply the change of variables formula. Let U W ≡ √ Z/ V V 122 / 157 The t distribution The Jacobian is √ 1/ V J = 0 −Z /2V 3/2 −1/2 = W −1/2 . =V 1 Since the joint density of Z and V is p(Z , V ) ∝ exp(−Z 2 /2)V p/2−1 exp(−V /2), it follows that the joint density of U and W is p(U, W ) ∝ exp(−U 2 W /2)W p/2−1/2 exp(−W /2) = exp(−W (U 2 + 1)/2)W p/2−1/2 . 123 / 157 The t distribution Therefore Z p(U) = p(U, W )dW Z exp(−W (U 2 + 1)/2)W p/2−1/2 dW Z 2 −p/2−1/2 = (U + 1) exp(−Y /2)Y p/2−1/2 dY ∝ ∝ (U 2 + 1)−p/2−1/2 . where Y = W (U 2 + 1). Finally, write T = √ pU to get that p(T ) ∝ (T 2 /p + 1)−(p+1)/2 . 124 / 157 Confidence intervals The linear model residuals are R ≡ Y − Ŷ = (I − P)Y and the estimated coefficients are β̂ = (X0 X)−1 X0 Y . 125 / 157 Confidence intervals Recalling that (I − P)X = 0, and E [Y − Ŷ |X] = 0, it follows that cov(Y − Ŷ , β̂) = E [(Y − Ŷ )β̂ 0 ] = E [(I − P)YY 0 X(X0 X)−1 ] = (I − P)(Xββ 0 X0 + σ 2 I )X(X0 X)−1 = 0. Therefore every estimated coefficient is uncorrelated with every residual. If is Gaussian, they are also independent. Since σ̂ 2 is a function of the residuals, it follows that if is Gaussian, then β̂ and σ̂ 2 are independent. 126 / 157 Confidence interval for a regression coefficient Let Vk = [(X0 X)−1 ]kk so that var[β̂k |X] = σ 2 Vk . If the are multivariate Gaussian N(0, σ 2 I ), then (β̂k − βk )/σ p Vk ∼ N(0, 1). 127 / 157 Confidence intervals for a regression coefficient Therefore (n − p − 1)σ̂ 2 /σ 2 ∼ χ2n−p−1 and using the fact that β̂ and σ̂ 2 are independent, it follows that the pivotal quantity β̂k − βk √ σ̂ 2 Vk has a t-distribution with n − p − 1 degrees of freedom. 128 / 157 Confidence intervals for a regression coefficient Therefore if QT (q, k) is the q th quantile of the t-distribution with k degrees of freedom, then for 0 ≤ α ≤ 1, and qα = QT (1 − (1 − α)/2, n − p − 1), p P −qα ≤ (β̂k − βk )/ σ̂ 2 Vk ≤ qα = α. Rearranging terms we get the confidence interval p p P β̂k − σ̂qα Vk ≤ βk ≤ β̂k + σ̂qα Vk = α which has coverage probability α. 129 / 157 Confidence intervals for the expected response Let x ∗ be any point in Rp+1 . The expected response at X = x ∗ is E [Y |X = x ∗ ] = β 0 x ∗ . A point estimate for this value is β̂ 0 x ∗ , which is unbiased since β̂ is unbiased, and has variance var[β̂ 0 x ∗ ] = σ 2 x ∗0 (X0 X)−1 x ∗ ≡ σ 2 Vx ∗ . 130 / 157 Confidence intervals for the expected response As above we have that β̂ 0 x ∗ − β 0 x ∗ √ σ̂ 2 Vx ∗ has a t-distribution with n − p − 1 degrees of freedom. Therefore P(β̂ 0 x ∗ − σ̂qα p Vx ∗ ≤ β 0 x ∗ ≤ β̂ 0 x ∗ + σ̂qα p Vx ∗ ) = α defines a CI for E [Y |X = x ∗ ] with coverage probability α. 131 / 157 Prediction intervals Suppose Y ∗ is a new observation at X = x ∗ , independent of the data used to estimate β̂ and σ̂ 2 . If the errors are Gaussian, then Y ∗ − β̂ 0 x ∗ is Gaussian, with the following mean and variance: E [Y ∗ − β̂ 0 x ∗ |X] = β 0 x ∗ − β 0 x ∗ = 0 and var[Y ∗ − β̂ 0 x ∗ |X] = σ 2 (1 + Vx ∗ ), 132 / 157 Prediction intervals It follows that Y ∗ − β̂ 0 x ∗ p σ̂ 2 (1 + Vx ∗ ) has a t-distribution with n − p − 1 degrees of freedom. Therefore a prediction interval at x ∗ with coverage probability α is defined by p p P β̂ 0 x ∗ − σ̂qα (1 + Vx ∗ ) ≤ Y ∗ ≤ β̂ 0 x ∗ + σ̂qα (1 + Vx ∗ ) = α. 133 / 157 Wald tests Suppose we want to carry out a formal hypothesis test for the null hypothesis βk = 0, for some specified index k. Since var(β̂k |X) = σ 2 Vk , the “Z-score” β̂k / p σ 2 Vk follows a standard normal distribution under the null hypothesis. The “Z-test” √ or “asymptotic Wald test” rejects the null hypothesis if |β̂k |/ σ 2 Vk > F −1 (1 − α/2). More generally, we can also test hypotheses of the form βk = c using the Z-score (β̂k − c)/ p σ 2 Vk 134 / 157 Wald tests If the errors are normally distributed, then β̂k / p σ̂ 2 Vk follows a t-distribution with n − p − 1 √ degrees of freedom, and the Wald test rejects the null hypothesis if |β̂k |/ σ 2 Vk > Ft−1 (1 − α/2, n − p − 1), where Ft (·, d) is the student t CDF with d degrees of freedom, and α is the type-I error probability (e.g. α = 0.05). 135 / 157 Wald tests for contrasts More generally, we can consider the contrast θ0 β defined by a fixed vector θ ∈ Rp+1 . The population value of the contrast can be estimated by the plug-in estimate θ0 β̂. We can test the null hypothesis θ0 β = 0 with the Z-score √ θ0 β̂/ σ̂ 2 θ0 V θ. This approximately follows a standard normal distribution, and more “exactly” (under specified assumptions) it follows a student t-distribution with n − p − 1 degrees of freedom (all under the null hypothesis). 136 / 157 F-tests Suppose we have two nested design matrices X1 ∈ Rn×p1 and X2 ∈ Rn×p2 , such that col(X1 ) ⊂ col(X2 ). We may wish to compare the model defined by X1 to the model defined by X2 . To do this, we need a test statistic that discriminates between the two models. Let P1 and P2 be the corresponding projections, and let Ŷ (1) = P1 Y (2) = P2 Y Ŷ be the fitted values. 137 / 157 F-tests Due to the nesting, P2 P1 = P1 and P1 P2 = P1 . Therefore (P2 − P1 )2 = P2 − P1 , so P2 − P1 is a projection that projects onto col(X2 ) − col(X1 ), the complement of col(X1 ) in col(X2 ). Since (I − P2 )(P2 − P1 ) = 0, it follows that if E [Y ] ∈ col(X2 ) then cov(Y − Ŷ (2) , Ŷ (2) − Ŷ (1) ) = E [(I − P2 )YY 0 (P1 − P2 )] = 0. If the linear model errors are Gaussian, Y − Ŷ (2) and Ŷ (2) − Ŷ (1) are independent. 138 / 157 F-tests Since P2 X1 = P1 X1 = X1 , we have (I − P2 )X1 = (P2 − P1 )X1 = 0. Now suppose we take as the null hypothesis that EY ∈ col(X1 ), so we can write Y = θ + , where θ ∈ col(X1 ). Therefore under the null hypothesis kY − Ŷ (2) k2 = tr[(I − P2 )YY 0 ] = tr[(I − P2 )0 ] and kŶ (2) − Ŷ (1) k2 = tr[(P2 − P1 )YY 0 ] = tr[(P2 − P1 )0 ]. 139 / 157 F-tests Since I − P2 and P2 − P1 are projections onto subspaces of dimension n − p2 and p2 − p1 , respectively, it follows that kY − Ŷ (2) k2 /σ 2 = k(I − P2 )Y k2 /σ 2 ∼ χ2n−p2 and under the null hypothesis kŶ (2) − Ŷ (1) k2 /σ 2 = k(P2 − P1 )Y k2 /σ 2 ∼ χ2p2 −p1 . Therefore kŶ (2) − Ŷ (1) k2 /(p2 − p1 ) kY − Ŷ (2) k2 /(n − p2 ) . Since kŶ (2) − Ŷ (1) k2 will tend to be large when EY 6∈ col(X1 ), i.e. when the null hypothesis is false, this quantity can be used as a test-statistic. It is called the F-test statistic. 140 / 157 The F distribution The F-test statistic is the ratio between two rescaled independent χ2 draws. If U ∼ χ2p and V ∼ χ2q , then U/p V /q has an “F-distribution with p, q degrees of freedom,” denoted Fp,q . To derive the kernel of the density, let X Y ≡ U/V V . 141 / 157 The F distribution The Jacobian of the map is 1/V 0 −U/V 2 = 1/V . 1 The joint density of U and V is p(U, V ) ∝ U p/2−1 exp(−U/2)V q/2−1 exp(−V /2). The joint density of X and Y is p(X , Y ) ∝ X p/2−1 Y (p+q)/2−1 exp(−Y (X + 1)/2). 142 / 157 The F distribution Now let Z = Y (X + 1), so p(X , Z ) ∝ X p/2−1 Z (p+q)/2−1 (X + 1)−(p+q)/2 exp(−Z /2) and hence p(X ) ∝ X p/2−1 /(X + 1)(p+q)/2 . 143 / 157 The F distribution Now if we let F = U/p q = X V /q p then p(F ) ∝ F p/2−1 /(pF /q + 1)(p+q)/2 . 144 / 157 The F distribution Therefore, the F-test statistic follows an Fp2 −p1 ,n−p2 distribution kŶ (2) − Ŷ (1) k2 /(p2 − p1 ) kY − Ŷ (2) k2 /(n − p2 ) ∼ Fp2 −p1 ,n−p2 . 145 / 157 Simultaneous confidence intervals If θ is a fixed vector, we can cover the value θ0 β with probability α by pivoting on the tn−p−1 -distributed pivotal quantity θ0 β̂ − θ0 β √ , σ̂ Vθ where Vθ = θ0 (X0 X)−1 θ. Now suppose we have a set T ⊂ Rp+1 of vectors θ, and we want to construct a set of confidence intervals such that P(all θ0 β covered, θ ∈ T ) = α. We call this a set of simultaneous confidence intervals for {θ0 β; θ ∈ T }. 146 / 157 The Bonferroni approach to simultaneous confidence intervals The Bonferroni approach can be applied when T is a finite set, |T | = k. Let Ij = I(CI j covers θj0 β) and Ij0 = I(CI j does not cover θj0 β). Then the union bound implies that P(I1 and I2 · · · and Ik ) 1 − P(I10 or I20 · · · or Ik0 ) X ≥ 1− P(Ij0 ). = i 147 / 157 The Bonferroni approach to simultaneous confidence intervals As long as 1−α≥ X P(Ij0 ), i the intervals cover simultaneously. One way to achieve this is if each interval individually has probability α0 ≡ 1 − (1 − α)/k of covering its corresponding true value. To do this, use the same approach as used to construct single confidence intervals, but with a much larger value of qα . 148 / 157 The Scheffé approach to simultaneous confidence intervals The Scheffé approach can be applied if T is a linear subspace of Rp+1 . Begin with the pivotal quantity θ0 β̂ − θ0 β p , σ̂ 2 θ0 (X0 X)−1 θ and postulate that a symmetric interval can be found so that P ! θ0 β̂ − θ0 β −Qα ≤ p ≤ Qα for all θ ∈ T σ̂ 2 θ0 (X0 X)−1 θ = α. Equivalently, we can write P supθ∈T (θ0 β̂ − θ0 β)2 ≤ Qα2 σ̂ 2 θ0 (X0 X)−1 θ ! = α. 149 / 157 The Scheffé approach approach to simultaneous confidence intervals Since β̂ − β = (X0 X)−1 X0 , we have (θ0 β̂ − θ0 β)2 σ̂ 2 θ0 (X0 X)−1 θ = = θ0 (X0 X)−1 X0 0 X(X0 X)−1 θ σ̂ 2 θ0 (X0 X)−1 θ Mθ0 0 Mθ , σ̂ 2 Mθ0 Mθ where Mθ = X(X0 X)−1 θ. 150 / 157 The Scheffé approach to simultaneous confidence intervals Note that Mθ0 0 Mθ = h, Mθ /kMθ ki2 /σ̂ 2 , σ̂ 2 Mθ0 Mθ i.e. it is the squared length of the projection of onto the line spanned by Mθ (divided by σ̂ 2 ). The quantity h, Mθ /kMθ ki2 is maximized at kPk2 , where P is the projection onto the linear space {X(X0 X)−1 θ | θ ∈ T } = {Mθ }. Therefore supθ∈T h, Mθ /kMθ ki2 /σ̂ 2 = kPk2 /σ̂ 2 , and since {Mθ } ⊂ col(X), it follows that P and σ̂ 2 are independent. 151 / 157 The Scheffé approach to simultaneous confidence intervals Moreover, kPk2 /σ 2 ∼ χ2q where q = dim(T ), and as we know, n−p−1 2 σ̂ ∼ χ2n−p−1 . σ2 Thus kPk2 /q ∼ Fq,n−p−1 . σ̂ 2 152 / 157 The Scheffé approach to simultaneous confidence intervals Let QF be the α quantile of the Fq,n−p−1 distribution. Then P p |θ0 β̂ − θ0 β| p ≤ σ̂ qQF for all θ θ0 (X0 X)−1 θ ! = α, so p p P θ0 β̂ − σ̂ qQF Vθ ≤ θ0 β ≤ θ0 β̂ + σ̂ qQF Vθ for all θ = α defines a level α simultaneous confidence set for {θ0 β | θ ∈ T }, where Vθ = θ0 (X0 X)−1 θ. 153 / 157 Polynomial regression The conventional linear model has the mean specification E [Y |X1 , . . . , Xq ] = β0 + β1 X1 + · · · βp Xp . It is possible to accomodate nonlinear relationships while still working with linear models. Polynomial regression is a traditional approach to doing this. If there is only one predictor variable, polynomial reqression uses the mean structure E [Y |X ] = β0 + β1 X + β2 X 2 + · · · + βp X p . Note that this is still a linear model, as it is linear in the coefficients β. Multiple regression techniques (e.g. OLS) can be used for estimation and inference. 154 / 157 Functional linear regression A drawback of polynomial regression is that the polynomials can be highly colinear (e.g. cor(U, U 2 ) ≈ 0.97 if U is uniform on 0, 1). Also, polynomials are unbounded and it is often desirable to model E [Y |X ] as a bounded function. If there is a single covariate X , we can use a model of the form E [Y |X ] = β1 φ1 (X ) + · · · + βp φp (X ) where the φj (·) are basis functions that are chosen based on what we think E [Y |X ] might look like. Splines, sinusoids, wavelets, and radial basis functions are possible choices. Since the mean function is linear in the unknown parameters βj , this is a linear model and can be estimated using multiple linear regression (OLS) techniques. 155 / 157 Confidence bands Suppose we have a functional linear model of the form E [Y |X , t] = β 0 X + f (t), and we model f (t) as f (t) = q X γj φj (t) j=1 where the φj (·) are basis functions. A confidence band for f with coverage probability α is an expression of the form fˆ(t) ± M(t) such that P fˆ(t) − M(t) ≤ f (t) ≤ fˆ(t) + M(t) ∀t = α. 156 / 157 Confidence bands We can use as a point estimate fˆ(t) ≡ q X γ̂j φj (t) j=1 Pq For each fixed t, j=1 γj φj (t) is a linear combniation of the γ̂j , so if we simultaneously cover all linear combinations of the γj , we will have our confidence band. Thus the Scheffé procedure can be applied with T = Rq . 157 / 157