What is a Regression Model? Statistical model for: Y x (non-random variable) Aim: understand the effect of x on the random quantity Y Linear Models Victor Panaretos (random variable) depending on General formulation1 : Y Distributionfg (x )g Institut de Mathématiques – EPFL victor.panaretos@epfl.ch g () from data f(xi ; yi )gni=1 . Use for: Statistical Problem: Estimate (learn) Description Inference Prediction Data compression (parsimonious representations) ::: j f g 1 Often books/people write Y x Distribution g (x ) but this implies that (X ; Y ) have a joint distribution; this assumption is unnecessary (e.g., in a designed experiment we choose values for x ). Despite this, we write Y x to remind ourselves that the distribution of Y depends on x . j Victor Panaretos (EPFL) Linear Models 1 / 299 Linear Models Example: Gas mileage 2 / 299 Histogram of Auto$mpg 40 0 20 Frequency 60 80 Example: Honolulu tide Victor Panaretos (EPFL) 10 20 30 40 50 Auto$mpg Figure: Miles per gallon for 392 car models Victor Panaretos (EPFL) Figure: RecordedLinear tideModels heights in Honolulu 3 / 299 Victor Panaretos (EPFL) Linear Models 4 / 299 Example: Honolulu tide with time covariate Example: Gass mileage with horsepower covariate ● 30 20 ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●● ●●●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ●●● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ●●●● ● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 10 Gas mileage (MPG) 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 ● ● ● 150 ●● ● ● ● ● ● ● 200 Horsepower Victor Panaretos (EPFL) Linear Models 5 / 299 Great Variety of Models x can be: continuous, discrete, categorical, vector Y j x N ( 0 + 1 x ; 2 ) m Y = 0 + 1 x + ; N (0; 2 ) ::: The second verson is useful for mathematical work, but is puzzling statistically, since we don’t observe . arrive randomly, or be chosen by experimenter, or both x arises, we treat it as constant in the analysis Also, Distribution can be: Gaussian (Normal), Laplace, binomial, Poisson, gamma, General exponential family, . . . g () can be: g (x ) = 0 + 1 x , g (x ) = PKk= Victor Panaretos (EPFL) 6 / 299 Y ; x 2 R, g (x ) = 0 + 1 x , Distribution = Gaussian Y Distributionfg (x )g Function Linear Models Fundamental Case: Normal Linear Regression Remember general model: however Victor Panaretos (EPFL) x could be vector (Y ; 0 2 R, x 2 Rp , 2 Rp ): Y j x N ( 0 + > x ; 2 ) m Y = 0 + > x + ; N (0; 2 ) ikx , Cubic spline, . . . K ke Linear Models 7 / 299 Victor Panaretos (EPFL) Linear Models 8 / 299 Example: Professor’s Van Victor Panaretos (EPFL) Linear Models 9 / 299 Example: Professor’s Van Victor Panaretos (EPFL) Victor Panaretos (EPFL) Linear Models 10 / 299 Linear Models 12 / 299 Example: Professor’s Van Linear Models 11 / 299 Victor Panaretos (EPFL) Tools of the trade ::: Start from Normal linear model ! gradually generalise Important features of Normal linear model: ::: Gaussian distribution Linearity Projections, Spectra, Gaussian Law These two combine well and give geometric insights to solve the estimation problem. Thus we need to revise some linear algebra and probability . . . Will base course on the Gaussian assumption, but relax linearity later: linear Gaussian regression nonlinear Gaussian regression nonparametric Gaussian regression Many further generalisations are possible . . . Victor Panaretos (EPFL) Linear Models 13 / 299 Reminder: Subspaces, Spectra, Projections M Recall that M (Q ) = fy :9 2 Rp ; Q to be Two further important subspaces associated with a real ( ) ( ) Allows interpretation of system of linear equations Q = y: $ [existence of solution is y an element of M(Q )?] [uniqueness of solution is there a unique coordinate vector Linear Models n p matrix Q : the null space (or kernel), The columns of Q provide a coordinate system for the subspace M Q If Q is of full column rank (p ), then the coordinates corresponding to a y 2 M Q are unique. Victor Panaretos (EPFL) 14 / 299 ker(Q ), of Q is the subspace defined as ker(Q ) = fx 2 Rp : Qx = 0g; the orthogonal complement of M(Q ), M? (Q ), is the subspace defined as M? (Q ) = fy 2 Rn : y > Qx = 0; 8x 2 Rp g = fy 2 Rn : y >v = 0; 8v 2 M(Q )g: y = Q g: (Q ) is a subspace of Rn . $ Linear Models Reminder: Subspaces, Spectra, Projections If Q is an n p real matrix, we define the column space (or range) of the set spanned by its columns: 2 Rn Victor Panaretos (EPFL) The orthogonal complement may be defined for arbitrary subspaces by using the second equality. ?] 15 / 299 Victor Panaretos (EPFL) Linear Models 16 / 299 Reminder: Subspaces, Spectra, Projections Reminder: Subspaces, Spectra, Projections. Theorem (Singular Value Decomposition) Theorem (Spectral Theorem) A p p matrix Q is symmetric if and only if there exists a matrix U and a diagonal matrix such that p p orthogonal Any n p Q = U U > : In particular: 1 2 3 the columns of such that where U and V > are orthogonal with columns called left singular vectors and right singular vectors, respectively, and is diagonal with real entries called singular values. U = (u1 up ) are eigenvectors of Q , i.e. there exist j Quj = j uj ; = n p real matrix can be factorised as Q = nUn np pVp> ; j = 1; : : : ; p ; 1 (1; : : : ; p ) are the corresponding eigenvalues of Q , QQ > . The right singular vectors are eigenvectors of Q > Q . The left singular vectors are eigenvectors of the entries of diag which are real; and 2 3 The squares of the singular values are eigenvalues of both the rank of 4 The left singular vectors corresponding to non-zero singular values form an orthonormal basis for M Q . 5 The left singular vectors corresponding to zero singular values form an orthonormal basis for M? Q . Q is the number of non-zero eigenvalues. Note: if the eigenvalues are distinct, the eigenvectors are unique (up to re-ordering). Victor Panaretos (EPFL) Linear Models Reminder: Subspaces, Spectra, Projections. Recall that a matrix ( ) ( ) Victor Panaretos (EPFL) 17 / 299 Linear Models 18 / 299 Reminder: Subspaces, Spectra, Projections. Q is called idempotent if Q 2 = Q . An orthogonal projection (henceforth projection) onto a subspace V is a symmetric idempotent matrix H such that M H V. ( )= Proposition Let V be a subspace and Proposition The only possible eigenvalues of a projection matrix are 0 and 1. Proposition Proposition If Let V be a subspace and matrix onto V? . H be a projection onto V. Then I H is the projection Proof. (I H )> = I H > = I H since H is symmetric and, (I H )2 = I 2 2H + H 2 = I H . Thus I H is a projection matrix. It remains to identify the column space of I H . Let H = U U > be the spectral decomposition of H . Then I H = UU > U U > = U (I )U > . Hence the column space of I H is spanned by the eigenvectors of H corresponding to zero eigenvalues of H , which coincides with M? (H ) = V? . Victor Panaretos (EPFL) QQ > and Q > Q . Linear Models 19 / 299 H be a projection onto V. Then Hy = y for all y 2 V. P and Q are projection matrices onto a subspace V, then P = Q . Proposition If x1 ; : : : ; xp are linearly independent and are such that span the projection onto V can be represented as (x1 ; : : : ; xp ) = V, then H = X (X > X ) 1 X > where X is a matrix with columns x1 ; : : : ; xp . Victor Panaretos (EPFL) Linear Models 20 / 299 Reminder: Subspaces, Spectra, Projections. Reminder: Subspaces, Spectra, Projections. (proof continued) Proposition kx H be a projection onto V. Then kx Hx k kx v k; 8v 2 V: Let V be a subspace of Rn and Hx k2 = = Proof = H = U U > be the spectral decomposition of H , U = (u1 un ) and = diag(1 ; : : : ; n ). Letting p = dim(V), 1 = = p = 1 and p +1 = = n = 0, u1 ; : : : ; un is an orthonormal basis of Rn , u1 ; : : : ; up is an an orthonormal basis of V. Let n X (x >ui (Hx )>ui )2 i =1 n X (x >ui x >Hui )2 i =1 n X i =1 = 0+ 1 2 3 p X i =1 i x > u i ) 2 (x >ui n X i =p +1 (x >ui )2 (x > ui v > ui )2 + = kx v k2 : Victor Panaretos (EPFL) Linear Models 21 / 299 Victor Panaretos (EPFL) [orthonormal basis] [H is symmetric] [u’s are eigenvectors of H] [eigenvalues 0 or 1] n X (x >ui )2 i =p +1 8v 2 V Linear Models 22 / 299 Positive-Definite Matrices Proposition Let V1 V Rn be two nested linear subspaces. If and H is the projection onto V, then H1 is the projection onto V1 = 0 = HH1. For all y 2 Rn we H1 )y = 0 for all y 2 Rn , HH1 = H1 . First we show that HH1 H1 , and then that H1 H have H1 y 2 V1 . But then H1 y 2 V, since V1 V. Therefore HH1 y H1 y . We have shown that HH1 so that HH1 H1 , as its kernel is all Rn . Hence = =0 ( (Or, take n linearly independent vectors y1 ; : : : ; yn matrix Y . Now Y is invertible, and (HH1 H1 )Y = 0 A p p real symmetric matrix is called non-negative definite (written ) if and only if x > x for all x 2 Rp . If x > x > for all x 2 Rp n f g, then we call positive definite (written ). HH1 = H1 = H1 H : Proof. Definition (Non-Negative Matrix – Quadratic Form Definition) 2 Rn , and use them as columns of the n n = 0, so HH1 H1 = 0, giving HH1 = H1 .) To prove that H1 H HH1 , note that symmetry of projection matrices and the first part of the proof give H1 H = H1> H > = (HH1 )> = (H1 )> = H1 = HH1 : 0 0 0 An equivalent definition is: Definition (Non-Negative Matrix – Spectral Definition) 0 A p p real symmetric matrix is called non-negative definite (written ) if and only the eigenvalues of are non-negative. If the eigenvalues of are strictly positive, then is called positive definite (written ). 0 Lemma (Exercise) Prove that the two definitions are equivalent. Victor Panaretos (EPFL) Linear Models 23 / 299 Victor Panaretos (EPFL) Linear Models 24 / 299 Covariance Matrices Covariance Matrices Definition (Covariance Matrix) Let Y = (Y1 ; : : : ; Yn )> be a random n 1 vector such that EkY k2 < 1. The Lemma Y , say , is the n n symmetric matrix with entries ij = cov(Yi ; Yj ) = E[(Yi E[Yi ])(Yj E[Yj ])]; 1 i j n : covariance matrix of That is, the covariance matrix encodes the variances of the coordinates of the diagonal) and the covariances between the coordinates of Y (off the diagonal). If we write Y (on Let Y be a random d vector such that EkY k2 vectors. If denotes the covariance matrix of Y , 1 Y , then the covariance matrix of Y can be written as E[(Y )(Y )> ] = E[YY > ] > : Whenever Y is a random vector, we will write cov(Y ) or var(Y ) for the covariance matrix of Y . Non-negative Matrices Proof. Corollary (Covariance of Projections) for the mean vector of Linear Models 1 Exercise. = E[Y ] = (E[Y1 ]; : : : ; E[Yn ])> Victor Panaretos (EPFL) Let Y be a random d vector such that EkY k2 < 1. Let be the mean vector and be the covariance matrix of Y . If A is a p d real matrix, the mean vector and covariance matrix of AY are A and A A> , respectively. > Y is > ; the covariance of > Y with > Y is > < 1. Let ; 2 Rd be fixed the variance of 25 / 299 Covariance Matrices Victor Panaretos (EPFL) . Linear Models 26 / 299 Principal Component Analysis Let Y be a random vector in Rd with covariance matrix . Find direction v1 2 Sd 1 such that the projection of Y onto v1 has maximal variance. For j ; ; : : : ; d , find direction vj ? vj 1 such that projection of Y onto vj has maximal variance. Solution: maximise Var v1> Y v1> v1 over kv1 k =2 3 Proposition (Non-Negative and Covariance Matrices) ( Let be a real symmetric matrix. Then is non-negative definite if and only if is the covariance matrix of some random variable Y . Proof. =1 v1> v1 = v1> U U > v1 = k1=2 U > v1 k2 = Now Exercise. )= Pd d X i =1 i (ui> v1 )2 [change of basis] d > 2 2 i =1 (ui v1 ) = kv1 k = 1 so we have a convex combination of the fj gj =1 , d X X pi i ; pi = 1; pi 0; i = 1; : : : ; d : i =1 i 1 i 0 so clearly this sum is maximised when p1 = 1 and pj 8j 6= 1, i.e. v1 = u1 . But Iteratively, vj Victor Panaretos (EPFL) Linear Models 27 / 299 = uj , i.e. principal components are eigenvectors of Victor Panaretos (EPFL) Linear Models =0 . 28 / 299 Principal Component Analysis Principal Component Analysis Theorem (Proof of Optimal Linear Dimension Reduction Theorem) Optimal Linear Dimension Reduction. Pk Write Q = j =1 vi vi> for some orthonormal fvj gkj=1 . Then, Let Y the projection matrix onto the span of the first be a mean-zero random variable in Rd with d d covariance k eigenvectors of EkY HY k2 EkY QY k2 for any d d projection operator Q or rank at most k . . Let . Then H be EkY Intuitively: if you want to approximate a mean-zero random variable taking values Rd by a random variable that ranges over a subspace of dimension at most k d , the optimal choice is the projection of the random variable onto the space spanned by its first k principal components (eigenvectors of the covariance). “Optimal” is with respect to the mean squared error. For the proof, use lemma below (follows immediately from spectral decomposition) Lemma Q is a rank k projectionPmatrix if and only if there exist orthonormal vectors k k fvj gj =1 such that Q = j =1 vi vi> . Victor Panaretos (EPFL) Linear Models QY k2 = trfQ g= Y > (I Q )> (I Q )Y = E trf(I Q )YY > (I Q )> g = trf(I Q )E YY > (I Q )>g = trf(I Q )>8(I Q ) g 9 E = trf (I Q ) g = trf g d X = i =1 d X = i =1 i i k X j =1 k X j =1 d X = tr vi vi> [ Var vi> Y i =1 i d X i =1 k X j =1 i tr k <X : j =1 vi vi> = ; vi vi> ] = = If we can minimise this expression over all fvj gkj=1 with vi> vj 1fi j g, then we’re done. By PCA, this is done by choosing the top k eigenvectors of . 29 / 299 Principal Component Analysis Victor Panaretos (EPFL) Linear Models 30 / 299 Gaussian Vectors and Affine Transformations Definition (Multivariate Gaussian Distribution) Corollary + + =0 Let fx1 ; :::; xn g Rd be such that x1 : : : xn , and let X be the matrix with columns fxi gni=1 . The best approximating k -hyperplane to the points fx1 ; :::; xn g is given by the span of the first k eigenvectors of the matrix X > X , i.e. if H is the projection onto this span, it holds that n X i =1 for any kxi Hxi k2 n X i =1 kxi Qxi k2 Define the discrete random vector dimension reduction. Victor Panaretos (EPFL) Y Observation: From the definition if follows that Y must have some well-defined mean vector and some well defined covariance matrix . To see this note that since Ef > Y 2 g < 1 for all , then we can successively pick to be equal to each canonical basis vector and conclude that each coordinate has finite variance and thus EkY k2 < 1. ( ) So all the means, variances and covariances of its coordinates are well defined. d d projection operator Q or rank at most k . Proof. A random vector Y in Rd has the multivariate normal distribution if and only if > Y has the univariate normal distribution, 8 2 Rd . Then, the mean vector (say) and covariance matrix (say) determined entrywise by equating [ = xi ] = 1=n , and use optimal linear by P Y i = E[ei>Y ] & where ej is the j th canonical basis vector ej = (0 ; 0 ; : : : ; Linear Models 31 / 299 Victor Panaretos (EPFL) ij 1 = covfei>Y ; ej>Y g: |{z} ;:::; jth position Linear Models can be (uniquely) 0 ; 0)> 32 / 299 Gaussian Vectors and Affine Transformations Gaussian Vectors and Affine Transformations Useful facts: 1 Moment generating function of How can we use this definition to determine basic properties? The moment generating function (MGF) of a random vector as MW E e > W ; 2 Rd ; ( )= [ W in Rd is defined ] provided the expectation exists. When the MGF exists it characterises the distribution of the random vector. Furthermore, two random vectors are independent if and only if their joint MGF is the product of their marginal MGF’s, i.e. Xn 1 independent of Ym 1 [ Ee >X + >Y ] = E[e >X () ] E[e > Y ]; 8 2 Rn & 2 Rm MY (u ) = exp 2 Y N (p1 ; 3 N (; ) density, assuming 4 5 7 Linear Models 33 / 299 Proposition (Property 1: Moment Generating Function) be arbitrary. Then v > Y is scalar Gaussian with mean Hence it has moment generating function: 1 B > ). nonsingular: (2)p=2 j j1=2 exp 1 (y 2 1 (y )> ) : Constant density isosurfaces are ellipsoidal Marginals of Gaussian are Gaussian (converse NOT true). diagonal , independent coordinates Yj . If Y N p 1 ; p p , AY independent of BY () A B > ( ) = 0. Linear Models Y N (p1 ; 34 / 299 p p ) and given Bn p and n 1 , we have + BY N ( + B ; B Combining the two, we conclude that MY (v ) = exp v > + 12 v > v ; v 2 Rd : Linear Models u : B >) M+BY (u ) = E expfu > ( + BY )g = exp u > E expf(B > u )> Y g = exp u > MY (B > u ) n o = exp u > exp (B > u )> + 1 u > B B > u 2 n o 1 > > > = exp u + u (B ) + u B B > u 2 n o 1 > > = exp u ( + B ) + u B B > u 2 And this last expression is the MGF of a N ( + B ; B B > ) distribution. t = 1 and observe that Mv > Y (1) = E e v > Y = MY (v ): Victor Panaretos (EPFL) Proof. Mv > Y (t ) = E e t > Y = exp t (v > ) + t2 (v > v ) : Now take 2 + BY N ( + B ; B Victor Panaretos (EPFL) For v > and variance v > v . 2 u > + 1 u > Proposition (Property 2: Affine Transformation) Y N (; ) is MY (u ) = exp u > + 12 u > u The moment generating function of Proof. Let v 2 Rd p p ) and given Bn p and n 1 , then fY (y ) = 6 Victor Panaretos (EPFL) Y N (; ): 35 / 299 Victor Panaretos (EPFL) Linear Models 36 / 299 Proposition (Property 3: Density Function) Let proof continued N (p1 ; pp ) is 1 exp 2 (y )> 1(y ) p p be nonsingular. The density of fY (y ) = Proof. =( 1 (2)p=2 j j1=2 ) Z =) Z MZ (u ) = E exp i =1 ui Zi which is the MGF of a p -variate Victor Panaretos (EPFL) !) = p Y i =1 exp(ui Zi )g = exp(u >u =2); Proposition (Property 4: Isosurfaces) ( 37 / 299 1) The isosurfaces of a N p 1 ; p p are p -dimensional ellipsoids centred at , with principal axes given by the eigenvectors of and with anisotropies given by the ratios of the square roots of the corresponding eigenvalues of . Victor Panaretos (EPFL) Linear Models () independent if and only if is diagonal. Proof. Suppose that the Yj are independent. Property 5 yields j > . Thus the density of Y is p Y Yj = (0 ; 0 ; : : : ; Victor Panaretos (EPFL) 1 |{z} ;:::; jth position Linear Models 0 ; 0)Y p Y p1 exp fY (y ) = fYj (yj ) = j =1 i =1 j 2 Proposition (Property 5: Coordinate Distributions) Let Y = (Y1 ; : : : ; Yp )> N (p 1 ; p p ). Then Yj N (j ; jj ) . 38 / 299 Proposition (Property 6: Diagonal Independence) Let Y = (Y1 ; : : : ; Yp )> N (p 1 ; p p ). Then the Yi are mutually Exercise: Use Property 3, and the spectral theorem. Observe that Y =d 1=2 Z + N (; ). 0 Proof. Proof. is = ( ) ( ) N (0; I ) distribution. ) Furthermore, since [Recall that to obtain the density of W g X at w , we need to evaluate fX at g 1 w but also multiply by the Jacobian determinant of g 1 at w .] Ef Linear Models ( 1=2 . . 1 2 is p X admits a square root, fY (y ) = f = Z + (y ) = j 1=2jfZ f 1=2 (y )g 1 1 (y )> 1(y ) : = exp 2 (2)p=2 j j1=2 p Y ( 2 By the change of variables formula, 1 1 1 1 2 > fZ (z ) = fZi (zi ) = p exp 2 zi = p=2 exp 2 z z : (2) i =1 i =1 2 (b) The MGF of N (0; I ) density is fZ (z ) = (21)p= exp 12 z > z Now observe that from our Property 2, we have is p Y the By the spectral theorem, non-singular, so is 1=2 . N (0; 1) random variables. Then, Let Z Z1 ; : : : ; Zp > be a vector of iid because of independence, (a) the density of (a )+(b ) 1 1 (y 2 ( Yj N (j ; j2 ) for some 1 (yj 2 j )2 j2 ) )> diag(1 2 ; : : : ; p 2 )(y exp ) : (2)p=2 jdiag(12; : : : ; p2)j1=2 Hence Y Nf; diag(12 ; : : : ; p2 )g, i.e. the covariance is diagonal. Conversely, assume is diagonal, say = diag(12 ; : : : ; p2 ). Then we can reverse the steps of the first part to see that the joint density fY (y ) can be written as a product of the marginal densities fYj (yj ), thus proving independence. = and use Property 2. 39 / 299 Victor Panaretos (EPFL) Linear Models 40 / 299 Proposition (Property 7: If Y N (p1 ; p p ), and AY ; BY indep () A B > = 0) Am p , Bd p be real matrices. Then, AY independent of BY () A B > = 0. Proof It suffices to prove the result assuming . Let W(m +d )1 First assume A B > =0 1 > > > > > > > > =) exp 2 u A A u + v B B v + u A B v + v B A u 1 1 > > > > = exp 2 u A A u exp 2 v B B v 1 > > =) exp 2 2v A B u = 1 =) v >A B >u = 0; 8 u 2 Rd ; v 2 Rm ; = 0 AY (and it simplifies the algebra). = BY and (m +d )1 = uvmd . 1 1 [expfW >g] = E exp Y >A>u + Y >B >v > > = E exp Y (A u + B > v ) = MY (A> u + B > v ) 1 > > > > > = exp 2 (A u + B v ) (A u + B v ) MW () = E 8 0 < @ : AY and BY are independent. Then, 8u ; v , MW () = MAY (u )MBY (v ); 8u ; v ; For the converse, assume that 19 = > > A | {z } ; = exp 12 u > A A>u + v >B B >v + u >A| {zB >}v + v B A u =0 =0 = MAY (u )MBY (v ); =) the orthocomplementa of the column space of =) A B > = 0: a recall that for Qm d we have M? (Q ) = A B > is the whole of Rm : fy 2 Rm : y > Qx = 0; 8x 2 Rd g i.e., the joint MGF is the product of the marginal MGFs, proving independence. Victor Panaretos (EPFL) Linear Models Gaussian Quadratic Forms and the Definition (2 distribution) 41 / 299 2 Distribution Gaussian Quadratic Forms and the = Pp 2 2 Let Z N ; Ip p . Then kZ k2 j =1 Zj is said to have the chi-square ( ) distribution with p degrees of freedom; we write kZ k2 2p . (0 ) [Thus, 2p is the distribution of the sum of squares of Gaussian random variates.] Victor Panaretos (EPFL) V 2p and W 2q be independent random variables. Then (V =p )=(W =q ) is said to have the F distribution with p and q degrees of freedom; we write (V =p )=(W =q ) Fp ;q . 42 / 299 2 Distribution Proposition (Gaussian Quadratic Forms) Z N (0p1 ; Ipp ) and H is a projection of rank r p, 1 If 2 Z > HZ 2r : Y N (p1 ; pp ) with nonsingular =) (Y )> 1(Y ) 2p : p real independent standard Definition (F distribution) Linear Models Let Victor Panaretos (EPFL) Linear Models Exercise: Prove these results. 43 / 299 Victor Panaretos (EPFL) Linear Models 44 / 299 Approximately 2 Quadratic Forms What if the random vector is not Gaussian? Here’s a CLT that helps: Theorem (Weighted Sum CLT) Let fXn g be an i.i.d sequence of real random variables, with common mean 0 and variance 1. Let f n g be a sequence of real constants. Then, sup 1j n Pn 2 j i =1 n !1 2i ! 0 =) 1 n X X !d N (0; 1): 2i i =1 i i i =1 qP n Linear Models: Likelihood and Geometry Supremum condition amounts to saying that, in the limit, any single component contributes a negligible proportion of the total variance. Coefficient sequence f n g might very well diverge, without contradicting the negligibility condition. Victor Panaretos (EPFL) Linear Models 45 / 299 Victor Panaretos (EPFL) Linear Models 46 / 299 47 / 299 Victor Panaretos (EPFL) Linear Models 48 / 299 Simple Normal Linear Regression General formulation: Yi jxi ind Distributionfg (xi )g; i = 1; : : : ; n : Simple Normal Linear Regression: Resulting Model: Victor Panaretos (EPFL) = Nfg (x ); 2g g (x ) = 0 + 1 x Distribution Yi ind N ( 0 + 1 xi ; 2 ) m Yi = 0 + 1 xi + "i ; "i ind N (0; 2 ) Linear Models Example: Professor’s Van Example: Professor’s Van Fillup 1 2 3 4 5 6 7 8 9 10 Victor Panaretos (EPFL) Km/L 7.72 8.54 8.35 8.55 8.16 8.12 7.46 6.43 6.74 6.72 Linear Models 49 / 299 Victor Panaretos (EPFL) Linear Models Simple Normal Linear Regression Multiple Normal Linear Regression Jargon: Y is response variable and x is explanatory variable (or covariate) Linearity: Linearity is in the parameters, not the explanatory variable. Example: Flexibility in what we define as explanatory: Instead of xi xj Example: Sometimes a transformation may be required: 1 xj ; j j iid Lognormal B @ log() # " exp() log Yj = log 0 + 1 xj + log j ; log j iid Normal Data Structure: For i ; : : : ; n , pairs =1 | X values of x (xi ; yi ) ! xyii fixed treated as a realisation of Yi at xi Victor Panaretos (EPFL) Linear Models 2 R could have xi> 2 Rq ): Yi = 0 + 1 xi 1 + 2 xi 2 + : : : + q xiq + "i ; "i ind N (0; 2 ): Letting p = q + 1, this can be summarised via matrix notation: 0 Y1 1 0 1 x11 : : : x1q 10 0 1 0 "1 1 B Y2 C B 1 x21 B C B C x2q C B C B CB 1 C B " 2 C Yj = 0 + 1 sin( xj ) + "j ; "j iid Normal(0; 2 ): | {z } Yj = 0 e 50 / 299 51 / 299 C A .. . Y{zn } Y =B @ | .. . .. . .. . 1 xn 1 {z: : : xnq X =) nY1 = nXp p1 + n " 1; CB . C A@ .. A }| q {z } +B @ | .. C . A "n {z " } " Nn (0; 2 I ) is called the design matrix. Victor Panaretos (EPFL) Linear Models 52 / 299 10 15 100 Heat evolved 5 80 Heat 78.50 74.30 104.30 87.60 95.90 109.20 102.70 72.50 93.10 115.90 83.80 113.30 109.40 20 30 5 10 15 Linear Models 53 / 299 Example: polynomial terms for MPG vs Horsepower 40 30 Gas mileage (MPG) 10 20 ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●●● ● ● ● ● ●● ● ●●● ● ● ●●● ●●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ● ● ●● ● ●● ● ●● ●● ●● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●●●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 100 150 ● ● ● ● ● ●● ● ● ● ● ● ● 200 Horsepower Perhaps more fitting than would be Yj = 0 + 1 xj + "j 30 40 50 60 Linear Models 54 / 299 (y1; x11; : : : ; x1q ); : : : ; (yi ; xi 1; : : : ; xiq ); : : : ; (yn ; xn 1; : : : ; xnq ) Yj = 0 + 1 xj + 2 xj2 + "j Likelihood and Loglikelihood Still a linear model but now with 2 covariates: xj and xj xj2 Normally would require a (hyper)plane to visualise dependence of mean on 2 or more covariates When additional covariates are variable transformation, can visualise mean dependence via a non-linear curve, even though model is linear = Victor Panaretos (EPFL) 20 Percent weight of 2CaO.SiO_2 Yi = 0 + 1 xi 1 + 2 xi 2 + + q xiq + "i ; "i iid N (0; 2 ) m Y = X + "; " Nn (0; 2 I ) Observe: y = (y1 ; : : : ; yn )> for given fixed design matrix X , i.e.: ● ● 50 70 Model is: ● ● 60 Likelihood for Normal Linear Regression ● ● ● ● ● Victor Panaretos (EPFL) 50 100 20 Percent weight of 4CaO.Al_2O_3.Fe2O_3 Victor Panaretos (EPFL) 40 Percent weight of 3CaO.SiO_2 Heat evolved Percent weight of 3CaO.Al_2O_3 80 60.00 52.00 20.00 47.00 33.00 22.00 6.00 44.00 22.00 26.00 34.00 12.00 12.00 100 2CaO :SiO2 6.00 15.00 8.00 8.00 6.00 9.00 17.00 22.00 18.00 4.00 23.00 9.00 8.00 Heat evolved 4Cao :Al2 O3 :Fe2 O3 26.00 29.00 56.00 31.00 52.00 55.00 71.00 31.00 54.00 47.00 40.00 66.00 68.00 100 3CaO :SiO2 7.00 1.00 11.00 11.00 7.00 11.00 3.00 1.00 2.00 21.00 1.00 11.00 10.00 Heat evolved 3CaO :Al2 O3 80 Case 1 2 3 4 5 6 7 8 9 10 11 12 13 Cement Heat Evolution 80 Example: Cement Heat Evolution Linear Models 55 / 299 1 (y X )> (y X ) 22 1 1 2 2 > `( ; ) = 2 n log 2 + n log + 2 (y X ) (y X ) L( ; 2 ) = (212 )n =2 exp Victor Panaretos (EPFL) Linear Models 56 / 299 Maximum Likelihood Estimation Maximum Likelihood Estimation Whatever the value of , the log-likelihood is maximised when (y X )>(y X ) is minimised. Hence, the MLE of is: ^ = arg max (y X )> (y X ) = arg min(y X )>(y X ) ^ is called the least squares estimator because it is a result of minimising (y X )>(y X ) = Obtain minimum by solving: 0 = 0 = 0 = > X X = ^ = @ (y @ @ (y @ X )> (y X ) X ) @ (y X )> (y X ) @ (y X ) > X (y X ) (normal equations) X >y (X >X ) 1 X >y (if X has rank p ) Victor Panaretos (EPFL) 57 / 299 Victor Panaretos (EPFL) 2 q xiq ) : {z } Linear Models 58 / 299 Maximum Likelihood Estimation ^ 2 = = is ^ = (X >X ) 1 X >y for all values of 2, we have 2 2 e = y X ^, so that e = (e1 ; : : : ; en )> , with P ei2 is minimised over all ^ 2 = 1 (y X ^)>(y X ^): n Next week we will see that a better (unbiased) estimator is . S 2 = n 1 p (y X ^)> (y X ^): y^ = X ^> , so that y^ = (^y1 ; : : : ; y^n )> , with y^i = ^0 + ^1 xi 1 + + ^q xiq Linear Models 1 n log 2 + 1 (y X ^)> (y X ^) : 2 2 Differentiating and setting equal to zero yields ^0 ^1xi 1 ^2 xi 2 ^q xiq “Regression Line” is such that arg max max `( ; 2 ) 2 arg max `( ^; 2 ) = arg max Victor Panaretos (EPFL) 2 xi 2 Thus we are trying to find the that gives the hyperplane with minimum sum of squared vertical distances from our observations. (chain rule) Since the MLE of Fitted Values: 1 xi 1 0 sum of squares Linear Models ei = yi i =1 (yi | Maximum Likelihood Estimation Residuals: n X 59 / 299 Victor Panaretos (EPFL) Linear Models 60 / 299 Example: Professor’s Van Example: MPG vs Horsepower Model with linear term only ● ● ● ● ● 20 30 40 ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●● ●●●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ●●● ●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ●●● ● ● ●● ●●● ● ●● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ●●● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● 10 Gas mileage (MPG) ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 ● ● ● ● ● ● ^0 = 8:6 ^1 = 0:068 S 2 = 17:4 150 ●● ● ● ● ● ● ● 200 Horsepower Parameter estimates: Victor Panaretos (EPFL) ● Linear Models 61 / 299 ^0 = 39:94 and ^1 = 0:16 and S 2 = 24:06. Victor Panaretos (EPFL) Linear Models 62 / 299 Example: MPG vs Horsepower The Geometry of Least Squares Model with linear quadratic terms There are two dual geometrical viewpoints that one may adopt: 0 ● 30 20 10 Gas mileage (MPG) 40 ● ● ● ● ● ● B B B @ ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●●●●● ●● ●● ● ● ● ●●● ● ●●● ● ● ●● ●● ● ●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ●●● ● ●● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ●●●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● 100 150 Victor Panaretos (EPFL) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● .. . x12 x22 1 x(n 1)1 x(n 1)2 1 xn 1 xn 2 ::: .. . ::: ::: x1q x2q x(n 1)q xnq 1 0 1 0 C CB 1 C CB C CB . C C @ .. A A q 0 + "1 "2 1 B C B C B . C @ .. A "n Both are useful, usually for different things: ● ● ● ● ● Row geometry useful for exploratory analysis. ● Column geometry useful for theoretical analysis. 200 ^0 = 56:90, ^1 = 0:47 and ^2 = 0:0012 and S 2 = 19:13. Linear Models Yn = B B B . B .. B @ x11 x21 n OBSERVATIONS Column geometry: focus on the p EXPLANATORIES Horsepower Parameter estimates: .. . C C C A 1 1 Row geometry: focus on the ● 50 Y1 Y2 0 1 63 / 299 Both geometries give useful, but different, intuitive interpretations of the least squares estimators. Victor Panaretos (EPFL) Linear Models 64 / 299 Row Geometry (Observations) Column Geometry (Variables) Corresponds to the “scatterplot geometry” – (data space) n points in Rp Adopt the dual perspective: each corresponds to an observation Consider the entire vector y as a single point living in Rn Then consider each variable (column) as a point also in Rn least squares parameters give parametric equation for a hyperplane ^ What is the interpretation of the p -dimensional vector , and the vectors y and e in this dual space? hyperplane has property that it minimizes the sum of squared vertical distances of observations from the plane itself over all possible hyperplanes ^ n -dimensional Turns out there is another important plane here: the plane spanned by the variable vectors (the column vectors of X ). Recall that this is the column space of X , denoted by M(X ). Fitted values are vertical projections (NOT orthogonal projections!) of observations onto plane, residuals are signed vertical distances of observations from plane. Victor Panaretos (EPFL) Linear Models 65 / 299 Column Geometry (Variables) Recall: Victor Panaretos (EPFL) (X ) := fX : 2 Rp g M | {z } Another derivation of the MLE of : ^ to minimise (y X )>(y X ) = ky X k2, so ^ = arg min ky X k2 : min 2Rp ky X k2 = min 2M(X ) ky k2 But the unique that yields min 2M(X ) ky k2 is = Py . Here P is the projection onto the column space of X , M(X ). Since X is of full rank, P = X (X > X ) 1 X > . So = X (X > X ) 1 X > y ^ will now be the unique (since X non-singular) vector of coordinates of Choose Y = X + " mean? A: Y is [some element of M(X )] + [Gaussian disturbance]. Any realisation y of Y will lie outside M(X ) (almost surely). MLE estimates Q: What does by minimising (y X )>(y X ) = ky X k2 Thus we search for a giving the element of M(X ) with the minimum distance from y . Hence y^ = X ^ is the projection of y onto M(X ): y^ = X ^ := X| (X > X{z) 1 X >}y = Hy : with respect to the basis of columns of So H which implies that is the hat matrix (puts hat on y !) Victor Panaretos (EPFL) 66 / 299 Column Geometry (Variables) Column Space H Linear Models Linear Models 67 / 299 Victor Panaretos (EPFL) X. X ^ = = X (X > X ) 1 X > y ; ^ = (X > X ) 1X >y Linear Models 68 / 299 The (Column) Geometry of Least Squares The (Column) Geometry of Least Squares ^ So what is ? If X columns linearly independent, they are a (non-orthogonal) basis for M Hence for any z exists a unique z =X So 2 M(X ), there 2 Rp such that contains coordinates of z with respect to the X -column basis Consequently, ^ contains coordinates of y^ with respect to the X -column basis But y^ = Hy = X (|X > X {z ) 1 X >y} = Xu , so u is the unique vector that gives u y with respect to the X -column basis Hence we must have ^ = u = (X > X ) 1 X > y coordinates of Victor Panaretos (EPFL) Linear Models 69 / 299 The (Column) Geometry of Least Squares 2 3 e = (I H )y = (I H )". y^ and e are orthogonal, i.e. y^ > e = 0 Pythagoras: y > y = y^ > y^ + e > e = y > Hy + "> (I H )" Derivation: 1 2 3 e =y X^=y (I H )X + (I e = y y^ = (I y > y = (Hy + (I Victor Panaretos (EPFL) 70 / 299 Assume slightly different model: Yi = 0 + 1 xi 1 + 2 xi 2 + + q xiq + p"wi ; "i ind N (0; 2 ); wi > 0 i Hy = (I H )y = (I H )(X + ") = H )" = (I H )" H )y =) y^ > e = y > H > (I H )y = 0 H )y )> (Hy + (I H )y ) = y^ > y^ + e > e + |2yH (I{z H )y}. =0 Linear Models Linear Models Weighted Least Squares Facts: 1 Victor Panaretos (EPFL) 71 / 299 Yi ind N m 2 0 + 1 xi 1 + 2 xi 2 + + q xiq ; w i With the wj known weights (example: each measurements). Yj is an average of : wj Arises often in practice (e.g., in sample surveys), but also arises in theory. Victor Panaretos (EPFL) Linear Models 72 / 299 Weighted Least Squares Transformation: y 0 = W 1=2 y ; X 0 = W 1=2 X with Wn n = diag(w1 ; : : : ; wn ) Distribution Theory of Least Squares Leads to usual scenario. In this notation we obtain: ^ = [(X 0 )>X 0 ] 1(X 0)>y 0 = (X >WX ) 1 X >Wy Similarly: S 2 = n 1 p y > W WX (X > WX ) 1 X > W y Victor Panaretos (EPFL) Linear Models 73 / 299 Victor Panaretos (EPFL) Linear Models 74 / 299 Least Squares Estimators Joint Distribution of LSE’s Gaussian Linear Model: Theorem Let Yn 1 = Xn p p 1 + "n 1 with " Nn (0; 2 I ) and assume that X has full Yn 1 = Xn p p 1 + "n 1 ; " Nn (0; 2 I ) rank We have derived the estimators: 1 ^ = (X >X ) 1X >y 1 1 ^ 2 = (y X ^)> (y X ^) = ky^ y k2 n n 1 2 2 S = n p ky^ y k 2 3 ; 2 (X > X ) 1 g; the random variables n p S 2 2 n 2 ^ and S 2 are independent; and 2 p , where denotes the chi-square distribution with degrees of freedom. We need to study the distribution of these estimators for the purpose of: Corollary Let Yn 1 Xn p p 1 "n 1 with " Nn ; 2 I . The statistic Hy is sufficient for the parameter . If X has full rank p < n , then is also sufficient for . = Understanding their precision Building confidence intervals Testing hypotheses (0 ) ^ S 2 is unbiased whereas ^ 2 is biased (so we prefer S 2 ). ... Linear Models + Corollary Comparing them to other candidate estimators Victor Panaretos (EPFL) p < n . Then, ^ Np f 75 / 299 Victor Panaretos (EPFL) Linear Models 76 / 299 Geometry Reminder Joint Distribution of LSE’s Proof of the Theorem. ^ = (X X ) X Y Y Nn (X ; 2 I ) 1. Recall our results for linear transformations of Gaussian variables: > 1 > =) ^ Np f ; 2 (X > X ) 1 g ^ = X ^, then S 2 = e >e =(n p ) will be independent of 2. If e is independent of y (why?). Now notice that: ^ e = (I H )y y^ = Hy y N (X ; 2 I ) Therefore, from the properties of the Gaussian distribution e is independent of since I H 2 I H 2 I H H , by idempotency of H . ( Victor Panaretos (EPFL) Linear Models )( ) = ( ) =0 Victor Panaretos (EPFL) 77 / 299 Linear Models y^ 78 / 299 Proof of the first Corollary. Write 2 If we can show that the conditional distribution of the n -dimensional vector W y ; e > given y does not depend on , then we will also know that the conditional distribution of y y e given y does not depend on either, proving the proposition. proof cont’d. = (^ ) 3. For the last part recall that > e = (I H )" =) (n p ) ne ep = "> (I H )" by idempotency of H . But recall that " Nn (0; 2 In ) so 1 " Nn (0; In ). p )S 2 = (n ^ =^+ ^ ^ ^ But we have proven that y is independent of e . Therefore, conditional on y , e always has the same distribution N ; I H 2 . It follows that, conditional on y , the vector W has a distribution whose first n coordinates equal y almost surely, and whose last n coordinates are N ; I H 2 . Neither of those two depend on , and the proof is complete. ^ Therefore, by the properties of normal quadratic forms (slide 40), (n p ) S 2 = ( 1 ")>(I H )( 1") 2 n 2 y = Hy + (I H )y = y^ + e . p: When X has full rank, (0 ( ) ) (0 ( ) ) ^ ^ is a 1-1 function of Hy , and is also sufficient for . Proof of the second Corollary. Recall that if Victor Panaretos (EPFL) Linear Models 79 / 299 Q 2d , then E[Q ] = d . Victor Panaretos (EPFL) Linear Models 80 / 299 Confidence and Prediction Intervals How to construct 1 Confidence and Prediction Intervals CI for a linear combination of the parameters, c> ) 100% CI: q c > ^ tn p ( =2) S 2 c > (X > X ) 1 c : What about a (1 ) CI for r ? (r th coordinate) Let cr = (0; 0; : : : ; 0; 1 ; 0; : : : ; 0) r th position Then r = c > ? We obtain c > ^ N1 (c > ; 2 c > (X > X ) 1 c ) = N1 (c > ; 2 ) p Therefore Q = (c > ^ c > )=( ) N1 (0; 1) Hence Q 2 21 and Q 2 is independent of S 2 (since ^ is independent of S 2 ) while n p S 2 2n p . Have 2 In conclusion: Q2 1p) (n 2 2 S n p (c > ^ 2c > )2 F1;n p ) But for real S 22 >^ > = p c2 > >c 1 S c (X X ) c !2 Therefore, base CI on F1;n ^r r cr> ^ cr> p = t ; S 2 vr ;r n p S 2 cr> (X > X ) 1 cr where vr ;s is the r ; s element of (X > X ) 1 . Obtain (1 ) 100% CI: p p W , W 2 F1;n p () W tn p , so base CI on: c> ^ c> p t S 2 c > (X > X ) 1 c n p Victor Panaretos (EPFL) Linear Models ^ tn p ( =2)pS 2 vrr : 81 / 299 Confidence and Prediction Intervals = + Victor Panaretos (EPFL) y+ by x+> ^. y+ for an x+ 2 Rp But y+ x+> "+ so a prediction interval is DIFFERENT from an interval for a linear combination c > (extra uncertainty due to "+ ): E[x+> ^ + "+ ] = x+> var[x+> ^ + "+ ] = var[x+> ^] + var["+ ] = 2 [x+> (X > X ) 1 ke k2 = e > e = y > (I H )y = 1 y^ > y^ ky k2 y > y y >y y >y x+> ^ y+ tn p : S 2 f1 + x+> (X > X ) 1 x+ g Define ) prediction interval: q x+> ^ tn p ( =2) S 2 f1 + x+> (X > X ) 1 x+ g: Victor Panaretos (EPFL) Linear Models R2 How large is this, relative to data variation? Look at x+ + 1] q (1 82 / 299 R2 is a measure of fit of the model to the data. We are trying to best approximate y through an element of the column-space of X . How successful are we? Squared error is e > e . Base prediction interval on: Obtain Linear Models The Coefficient of Determination, Suppose we want to predict the value of Our model predicts (1 Note that 0 R02 1 2 > R02 = yy^ > yy^ = kkyy^ kk2 Interpretation: what proportion of the squared norm of explain? 83 / 299 Victor Panaretos (EPFL) Linear Models y does our fitted value y^ 84 / 299 Geometry Reminder Different Versions of R2 “Centred (in fact, usual) R 2 ”. Compares empirical variance of variance of y , instead of the empirical norms. In other words: 1 Pn (^yi y )2 2 n R = 1 Pin=1 (y y )2 = n i =1 i Pn Pn (note that n1 i =1 y^i = n1 i =1 (yi y^ to empirical n y^ 2 n y 2 y )2 = P Pin=1 i2 2: y )2 i =1 yi n y ei ) = y because P e ? 1 (recall that 1 is the vector of 1’s = first column of design matrix X ) so i ei = 0. Note that Pn (^yi Pin=1 i =1 (yi 2 2 R2 = kkyy^ kk2 kkyy 11kk2 : R02 mathematically more natural (does not treat first column of X as special). R2 statistically more relevant (expresses variance—the first column of X usually is special, in statistical terms!). R02 and R2 may differ a lot when y large. 85 / 299 R2 y^ on orthogonal complement of So 1(1> 1) 1(1> 1) NOTE: Statistical packages (e.g., R) provide 0:96 0:61 0:97 0:69 linear quadratic 1 1)^y k2 1 1)y k2 ● ● ● ● ● 40 1 1> y = 1n 1 Pn yi = 1y . i =1 1 1> y^ = 1n 1 Pn y^i = 1y . i =1 2 2 R2 = kkyy^ kk2 kkyy 11kk2 = kk((II Intuition: Should not take into account the part of ky k that is explained by a constant, we only want to see the effect of the explanatory variables. R2 (and/or Ra2 , see below), not R02 . Show that R 2 = [corr(fy^i gni=1 ; fyi gni=1 )]2 . Show that R 2 R02 . ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●● ●●●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ●●● ● ●● ● ●● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ●● ● ● ●●● ● ● ● ● ● ●●●●● ● ● ●●● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 10 Exercise: Exercise: 86 / 299 R2 coefficients for the linear and quadratic models: R02 R2 30 ( ) 1(1> 1) Linear Models Horsepower and MPG of cars Geometrical interpretation of R 2 : project y and 1, then compare the norms (of the projections): 1 1> 1 Victor Panaretos (EPFL) Gas mileage (MPG) Different Versions of Linear Models 20 Victor Panaretos (EPFL) ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 150 ● ● ● ● ●● ● ● ● ● ● ● 200 Horsepower Victor Panaretos (EPFL) Linear Models 87 / 299 Victor Panaretos (EPFL) Linear Models 88 / 299 Different Versions of R2 Example: Cement Heat Evolution R2 takes into account the number of variables employed. It is Ra2 = R2 (1 R2 ) nn p1 : Corrects for the fact that we can always increase R 2 by adding variables. One can also correct the un-centred R02 by evaluating R02 (1 R02 ) n n p : The adjusted defined as: Victor Panaretos (EPFL) Linear Models 89 / 299 15 20 30 20 Victor Panaretos (EPFL) Heat 78.50 74.30 104.30 87.60 95.90 109.20 102.70 72.50 93.10 115.90 83.80 113.30 109.40 Linear Models 50 60 70 Estimate 62.4054 1.5511 0.5102 0.1019 0.1441 Residual standard error: R-Squared: 0.9824 10 Percent weight of 4CaO.Al_2O_3.Fe2O_3 Victor Panaretos (EPFL) 40 100 Heat evolved 15 60.00 52.00 20.00 47.00 33.00 22.00 6.00 44.00 22.00 26.00 34.00 12.00 12.00 90 / 299 Std. Error 70.0710 0.7448 0.7238 0.7547 0.7091 t value 0.89 2.08 0.70 0.14 0.20 Pr(>jtj) 0.3991 0.0708 0.5009 0.8959 0.8441 2.446 on 8 degrees of freedom 80 100 10 2CaO :SiO2 6.00 15.00 8.00 8.00 6.00 9.00 17.00 22.00 18.00 4.00 23.00 9.00 8.00 (Intercept) x1 x2 x3 x4 Percent weight of 3CaO.SiO_2 80 Heat evolved Percent weight of 3CaO.Al_2O_3 5 4Cao :Al2 O3 :Fe2 O3 26.00 29.00 56.00 31.00 52.00 55.00 71.00 31.00 54.00 47.00 40.00 66.00 68.00 > cement.lm<-lm(y1+x1+x2+x3+x4,data=cement) > summary(cement.lm) 100 Heat evolved 10 3CaO :SiO2 7.00 1.00 11.00 11.00 7.00 11.00 3.00 1.00 2.00 21.00 1.00 11.00 10.00 80 100 5 3CaO :Al2 O3 Example: Cement Heat Evolution 80 Heat evolved Example: Cement Heat Evolution Case 1 2 3 4 5 6 7 8 9 10 11 12 13 Linear Models 20 30 40 50 60 Percent weight of 2CaO.SiO_2 91 / 299 Victor Panaretos (EPFL) Linear Models 92 / 299 Example: Cement Heat Evolution Horsepower and MPG of cars > x.plus [1] 25 25 25 25 predict(cement.lm,x.plus,interval="confidence", se.fit=T,level=0.95) 40 Upper 128.2 predict(cement.lm,x.plus,interval="prediction", se.fit=T,level=0.95) Lower 96.5 Upper 129.2 ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●●●●● ●● ●● ● ● ● ●●● ● ●●● ● ● ●● ●● ● ●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●●●● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 10 Fit 112.8 30 Gas mileage (MPG) Lower 97.5 20 Fit 112.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 150 ● ● ● ●● ● ● ● ● ● ● 200 Horsepower Victor Panaretos (EPFL) Linear Models 93 / 299 Victor Panaretos (EPFL) Linear Models Horsepower and MPG of cars Horsepower and MPG of cars > auto.lm <- lm(mpg1+horsepower+I(horsepower2 ),data=Auto) > summary(auto.lm) > x.plus horsepower 120 > predict(auto.lm, x.plus, interval="confidence", se.fit=T, level=0.95) Estimate (Intercept) horsepower I(horsepower2 ) 56:9000 0:4662 0:0012 Residual standard error: R-Squared: 0.6876 Std. Error 1:8004 0:0311 0:0001 t value 31:60 14:98 10:08 Pr(>jtj) < 2 10 16 < 2 10 16 < 2 10 16 Fit 18.68 Linear Models Upper 19.33 > predict(auto.lm, x.plus, interval="prediction", se.fit=T, level=0.95) 4.374 on 389 degrees of freedom Fit 18.68 Victor Panaretos (EPFL) Lower 18.03 94 / 299 95 / 299 Lower 10.05 Upper 27.30 Victor Panaretos (EPFL) Linear Models 96 / 299 Gaussian Linear Model: Efficiency of LSE (Optimality) ^ is a sensible estimator. But is it the best Q: Geometry suggests that the LSE we can come up with? A: Yes, Gauss-Markov & Optimal Estimation ^ is the unique minimum variance unbiased estimator of (To be seen in Statistical Theory course, since . ^ is sufficient and complete) Thus, in the Gaussian Linear model, the LSE are the best we can do as far as unbiased estimators go. (actually can show Victor Panaretos (EPFL) Linear Models 97 / 299 Second Order Assumptions: Optimality in a weaker setting? S 2 is optimal unbiased estimator of 2 , by similar arguments) Victor Panaretos (EPFL) Linear Models 98 / 299 Second Order Assumptions For a start, we retain unbiasedness: The crucial assumption so far was: Normality: Lemma " Nn (0; 2 I ) If we only assume both [ ]=0 E What if we drop this strong assumption and assume something weaker? [ ] = 0 & var["] = 2I instead of Uncorrelatedness: E " (notice we do not assume any particular distribution.) [ ] = 2 I var " N (0; 2 I ); then the following remain true: 1 How well do our LSE estimators perform in this case? 2 (note that in this setup the observations may not be independent — uncorrelatedness implies independence only in the Gaussian case.) 3 [ ^] = ; Var[ ^] = 2 (X > X ) E[S 2 ] = 2 . E 1; But what about optimality properties? Victor Panaretos (EPFL) Linear Models 99 / 299 Victor Panaretos (EPFL) Linear Models 100 / 299 Gauss–Markov Theorem Gauss–Markov Theorem Proof. Let Theorem Let Yn 1 = Xn p p 1 + "n 1 , with p < n , X having rank p , and [ ] = 0, var["] = 2 I . Then, ^ = (X > X ) 1 X > Y is the best linear unbiased estimator of , that is, for any linear unbiased estimator of , it holds that ~ Victor Panaretos (EPFL) ( ~) var ( ^) 0: Linear Models Large Sample Distribution of ^ ,!Gauss-Markov says Victor Panaretos (EPFL) ^ when cov(") = 2 I , but Theorem (Large Sample Distribution of ^) 1 = (X >X ) 1X >": 2 fXn gn 1 be a sequence of n p design matrices and Yn = Xn + "n . If Xn is of full rank p for all n 1 max1i n [xi>(Xn>Xn ) 1 xi ] n !1 ! 0, (where xi> is the i th row of Since there is a huge variety of candidate distributions for " that would be compatible with the property cov(") = 2 I , we cannot say very much about the exact distribution of ^ = ( X > X ) 1 X > ". 103 / 299 Xn ) n -vector with i.i.d. coordinates of variance 2 , then the least squares estimator ^n = (Xn> Xn ) 1 Xn> Yn satisfies 3 "n is a zero mean (Xn> Xn )1=2 ( ^n Can we at least hope to say something about this distribution asymptotically, as the sample becomes large? Linear Models ^ () increasing number of observations. We let n ! 1 (# rows of X tend to infinity) # columns of X , i.e., p , (held fixed). Let Note that we can always write ^ 102 / 299 Large sample ^ optimal linear unbiased estimator, regardless of whether or not " is Question: What can we say about the distribution of " is not necessarily Gaussian? Victor Panaretos (EPFL) Linear Models Large Sample Distribution of Gaussian. for some Ap n ; for all 2 Rp : = E[ ~] = E[AY ] = E[AX + A"] = AX ; 2 Rp =) (AX I ) = 0; 8 2 Rp : We conclude that the null space of (AX I ) is the entire Rp , and so AX = I . var[ ~] var[ ^] = A 2 IA> 2 (X > X ) 1 = 2fAA> AX (X >X ) 1 X >A>g = 2A(I H )A> = 2A(I H )(I H )>A> 0: 101 / 299 [ ] = 0 and cov["] = 2I If E " ~ be linear and unbiased, in other words: ~ =~ AY ; E[ ] = ; These two properties combine to yield, E" var ( Victor Panaretos (EPFL) ) d! Np (0; 2I ): Linear Models 104 / 299 Large Sample Distribution of ^ Large Sample Distribution of ^ To understand Condition (2), consider simple linear model Theorem’s conclusion can be interpreted as: d for n “large enough”, ^ ' Nf ; 2 (Xn> Xn ) 1 g i.e. distribution of ^ gradually becomes the same as what it would be if " Yi = 0 + 1 ti + "i ; Here, p = 2. Can show that = (Xn>Xn ) 1 Xn> , ( ) P ( )= ( ) 0 Note that trace H p , so that the average hjj n =n ! — the question is do all the hjj n ! uniformly? Has a very clear interpretation in terms of the form of the design that we will see when we discuss the notions of leverage and influence. ( ) 0 Victor Panaretos (EPFL) Linear Models Large Sample Distribution of 105 / 299 ^ 2 hjj (n ) = n1 + Pn(tj (t t ) t )2 were Gaussian . . . provided design matrix X satisfies extra condition (2). Can be shown equivalent to: diagonal elements of Hn Xn say hjj n converge to zero uniformly in j as n ! 1 i = 1; : : : ; n : k =1 k = i , for i = 1; : : : ; n (regular grid). Then =2g2 hjj (n ) = n1 + fj (n 2(n +n )1)=12 so max h (n ) = hnn (n ) = n1 + n6((nn + 1)1) n !1 ! 0: 1j n jj Suppose ti Victor Panaretos (EPFL) Large Sample Distribution of Linear Models 106 / 299 Linear Models 108 / 299 ^ =2 i (grid points spread apart as n grows). Now consider ti The centre of mass and sum of squares of the grid points is now t = 2(2 n and so Victor Panaretos (EPFL) n 1) ; n X n +1 n +1 n +3 (ti t )2 = 4 3 4 4 + n4 2 i =1 3 max h (n ) = hnn (n ) n !1 ! 4: 1j n jj Linear Models 107 / 299 Victor Panaretos (EPFL) Proof. ^ = (Xn>Xn ) 1Xn> "n . We will show that for any unit vector u , Recall that n u > (Xn> Xn ) 1=2 Xn> "n !d N (0; 2 ); and then the theorem will be proven by the Cramér-Wold devicea . Now notice that u > (Xn> Xn ) 1=2 Xn> "n = n> "n where: 1 2 3 Diagnostics u > (Xn> Xn ) 1=2 x1 ; : : : ; u > (Xn> Xn ) 1=2 xn > 2 2 > 1=2 xi 2 = xi> (Xn> Xn ) 1 xi (Cauchy-Schwarz) n ;i ku k (Xn Xn ) > > > 1=2 (Xn> Xn )(Xn> Xn ) 1=2 u = 1. n n = u (Xn Xn ) n =( n ;1 ; : : : ; n ;n )> = Consequently, the result follows from the weighted sum CLT upon noticing: max1i n n2;i a Cramér-Wold: n Pn 2 max1i n xi> (Xn> Xn ) 1 xi ! 0 k =1 n ;k !d in Rd if and only if u > !d u > in R for all unit vectors u . Victor Panaretos (EPFL) Linear Models 109 / 299 Assumptions to Check for Linearity: E Y is linear in X. 110 / 299 Scientific reasoning: impossible to validate model assumptions. [ ] = 2 for all j = 1; : : : ; n . Homoskedasticity: var "j Cannot prove that the assumptions hold. Can only provide evidence in favour (or against!) them. Gaussian Distribution: errors are Normally distributed. Independent Errors: "i independent of "j for i 6= j . Strategy: Find implications of each assumption that we can check graphically (mostly concerning residuals). When one of these assumptions fails clearly, then Gaussian linear regression is inappropriate as a model for the data. Construct appropriate plots and assess them (requires experience). “Magical Thinking”: Beware of overinterpreting plots! Isolated problems, such as outliers and influential observations also deserve investigation. They may or may not decisively affect model validity. Victor Panaretos (EPFL) Linear Models How do we check these assumptions? Four basic assumptions inherent in the Gaussian linear regression model: [ ] Victor Panaretos (EPFL) Linear Models 111 / 299 Victor Panaretos (EPFL) Linear Models 112 / 299 Residuals Revisited Checking for Linearity Residuals e : Basic tool for checking assumptions. e = y y^ = y X ^ = (I H )y = (I H )" Intuition: the residuals represent the aspects of y that cannot be explained by the columns of X . Since " Nn (0; 2 I ), if the model is correct we should have e Nn f0; 2 (I H )g. ei Nf0; 2 (1 hii )g So if assumptions hold ! cov(ei ; ej ) = 2 hij Recall: Note the residuals are correlated, and that they have unequal variances. Define the standardised residuals: ri := s p1ei h ; i = 1; : : : ; n : ii These are still correlated but have variance 1. (can decorrelate by U > e , where H = U U > ) – why? Victor Panaretos (EPFL) Linear Models X > e = 0: Hence, no correlation should appear between explanatory variables and residuals. Plot standardised residuals r against each explanatory variable (columns of X ). ,! No systematic patterns should appear in these plots. A systematic pattern would suggest incorrect dependence of the response on the particular explanatory (e.g. need to add a transformation of that explanatory as an additional variable). Plot standardised residuals ,! 113 / 299 Linearity OK Victor Panaretos (EPFL) A first impression can be drawn by looking at plots of the response against each of the explanatory variables. Other plots to look at? Notice that under the assumption of linearity we have Victor Panaretos (EPFL) Linearity NOT OK – need to add Linear Models 115 / 299 r against explanatories left out of the model. No systematic patterns should appear in these plots. A systematic pattern suggests that we have left out an explanatory variable that should have been included. Victor Panaretos (EPFL) Linear Models 114 / 299 sin(x1 ) in model Linear Models 116 / 299 Important Covariate Left out Checking for Homoskedasticity Homoskedastic = |{z} óo +" ò& {z } | same spread According to our model assumptions, the variance of the errors "j should be the same across indices: var "j 2 ( )= Plot r against the fitted values y^ . (why not against y ?) ,! ,! Victor Panaretos (EPFL) Linear Models 117 / 299 Homoskedasticity OK Victor Panaretos (EPFL) A random scatter should appear, with approximately constant spread of the values of r for the different values of y^ . “Trumpet” or “bulging” effects indicate failure of the homoskedasticity assumption. Since y^ > e = 0, this plot can also be used to check linearity, as before. Victor Panaretos (EPFL) Linear Models 118 / 299 Heteroskedasticity (i.e. lack of Homoskedasticity) Linear Models 119 / 299 Victor Panaretos (EPFL) Linear Models 120 / 299 Checking for Normality Checking for Normality Idea: compare the distribution of standardised residuals against a Normal distribution. A quantile plot for a given sample plots certain empirical quantiles against the corresponding theoretical quantiles (i.e. those under the assumed distribution). How? If the sample at hand originates from plot fall close to the line. 45 Compare the empirical with the theoretical quantiles . . . fk =n gnk=1 quantiles of standardised residuals r(1) r(2) r(n ) against theoretical quantiles 1 f1=(n + 1)g; : : : ; 1 fn =(n + 1)g of a N (0; 1) distribution. 2 [0; 1]) of a distribution F is the value defined as := inf f 2 R : F ( ) p g: Notation: = F 1 (p ) (although the inverse may not be well defined) Given a sample X1 ; : : : ; Xn , the empirical p quantile is the value defined as # f X g i p : = inf 2 R : n ^ n 1 (p ) Notation: = F The p -quantile (p Victor Panaretos (EPFL) QQ Plot for Linear Models Plot the empirical ,! ,! k instead of 1 nk . Think why we pick 1 n +1 If the points of the quantile plot deviate significantly from the 45 line, there is evidence against the normality assumption. Outliers, skewness and heavy tails easily revealed. Beware of overinterpretation when 121 / 299 n = 50 Victor Panaretos (EPFL) Victor Panaretos (EPFL) QQ Plot for Linear Models F , then we expect that the points of the 123 / 299 n is small! Linear Models 122 / 299 Linear Models 124 / 299 n = 100 Victor Panaretos (EPFL) n = 300 Normality NOT OK Victor Panaretos (EPFL) Linear Models 125 / 299 Checking for Independence Victor Panaretos (EPFL) [ ] = 2 I . 0.5 0 5 10 15 5 Lag Series std.residual2 ACF 0.0 seriously affects estimator reliability −1.0 −0.5 inflates standard errors 0 5 10 15 Lag Linear Models 127 / 299 Victor Panaretos (EPFL) 15 Series std.residual2 0.5 Existence of dependence: 10 Lag 1.0 ] 1.0 [ −1.0 −0.5 ] When observations are time-ordered can look at correlation corr rt ; rt +k or partial correlation corr rt ; rt +k jrt +1 ; : : : ; rt +k 1 . When such correlations exist, we enter the domain of time series. Partial ACF −0.5 0.0 0.5 [ −1.0 e.g. identifying groups of related individuals with correlated responses ACF 0.0 Difficult to check this assumption in practice. One thing to check for is clustering, which may suggest dependence. Partial ACF −0.5 0.0 0.5 1.0 Series std.residual 1.0 Series std.residual Under assumption of normality this is equivalent to independence Victor Panaretos (EPFL) 126 / 299 Checking for Independence It is assumed that var " ,! Linear Models −1.0 QQ Plot for Linear Models 5 10 Lag 15 128 / 299 Identifying Influential Observations Outliers An influential observation can usually be categorised as an: An outlier is an observation that stands out in some way from the rest of the observations, causing Surprise! Exact mathematical definition exists (Tukey) but we will not pursue it. outlier (relatively easier to spot by eye) In regression, outliers are points falling far from the cloud surrounding the regression line (or surface). OR leverage point (not as easy to spot by eye) They have the effect of “pulling” the regression line (surface) toward them. Influential observations Outliers can be checked for visually through: The regression scatterplot. May or may not decisively affect model validity. Require scrutiny on an individual basis and consultation with the data expert. David Brillinger (Berkeley): You will not find your Nobel prize in the fit, you will find it in the outliers! Influential observations may reveal unanticipated aspects of the scientific problem that are worth studying, and so must not simply be scorned as “non-conformists”! Victor Panaretos (EPFL) Linear Models 129 / 299 An Outlier Victor Panaretos (EPFL) ,! Points that can be seen to fall relatively far from the point cloud surrounding the regression line (surface) Residual Plots. ,! Points that fall beyond ( 2; 2) in the (^y ; r ) plot. Outliers may result from a data registration error, or a single extreme event. They can, however, result because of a deeper inadequacy of our model (especially if there are many!). Victor Panaretos (EPFL) Linear Models 130 / 299 Linear Models 132 / 299 Professor’s Van: Outliers Linear Models 131 / 299 Victor Panaretos (EPFL) Leverage and Leverage Points A (very) Noticeable Leverage Point Outliers may be influential: they “stand out” in the “y -dimension”. However an observation may also be influential because of unusual values in the “x -dimension”. Such influential observations cannot be so easily detected through plots. Call (xj ; yj ) the j -th case and notice that var(yj y^j ) = var(ej ) = 2 (1 hjj ): If hjj 1, then the model is constrained so y^j = xj> ^ ' yj ! (i.e., need a separate parameter entirely devoted to fitting this observation!) hjj is called the leverage of the j -th case. Pn since trace(H ) = j =1 hjj = p , cannot have low leverage for all cases a good design corresponds to hjj ' p =n for all j n !0 (i.e. assumption maxj n hjj ! 0 satisfied in asymptotic thm). Leverage point: (rule of thumb) if hjj > 2p =n observation needs further scrutiny—e.g., fitting again without j -th case and studying effect. Outlier+Leverage Point = TROUBLE Victor Panaretos (EPFL) Linear Models 133 / 299 Assessing the Influence of an Observation Victor Panaretos (EPFL) Linear Models 134 / 299 Linear Models 136 / 299 A Cook Distance Plot How to find cases having strong effect on fitted model? Idea: see effect when case j , i.e., (xj ; yj ), is dropped. ^ j be the LSE when model is fitted to data without case j , and let y^ j = X ^ j be the corresponding fitted value. Let Define Cook’s distance Cj = ps1 2 (^y y^ j )> (^y y^ j ); which measures scaled distance between Can show that j. r 2h Cj = p (1j jjh ) ; jj implies large rj and/or large hjj . Cases with Cj > = n p worth a closer look (rule of thumb) so large Cj y^ and y^ 2) Plot Cj against index j = 1; : : : ; n and compare with 8=(n 2p ) level. Victor Panaretos (EPFL) 8( Linear Models 135 / 299 Victor Panaretos (EPFL) Summary Diagnostic plots usually constructed: y against columns of X ,! check for linearity and outliers standardized residual r against columns of X ,! check for linearity ,! check for variables left out ,! check for homoskedasticity Detour: Reminder on Hypothesis Tests r against explanatories not included r against fitted value y^ Normal quantile plot ,! check for normality Cook’s distance plot ,! check for influential observations Victor Panaretos (EPFL) Linear Models 137 / 299 Detour: Very brief Reminder on Testing Hypotheses Victor Panaretos (EPFL) Linear Models 138 / 299 Hypothesis Testing Setup H0 : The null hypothesis Scientific theories lead to assertions that are testable using empirical data. Data may discredit the theory (call it the hypothesis) or not (i.e., empirical findings reasonable under hypothesis). ,! scientific theory under scrutiny Y; T (); ,! data test statistic, assumed positive the experimental setup to test theory INTUITION: Example: Theory of “luminoferous aether” in late 19th century to explain light travelling in vacuum. Discredited by Michelson-Morley experiment. Similarities with the logical/mathematical concept of a necessary condition. Victor Panaretos (EPFL) Linear Models 139 / 299 The null hypothesis would predict a certain plausible range of values for T Y (plausible results of the experiment). ( ) We would say that the assertion made by the null hypothesis (theory) is not supported by the data if T Y is an extreme (unlikely) observation given the range of plausible values predicted by the hypothesis (if the experimental evidence appears to be inconsistent with the theory). ( ) Victor Panaretos (EPFL) Linear Models 140 / 299 Hypothesis Testing Setup Example: Mean of a Normal Distribution () ( ) PH [T (Y ) 2 ] Suppose that we perform the experiment T (Y ) and the result is T (Y ) = t . The Plausibility of different values of T under the theory H0 ! described by the distribution of T Y under the null hypothesis: 0 result t is judged to be incompatible with the hypothesis when p = PH [T (Y ) t ] 0 is small. The value p is called the p-value. Small values of p suggest that we have observed something which is unlikely to happen if H0 holds true. Large values of true. p suggest that what we have observed is plausible if H0 holds (Choice of T often guided by an alternative hypothesis should be large) Thus we reject the null hypothesis when Victor Panaretos (EPFL) H1 , under which T p is small. Linear Models 141 / 299 Example continued and two comments Let X N (; 2 ), unknown mean, known variance H0 : = 0 d Data: Y = (X1 ; : : : ; Xn ), Xi = X , Xi indep Xj for i 6= j . P 2 X i i p . (tends to be large when 6= 0). Test statistic: T (Y ) = n Perform experiment (i.e., obtain values y = (x1 ; : : : ; xn )) and observe T (y ) = t . H Under the null hypothesis: T (Y ) 21 . Hence: p = PH [T (Y ) t ] = P[21 t ] p p = P[fN (0; 1) t g [ fN (0; 1) t g]: Usually reject when p < 0:05. 0 0 Victor Panaretos (EPFL) Example: Testing for Linear Models 142 / 299 c > = 0 in a Gaussian Regression Y N (X ; 2 I ), unknown H0 : c > = 0 Data: (y ; X ). Let , unknown variance !2 c> ^ p Test statistic: T (Y ) = S c > (X > X ) 1 c Suppose we observe T (y ) = and let W tn p . Then, p = PH [T (Y ) ] = P[fW p g [ fW p g]: Reject the null hypothesis if p < , some small . Identical to building a 1 confidence interval for c > based on > > ^ pc c and rejecting the hypothesis H0 if and only if the interval S c > (X > X ) c 0 For continuous test statistics with everywhere positive densities, if we reject H0 whenever p < , then our (type I) error probability is . ,! The probability of rejecting H0 when in fact H0 is true is 1 does not contain zero. There is a close link with confidence intervals. ,! We will only illustrate this link in a specific example Victor Panaretos (EPFL) Linear Models 143 / 299 Victor Panaretos (EPFL) Linear Models 144 / 299 Many many issues remain (this was just a reminder!) The role of an alternative hypothesis. How do we choose a test statistic? Nested Model Selection & ANOVA Are there optimal tests in a given situation? Simple and composite hypotheses. One and two-sided tests. Limitations of hypothesis testing . . . ... Review your 2nd year Probability/Statistics course! Victor Panaretos (EPFL) Linear Models 145 / 299 Comparing Nested Models Victor Panaretos (EPFL) Linear Models 146 / 299 Rephrasing The Question: Gaussian Linear Model y = X + " with " Nn (0; 2 I ). Estimate: ^ = (X >X ) 1X >y : Interpretation: y^ = X ^ = Hy is the projection of y into the column space of X, M(X ). This subspace has dimension p , when X is of full column rank p . Now for q < p write X in block notation as X = ( nX1q X2 ): Model is Consider the model: y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + ": This will always have higher R 2 than the sub-model: y = 0 + 1 x1 + ": n (p q ) Why? (think of geometry. . . ) The question is: is the first model significantly better than the second one? ,! Interpretation: Similarly write i.e. does the first model explain the variation adequately enough, or should we incorporate extra explanatory variables? Need a quantitative answer. X1 is built by the first q columns of X =( 1 2 )> so that: and X2 by the rest. y = X + " = (X1 X2 ) 1 + " = X1 1 + X2 2 + ": 2 Our question can now be stated as: Is Victor Panaretos (EPFL) Linear Models 147 / 299 2 = 0? Victor Panaretos (EPFL) Linear Models 148 / 299 Residual Sums of Squares Let Geometry Revisited H1 = X1 (X1> X1 ) 1 X1> , and y^1 = H1 y , e1 = y y^1 . Pythagoras tells us that: k| y {zy^1 k}2 = |ky {zy^ k}2 + RSS ( ^1 )=ke1 k2 Notice that RSS ( ^)=ke k2 k| y^ {zy^1 k}2 RSS ( ^1 ) RSS ( ^)=ke e1 k2 RSS ( ^1 ) RSS ( ^) always (think why!) So the idea is simple: to see if it is worthwhile to include much larger RSS 1 is compared to RSS . (^ ) ( ^) Equivalently, we can look at a ratio like 2 we will compare how fRSS ( ^1 ) RSS ( ^)g=RSS ( ^) To construct a test based on this quantity, we need to figure out distributions ::: Victor Panaretos (EPFL) Linear Models 149 / 299 Victor Panaretos (EPFL) Linear Models 150 / 299 Distributions of Sums of Squares proof continued Theorem Therefore, We have the following properties: (A) (B) (C) (D) e e1 ? e ; ke k2 = RSS ( ^) and ke1 e k2 = RSS ( ^1 ) RSS ( ^) are independent; ke k2 2 2n p ; under the hypothesis H0 : 2 = 0, ke1 e k2 2 2p q . e e1 = (I H )y (I H1 H )y = y Hy y + H1 Hy = (H1 I )Hy : But recall that y N (X ; 2 I ). Therefore, to prove independence of e e1 = (H1 I )Hy and e = (I H )y , we need to show that Proof. (H1 I )H [2 I ](I H )> = 0: This is clearly the case since H (I H ) = 0, proving (B). To show (B), we notice that (C) follows immediately, since we have already proven last time that when 2 ) (A) holds since e e1 = y y^ y + y^1 = y^ + y^1 2 M(X1 ; X2 ) but e 2 [M(X1 ; X2 )]? . =0 e1 = (I H1 )y = (I H1 H )y because M(X1 ) M(X1 ; X2 ). Victor Panaretos (EPFL) Linear Models 151 / 299 Victor Panaretos (EPFL) RSS ( ^) 2 2n Linear Models 8 (even p 152 / 299 proof continued. Relative Sum of Square Reductions To prove (D), we note that e e1 = (H1 I )Hy Nf(H1 I )HX ; 2 |(H1 I )HH{z> (H1 I )>}g: =H H1 HX = X (X > X ) 1 X > X = X . So, in block notation, e e1 N ((H1 I )X1 1 + (H1 I )X2 2 ; 2 (H H1 )): Now (I H1 )X1 1 = 0 always, since I H1 projects onto M? (X1 ). Therefore, But Corollary We conclude that, under the hypothesis 2 = 0, ! RSS ( ^1 ) RSS ( ^) p q ! Fp RSS ( ^) n p e e1 N (0; 2 (H H1 )); when 2 = 0: Now observe that (H H1 )> = (H H1 ) and (H H1 )2 = (H H1 ) (because M(X1 ) M(X1 ; X2 )). Thus, q ;n p e e1 N (0; 2 (H H1 )2 ) =) e e1 =d (H H1 )" =) RSS ( ^1) RSS ( ^) = ke e1k2 =d ">(H H1)" 2 2p q : since (H H1 ) is symmetric idempotent with trace p q and " N (0; 2 In ). Victor Panaretos (EPFL) The Linear Models 153 / 299 F -Test Victor Panaretos (EPFL) Linear Models 154 / 299 Example: Nested Models in Cement Data IWe fitted the model: y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + " Distributional results suggest the following test: Have Y N (X1 1 + X2 2 ; 2 I ) H0 : 2 = 0 Data: (y ; X1 ; X2 ). I But would the following simpler model be in fact adequate? RSS ( ^1 ) RSS ( ^) p q ! Test statistic: T = RSS ( ^) n p Then, under H0 , it holds that T Fp q ;n p . Suppose we observe T = . Then, p = PH [T (Y ) ] = P[Fp q ;n p ] Reject the null hypothesis if p < , some small , usually 0.05. 0 Victor Panaretos (EPFL) y = 0 + 1 x1 + " ! Linear Models 155 / 299 I Intuitively: is the extra explanatory power of the “larger” model significant enough in order to justify its use instead of a simpler model? (i.e., is the residual vector for the “larger” model significantly smaller than that of the simpler model?) I In this case, n ,p ,q and = 13 = 5 = 2 RSS ( ^) = 47:86; RSS ( ^1 ) = 1265:7 yielding :86)=(5 2) = (1265(47:7:86)47=(13 5) = 67:86 Ip = P[F3;8 67:86] = 4:95 10 6 , so we reject the hypothesis H0 : 2 = 3 = 4 = 0. Victor Panaretos (EPFL) Linear Models 156 / 299 Example: MPG vs Horsepower IWe can fit the quadratic model: + 2 horsepower2 +" 40 1 horsepower MPG = 0+ 1 horsepower Gas mileage (MPG) I But would the model only with linear term suffice? +" I Intuitively: is the reduction of RSS afforded by the “complex” model substantial enough in order to justify its use instead of a simpler model? I In this case, n ,p ,q and RSS ( ^1 ) = 1943:9 yielding (14433:1 1943:9) = 12489:2 = (3 2) Ip = P[F1;389 12489:2] = 2:2 10 6 , so we reject the hypothesis H0 : 2 = 0. Victor Panaretos (EPFL) Linear Models ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● 150 ●● ● ● ● ● ● ● ● 200 Horsepower Victor Panaretos (EPFL) Linear Models 158 / 299 Proceed similarly as before. Define: X1 , . . . ,Xr be groups of columns of X (the “terms”), such that X = (n11 nX1q nX2q : : : nXrq ); = ( 0 1 2 : : : r )> 2 r 11 1q1 1q2 X0 := 1 and Xk = (X0 X1 X2 : : : Xk ); k 2 f0; : : : ; r g Hk := Xk (Xk> Xk ) 1 Xk> ; k 2 f0; : : : ; r g y^k := Hk y ; k 2 f0; : : : ; r g ek = y y^k ; k 2 f0; : : : ; r g Note that y^0 = y 1. 1qr y = X + " = 1 0 + X1 1 + + Xr r + " I Would like to do the same “F-test investigation”, but this time do it term-by-term. That is, we want to look at the following sequence of nested models: I As before, Pythagoras implies k| y {zy^0 k}2 = k| y {zy^r k}2 + k| y^ {zy^r 1 k}2 + + k| y^1 {zy^0 k}2 y =1 0+" y = 1 0 + X1 1 + " y = 1 0 + X1 1 + X2 2 + " ke0 k2 y = 1 0 + X1 1 + X2 2 + + Xr r + " RSSr with Linear Models ker k2 = k|e{zr k}2 + .. . Victor Panaretos (EPFL) ● The Analysis of Variance 1 We have ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●● ● ●●●●● ●● ●● ● ● ● ●●● ● ●●● ● ● ●● ●● ● ●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●● ●●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 50 157 / 299 The Analysis of Variance I Let 1, ● 10 = 392 = 3 = 2 RSS ( ^) = 14433:1; 30 = 0+ 20 MPG ● ● ● ● ● 159 / 299 | r 1 X ker er 1 k2 {z ker e0 k2 ke1 e0 k2 } k| ek +1{z ek k}2 k =0 RSSk RSSk +1 RSSk the residual sum of squares for y^k , with k degrees of freedom. Victor Panaretos (EPFL) Linear Models 160 / 299 The Analysis of Variance ANOVA Table Terms Some observations: RSSk RSSk +1 is the reduction in residual sum of squares caused by adding Xk +1 , when the model already contains X0 ; : : : ; Xk . RSSr and fRSSk RSSk +1 gkr =01 are all mutually independent. Obviously, 0 1 2 r k +1 = k if Xk +1 2 M(Xk ). IGiven this information, we want to see how adding each term in the model sequentially, affects the explanatory capacity of the model. ,! In other words, we want to investigate the reduction in the residual sum of squares achieved by adding each term to the model. Is this significant? df n 1 1 1; X1 1; X1 ; X2 .. . 1; X1 ; : : : ; Xr 1 2 .. . r 161 / 299 Example: Nested Sequence in Cement Data Reductions in overall sum of squares when sequentially entering terms x1 x2 x3 x4 Residual SSq n 1 1 Xr r .. . 2 .. . 1 r Fk Fk 1 F-test RSS0 RSS1 RSS1 RSS2 RSSr .. . 1 RSSr RSS when Xk is RSSk )=(k 1 k ) ; Fk = (RSSk 1 RSS r =r k ;r under the null hypothesis H0 : k = 0. Fk relative to the null distribution are evidence against H0 . Victor Panaretos (EPFL) Linear Models 162 / 299 x1 , x2 , Significance of entering a term depends on how the sequence is defined: when entering terms in different order get different results! (why?) Does adding extra variables improve model significantly? Red Sum Sq 1450.08 1207.78 9.79 0.25 47.86 RSSr X1 X2 1 Reduction in RSS Warning! x3 and x4 . Df 1 1 1 1 8 .. . df 1 Large values of Linear Models RSS0 RSS1 RSS2 Terms added The F -statistic for testing the significance of the reduction in added to the model containing terms ; X1 ; : : : ; Xk is and Victor Panaretos (EPFL) Residual RSS F value ( ) 242.37 201.87 1.64 0.04 p -value 2.8810 7 5.8610 7 When a term is entered “early” and is significant, this does not tell us much (why?) When a term is entered “late” is significant, then this is quite informative (why?) 0.2366 0.8441 I Why is this true? Are there special cases when the order of entering terms doesn’t matter? I In this case, each term is a single column (variable). Victor Panaretos (EPFL) Linear Models 163 / 299 Victor Panaretos (EPFL) Linear Models 164 / 299 The Effect of Orthogonality X0 = 1; X1 ; X2 from X , so X = (nX01 nX1q nX2q ); = ( 0 1 2 )> 11 1q 1q IAssume orthogonality of terms, i.e. Xi> Xj = 0; i 6= j IConsider terms 1 2 1 2 Model Selection / Collinearity / Shrinkage Notice that in this case 1 0 0 ^ = @ 0 X1>X1 0 A X0 X1 X2 > y 0 0 X2>X2 =) ^0 = y ; ^1 = (X1>X1) 1 X1>y ; ^2 = (X2> X2) 1X2>y 0 1 X0> X0 It follows that the reductions of sums of squares are unique, in the sense that they do not depend upon the order of entry of the terms in the model. (show this!) Intuition: Xi contains completely independent linear information from Xj for y , i 6= j Victor Panaretos (EPFL) Linear Models 165 / 299 Theory VS Practice Victor Panaretos (EPFL) Linear Models 166 / 299 Albert Einstein (1879–1955) I Theory: We are given a relationship y =X +" and asked to provide estimators, tests, confidence intervals, optimality properties ... . . . and we can do it with complete success! I Practice: We are given data (y ; X ) and suspect a linear relationship between y and some of the columns of X . We don’t know a priori which exactly! ,! Need to select a “most appropriate” subset of the columns of X ,! General principle: parsimony (Latin parsimo nia: sparingness; simplicity and least number of requisites and assumptions; economy or frugality of components and associations). Victor Panaretos (EPFL) Linear Models ‘Everything should be made as simple as possible, but no simpler.’ 167 / 299 Victor Panaretos (EPFL) Linear Models 168 / 299 William of Ockham (?1285–1347) Exploratory Data Analysis Graphical exploration provides initial picture: y against candidate variables; plots of transformations of y against candidate variables; plots of transformations of certain variables against y ; plots of plots of pairs of candidate variables. This will often provide a starting point, but: Automatic Model Selection: Need objective model comparison criteria, as a screening device. ,! We saw how to do an nested? F -test, but what if models to be compared are not Automatic Model Building: Situations when possible models. ,! Victor Panaretos (EPFL) Linear Models 169 / 299 Automatic Model Selection 2 Denote set of all models generated by (1 + ) Automatic methods for building a model? We saw that ANOVA depends on the order of entry of variables in the model . . . Victor Panaretos (EPFL) Linear Models 170 / 299 Model Selection Criteria Occam’s razor: is vainXtowith do with more what can be done with fewer. Consider design Itmatrix p variables. Given pseveral explanations of the same phenomenon, we should prefer the simplest. possible models! If wish to consider becomes kp p large, so there are lots of X 2 by X (model powerset) k different transformations of each variable, then p Many possible choices, none universally accepted. Some (classical) possibilities: Prediction error based criteria (CV) Information criteria (AIC, BIC, . . . ) Mallow’s Cp statistic Before looking at these, let’s introduce terminology: Suppose that the truth is y = X + " but with Fast algorithms (branch and bound, leaps in R) exist to fit them, but they don’t work for large p , and anyway : : : . . . need criterion for comparison. So given a collection of models, we need an automatic (objective) way to pick out a “best” one (unfortunately cannot look carefully at all of them, BUT NOTHING replaces careful scrutiny of the final model by an experienced researcher). r = 0 for some subset r of . The true model contains only the columns for which r ,! 6= 0 Equivalently, the true model uses X~ as the design matrix, the latter being the matrix of columns of X corresponding to non-zero coefficients. A correct model is the true model plus extra columns. ,! Equivalently, a correct model has a design matrix M(X~ ) M(X} ). X} , such that A wrong model is a model that does not contain all the columns of the true model. ,! Victor Panaretos (EPFL) Linear Models 171 / 299 Equivalently, a wrong model has a design matrix M(X~ ) \ M(X} ) 6= M(X~ ). Victor Panaretos (EPFL) Linear Models X} , such that 172 / 299 Expected Prediction Error The Bias/Variance Tradeoff I We may wish to choose a model by minimising the error we make on average, when predicting a future observation given our model. Our “experiment is”: X response y at X Every model f 2 2X , will yield fitted values y^ (f ) = Hf y . And suppose we now obtain new independent responses y+ for the same “experimental setup” X . Let X be a design matrix, and let X} (n p ) and X~ (n q ) be matrices built using columns of X . Suppose that the true relationship between y and X is y =X ~ +" | {z } Design matrix Then, one approach is to select the model 1 E ky+ y^ (f )k2 ; f = arg min X n f 22 where expectation is taken over both Victor Panaretos (EPFL) | {z (f ) } y^ = (X}> X} ) 1 X}> y = H} y : Now suppose that we obtain new observations y+ corresponding to the same design X y+ = X~ + "+ = + "+ : y+ y^ = + "+ H} ( + ") = (I H}) + "+ H}": y and y+ . Linear Models 173 / 299 Victor Panaretos (EPFL) 174 / 299 I Impossible to calculate (depends on unknown and 2 ), so we must find a proxy (estimator) b . Suppose that n is large so that we can split the data in two pieces: ky+ y^ k2 = (y+ y^ )> (y+ y^ ) = >(I H}) + ">H}" + ">+"+ + [cross terms]: ] = 0 (why?), we observe that n 1 > (I H} ) + (1 + p =n )2 ; = : (1 + p =n )22 ; (1 + q =n ) ; 8 < The estimator of the prediction error will be if model wrong; if model correct; if model true: b = (n 0) 1ky 0 X 0 ^k2: In practice n can be small and we often cannot afford to split the data (variance of is too large). Instead we use the leave-one-out cross validation sum of squares: ^ Selecting a correct model instead of the true model brings in additional variance, because q < p . Selecting a wrong model instead of the true model results in bias, since I H} 6 when is not in the column space of X} . n b CV = CV = ) =0 Must find a balance between small variance (few columns in the model) and small bias (all columns in the model). Linear Models X , y used to estimate the model X 0 , y 0 used to estimate the prediction error for the model Since E cross terms Victor Panaretos (EPFL) Linear Models Cross Validation It follows that ( X} instead of X~ (i.e., we fit a different model). Therefore Then, observe that The Bias/Variance Tradeoff [ but we use the matrix our fitted values are 175 / 299 where ^ n X j =1 (yj xj> ^ j )2; j is the estimate produced when dropping the j th case. Victor Panaretos (EPFL) Linear Models 176 / 299 Cross Validation Akaike’s Information Criterion No need to perform n regressions since CV = Criteria can be obtained based on the notion of information (relative entropy). n (y X j =1 j x > ^)2 j (1 hjj )2 Same basic idea as for prediction error: aim to choose candidate model to minimise information distance: ; Z so the full regression may be used (show this!). Alternatively one may use a more stable version: (yj xj> ^)2 ; GCV = (1 trace (H )=n )2 j =1 n X where “G” stands for “generalised”, and we guard against any hjj It holds that: [ E GCV > 2 ] = (1 (I p =Hn ))2 + 1 n p =n n : Linear Models 1. g ( y ) log f (y ) g (y )dy 0; where g y represents true model—equivalent to maximising expected log likelihood Z Can show that (apart from constants) information distance is estimated by AIC ^ = 2`^ + 2p ( n log ^ 2 + 2p in linear model) where ` is maximised log likelihood for given model, and parameters. 177 / 299 Other Information Criteria Victor Panaretos (EPFL) p is number of Linear Models 178 / 299 Simulation Experiment 2f AICc AIC + n2p (pp+ 1)1 : n Also can use Bayes’ information criterion BIC g = 2`^ + p log n : 10 Cp BIC AIC AICc Mallows suggested p Cp = SS s 2 + 2p n ; where SSp is RSS for fitted model and s 2 estimates 2 . 20 Cp BIC AIC AICc Comments: AIC tends to choose models that are too complicated, buts AICc cures this somewhat; BIC is model selection consistent—if the true model is among those fitted, BIC chooses it with probability ! 1 as n ! 1 (for fixed p ). Linear Models For each n 10; 20; 40 we construct 20 n 7 design matrices. We multiply each of these design matrices from the right with = (1; 2; 3; 0; 0; 0; 0)> and we add a n 1 Gaussian error. We do this independently 50 times, obtaining 1000 regressions with p = 3. Selected models with 1 or 2 covariates have a bias term, and those with 4 or more covariates have excess variance. Improved (corrected) version of AIC for regression problems: Victor Panaretos (EPFL) log f (y )g (y )dy : B Suggests strategy: pick variables to minimise (G)CV. Victor Panaretos (EPFL) () f (y ) 179 / 299 40 Cp BIC AIC AICc Victor Panaretos (EPFL) 1 15 2 131 72 52 398 4 6 2 8 Number 3 504 373 329 565 of covariates 4 5 91 63 97 83 97 91 18 4 6 83 109 125 7 128 266 306 673 781 577 859 121 104 144 94 88 52 104 30 61 30 76 8 53 27 97 1 712 904 673 786 107 56 114 105 73 20 90 52 66 15 69 41 42 5 54 16 Linear Models 180 / 299 Automatic Model Building Forward/Backward/Stepwise Selection I We saw so far: Automatic Model Selection: build a set of models and select the “best” one. I Now look at different philosophy: Automatic Model Building: construct a single model in a way that would hopefully provide a good one. Forward selection: starting from the model with constant only, 1 2 3 add each remaining term separately to the current model; if none of these terms is significant, stop; otherwise update the current model to include the most significant new term; go to step 1. Backward elimination: starting from the model with all terms, There are three standard methods for doing this: 1 2 Forward Selection Backward Elimination 1 CAUTION: Although widely used, these have no theoretical basis. Element of arbitrariness . . . Linear Models 2 181 / 299 Forward/Backward/Stepwise Selection statistic; go consider three options—add a term, delete a term, swap a term in the model for one not in the model, and choose the most significant option; if model unchanged, stop; otherwise go to step 1. Victor Panaretos (EPFL) Linear Models 182 / 299 Example: Nuclear Power Station Data Some thoughts: Each procedure may produce a different model. Systematic search minimising Prediction Error, AIC or similar over all possible models is preferable— BUT not always feasible (e.g., when p large). Stepwise methods can fit ‘highly significant’ models to purely random data! Main problem is lack of objective function. Can be improved by comparing Prediction Error/AIC for different models at each step — uses objective function, but no systematic search. Victor Panaretos (EPFL) F Stepwise: starting from an arbitary model, Stepwise Selection Victor Panaretos (EPFL) if all terms are significant, stop; otherwise update current model by dropping the term with the smallest to step 1. Linear Models 183 / 299 Data on light water reactors (LWR) constructed in the USA. The covariates are date (date construction permit issued), T1 (time between application for and issue of permit), T2 (time between issue of operating license and construction permit), capacity (power plant capacity in MWe), PR (=1 if LWR already present on site), NE (=1 if constructed in north-east region of USA), CT (=1 if cooling tower used), BW (=1 if nuclear steam supply system manufactured by Babcock–Wilcox), N (cumulative number of power plants constructed by each architect-engineer), PT (=1 if partial turnkey plant). 1 2 3 4 5 6 7 8 .. . 32 cost 460.05 452.99 443.22 652.32 642.23 345.39 272.37 317.21 date 68.58 67.33 67.33 68.00 68.00 67.92 68.17 68.42 T1 14 10 10 11 11 13 12 14 T2 46 73 85 67 78 51 50 59 capacity 687 1065 1065 1065 1065 514 822 457 PR 0 0 1 0 1 0 0 0 NE 1 0 0 1 1 1 0 0 CT 0 1 1 1 1 1 0 0 BW 0 0 0 0 0 0 0 0 N 14 1 1 12 12 3 5 1 PT 0 0 0 0 0 0 0 0 270.71 67.83 7 80 886 1 0 0 1 11 1 Victor Panaretos (EPFL) Linear Models 184 / 299 Example: Nuclear Power Station Data Full model Est Int. date logT1 logT2 logcap PR NE CT BW log(N) PT s (df) 14:24 0.2 0:092 0:29 0:694 0:092 0:25 0:12 0:033 0:08 0:22 t 3:37 3.21 0.38 1.05 5.10 1:20 3.35 1.82 0.33 1:74 1:83 0.164 (21) Victor Panaretos (EPFL) More Dangers of “Big” Models Backward Est 13:26 0:21 0:72 t 4:22 4.91 6.09 0.24 0.14 3.36 0:08 0:22 2:11 1:99 Forward Est Recall: ,! Adding more variables (columns) into X t 7:62 0:13 2:66 0:67 4:75 y^ is projection of y onto M(X ) 3.38 Consider two extremes Adding a new variable ,! 0.159 (25) 0:49 Xp+1 2= M(X ) 4:77 0.195 (28) but 185 / 299 Victor Panaretos (EPFL) Linear Models Problem of Multicollinearity Using block matrix properties, have Multicollinearity: when dimension q < p ( ^) = 2 (X Xp+1 )>(X Xp+1 ) 1 186 / 299 p explanatories concentrate around a subspace of [simplest case: pairs of variables that are correlated] BUT: might exist even if pairs of variables appear uncorrelated! with (X Xp+1 )>(X Xp+1) 1 = CA DB X (X > X ) 1 X > Xp+1 = HXp+1 ' Xp+1 ? We can certainly fit the regression, but what will happen? Unstable Matrix Inversion Xp+1 2 M(X ) Gives no “new” information — cannot even do least squares (why not?) What if we are between the two extremes? What if Linear Models var Xp+1 2 M? (X ) Gives us completely “new” information. Adding a new variable ,! ( ) “enlarges” M X . . . IF the rank increases by the # of new variables Can be caused by: Poor design [can try designing again], Inherent relationships [other remedies needed]. where A = (X > X ) 1 + (X > X ) 1 X > Xp+1 (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 Xp>+1 X (X > X ) 1 ; B = (X > X ) 1 X > Xp+1 (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 ; C = (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 Xp>+1 X (X > X ) 1 ; D = (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 : So what are the results? Huge variances of the estimators! ,! Can even flip signs for different data, to give the impression of inverse effects. Individual coefficients insignificant: ,! t -test p -values inflated. But global Victor Panaretos (EPFL) Linear Models 187 / 299 F -test might give significant result! Victor Panaretos (EPFL) Linear Models 188 / 299 The Picket-Fence (Hocking & Pendleton) Diagnosing Multicollinearity Simple first steps: Look at scatterplots, Look at correlation matrix of explanatories, Might not reveal more complex linear constraints, though. Look at the variance inflation factors: 2 ^ VIFj = var( j)2kXj k = kXj k2 (X > X ) 1 jj : VIFj = 1 1R2 j Can show that Rj2 is the coefficient of determination for the regression Xj = 0;j + 1;j X1 + + j 1;j Xj 1 + j +1;j Xj +1 + + measuring linear dependence of Xj on the other columns of X . where Victor Panaretos (EPFL) Linear Models 189 / 299 Victor Panaretos (EPFL) p ;j Xp + "; Linear Models 190 / 299 More on Diagnosing Multicollinearity Let X j be the design matrix without the j -th variable. Then Rj2 = is close to kX j (X >j X j ) 1 X >j Xj k2 kXj k2 Consider the spectral decomposition of X > X , diagf1 ; : : : ; p g and U > U I . Then = 2 [0; 1] = rank(X > X ) = #fj : j 1 if |X j (X >j X{z j ) 1 X }j Xj ' Xj . H j Large values of VIFj indicate that of the design matrix. Xj is linearly dependent on the other columns Rule of thumb: Hence “small” j ’s mean “almost” reduced rank, revealing the effect of collinearity. Measure using condition index: q max =j q CN (X > X ) = max =min 30 Rule of thumb: CN > indicates moderate to significant collinearity, CN > indicates severe collinearity (choices vary). 100 Linear Models j =1 Global “instability” measured by the condition number, VIFj > 5 or VIFj > 10 considered to be “large”. Victor Panaretos (EPFL) p Y det(X > X ) = j : 6= 0g; CIj (X > X ) := Interpretation: how much the variance is inflated when including variable j as compared to the variance we would obtain if Xj were orthogonal to the other variables—how much worse are we doing as compared to the ideal case. X > X = U U > with 191 / 299 Victor Panaretos (EPFL) Linear Models 192 / 299 Remedies? Example: Body Fat Data Body fat is measure of health ! not easy to measure! Collect 252 measurements on body fat and some explanatory variables. If design faulty, may redesign. ,! Otherwise? Inherent relationships between explanatories. Variable deletion - attempt to remove problematic variables ! Can we use measuring tape and scales only to find body fat? E.g., by backward elimination. Explanatory variables: (X ) and use its elements as explanatories Choose an orthogonal basis for M ! ! ! from spectrum, X > X = U U > Use columns of U OK for prediction Problem: lose interpretability age neck hip weight chest thigh height abdomen knee a biceps forearm ankle wrist Other approaches? Victor Panaretos (EPFL) Linear Models 193 / 299 Some Scatterplots [library(car);scatterplot.matrix( . . . )] | |||||||||||||||||||||| ||||||||||||||||||||| |||||| ||| |||||||||||||||||||||| || 350 | 200 250 300 350 16 ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●●●● ● ●●● ●● ●● ●● ● ● ●● ● ● ●● ●● ●● ● ● ●● ●● ● ●● ●●●● ● ● ●● ●● ● ● ● ● ●●● ● ● ●●● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●●● ● ● ● ● ● ● ●● ●●● ● ●●●● ● ● ●●● ●●● ●● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●● ●● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 17 18 19 20 ● ● ● ●●● ●●●●●●●●● ●● ●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●● ● ● ●●● ●●●● ●●●●● ●● ● ●● ● ● ●● ●●● ● ● ●● ● ● ●● ● ●● ●● ● ●● ●●●● ●● ● ● ●● ●● ●●● ● ● ●● ● ● ● ●●●●● ●●● ●●● ● ● ● ● ●●●● ● ●● ●● ● ● ● ● ● ●●● ●●●●●●● ● ●●●● ●●● ●● ● ● ● ● ● ● 21 ● ● ● ● ● ● ● ● ● | || |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| ||||||| ||| ||||| ||| ||| | ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ●●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●●● ●●● ●●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●●●● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ●● ●● ● ● ● ● 20 18 16 ● ● 30 40 50 60 70 ● abdomen ● ● ● || ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||| || || | | | | | ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ●● ●●● ●●●●● ● ●● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ●●● ●●●●●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ●● ●●●●● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●●● ● ●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ●●●●● ● ● ● ●●● ● ● ●● ● ● ● ●● ●● ● ● ● ●●●● ● ●● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●●● ● ●● ● ●●●● ●● ●● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●●● ●●● ●●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●●●● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● ●● ● ●● ●● ● ● ● ● ●●● ●●● ●● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ●●● ●● ●●●●● ●●● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ●● ●●● ● ● ● ● ●● ● ●●● ● ●● ●●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● | ● ● ●● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ●● ●●●● ● ● ● ●● ● ●●●●● ● ●●●●●● ●●● ●● ● ●●● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ●●●● ●● ●●● ●● ● ●● ● ● ● ●● ●●● ●●●● ● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ●●● ● ●● ● ●●● ● ● ●●●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 80 100 ● wrist ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ● ● ● ● ●●● ●● ● ●● ● ●●● ● ● ●● ●●● ●● ● ● ●● ●●●●● ●● ●●●● ● ●●● ● ●●●● ●● ● ●●● ● ●● ●● ●● ● ●● ●●●● ●●●● ● ●● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●● ●● ●● ●● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●● ● ● ● ●● ● ●● ●● ●●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ●●● ●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●●●● ●● ● ●● ●● ●●●● ●● ●●●● ●● ● ● ●● ●●●● ●● ●● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●● ●● ●● ● ●●● ● ● ● ●● ●●● ●● ● ● ● ●●● ● ● ● ●●● ●● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● 100 150 ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●●●● ● ● ● ●● ●● ● ●● ● ●● ● ●● ●● ● ● ●●● ●● ●●● ● ● ● ●● ●● ●● ● ● ●● ●●●● ●● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●●●● ●●● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● 140 ● 80 250 weight ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●●● ● ● ●●● ●●●● ●● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● 120 | | | |||||||||||||||||||||||||||||||||||||| | | | 140 Looks like we’re in trouble. Let’s go ahead and fit anyway . . . Victor Panaretos (EPFL) Linear Models Linear Models 194 / 299 Model Fit Summary 30 40 50 60 70 150 ● ● ● ●● ●●●●● ● ●● ● ● ●●●●●● ● ●● ●● ●● ● ● ●● ●●● ●●● ●● ●● ● ●● ● ●●● ●● ●● ● ● ● ● ● ● ●●●● ● ●● ●● ●● ● ● ●● ● ● ●●● ●●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●●●● ● ●● ● ●● ● ●●● ●● ● ●●● ● ●● ●● ● ● ● ●● ●●●● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● height Victor Panaretos (EPFL) (Intercept) age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist Estimate 18.1885 0.0621 0.0884 0.0696 0.4706 0.0239 0.9548 0.2075 0.2361 0.0153 0.1740 0.1816 0.4520 1.6206 Std. Error 17.3486 0.0323 0.0535 0.0960 0.2325 0.0991 0.0864 0.1459 0.1444 0.2420 0.2215 0.1711 0.1991 0.5349 t value 1.05 1.92 1.65 0.72 2.02 0.24 11.04 1.42 1.64 0.06 0.79 1.06 2.27 3.03 Pr(>jtj) 0.2955 0.0562 0.0998 0.4693 0.0440 0.8100 0.0000 0.1562 0.1033 0.9497 0.4329 0.2897 0.0241 0.0027 R2 = 0:749, F -test: p < 2:2 10 16 . 195 / 299 Victor Panaretos (EPFL) Linear Models 196 / 299 Victor Panaretos (EPFL) Pr(>jtj) 0.9730 0.6252 0.8223 0.7284 0.2635 0.5877 0.0000 0.9130 0.8793 0.7366 0.3516 0.1179 0.3750 0.0560 Linear Models age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist 197 / 299 Variable Deletion: Backward Elimination (Intercept) age weight neck abdomen hip thigh forearm wrist Victor Panaretos (EPFL) Singular Values Eigenvalue Roots ● ● ● 2 ● ● 4 ● ● 6 ● 8 ● ● ● 10 ● ● 12 ● 14 Index Condition Number ' 556 ! Linear Models 198 / 299 Variable Transformation: Eigenvector Basis 0.7466, < 2.2e-16 Estimate 22.6564 0.0658 0.0899 0.4666 0.9448 0.1954 0.3024 0.5157 1.5367 1 2 3 4 5 6 7 8 9 10 11 12 13 CI 1.00 17.47 25.30 58.61 83.59 100.63 137.90 175.29 192.62 213.01 228.16 268.21 555.67 Victor Panaretos (EPFL) Define Multiple R-Squared: F-statistic p-value: VIF 2.25 33.51 1.67 4.32 9.46 11.77 14.80 7.78 4.61 1.91 3.62 2.19 3.38 4000 Estimate 1.2221 0.0256 0.0237 0.1005 0.4619 0.0910 0.8924 0.0265 0.0334 0.1310 0.5037 0.4458 0.2247 1.5902 2000 Pr(>jtj) 0.1393 0.0153 0.0502 0.5207 0.0721 0.9002 0.0000 0.1265 0.0565 0.5111 0.0694 0.5485 0.0174 0.0309 0 (Intercept) age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist Estimate 32.6564 0.1048 0.1285 0.0666 0.5086 0.0168 0.9750 0.2891 0.3850 0.2218 0.4377 0.1297 0.8871 1.7378 Diagnostic Check values Split Data in Two and Fit Separately (Picket Fence) Std. Error 11.7139 0.0308 0.0399 0.2246 0.0719 0.1385 0.1290 0.1863 0.5094 Linear Models t value 1.93 2.14 2.25 2.08 13.13 1.41 2.34 2.77 3.02 Pr(>jtj) 0.0543 0.0336 0.0252 0.0388 0.0000 0.1594 0.0199 0.0061 0.0028 Z = XU as design matrix. (Intercept) Z[, 1] Z[, 2] Z[, 3] Z[, 4] Z[, 5] Z[, 6] Z[, 7] Z[, 8] Z[, 9] Z[, 10] Z[, 11] Z[, 12] Z[, 13] VIF 2.05 18.82 4.08 8.23 13.47 6.28 1.94 3.09 199 / 299 Victor Panaretos (EPFL) R2 =0.749, F-test Estimate 18.1885 0.1353 0.0168 0.2372 0.7188 0.0248 0.4546 0.5903 0.1207 0.0836 0.5043 0.5735 0.3007 1.5168 Std. Error 17.3486 0.0619 0.0916 0.1070 0.0571 0.0827 0.1001 0.1366 0.1742 0.1914 0.2082 0.2254 0.2628 0.5447 Linear Models p-value< t value 1.05 2.19 0.18 2.22 12.58 0.30 4.54 4.32 0.69 0.44 2.42 2.54 1.14 2.78 2:2 10 16 Pr(>jtj) 0.2955 0.0297 0.8546 0.0276 0.0000 0.7649 0.0000 0.0000 0.4890 0.6627 0.0162 0.0116 0.2536 0.0058 200 / 299 From Rotation to Shrinkage Ridge Regression Multicollinearity problem is that det [i.e. X > X almost not invertible] Eigenvector approach rotates space so as to “free” the dependence of one coefficient j on others f i gi 6=j ,! Imposes constraint on X (orthogonal columns) (X >X ) 1 0 A Solution: add a “small amount” of a full rank matrix to X >X . For reasons to become clear soon, we standardise the design matrix: Problem: lose interpretability! (prediction OK) Write Recentre/rescale the covariates defining: Example: most significant “rotated” term in fat data: Z[,4]=-0.01*age -0.058*weight -0.011*height +0.46*neck -0.144*chest -0.441*abdomen +0.586*hip +0.22*thigh -0.197*knee -0.044*ankle -0.07*biceps -0.33*forearm -0.249*wrist ,! ,! Other approach to reduce this strong dependence? ,! Impose constraint on ! How? (introduces bias) Victor Panaretos (EPFL) X = (1 W ), = ( 0 )> 1W j ) Coefficients now have common scale Interpretation of j slightly different: not “mean impact on response per unit change of explanatory variable”, but now “mean impact on response per unit deviation of explanatory variable from its mean, measured in units of standard deviation” The Zj are all orthogonal to 1 and are of unit norm. Linear Models 201 / 299 Ridge Regression Victor Panaretos (EPFL) Linear Models 202 / 299 Ridge Regression: Shrinkage Viewpoint ! Ridge term I Since Zj ? 1 for all, j , we can estimate (orthogonality). 0 and by two separate regressions seems slightly ad-hoc. Motivation? ,! Can see that ( ^0 Least squares estimators become ^0 = Y ; ^ = (Z >Z ) 1 Z >Y : Ridge regression replaces Z >Z by Z > Z + I (i.e. adds a “ridge”) 01 Z k22 p 1 X subject to j =1 2 = k k2 r () 2 j instead of least squares estimator which minimizes : kY Adding I to Z > Z makes inversion more stable ,! called ridge parameter. Linear Models ^) = (Y (Z >Z + I ) 1Z >Y ) minimizes kY 0 1 Z k22 + k k22 or equivalently kY ^0 = Y ; ^ = (Z >Z + I ) 1Z >Y Victor Panaretos (EPFL) p Zj = sd(Wn j ) (Wj 01 Z k22 : Idea: in the presence of collinearity, coefficients are ill-defined: a wildly positive coefficient can be cancelled out by a largely negative coefficient (many coefficient combinations can produce the same effect). By imposing a size constraint, we limit the possible coefficient combinations! 203 / 299 Victor Panaretos (EPFL) Linear Models 204 / 299 L2 Shrinkage [Ridge Regression] Ridge Regression Proposition Let Zn q be a matrix of rank r Given q with centred column vectors of unit norm. > 0, the unique minimiser of Q ( ^0 ; ^) = ky ^0 1 Z ^k22 + k^k22 is ( ^0 ; ^) = (y ; (Z > Z + I ) 1 Z >y ): Proof. Write y = |(y {zy 1}) + |{z} y 1 =y 2M? (1) 2M(1) (Z ). Therefore by Pythagoras’ theorem Z ^k22 = k( y ^0 )1 + (y Z ^)k2 = k(y ^0 )1k22 + k(y Z ^)k22 : | {z } | {z } 2 Note also that by assumption 1 2 M? ky ^0 1 Victor Panaretos (EPFL) Linear Models 205 / 299 n 2 2 ^ ^ 2 min ^ ;^ Q ( 0 ; ^) = min ^ k(y 0 )1k2 + min ^ k(y Z ^)k2 + k^k2 0 Clearly, Note that if the SVD of used to show that ^0 )1k22 = y^ while the second component can be written pZ ^ min q ^2R Iq q y Note that and p Linear Models 206 / 299 0q 1 Iq q ) p Z Iq q 1 (Z > ; p Iq q ) {z Victor Panaretos (EPFL) q X !j > 2j + (vj y )uj ; ! j =1 V and U , respectively. Compare this to the ordinary least squares solution, when y = (Z > Z + I ) 1 Z > y 0 ^= q 1 which is not even defined if Role of = diagf| 1 ; :{z : : ; r ; r +1 ; : : : ; q g (Z > Z 0 & rank (Z > Z ) = rank (Z )). } | ^= where the vj s and uj s are the columns of 2 Z > Z + I is indeed invertible. Writing Z > Z = U U > , we have Z > Z + I = U U > + U (Iq q )U > = U ( + Iq q )U > >0 =0 To complete the proof, observe that Z is Z = V U > , last steps of previous proof may be 2 using block notation. This is the usual least squares problem with solution (Z > ; Victor Panaretos (EPFL) o 0 arg min ^0 k(y 2M(Z ) The Effect of Shrinkage (proof ctd) Therefore, 2M(1) Z q X = 0: 1 (v >y )uj ; j =1 !j j is of reduced rank. is to reduce the size of 1=!j when !j becomes very small. } Z > y = Z > y yZ > 1 = Z > y . Linear Models 207 / 299 Victor Panaretos (EPFL) Linear Models 208 / 299 Bias and Variance Domination over Least Squares Proposition Corollary Let ^ be the ridge regression estimator of . Then bias(^ ; ) = Z > Z + Iq and Assume that 1 E Proof. Since E(^ ) = (Z > Z + I ) 1 Z > E(y ) = (Z > Z + I ) 1 Z > Z , the bias is bias(^ ; ) = E(^ ) = f(Z > Z + I ) 1 Z > Z I g 1 1 1 > > = Z Z +I Z Z +I I I = I 1 Z >Z + I 1 I = for all (~ )(~ )> E (^ )(^ )> 0 2 (0; 22 =k k2 ). Gauss-Markov only covers unbiased estimators – but ridge estimator biased. 1 Z >Z + I 1 A bit of bias can improve the MSE by reducing variance! : Also, there is a catch! The “right” range for Choosing a good Linear Models 209 / 299 depends on unknowns. is all about balancing bias and variance. Victor Panaretos (EPFL) Linear Models 210 / 299 Bias–Variance Tradeoff Proof. From our bias/variance calculations on the ridge estimator, we have E 2 (Z > Z ) Ridge estimator uniformly better than least squares! How can this be? (What happened to Gauss-Markov?) The covariance term is obvious. Victor Panaretos (EPFL) ^ = (Z >Z + I ) 1Z >y be the least squares estimator and ridge estimator, respectively. Then, cov (^) = 2 (Z > Z + I ) 1 Z > Z (Z > Z + I ) 1 : rank(Zn q ) = q and let ~ = (Z >Z )1Z >y & 1 (~γ γ )(~γ γ )> E (Z > Z +I ) 1 2 Z > Z (Z > Z +I ) = (Z > Z + I ) 1 (^γ 1 γ )(^ γ γ )> = 2 Z > Z + I 2 (2I + (Z > Z ) 1 ) 1 γγ > Z > Z + I 1 γγ > (Z > Z + I ) 1 : Role of : Regulates Bias–Variance tradeoff " decreases variance (collinearity) but increases bias # decreases bias but variance inflated if collinearity exists Recall: Ek ^ To go from 2nd to 3rd line, we wrote 2 (Z > Z ) = 2 (Z > Z + I ) 1 (Z > Z + I )(Z > Z ) 1 (Z > Z + I )(Z > Z + I ) = (Z > Z + I ) 1 (2 Z > Z + 22 I + 2 2 (Z > Z ) 1 )(Z > Z + I ) 1 1 Note that if 1 and did the tedious (but straighforward) algebra. The final term can be made positive definite if 22 I + 2 (Z > Z ) 1 k2 = Ek | ^ {zE^k}2 + k|E^ {z k}2 + 2( E^ | Variance =trace [cov (^)] Bias 2 Z > Z = U U > trace (cov (^)) = Pq i j =1 2i + 2 ){z>E[^ =0 E ^}] So choose so as to optimally increase bias/decrease variance Use cross validation! γγ > 0: Noting that we can always write f k k g > Pq 1 > j =1 θj θj an orthonormal basis of Rq we see that 2 (0; 2 2 =kγ k2 ) suffices for I = kγγ + γ k2 for γ = γ ; θ1 ; :::; θq 1 positive definiteness to hold true. Victor Panaretos (EPFL) Linear Models 211 / 299 Victor Panaretos (EPFL) Linear Models 212 / 299 L1 Shrinkage? Orthogonal Design When the explanatory variables are orthogonal (i.e. performs model selection via thresholding: Motivated from Ridge Regression formulation can consider: kY min! 01 Z k22 subject to () kY min! 01 p 1 X j =1 j j j = k k1 r () Consider the linear model Y = 0 n11 + n (Zp 1) + " 11 (p 1)1 n 1 where Z > 1 = 0 and Z > Z = I . Let ^ be the least squares estimator of ^ = (Z >Z ) 1 Z >Y = Z >Y : n 1 Z k22 + k k1 : Y No explicit form available (unless Z > Z = I ), needs quadratic programming Resulting estimator non–linear in algorithm Why choose a different type of norm? Because L1 penalty (almost) produces a “continuous” model selection! Linear Models min 2R; 2Rp kY 01 Z k22 + k k1 is given by ( ^0 ; ) = ( 0 ; 1 ; : : : ; p 1 ), defined as 0 213 / 299 Note that since Z > 1 = 0 and since 0 does not appear in the L1 penalty, we have ^0 = (1>1) 11Y = Y : & Victor Panaretos (EPFL) 0 kY 01 Z ^k22 + k^k1 = 2min ku Z ^k22 + k^k1 : Rp 1 u = Y Y 1 for tidiness. Expanding the squared norm gives > Z ) = u > u 2Y > Z + 2Y 1> Z + ku Z ^k22 = u > u 2u > Z + > (|Z{z | {z } | {z } } where =^> =I Since u > u does not depend on > =0 , we see that the LASSO objective function is 2^> + k k22 + k k1: 1=2, i.e. ^i i + 12 i2 + 2 j i j : Clearly, this has the same minimizer if multiplied across by ^> + 12 k k22 + 12 k k1 = Ppi =11 Victor Panaretos (EPFL) Linear Models 215 / 299 1 i = sgn (^i ) j^i j 2 + ; i = 1; :::; p 1: Linear Models 214 / 299 1 Notice that we now have a sum of p objective functions, each depending only on one i . We can thus optimise each separately. That is, for any given i p , we must minimise 1 ^i i + 12 i2 + 2 j i j: Thus, the LASSO problem reduces to 1 , Then, the unique solution to the LASSO problem ^0 = Y Proof. min 2R; 2Rp Z > Z = I ), then the LASSO exactly Theorem Shrinks coefficient size by different version of magnitude. Victor Panaretos (EPFL) $ Model Selection We distinguish 3 cases: 1 2 Case i . In this case, the objective function becomes 12 i2 2 j i j and it is clear that it is minimised when i . So in this case i . 1 2 Case i > . In this case, the objective function i i 2 i 2 j i j is minimised somewhere in the range i 2 ; 1 because the term i i is negative there (and all other terms are positive). But when i , the objective function becomes ^ =0 ^ 0 =0 [0 ) + =0 ^ + + 0 ^ 1 2 ^i i + 2 i + 2 i = 2 ^i i + 12 i2: If 2 ^i 0, then the minimum over i 2 [0; 1) is clearly at i = 0. Otherwise, when 2 ^i < 0, we differentiate and find the minimum at i = ^i =2 > 0. In summary, i = (^i =2)+ = sgn (^i )(j ^i j =2)+ . Victor Panaretos (EPFL) Linear Models 216 / 299 3 Case i < . In this case, the objective function i i 21 i2 2 j i j is minimised somewhere in the range i 2 1; because the term i i is negative there (and all other terms are positive). But when i , the objective function becomes ^ 0 ( 0] ^ + + 0 ^ L1 Shrinkage [The LASSO] 1 1 2 2 ^i i + 2 i + 2 ( i ) = 2 + ^i ( i )+ 2 i = 2 j^i j ( i )+ 12 i2 : If 2 j ^i j 0, then the minimum over i 2 ( 1; 0] is clearly at i = 0, since i ranges over [0; 1). Otherwise, when 2 j ^i j < 0, we differentiate and find the minimum at i = j ^i j + =2 < 0, which we may re-write as: j^i j + =2 = (j^i j =2) = sgn (^i ) (j^i j =2) : In summary, i = sgn (^i )(j ^i j =2)+ . The proof is now complete, as we can see that all three cases yield i = sgn (^i ) j^i j 2 Victor Panaretos (EPFL) + ; i = 1; :::; p 1: Linear Models 217 / 299 LASSO as the Relaxation of Best Subsets Intuition: Victor Panaretos (EPFL) Linear Models 218 / 299 LASSO profile for Bodyfat Data [LARS algorithm] L1 norm induces “sharp” balls! LASSO and CV for different values of r ()=k^k1 LASSO 0 Induces model selection by regulating the lasso (through ) Extreme case: L0 “Norm”, gives best subsets selection! 7 8 10 13 70 150 * * * * ● ● ● ** ● 50 ● cv ● ● 40 ● ** ** * * * ** * ** ** ** * * * * 0.0 0.2 0.4 0.6 0.8 * ** ** * ** ** ** ● 8 ** * * * * ** ** ** * ** 1.0 |beta|/max|beta| Linear Models 219 / 299 Victor Panaretos (EPFL) ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 * ** * −50 ** **** ** ** 30 * ** * * * * ** ** ** ** ** * * ** * * 20 * * * * Victor Panaretos (EPFL) ● ● 3 1 p j =1 j j j , sharp balls for 0 < p 1 Pp ● * 4 : j 6= 0g * 100 =0g = #fj j6 60 ** 1f ● ● * 2 = 5 50 k j =1 4 * 0 Generally: kpp j =1 j j j0 = p 1 X 3 ** Standardized Coefficients k k0 = p 1 X 1 6 Balls more concentrated around the axes 0.0 0.2 0.4 0.6 0.8 1.0 fraction Linear Models 220 / 299 Robust/Resistant Methods The “success” of the LSE in a regression model depends on “assumptions”: Normality (LSE optimal in this case) Not many “extreme” observations (LSE affected from “extremities”) Robust Linear Modeling Picture: Resistant procedure: not strongly affected by changes to data. Robust procedure: not strongly affected by departures from distribution. Often: Robust , Resistant Victor Panaretos (EPFL) Linear Models 221 / 299 Motivating Example: Estimating a Mean Let X1 ; : : : ; Xn iid F , estimate = x = 1 n X n i =1 1 xF (dx ) by n X 2R i =1 (xi m = arg min )2 jxi ; n = 2k + 1; j = xx(kk +1) +x k ; n = 2k : 2 ( ) ( +1) Median much less sensitive to bad values. Higher breakdown point: must blow up at least 50% of obs to blow Median is optimal (MLE) when x 7! x + =) x 7! x + =n . large relative to n ! disaster . . . May not be optimal for other possible m up. F is Laplace. But how well does m perform when F ' Normal (relative efficiency)? x is optimal (MLE) when F is Normal. Victor Panaretos (EPFL) n X 2R i =1 Extremely sensitive to outliers (low breakdown point). If 222 / 299 Can we “cure” sensitivity by using different distance function? R1 xi = arg min Blows up from a single value: Linear Models Motivating Example: Estimating a Mean Some observations: Average Victor Panaretos (EPFL) Remember picture: F ’s . . . Linear Models 223 / 299 Victor Panaretos (EPFL) Linear Models 224 / 299 Motivating Example: Estimating a Mean Regression Setup Other alternatives? Regression situation is similar. Have: Y = X + "; " F ; I -Trimmed mean: throw away most extreme observations: trm = jE1c j X i 2= E xi ; LSE for n F = Normal Disastrous if yi 7! yi + c with c large: How to objectively/automatically choose weight? Linear Models 225 / 299 Robust/Resistant Alternatives ^ 7! ^ + (X T X ) 1xi c Gauss-Markov: optimal linear for any ,! May not be overall optimal for other Victor Panaretos (EPFL) F ’s Linear Models 226 / 299 Seek a unifying approach: 2R = arg min PKi=1 (yi xi> )2(i ), where we set 2Rp K = bn =2c + b(p + 1)=2c Weighted least squares: = (X > V 1 X ) 1 X > V 1 Y for a diagonal weight Trimmed least squares: Instead of ()2 or j j, consider a more general distance function (). MLE when errors are Gaussian is obtained as maximising loglikelihood kernel n ^ = arg max 1 X yi xi> 2 i =1 2Rp (recall earlier lecture): 0 B V =B B @ w1 w2 0 0 .. . wn 1 C C C: A Replacing b ! Find a general formulation of which above are special cases. Linear Models 2 (u ) = u 2 by general () yields: Would like to formalise the concept of robust/resistant estimation Victor Panaretos (EPFL) F M-Estimators Pn > L1 regression: ~ = arg min k =1 jyi xi j p V 2R k =1 Optimal at Weights downplaying certain observations (i.e., give less weight to extremes ...) matrix given by ^ = (X >X ) 1X >y = arg min X(yi xi> )2 p E being subset of n most extreme observations from each end. Both m and trm may ‘throw away’ information. View as special cases of the IWeighted estimate: Pn wi xi wm = Pi =1 n w : i =1 i Victor Panaretos (EPFL) [ ] = 0; cov["] = 2 I E" := arg min p 2R n X i =1 y x> i i Call this an M(aximum likelihood like)-Estimator. 227 / 299 Victor Panaretos (EPFL) Linear Models 228 / 299 M-Estimation as Weighted Regression Obtaining Pn arg min i =1 p 2R yi xi> n X i =1 with xi> Examples of Distance Functions and Weight Functions reduces to solving yi xi> Idea: choose to have desirable properties (reduce/eliminate impact of outliers) — same as choosing weight function. =0 Some typical examples are: (z ) = z 2 (z ) = jz j (t ) = d (t )=dt . Letting w (u ) = (u )=u this reduces to n X i =1 wi xi> (yi xi> ) = 0; where > wi = w yi xi : But this is simply the weighting scenario! I Robust Regression can be written as a Weighted Regression, but the weights depend on the data. Distance functions are in Victor Panaretos (EPFL) 1 1 correspondence with loss functions. Linear Models 229 / 299 Examples of Distance Functions and Weight Functions Victor Panaretos (EPFL) , w (u ) = 2 , w (u ) = 1=ju j 2 z ; if jz j H Huber: (z ) = 2H jz j H 2; otherwise 8 n < 1 2 2 o3 ; jz j B ; B 1 1 ( z = B ) 6 Bisquare: (z ) = : 1 2 6 B ; otherwise: Linear Models Victor Panaretos (EPFL) Linear Models 230 / 299 Examples of Distance Functions and Weight Functions 231 / 299 Victor Panaretos (EPFL) Linear Models 232 / 299 Computing a Regression M-Estimator (Asymptotic) Distribution of M-Estimators IObtained M-Estimator as the solution to the system X > ψ( ) = 0 I Explicit expression for LSE I M-Estimation: non-linear optimisation problem — use iterative approach instead of I Iteratively re-weighted least squares: 1 2 3 4 5 6 ^(0) Obtain initial estimate (0) = ( ^ ) Form normalised residuals ui yi xi> (0) =MAD yi (0) w u (0) for the chosen weight function w Obtain wi i = ( ) ( () X > (y X ) = 0. Here we defined ψ xi> ^(0) ( )= ;:::; yn xn> > X > ( ) = 0; 8 2 Rp ; then under mild regularity conditions, as n ! 1, we can show that E Iterate until convergence (?) ^n d Np Linear Models y1 x1> IIf these estimating equations are unbiased, i.e., (0) (0) Perform weighted least squares with V (0) = diagfw1 ; : : : ; wn g Obtain updated estimate ^(1) Victor Panaretos (EPFL) 233 / 299 Example: Professor’s Van ; Victor Panaretos (EPFL) [ E X > rψ ] 1 X > E[ψψ> ]X E[X > rψ] 1 : Linear Models 234 / 299 Example: Professor’s Van ^ = 0:07 (with p = 0:06) while ~ = 0:09 (with p ' 0) Victor Panaretos (EPFL) Linear Models 235 / 299 Victor Panaretos (EPFL) Linear Models 236 / 299 Asymptotic Relative Efficiency (ARE) ARE in the Linear model Remember our picture: Linear model known. y = X + ", with "j iid g (); assume var("j ) = 2 < 1 is Assume MLE is regular, with ARE measures quality of one estimator of p 1 relative to another, often the MLE , for which var I 1 , for large sample size. Generally ARE of relative to is less than 1 (100%): low ARE is bad, high ARE is good. ARE of relative to is ( )1=p jvar j : jvar j ^ ( ^) = ( ) ~ ^ ~ ~ ^ ^ ARE of r relative to r is ( ^) ( ~) ( 100%) ( ^ ) (100%): (~ ) Linear Models ARE of LSE of @ 2 log g (u ) g (u )du = @u 2 relative to MLE of Z @ log g (u ) 2 g (u )du : @u is 1 2 ig Examples: g () Gaussian: 1 ARE at g () Laplace: 1/2 ARE of Huber at g () Gaussian is 95% with H = 1:345 ARE at var r var r Victor Panaretos (EPFL) ig = Z 237 / 299 Victor Panaretos (EPFL) Linear Models 238 / 299 Mallow’s Rule A simple and useful strategy is to perform one’s analysis both robustly and by standard methods and to compare the results. If the differences are minor, either set may be presented. If the differences are not minor, one must perforce consider why not, and the robust analysis is already at hand to guide the next steps. Nonlinear and Nonparametric Models Perform analysis both ways and compare results. Plot weights to see which observations were downweighted. Try to understand why. Victor Panaretos (EPFL) Linear Models 239 / 299 Victor Panaretos (EPFL) Linear Models 240 / 299 The Big Picture Example: Logistic Growth Decennial population data from US, for 1790–1990. Recall most general version of regression given in Week 1: Yi j xi> ind Distfg (xi> )g; y is population in millions, x is time. i = 1; : : : ; n : Regression model: Yi = 1 + exp( 1 + x ) + "i ; "i iid N (0; 2 ) i = 1; : : : ; n : 2 3 i So far we have investigated what happens when g (x > ) = x > ; Dist = N (x > ; 2 ): 2 Rp ; Here (x ; We now consider a more general situation: where (xi> ; ) Yi j xi> ind Nf(xi> ; ); 2 g; i = 1; : : : ; n ; is a KNOWN function, that depends on a parameter Distribution remains Gaussian. Cannot transform into a linear regression problem. Coefficient interpretation different than in a linear model. Related to the differential equation 2 Rp , d (x ) = C (x )f1 (x )g: dx but is not linear in . Victor Panaretos (EPFL) ) = 1 + exp( 1 + x ) : 2 3 Linear Models 241 / 299 Example: Logistic Growth Victor Panaretos (EPFL) Linear Models 242 / 299 Basic Observations and Notation Still assume independent random variables y1 ; : : : ; yn , and explanatories x1 ; : : : ; xn . Y1 ; : : : ; Yn , with observed values Distribution still Gaussian. Introduce notation: y = (y1 ; : : : ; yn )> 2 Rn , ) = (1 ( ); : : : ; n ( ))> = ((x1>; ); : : : ; (xn>; ))> , i.e., ( ) : Rp ! R n 2 Rp 7! ( ) 2 Rn Therefore ( ) is a vector-valued function. Analogy with linear case: ( ) plays the role of X but is no longer linear in ( . Model now is: y = |{z} ( ) + n " 1; n 1 Victor Panaretos (EPFL) Linear Models 243 / 299 Victor Panaretos (EPFL) n 1 2 Rp ; " Nn (0; 2 I ): Linear Models 244 / 299 Likelihood and . . . least squares - Again! Since " iid N (0; 2 ), have Model Fitting by Taylor Expansions y Nf( ); 2 g; Main problem is non-linearity — cannot obtain closed form solution in general. ,! Idea: so likelihood and loglikelihood are L( ; 2 ) = (212 )n =2 exp 1 22 (y ( ))>(y ( )) ; linearise locally, assuming that is sufficiently smooth. First-order Taylor expansion: approximate as 1 1 > 2 2 `( ; ) = 2 n log 2 + n log + 2 (y ( )) (y ( )) : . . . exactly as in linear case, but with ( ) replacing X . Hence, suggests least ( ) ' ( (0) ) + [r ] = n 1 n 1 | {z n p where (0) ( }| {z (0) ) p 1 } (0) . is sufficiently close to squares estimators, 8 < ^ = arg minky ( )k2 2Rp : 1 2 ^ = n ky ( ^)k2: Victor Panaretos (EPFL) We dropped higher order terms by appealing to smoothness of () “close to zero” higher derivatives). ; (assuming identifiability) Linear Models 245 / 299 Model Fitting by Taylor Expansions Victor Panaretos (EPFL) (smoothness Linear Models 246 / 299 Geometry of Nonlinear Least Squares Linearised representation suggests Newton–Raphson iteration: (0) is available (k and = u (0) + (0) . Suppose an initial estimate Let D (0) = [r ] = (0) ^k < ). (0) As ranges over Rp , surface) in Rn , Taylor expansion yields y ( (0) ) D (0) |( To get {z u (0) 1 Initialise with 2 Let () = f( ) : 2 Rp g: M } provides the intrinsic coordinates on that manifold. y is obtained by selecting a point ( ) on the manifold, and adding a mean zero Gaussian vector ". (0) . u (1) = arg min ky ( (0) ) D (0) u k2 p u 2R :-) (but this is just a linear least squares problem, with X (0) D (0) !) = y (0) = y ( (0) ) and u (1) = ([D (0) ]> D (0) ) 1 [D (0) ]> fy ( (0) )g. (1) = (0) + u (1) and iterate until convergence criterion satisfied. 3 Thus set 4 Let Return last (k ) as Victor Panaretos (EPFL) ) traces a p -dimensional differentiable manifold (smooth (0) ) + ": u (0) . Consider the following iteration: we need ( Regression asks to find the coordinates of the point on the manifold that generated y . Would like to project expression! y on the manifold, but do not have a closed form ^. Linear Models 247 / 299 Victor Panaretos (EPFL) Linear Models 248 / 299 Geometry of Nonlinear Least Squares Geometry of Linear Approximation Newton–Raphson algorithm is interpretable via differential geometry: The p -dimensional tangent plane at a point (0) 2 M is spanned by (0) r u 2 Rp . = (0) u ; Hence we may write that ( )+[ ( ( )] T (0) () = f( M ) () (0) ) + D (0) u : u 2 Rp g In other words, the p columns of D (0) , when translated by (0) , form a basis for the tangent plane at (0) . Taylor expansion merely says that if is close to (0) , we approximately have (0) 2 T (0) M . This is equivalent to the expression ) ( () ( ) ( () ( ) ( (0) ) [r ] = ( | | {z D (0) (0) } {z ) (0) ): u (0) } Therefore, y (0) D (0) u (0) " means that E y approximately lies in T (0) M . Newton–Raphson algorithm iterated projection on approximating linear subspaces. ( () Victor Panaretos (EPFL) Linear Models 249 / 299 Geometry of Nonlinear Least Squares ) + Victor Panaretos (EPFL) [] Linear Models 250 / 299 Geometry of Linear Approximation Summarising, suppose we consider the tangent space is a subspace). ( (0) ) as the origin of space (i.e., now Then y (0) is approximately the response obtained when adding (0) 2 T (0) M . an element D (0) ( ) ( ) () " to So, approximately, we have our usual linear problem, and we can use orthogonal projection to solve it. () by a plane T Amounts to approximating the manifold M around (0) . ( ) Once initial value (0) is updated to and repeat the whole procedure. But how do we obtain our initial Victor Panaretos (EPFL) Linear Models 251 / 299 Victor Panaretos (EPFL) (0) () locally M (1) , use a new tangent plane approximation (0) ? Linear Models 252 / 299 Choosing (0) Approximate CIs for Parameters Under smoothness conditions on , one can in general prove that Successful linearisation depends on good initial value. n o1=2 S 1=2 r ( ^)> r ( ^) ( ^ Occasionally, can find initial values by inspection in simple problems. More generally, it takes some experimentation. for large E.g., one can try fitting polynomial models to data. Use these to find fitted values at fixed design points. Solve a system of equations to get initial values. n , where S = (n p ) 1 ke k2 . May thus mimic linear case: c > ^ d N1 = + 1 expf( xj =)g + "j Example: consider the model yj 0 1 Fit a polynomial regression to data Find fitted values 3 Equate fitted values with model expectation: 4 5 Linear Models which gives a (1 r 253 / 299 Victor Panaretos (EPFL) Start from simplestproblem: Dist N ; 2 ) Yi xi 2 R ( = g (xi ; ); 2 B Rp ; ) g (; ) known up to to be estimated from data, e.g. Dist( j ) = N ( j ) and = g (x j ) = x > , Dist( j ) = N ( j ) and = g (x j ) = (x ; ). Victor Panaretos (EPFL) Linear Models o 1 = = g (xi ) + "i ; c : 254 / 299 "i iid N (0; 2 ) 50 Head Accelaration (g) −100 F is usually assumed to be a class of smooth functions (e.g., c : Figure: Motorcycle Accident Data Would now like to extend model to a more flexible dependence: Yi j xi ind Dist[y j i ] ! gi 2=Fg (xiL);2 (Rp ) (say); with g unknown, to be estimated given data f(yi ; xi )gni=1 . A nonparametric problem (parameter 1-dimensional)! How to estimate g in this context? 0 with 1 −50 Yi j xi Dist[y j i ] ! o Linear Models Until today we have discussed the following setup: i n c > ^ z =2 S 2 c > r ( ^)> r ( ^) Scatterplot Smoothing r ( ^)> r ( ^) ) 100% CI: A More Flexible Regression Model ind n d c> ^ c> n o 1 N (0; 1); S 2 c > r ( ^)> r ( ^) c 0 ; 1 by linear regression, once (0) is at hand. Victor Panaretos (EPFL) c> ; S 2c> r y~k = 0 + 1 expf (x0 + k )=g; k = 0; 1; 2: System yields initial estimate (0) = = log [(~ y0 y~1 )=(~y1 y~2 )] Get initial values for So base confidence intervals (and tests) on y~0 ; y~1 ; y~2 at x0 ; x0 + ; x0 + 2 . 2 d ) Np (0; Ip ) 10 C k ). 20 30 40 50 Time After Impact (ms) 255 / 299 Victor Panaretos (EPFL) Linear Models 256 / 299 Exploiting Smoothness Exploiting Smoothness ! 1 and large covariate classes): Usually unique xi distinct: 0 −50 2 −100 3 Head Accelaration (g) 50 Ideally: multiple y ’s at each xi (n Response 10 20 30 40 50 Time After Impact (ms) 1 Here is where the smoothness assumption comes in Since have unique y at each xi , need to borrow information from nearby . . . . . . use continuity!!! (or even better, smoothness) g : R ! R is continuous if: 8 > 0 9 > 0 : jx x0 j < =) jg (x ) g (x0 )j < : I So maybe average yi ’s corresponding to xi ’s in a -neighbourhood of x as g^(x )? 0 I Recall: A function 0 20 40 60 80 100 x Then average y ’s at each xi and interpolate . . . But this is never the case . . . Victor Panaretos (EPFL) I Motivates the use of a kernel smoother . . . Linear Models 257 / 299 Kernel Smoothing Naive idea: Victor Panaretos (EPFL) Linear Models 258 / 299 Linear Models 260 / 299 Visualising a Kernel at Work g^(x0 ) should be the average of yi -values with xi ’s “close” to x0 . g^(x0 ) = Pn n X 1 x0 j g i =1 yi 1fjxi x0 j g: i =1 1fjxi A weighted average! Choose other weights? Kernel estimator: 1 g^(x0 ) = Pn K i =1 K xi x0 n X i =1 x x i 0 yi K : is a weight function (kernel), e.g. a pdf ,! Usually symmetric, non-negative, decreasing away from zero ,! small is the bandwidth parameter gives local behaviour, large gives global behaviour Choice of K not so important, choice of very important! The resulting fitted values are linear in the responses, i.e., y S y , where the smoothing matrix S depends on x1 ; : : : ; xn , K and . Analogous to a projection matrix in linear regression, but S is NOT a projection. ^= Victor Panaretos (EPFL) Linear Models 259 / 299 Victor Panaretos (EPFL) j=1 n X Motorcycle Data Kernel Smooth = 2 j=1 > > > > yj log yj ηˆj − yj + ηˆj Penalised Likelihood . plot(time,accel,xlab="Time After Impact (ms)",ylab="Head Accelaration (g)") lines(ksmooth(time,accel,kernel="normal",bandwidth=0.7)) lines(ksmooth(time,accel,kernel="normal",bandwidth=5),col="red") lines(ksmooth(time,accel,kernel="normal",bandwidth=10),col="blue") Find g 2 C 2 that minimises n X http://smat.epfl.ch/courses.html i =1 | {z + } 50 Fit Penalty Z fg 00 (t )g2 dt | I {z } Roughness Penalty This is a Gaussian likelihood with a roughness penalty 0 ,! If use only likelihood, any interpolating function is an MLE! −50 to balance fidelity to the data and smoothness of the estimated h . Remarkably, problem has unique explicit solution! ,! Natural Cubic Spline with knots at fxi gni=1 : −100 Head Accelaration (g) fyi g (xi )g2 piecewise polynomials of degree 3, with pieces defined at the knots, 10 20 30 40 with two continuous derivatives at the knots, 50 and linear outside the data boundary. Time After Impact (ms) 2 Victor Panaretos (EPFL) Linear Models 261 / 299 Victor Panaretos (EPFL) Linear Models 262 / 299 Natural Cubic Spline Details Can represent splines via natural spline basis functions s (t ) = n X j =1 Bj , as j Bj (t ): where B1 (t ) B2 (t ) Bm +2 (t ) m (t ) = 1 = t = m (t ) n 1(t ) 3 3 = (t xm t)+ t(t xn )+ n m Figure: The n = 4 natural spline basis functions for knots at x3 = 0:6 and x4 = 0:8 x1 = 0:2, x2 = 0:4, where xm are the knot locations and ()+ = maxf; 0g is the positive part of any function. Victor Panaretos (EPFL) Linear Models 263 / 299 Victor Panaretos (EPFL) Linear Models 264 / 299 Where does this come from? Back to Penalised Likelihood We wish to find a basis for natural cubic splines with knot locations () fxi gni=1 Letting Observe that any piecewise polynomial PP3 t of order 3 with 2 cts derivatives at the knots can be expanded in the truncated power series basis PP3 (t ) = 3 X j =0 j t j + n X i =1 xi )3+ k (t =( g (t ) = 1 ; : : : ; n )> , n X i =1 i Bi (t ); Z B = fBij g = fBj (xi )g; ij = Bi00 (t )Bj00(t )dt ; our penalised likelihood n X The n + 4 coefficients fj g3j =0 [ fi gni=1 must satisfy constraints to ensure linearity beyond boundary knots: 2 = 0 & 3 = 0 Pn i =1 i = 0 Pn i =1 i xi = 0 i =1 becomes fyi g (xi )g2 Z + fh 00 (t )g2 dt I (y B )>(y B ) + > : Differentiating and equating with zero yields (B >B + )^ = B >y =) ^ = (B >B + ) 1B >y : Can then use relations re-express basis in form on previous slide, with only (rather than n ) basis functions, and unconstrained coefficients. +4 Victor Panaretos (EPFL) Linear Models n 1B >. The smoothing matrix is S B B > B The natural cubic spline fit is approximately a kernel smoother. = ( 265 / 299 Motorcycle Example Cubic Spline Fit Victor Panaretos (EPFL) + ) Linear Models 266 / 299 Motorcycle Example Cubic Spline Residuals lines(smooth.spline(time,accel),col="red") 10 20 30 40 20 30 40 50 0 −2 Time (ms) 50 20 40 10 −80 −60 −40 −20 Sample Quantiles 20 0 −80 −60 −40 −20 Residuals 0 −50 −100 Head Accelaration (g) 40 50 Normal Q−Q Plot −1 0 1 2 Theoretical Quantiles Time After Impact (ms) Victor Panaretos (EPFL) Linear Models 267 / 299 Victor Panaretos (EPFL) Linear Models 268 / 299 Equivalent degrees of freedom Bias/Variance Tradeoff = + y^ = Hy , with = X (X >X ) 1X >. Here y^ = B| (B > B +{z ) 1 B >} y : Least squares estimation: y Xn p ", we have trace H p , in terms of the projection matrix H ( )= Focus on the fit for the given grid ^ = (^g (x1 ); : : : ; g^(xn )); g n X 1 1 + j j =1 ( ) ) ( Each eigenvalue of S lies in matrix. Choosing ) ( ) 2 0 bias2 Linear Models ( so 269 / 299 ^= > > g ^k2 ) = trace(S S ) 2 + (g S g) (g S g) ; E kg (0; 1), so this is a smoothing, NOT a projection, Victor Panaretos (EPFL) (^)k}2: Eg {z When estimator potentially biased, need to worry about both! In the case of a linear smoother, for which g S y , we find that ( ) ( )= ^ ) = E| fkE(^g{z) g^k}2g + k|g variance where j are eigenvalues of K B > B 1=2 B > B 1=2 . Hence trace S is monotone decreasing in , with trace S ! as ! 1 (K will have twos zero eigenvalues) and trace S ! n as ! . Note 1–1 map $ trace S df, so usually determine roughness using df (interpretation easier). =( = (g (x1); : : : ; g (xn )) g k2 ( E kg Idea: define equivalent degrees of freedom of smoother ( )= g Consider the mean squared error: S trace S x1 ; : : : ; xn : n n " =) # =) variance # but bias ", bias # but variance ". Would like to choose to find optimal bias-variance tradeoff: ,! Unfortunately, optimal will generally depend on unknown g ! Victor Panaretos (EPFL) Linear Models 270 / 299 Orthogonal Series: “Parametrising” The Problem Fitted values are ^ Depending on what F y^ = S y . Fitted value yj obtained when yj is dropped from fit is g (x ) = Sjj ()(yj y^j ) = y^j y^j : with Cross-validation sum of squares is () = CV n X j =1 (yj n X y^j )2 = j =1 yj y^j 2 ; 1 Sjj () Yi = k k (x ) (in an appropriate sense); X k =1 k k (xi ) + "i ; <1 Notice: truncation has implications, e.g., in Fourier case: Truncating implies assume g 2 G L2 . Interpret this as a smoothness assumption on g . How to choose Linear Models k =1 Idea: if truncate series, then have simple linear regression! 2 y y ^ j j GCV() = ; j =1 1 trace(S )=n where Sjj () is (j ; j ) element of S . Victor Panaretos (EPFL) 1 X f g1 k =1 known (orthogonal) basis functions for F, e.g., F = L2 ( ; ), f k g = fe ikx gk 2Z , i ? j , i =6 j . R Gives Fourier series expansion, k = 21 g (x )e ikx dx . and generalised cross-validation sum of squares is n X 3 g () is (Hilbert space) can write: 271 / 299 optimally? Victor Panaretos (EPFL) Linear Models 272 / 299 Convolution: Series Truncation '? Smoothing The Dirichlet kernel Easy exercise in Fourier analysis: X k= ke ikx Z 1 = 2 g (y )D (x y )dy with the Dirichlet kernel of order , Recall kernel smoother: D (u ) = sin f( + 1=2) u g=sin(u =2). n X Z 1 y K ( x x ) i i 0 g^(x0 ) = Pn K (x x ) = c y (x )K (x x0 )dx ; 0 I i =1 i =1 i with y (x ) = So if K n X i =1 yi (x xi ): is the Dirichlet kernel, we can do series approximation via kernel smoothing. Works for other series expansions with other kernels (e.g., Fourier with convergence factors) Victor Panaretos (EPFL) From Linear Models 273 / 299 x 2 R to (x1 ; : : : ; xd ) 2 Rp So far: how to estimate 274 / 299 ! Need some definition of “local” in the space of explanatories ,! Use some metric on Rp 3 (x1 ; : : : ; xp ) ! g : R ! R (assumed smooth) in given data f(yi ; xi )gni=1 : But which one? Choice of metric I Generalise to include multivariate explanatories? I “Immediate” Generalisation: g Rp ! R (smooth) : ,! Yj = g (xj 1 ; : : : ; xjp ) + "j ; "j iid N (0; 2 ) () choice of geometry e.g., curvature reflects intertwining of dimensions Geometry I Estimation by (e.g.) multivariate kernel method. I Two basic drawbacks of this approach . . . =) reflects structure in the explanatories potentially different units of measurement (variable stretching of space) g may be of higher variation in some dimensions Shape of kernel? (definition of local) (need finer neighbourhoods there) Curse of dimensionality Victor Panaretos (EPFL) Linear Models What is “local” in Rp ? Yi = g (xi ) + "i ; "i iid N (0; 2 ); ,! ,! Victor Panaretos (EPFL) statistical dependencies present in the explanatories (“local” should reflect these) Linear Models 275 / 299 Victor Panaretos (EPFL) Linear Models 276 / 299 Curse of Dimensionality (U [0; 1]p ) 1.0 Curse of Dimensionality 0.6 Bellman (1961) 0.4 Notion of local in terms of % of data: fails in high dimensions ,! There is too much space! Hence to allow for reasonably small bandwidths ,! Density of sampling must increase. 0.0 0.2 Side Length 0.8 “neighbourhoods with a fixed number of points become less local as the dimensions increase” Need to have ever larger samples as dimension grows. 0.0 0.2 0.4 0.6 0.8 1.0 Average Percentage of Sample p = 1, p = 2, p = 5, p = 10 Victor Panaretos (EPFL) Linear Models 277 / 299 Tackling the Dimensionality Issue Victor Panaretos (EPFL) How is the model fitted to data? our mastery of 1D case (at least we can do that well . . . ), Assume only one term, and higher dimensional explanatories (and associated difficulties). k =1 Smooth: Given a direction #, fitting splines. g into smooth functions hk : R ! R. Pursue: Given Z I ) fh100 (t )g2 dt : g1 (#> x) is done via 1D smoothing h1 , have a non-linear regression problem w.r.t. #. Hence, iterate between the two steps ,! ,! projections directions chosen for best fit ,! similarities to tomography. Each hk is a ridge function of x: varies only in the direction defined by #k Linear Models i =1 fyi h1 ([#> x]i )g2 + Two steps: Each function depends on a global feature ,! a linear combination of the explanatories, Victor Panaretos (EPFL) ( n X min hk (#>k x) + "; k#k k = 1; " N (0; 2 ): Additively decomposes K = 1 and consider penalized likelihood: h 2C 2 ;k#k=1 One approach: Projection-Pursuit Regression Y= 278 / 299 Projection Pursuit Regression Attempt to find a link/compromise between: K X Linear Models 279 / 299 Complication is that h1 not explicitly known, so need numerical derivatives. Computationally intensive (impractical in the ’80’s). ! Further terms added in forward stepwise manner. Victor Panaretos (EPFL) Linear Models 280 / 299 Additive Models The Backfitting Algorithm Projection pursuit: (+) Can uniformly approximate C 1 compact Rp function arbitrarily well as K ! 1 (very useful for prediction) ( ) Interpretability? What do terms mean within problem? ( [ ]) I How to fit additive model? ,! ,! Know how to fit each fk separately quite well Take advantage of this . . . I Motivation: Fix Need something that can be interpreted variable-by-variable j and drop it for ease: I Compromise: Additive Model Yj = j + p X k =1 m 6=k fk (xjk ) + "j ; "j iid N (0; 2 ); P j fk (xjk ) = 0. fm (xm )5 = fk (xk ) = avefyj P g, fk = fk0 , k = 1; : : : ; p . Cycle: fk = Sk (y m 6=k fm ) k = 1; : : : ; p ; 1; : : : ; p ; : : : (1) Initialise: (2) 3 I Suggests the Backfitting Algorithm: In our standard setting, have: 2 = N (j ; P ); Yj j xej ind Dist( j j ) ! Dist j = j = j + pk =1 fk (xjk ): Linear Models (3) Stop: when individual functions don’t change I 281 / 299 S is arbitrary scatterplot smoother Victor Panaretos (EPFL) Linear Models 0.5 0.5 6.5 0.0 −0.5 s(age,2.29) −0.5 s(base,1.87) 0.0 6.0 5.0 10 15 Age −30 −25 −20 −15 −10 −5 −1.0 3.0 −1.0 3.5 4.0 4.5 C−peptide 5.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 5 0 Base Deficit 5 10 15 age Victor Panaretos (EPFL) 282 / 299 Example: Diabetes Data 6.5 Example: Diabetes Data C−peptide X E 4Y fj ’s univariate smooth functions, Victor Panaretos (EPFL) 2 Linear Models 283 / 299 Victor Panaretos (EPFL) −30 −25 −20 −15 −10 −5 0 base Linear Models 284 / 299 Example: Rock Permeability Data Example: Rock Permeability Data Family: gaussian Link function: identity 3000 Formula: perm ~ 1 + s(peri) + s(area) 1000 2000 Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 415.45 27.18 15.29 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 0 s(area,3.36) 1000 Approximate significance of smooth terms: edf Est.rank F p-value s(peri) 8.739 9 18.286 9.49e-11 *** s(area) 3.357 7 6.364 7.41e-05 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 −2000 −1000 0 −2000 −1000 s(peri,8.74) 2000 3000 Measurements on 48 rock samples from a petroleum reservoir: rock.gam<-gam(perm 1+s(peri)+s(area),family=gaussian) 1000 2000 3000 4000 5000 2000 4000 peri 6000 8000 10000 12000 area Victor Panaretos (EPFL) R-sq.(adj) = Linear Models 285 / 299 0.815 Victor Panaretos (EPFL) Deviance explained = 86.3% Linear Models 286 / 299 Where does the cubic spline magic come from?? A ‘magic’ function To show the result, we will first consider the problem of: Our analysis will hinge on a seemingly magical function2 k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g Smooth Interpolation (with boundary constraints) 0 < x1 < : : : < xn < 1 and corresponding response values 0.20 such that: Z |0 1 2 h 00 (x ) dx {z C Victor Panaretos (EPFL) (h ) } Z |0 1 2 f 00 (x ) dx ; {z C Linear Models (f ) } 0.00 h interpolates the points f(xi ; yi )gni=1 , i.e. h (xi ) = yi ; i = 1; :::; n : h is smoothest (has minimal curvature) among all H-interpolants, 0.15 0 (g 00 (u ))2 < 1; g (0) = g 0 (0) = 0 0.10 1 Z k(x,y), for y=0.25, 0.5, 0.75 h 2 H := g : [0; 1] ! R : 0.05 Given design points fyi gni=1 find a function 0.0 0.2 0.4 0.6 0.8 1.0 2x 8 f 2 H: x 7! k (x ; 1=4); x 7! k (x ; 1=2); x 7! k (x ; 3=4) 2 which 287 / 299 is in fact the covariance of integrated Brownian motion Victor Panaretos (EPFL) Linear Models 288 / 299 k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g Now write kx0 1 Z 0 (x ) = k (x ; x0) and notice3 that for any f 2 C 2 k 00 (u )f 00 (u )du So if x0 = Z x0 0 u )f 00 (u )du + (x0 Z 1 x0 0du = f (x0 ) f (0) f 0 (0)x0 0 1 Note that ky 1 Z 0 3 the 0 x 0 Proof. By i =1 j =1 f (x ) = f (0) + f 0 (0)x + Victor Panaretos (EPFL) 0 f 00 (u )(x 289 / 299 Step 2: Guessing a Feasible Candidate Solution We guess the form h (x ) = n X i =1 n X j =1 Victor Panaretos (EPFL) Z i = 1; :::; n : 1 2 Victor Panaretos (EPFL) = 0 i =1 i kxi (u ) 00 du =0 Linear Models 290 / 299 By the reproducing kernel’s evaluation property and the interpolation constraints n equations in n unknowns, y = K θ, where > y = (y1 ; : : : ; yn ) ; K = fk (xj ; xi )gni;j =1 ; θ = (1 ; : : : ; n )> : The solution K 1 y exists uniquely, because K 0 since k is nonnegative kernel Check: h is feasible. We can directly verify that h 2 C 2 and h (0) = h0 (0) = 0, so h 2 H. h also interpolates: j =1 00 00 !2 h 2 H. Letting = h h , we have 2 H and R R R R R C(h ) = (h 00 )2 = (h00 + 00 )2 = (h00 )2 + ( 00 )2 + 2 h00 00 by solving a linear system of h (xi ) = i kxi (u ) j kxj (u )du n X Consider another interpolant i kxi (x ) j k (xj ; xi ); n X i =1 j =1 0 1 Z Step 3: Showing Optimality where the i are chosen so that the interpolation property is valid, yi = 1 [0 ) = = = u )du Linear Models i j k (xi ; xj ) = n X n Z X Each function kx00i is a linear function supported on ; xi hence if the fxi g are distinct, the supports are disjoint and the kx00j are linearly independent. Consequently the sum can be zero only if 1 . 2 ::: n second equality by Taylor’s formula with integral remainder, Z x k reproducing itself we can see that n X n X = ky (x ) ! We say that k reproduces itself. y fxi gni=1 (0; 1) we have Pn Pn 8 ( 1 ; :::; n )> 2 Rn i =1 j =1 i j k (xi ; xj ) 0 in other words K = fk (xi ; xj )gni;j =1 is nonnegative. If furthermore the fxi g are Given any distinct, we have strict inequality in the display and so K is strictly positive. kx00 (u )h 00 (u )du = h (x0 ) (x ) itself is C 2 and such that ky (0) = ky0 (0) = 0 so that k 00 (u )k 00 (u )du Theorem (k is a non-negative kernel) h 0 (0) = h (0) = 0 we can evaluate h at x0 by integrating it against kx : Z k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g j k (xj ; xi )=yi Linear Models Z h00 00 = 00 n X i =1 i kx00i = n X i =1 i Z 00 kx00i = n X i =1 i (xi ) = X i i (h (xi ) h (xi )) = 0: Therefore (h ) = R (h 00)2 = R (h00)2 + R (00 )2 R (h00 )2 = C(h); R showing optimality. Inequality is strict unless ( 00 )2 = 0, in which case 00 = 0. C Taylor’s formula (with integral remainder) then says Z (x ) = (0)+ 0 (0)x + x 0 00 (u )(x u )du = (h (0) h (0))+(h 0 (0) h0 (0))x +0 = 0 establishing uniqueness of the optimum. 291 / 299 Victor Panaretos (EPFL) Linear Models 292 / 299 Recap Smoothing vs Interpolation Theorem (Smooth Interpolation) 0 1 Given design points < x1 < : : : < xn < and corresponding response values fyi gni=1 , there exists a function of the form h (x ) = n X i =1 Interpolating would be overfitting, we want to smooth: find n X i kxi (x ) i =1 | 1 (g 00 (u ))2du 0 o n R1 among all g 2 H = g : [0; 1] ! R : 0 (g 00 (u ))2 < 1; g (0) = g 0 (0) = 0 that interpolate f(xi ; yi )gni=1 . The coefficients θ = (1 ; : : : ; n )> are uniquely given by θ = K 1y ; (g ) = C where y Linear Models Smoothing vs Interpolation, Step 1: Linear Plus x1 fg 00 (t )g2 dt {z } Roughness Penalty for design points x1 < : : : < xn and responses fyi gni=1 . Note that we can always re-scale the x -axis so that without loss of generality we may assume 0 < x1 < : : : < xn < 1 We proceed in two steps: 1 Split the candidate function into a linear plus an H-part. 2 Use smooth interpolation as a technical device. 293 / 299 H g (x ) = g (0) + g 0 (0)x + Z x 0 Victor Panaretos (EPFL) s (x ) = + x + g 00 (u )(x u )du = + x + h (x ) To show this, let h (0) = 0 h 0 (0) = 0 n X j =1 j kxj (x ) g (x ) be any C 2 function, so that g (x ) = 1 + 1 x + h (x ); 1 ; 1 2 R; h 2 H and define zi = h (xi ): Letting f be the smoothest interpolant of (xi ; zi ) among H-functions, we have n o R h 2 H = g : [0; 1] ! R : 01 (g 00 (u ))2 du < 1; g (0) = g 0 (0) = 0 n X i =1 So we may re-write the problem as a minimising n X ( 294 / 299 Given our smooth interpolation theorem we guess that the optimum has the form h is obviously C 2 and can be verified to satisfy Therefore: Linear Models Smoothing vs Interpolation, Step 2: Guesswork and Leveraging Interpolation Any candidate g 2 C 2 can be decomposed into two components by Taylor’s formula with integral remainder over all | } Z xn = (y1; : : : ; yn )> and K = fk (xj ; xi )gni;j =1. Victor Panaretos (EPFL) where {z + Fit Penalty that uniquely minimises the curvature functional Z fyi g (xi )g2 g 2 C 2 minimising i =1 fy i ( + ; ; h ) 2 R R H. Victor Panaretos (EPFL) xi + h (xi ))g2 1 Z 1 Z 0 fg 00 g2 = + fh 00(t )g2dt 0 Note that the penalty term is indifferent to Linear Models fyi g (xi )g2 + ( ; ). 295 / 299 where the inequality is strict, unless our smooth interpolation theorem). Victor Panaretos (EPFL) n X fyi 1 1 xi h| ({zxi })g2 + fyi 1 1 xi i =1 n X i =1 =zi f| ({zxi })g2 + =zi Z 1 0 Z 1 0 fh 00 g2 ff 00 g2 g is itself of the form s (x ) (whence f = h , by Linear Models 296 / 299 And the final step And the final step ( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1]. (x ) is a cubic in any subinterval [xi ; xi +1 ) ( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1]. (x ) is a cubic in any subinterval [xi ; xi +1 ) Observe that ky k Pn Hence j =1 j kxj Observe that ky k Pn Hence j =1 j kxj C 2 at the joins (knots) xj It is zero outside left of x1 and linear right of xn Pn Therefore s (x ) = + x + j =1 j kxj (x ) is C 2 at the joins (knots) xj It is zero outside left of x1 and linear right of xn Pn Therefore s (x ) = + x + j =1 j kxj (x ) is It is It is a cubic polynomial between knots. a cubic polynomial between knots. C 2 at the knots. linear outside C 2 at the knots. linear outside [x1 ; xn ]. [x1; xn ]. ... i.e. it is a natural cubic spline with knots Victor Panaretos (EPFL) Linear Models 297 / 299 And the final step Victor Panaretos (EPFL) fx1 ; :::; xn g. Linear Models 297 / 299 And the final step ( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1]. (x ) is a cubic in any subinterval [xi ; xi +1 ) Observe that ky k Pn Hence j =1 j kxj It is C 2 at the joins (knots) xj x1 and linear right of xn Pn Therefore s (x ) = + x + j =1 j kxj (x ) is It is zero outside left of =) Therefore, 1 The solution 2 θ> K θ = R1 0 s (x ) is of the form Pnk=1 k Bk (x ) 0 @ | a cubic polynomial between knots. C 2 at the knots. linear outside [x1 ; xn ]. where ij fx1 ; :::; xn g. ... it turns out that any Pnnatural cubic spline with knots fx1 ; :::; xn g can also be written as + x + j =1 j kxj (x ) ... i.e. it is a natural cubic spline with knots = R1 0 n X j =1 12 j kx00j (u )A {z [s 00 (x )]2 2 R du = 01 Pnj=1 j Bj00 (u ) du = γ > Ωγ, } Bi00 (t )Bj00 (t )dt The first equality is from the reproducing kernel property > Ω 0 because if there existed 0,nthen there would Pna γ for which γ Ωγ P exist ( ; ; θ ) such that B ( x ) = + x + k k =1 k j =1 j kxj (x ) and θ> K θ = γ> Ωγ 0, contradicting our proof that K 0. (will not prove the latter) Victor Panaretos (EPFL) Linear Models 297 / 299 Victor Panaretos (EPFL) Linear Models 298 / 299 Theorem (Roughness Penalised Least Squares is Spline Smoothing) Given covariates 0 < x1 < : : : < xn < 1 and responses fyi gni=1, the functional (g ) = L is uniquely minimised over n X i =1 yi g (xi ) 2 + 0 (g 00(u ))2 du g 2 C 2 (0; 1) at g^(x ) = with 1 Z n X j =1 ^j Bj (x ); (^1 ; : : : ; ^n )> = γ^ = (B >B + Ω) 1B >y ; where Bj (x ) is the j th natural spline basis function with knots fxj gnj=1 y = (y1 ; : : : ; yn )> B = fBj (xi )gni;j =1 nR on 1 Ω = 0 Bi00 (t )Bj00 (t )dt is a positive definite matrix. i ;j =1 Victor Panaretos (EPFL) Linear Models 299 / 299