Contents 1 Linear Models 1 2 Nonlinear Regression 48 3 Mixed Linear Models 3.1 Example: One way Random Effects Model . . . . 3.2 Example: Two way Mixed Effects Model without 3.3 Estimation of Parameters . . . . . . . . . . . . . 3.4 Anova Models . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Bootstrap Methods 5 Generalized Linear Model 5.1 Members of the Natural Exponential Family 5.2 Inference in GLMs . . . . . . . . . . . . . . 5.3 Binomial distribution of response . . . . . . 5.4 Likelihood Ratio Tests (Deviance) . . . . . 57 57 59 59 68 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 88 90 91 93 6 Model Free Curve Fitting 96 6.1 Bin Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Kernel Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 1 Linear Models Basic linear model structure: Y = Xβ + , where Y = y1 y2 .. . , vector of (observable) random variables yn X = x11 x21 .. . xn1 β = β1 β2 .. . x12 x22 .. . ... ... .. . x1p x2p .. . xn2 ... xnp , vector of known constants , vector of unknown parameters βp = 1 2 .. . , vector of (unobservable) random errors n Almost always the assumption is that E = 0, often V ar = σ 2 Id; often ∼ M V Nn (in the presence of V ar = σ 2 Id this means, that i are i.i.d. N (0, σ 2 )) Examples (a) Multiple regression yi = α + β1i xi1 + β2i xi2 + i , translates to y1 y2 .. . yn = 1 1 .. . x11 x21 .. . x12 x22 .. . 1 xn1 xn2 for i = 1, ..., n α β1 + β2 1 2 .. . n (b) One-way Anova Version 1: yij = µi + ij Version 2: yij = µ + τi + ij Version 1: (3 treatments, 2 observations per y11 1 y12 1 y21 0 y22 = 0 y31 0 y32 0 treatment) 0 0 11 12 0 0 µ1 1 0 µ2 + 21 22 1 0 µ 3 31 0 1 0 1 32 1 Version 2: (3 treatments, 2 observations y11 y12 y21 y22 = y31 y32 Assume solutions to version 2 are per treatment) 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 µ 5 τ1 1 τ2 = 2 τ3 3 µ τ1 + τ2 τ3 11 12 21 22 31 32 7 µ and τ1 = −1 τ2 0 1 τ3 Both produce EY = 6 6 7 7 8 8 , i.e. they produce the same set of mean values for the observations. There is no way of telling the solutions apart based on the data. This is because the matrix X in version 2 does not have full rank (the first column is the sum of the last three columns). Definition: Column Space The column space of matrix X is the space of vectors that can be reached by linear combinations of the columns of X. In the example (b) both versions have the same column space: a a b a, b, c real numbers C(X) = b c c The dimension of the column space C(X) turns out to be the same as the rank of the matrix X. 2 All linear models can be written in the form Y = Xβ + Our wish list is to: • estimate Xβ = E[Y ] = Ŷ , • make sensible point estimates for σ 2 , β, c0 β for “interesting” linear combinations c of β, • find confidence intervals of σ 2 , c0 β, • get prediction intervals for new responses, • test hypotheses H0 : βj = βj+1 = ... = βj+r = 0. Example: Pizza Delivery Experiment conducted by Bill Afantenou, second year statistics student at QUT. Here is his description of the experiment: As I am a big pizza lover, I had much pleasure in involving pizza in my experiment. I became curious to find out the time it took for a pizza to be delivered to the front door of my house. I was interested to see how, by varying whether I ordered thick or thin crust, whether Coke was ordered with the pizza and whether garlic bread was ordered with the pizza, the response would be affected. Variables: Variable Crust Coke Bread Driver Hour Delivery Using R to Description Thin=0, Thick=1 No=0, Yes=1 Garlic bread. No=0, Yes=1 Male=M, Female=F Time of order in hours since midnight Delivery time in minutes read the data: > pizza <- read.table("http://www.statsci.org/data/oz/pizza.txt", + header=T,sep="\t") % ASCII file with a header line and tabulator separated entries > pizza Crust Coke Bread Driver Hour Delivery 1 0 1 1 M 20.87 14 2 1 1 0 M 20.78 21 3 0 0 0 M 20.75 18 4 0 0 1 F 20.60 17 5 1 0 0 M 20.70 19 6 1 0 1 M 20.95 17 7 0 1 0 F 21.08 19 8 0 0 0 M 20.68 20 9 0 1 0 F 20.62 16 10 1 1 1 M 20.98 19 11 0 0 1 M 20.78 18 12 1 1 0 M 20.90 22 13 1 0 1 M 20.97 19 14 0 1 1 F 20.37 16 15 1 0 0 M 20.52 20 16 1 1 1 M 20.70 18 > attach(pizza) 3 Make boxplots of the data (result in figure 1): > > > > > par(mfrow=c(2,2)) boxplot(Delivery~Crust,col=c(2,3),main="Crust") boxplot(Delivery~Bread,col=c(2,3),main="Bread") boxplot(Delivery~Coke,col=c(2,3),main="Coke") boxplot(Delivery~Driver,col=c(2,3),main="Driver") Crust B Bread 0 C Coke F m 14 16 1 18 20 2 22 D M Driver M 4read 6 8 0 2 rust oke river 20 18 16 14 14 16 18 20 22 Bread 22 Crust 0 1 0 20 18 16 14 14 16 18 20 22 Driver 22 Coke 1 0 1 F M Figure 1: On average, pizzas with a thin crust are delivered faster; the delivery seems to be faster, if additionally garlic bread is ordered. There does not seem to be difference in delivery times dependent on whether coke is ordered, but the variance in time is increased, if coke is ordered. Women drivers seem to deliver faster than men, but, looking at the data directly, we see that there were only four deliveries by female drivers. Checking the interaction between bread and crust, we get four boxplots (see figure 2) - one for each combination of the two binary variables. > boxplot(Delivery~Crust*Bread,col=c(2,3),main="Crust and Bread") This example is written mathematically as yijk = µ |{z} average delivery time + αj |{z} effect of thick/thin crust + βk |{z} effect of bread/no bread where i = 1, ..., 4, j = 1, 2 and k = 1, 2. 4 + αβjk | {z } interaction effect bread/crust +ijk , 0.0 1.0 0 0.1 1.1 m 14 16 1 18 20 2 22 Crust M C M .0 .1 4 6 8 0 2 rust and Bread 14 16 18 20 22 Crust and Bread 0.0 1.0 0.1 1.1 Figure 2: Boxplots of delivery times comparing all combinations of bread (yes/no) and crust (thin/thick). The difference between delivery times seems to be the same regardless of whether gralic bread was ordered. This is a hint, that in a model the interaction term might not be necessary. In matrix notation this translates to y111 1 y112 1 y121 1 y122 1 y211 1 y212 1 = y221 1 y222 1 y311 1 .. .. . . y422 1 1 1 0 0 1 1 0 0 1 .. . 0 0 1 1 0 0 1 1 0 .. . 1 0 1 0 1 0 1 0 1 .. . 0 1 0 1 0 1 0 1 0 .. . 1 0 0 0 1 0 0 0 1 .. . 0 1 0 0 0 1 0 0 0 .. . 0 0 1 0 0 0 1 0 0 .. . 0 0 0 1 0 0 0 1 0 .. . 0 1 0 1 0 0 0 1 µ α1 α2 β1 β2 αβ11 αβ12 αβ21 αβ22 + 111 112 121 122 211 212 221 222 311 .. . 422 Here, the matrix X does not have full rank - we need to look at the column space of X closer. Excursion: Vector Spaces V is a real-valued vector space, if and only if: (i) V ⊂ IRn , i.e. V is a subset of IRn . (ii) 0 ∈ V , i.e. the origin is in V . (iii) For v, w ∈ V also v + w ∈ V , i.e. sums of vectors are in V . (iv) For v ∈ V also λv ∈ V for all λ ∈ IR, i.e. scalar products of vectors are in V . Examples of vector spaces in IRn are 0 or IRn itself; lines or planes through the origin are also vector spaces. Lemma 1.1 The column space C(X) of matrix X is a vector space. 5 Proof: For matrix X ∈ IRn×p , i.e. the matrix X has n rows and p columns, the column space C(X) is defined as: C(X) = {Xb | for all vectors b = (b1 , b2 , ..., bp )0 ∈ IRp } ⊂ IRn The origin is included in C(X), since for b = (0, 0, ..., 0) Xb = (0, ..., 0) = 0. If a vector v is in C(X), this means that there exists bv , such that v = Xbv . For two vectors v, w ∈ C(X), we therefore have bv and bw , such that v = Xbv and w = Xbw . With that v + w = Xbv + Xbw = X(bv + bw ) ∈ C(X) Similarly, for v ∈ C(X) we have λv = λXbv = X(λbv ) ∈ C(X). C(X) therefore fulfills all four conditions of a vector space. We can think of C(X) as being a line or a plane or some other higher-dimensional space. 2 Why do we care about the column space C(X) at all? With E = 0 we have EY = Xβ, i.e. EY (= Ŷ ) is in C(X)! The ordinary least squares solution for Ŷ is therefore done by finding the point in C(X) that is closest to Y . Ŷ is therefore the orthogonal projection of Y onto C(X) (see figure 3 for the three-dimensional case). y3 C(X) Y ^ Y y1 y2 Figure 3: Ŷ is the orthogonal projection of Y onto C(X). 6 How do we find Ŷ ? Let’s start by finding Ŷ by hand in one of our first examples: Example: One-way Anova (3 treatments, 2 repetitions each) yij = µ + αj + ij with Y = 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 µ α1 + . α2 α3 This model has column space C(X) = a a b b c c a, b, c ∈ IR Therefore Ŷ = (ao , ao , bo , bo , co , co )0 for some ao , bo , co ∈ IR, which minimize kŶ − Y k2 : X kY − Ŷ k2 = (yij − ŷij )2 = i,j = (y11 − a)2 + (y12 − a)2 + +(y21 − b)2 + (y22 − b)2 + +(y31 − c)2 + (y32 − c)2 . This is minimal, when a = (y11 + y12 )/2 = y1. , b = y2. , and c = y3. . This gives a solution Ŷ as y1. y1. y2. . Ŷ = y2. y3. y3. In order to find this result directly, we need a bit of math now: Projection Matrices Since C(X) is a vector space, there exists a projection matrix PX for which • v ∈ C(X) then PX v = v (identity on the column space) • w ∈ C(X)⊥ then PX w = 0 (null on the space perpendicular to the column space) Then for any y ∈ IRn , we have y = y1 + y2 with y1 ∈ C(X) and y2 ∈ C(X)⊥ : PX y = PX (y1 + y2 ) = PX y1 + PX y2 = y1 + 0 = y1 . Some properties of projection matrices: 7 1. idempotence: 2 PX = PX (relatively easy to see - apply PX twice to the y above - since this does not change anything, and y could be just any vector, this proves that PX is idempotent) 2. symmetry: 0 PX = PX Proof: let v, w be any vector in IRn . then there exist v1 , w1 ∈ C(X) and v2 , w2 ∈ C(X)⊥ with v = v1 + v2 and w = w1 + w2 . 0 Then have a look at matrix PX (I − PX ): 0 v 0 PX (I − PX )w = (PX v)0 (w − PX w) = = v10 (w − w1 ) = v10 w2 = 0, because v1 ∈ C(X) and w2 ∈ C(X)⊥ . Since we have chosen v and w arbitrarily, this proves that 0 0 0 PX (I − PX ) = 0. This is equal to PX = PX PX . The second matrix is symmetric, which implies the 0 and PX . symmetry of PX PX = X(X 0 X)− X 0 - sometimes PX is called the hat matrix H. Here, (X 0 X)− is a generalized inverse of X 0 X. If X is a full (column) rank matrix, we can use the regular inverse (X 0 X)−1 instead. Excursion: Generalized Inverse Definition: A− is a general inverse of matrix A, iff AA− A = A Properties: 1. A− exists for all matrices A, but is not necessarily unique. 2. If A is square and full rank, A− = A−1 . 3. If A is symmetric, there exists at least one symmetric generalized inverse A− . 4. Note, that A− A does not need to be the identity matrix! In the example of the one-way anova: we have 6 2 X 0X = 2 2 2 2 0 0 2 0 2 0 2 0 , 0 2 with a generalized inverse of 3 1 1 1 1 1 11 −5 −5 (X 0 X)− = 1 −5 11 −5 32 1 −5 −5 11 I got this inverse by using R: library(MASS); ginv(A) 8 Note that (X 0 X)X 0 X is not the identity matrix: 3 1 1 1 1 1 3 −1 1 (X 0 X)X 0 X = 4 1 −1 3 −1 1 −1 −1 3 The projection matrix PX is PX 1 0 − 0 = X(X X) X = 32 16 16 0 0 0 0 16 0 16 0 0 16 0 16 0 0 0 0 0 0 16 16 0 0 0 0 0 0 16 16 0 0 0 0 16 16 Then Ŷ = PX Y = (y1. , y1. , y2. , y2. , y3. , y3. )0 (the same result we found manually already). 9 How do we find a generalized inverse? A11 A12 n×p Let A ∈ IR with rank r, assume A = where A22 ∈ IRr×r is a full rank matrix. Then A21 A22 0 0 − A := . 0 A−1 22 Proof: 0 A12 A−1 A12 A−1 A12 22 22 A21 AA− A = A= 0 Id A21 A22 −1 Id −A12 A22 −1 in order to show that A12 A22 A21 equals A11 define B = . The rank of B is r, therefore 0 Id the rank of BA is also r. A11 − A12 A−1 A21 0 22 BA = A21 A22 Since A22 has rank r, the first column of matrices in BA has to be a linear combination of the second column, i.e. there exists some orthogonal matrix Q (get from a Gaussian elimination process) with 0 0 A11 − A12 A−1 22 A21 = Q= A22 A22 Q A21 −1 Therefore A11 − A12 A−1 22 A21 = 0, i.e. A11 = A12 A22 A21 . 2 Example: 6 2 X 0X = 2 2 2 2 0 0 2 0 2 0 2 0 has inverse (X 0 X)0 = 1 0 2 2 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Generalized Inverse in five steps: 1. identify square sub-matrix C of X with full rank r (do not need to be adjacent rows or columns) 2. find inverse C −1 of C 3. replace elements of C in X by elements of (C −1 )0 4. replace all other entries in X by 0 5. transpose to get X − . Example: 1 x = 2 has inverses x− = (1, 0, 0) or x− = (0, 0, 1/3) or x− = (a, b/2, c/3) with a + b + c = 1. 3 Remember: Ŷ = PX Y = X(X 0 X)X 0 Y Claim: PX is the orthogonal projection onto C(X). Proof: still to show for v ∈ C(X) : PX v = X(X 0 X)− X 0 v = X(X 0 X)− X 0 Xc = Xc = v. in order to show X(X 0 X)− X 0 X = X, use y ∈ IRn with y = y1 + y2 : y 0 X(X 0 X)− X 0 X = y10 X(X 0 X)− X 0 X y1 =Xb = b0 X 0 X(X 0 X)− X 0 X = b0 X 0 X = y10 X = y 0 X 10 2 0 for w ∈ C(X)⊥ : PX w = X(X 0 X)− X w = 0 = 0. |{z} With Ŷ = PX Y we get = Y − Ŷ = (I − PX )Y . Properties: • I − PX orthogonal projection onto C(X)⊥ . • C(PX ) = C(X) and rk(X) = rk(PX ) = tr(PX ) Proof: (only for rk(PX ) = tr(PX )) P from linear algebra we know that tr(PX ) = i λi , i.e. the trace of a matrix is equal to the sum of its 2 eigenvalues. Since PX = PX , this matrix has only eigenvalues 0 and 1. The sum of the eigenvalues is therefore equal to the rank of PX . 2 • C(I − PX ) = C(X)⊥ and rk(I − PX ) = n − rk(PX ) • Pythagorean theorem, anova identity kY k2 0 = Y 0 Y = [(PX + I − PX )Y ] [(PX + I − PX )Y ] = 0 0 = (PX Y )0 (PX Y ) + ((I − PX )Y ) ((I − PX )Y ) + Y 0 PX (I − PX ) Y + Y 0 (I − PX )0 PX Y = {z } | {z } | =0 2 2 = kPX Y k + k(I − PX )Y k 11 =0 Identifying β: Since Ŷ = Xβ, we have X 0 · X(X 0 X)− XY = X 0 · Xβ If X has full rank, then X 0 X is full rank and (X 0 X)− = (X 0 X)−1 , then β̂OLS = (X 0 X)−1 XY . For full rank X, this solution is unique. If X is not full rank, there are infinitely many β that solve Xβ = Ŷ . What do we do without full rank X? Example: One way anova : Means Model: Effects Model: yij = µ + αj + ij yij = µj + ij has full rank matrix X and unique solution µ̂i = y.i has not full rank and generally not a unique solution. But clearly, µ + α1 = µ1 This can be written as µ α1 0 (1100) α2 = c β α3 Question: what makes c special? (it changes the ambiguous β to an unambiguous c0 β) We have to look at linear combinations c0 β more closely: Theorem 1.2 Estimability For some c ∈ IRp the following properties are equivalent: 1. If Xβ1 = Xβ2 ⇒ c0 β1 = c0 β2 2. c ∈ C(X 0 ) 3. There exists a ∈ IRn such that a0 Xβ = c0 β for all β. Definition: if any of the three properties above holds for some c (and with that all will hold), the expression c0 β is called estimable. Proof: 1) ⇒ 2): Xβ1 = Xβ2 is equivalent to X(β1 − β2 ) = 0, which is equivalent to β1 − β2 ∈ C(X 0 )⊥ . Since ⊥ 1) holds, c0 (β1 − β2 ) = 0 for all β1 , β2 , i.e. c ⊥ β1 − β2 , therefore c ∈ C(X 0 )⊥ = C(X 0 ). 2) ⇒ 3): c ∈ C(X) ⇒ ∃ a such that X 0 a = c ⇒ c0 = a0 X ⇒ c0 β = a0 Xβ for all β 3) ⇒ 1): For Xβ1 = Xβ2 there exists a such that c0 β1 = a0 Xβ1 = a0 Xβ2 = c0 β2 . 2 12 If c0 β is estimable, then c0 β = a0 Xβ = a0 PX Y = |{z} a0 X (X 0 X)− X 0 Y = c0 (X 0 X)− X 0 Y. c This generalizes the formula for ordinary least square estimators with a full rank matrix X. We can therefore define ˆ = a0 Ŷ . cˆ0 β OLS := a0 Xβ Example: One-way Anova Y = 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 µ α1 + α2 α3 β itself is not estimable. Let c = (0, 1, −1, 0)0 Is c0 β estimable? Yes, because there exists a ∈ IR6 such that X 0 a = c: 0 1 1−1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 X 0a = 0 0 1 1 0 0 −1 = −1 0 0 0 0 0 0 1 1 0 0 1 = −1 0 c is therefore in C(X 0 ), which makes c0 β estimable. In practice we want to estimate several c01 β, c02 β, ..., c0l β simultaneously. Define C= c01 c02 .. . c0l ∈ Rl×p , estimate by 0 − 0 c Cβ OLS = C(X X) X Y Testability We want to test hypotheses of the form H0 : Cβ = d. This is a very general way of writing hypotheses - and fits our standard way, e.g. Example: Simple Linear Regression yi = a + bxi + i H0 : b = 0 translates to H0 : (0, 1) a b =0 13 Example: One-way anova yij = µ + αj + ij , with i = 1, 2 and j = 1, 2, 3 H0 : α1 = α2 = α3 translates to 1 −1 0 1 0 −1 0 0 H0 : µ α1 0 = α2 0 α3 First condition: each row of C has to be estimable (i.e. the rows in C are linear combinations of rows of X) Estimability alone is not enough, look at (assuming a regression model): 1 1 0 0 0 0 α0 3 α1 = 7 α2 Both rows are estimable, but the expression is still nonsensical. These leads to a necessary second condition: rows in C have to be linearly independent. Definition 1.3 (Testability) The hypothesis H0 : Cβ = d is estimable, if 1. every row in C is estimable, 2. the rank of C is l. The concept of testability is sometimes strange: 1 1 0 0 0 0 α0 3 α1 = is not testable 3 α2 but 1 0 How does a hypothesis ‘look’ like? a01 a02 Therefore A ∈ IRl×n AX = C, A = . .. 0 α0 α1 = α2 3 is testable. n 0 0 Since every row in C is estimable, i.e. ∃a ∈ IR with a X = c . a0l Most hypothesis are of the form H0 : Cβ = 0. With Cβ = 0, A · Xβ = 0. This means, that Xβ is perpendicular to each row in A, i.e. Ŷ = Xβ ∈ C(A0 )⊥ . On the other hand, we know, that Ŷ ∈ C(X). Under H0 the predicted value Ŷ is in the intersection of these two spaces, i.e. Ŷ ∈ C(X) ∩ C(A0 )⊥ 14 Example y1 1 y2 = 1 y3 0 0 µ1 0 + µ2 1 µ1 Consider H0 : µ1 = µ2 . This can be written as H0 : (1 − 1) = 0, which is equivalent to µ2 1 0 µ1 =0 (1, 0, −1) 1 0 µ2 | {z } 0 1 A In this setting = {v ∈ IR3 C(A0 ) = {v ∈ IR3 C(A0 )⊥ = {v ∈ IR3 C(X) ∩ C(A0 )⊥ = {v ∈ IR3 C(X) ∩ C(A′) v3 C(X) v2 v1 15 T C(X) a : v1 = v2 } = a : a, b ∈ IR b a : v1 = −v3 , v2 = 0} = 0 : a ∈ IR −a a : v1 = v3 } = b : a, b ∈ IR a a : v1 = v2 = v3 } = a : a ∈ IR a Since the null hypothesis H0 : Cβ = 0 is equivalent to the notion that Ŷ ∈ C(X) ∩ C(A0 )⊥ with AX = C, we need to talk about the distribution of errors (and with that the distribution of Y ). We are going to look at two different models closer: the Gauss Markov Model, and the Aitken Model. Both make assumptions on mean and variance of the error term , but do not specify a full distribution. For a linear model of the form Y = Xβ + , the Gauss-Markov assumptions are E = 0 V ar = σ 2 I, i.e. we are assuming independence among errors, and identical variances. The Aitken assumptions for the above linear model are E = 0 V ar = σ 2 V, where V is a known (symmetric and positive definite) matrix. Gauss-Markov Based on Gauss Markov error terms, we are want to derive means and variances for observed and predicted responses Y and Ŷ , draw conclusions about estimators c0 β and get an estimate s2 for σ 2 . With E = 0, V ar = σ 2 I we get • EY = E[Xβ + ] = Xβ + E = Xβ, V arY = V ar[Xβ + ] = V ar = σ 2 I. • E Ŷ = E [PX Y ] = PX EY = PX Xβ = Xβ, |{z} ∈C(X) 0 = arY PX 0 = σ 2 PX . PX · σ 2 I · PX V arŶ = V ar [PX Y ] = PX V h i h i • E Y − Ŷ = 0, V ar Y − Ŷ = (I − PX )V arY (I − PX )0 = σ 2 (I − PX ). 0 − 0 c • for estimable Cβ, we get for the least squares estimate Cβ OLS = C(X X) X Y : h i c E Cβ = C(X 0 X)− X 0 EY = C(X 0 X)− X 0 Xβ = OLS = A X(X 0 X)− X 0 Xβ = AXβ = Cβ i.e. unbiased estimator | {z } PX h c V ar Cβ OLS i = C(X 0 X)− X 0 V ar[Y ] C(X 0 X)− X 0 0 = = σ 2 C (X 0 X)− X 0 · X (X 0 X)− C 0 = | {z } symmetric | {z =(X 0 X)− , see (*) } = σ 2 C(X 0 X)− C 0 (*) holds because for a generalized inverse A0 of matrix A, matrix A is a generalized inverse of A− , i.e. A− AA− = A− . For a full rank model and C = I the above expression for the variance of β̂ simplifies to the usual expression: V arβ̂OLS = σ 2 (X 0 X)−1 . 16 P • For the sum of squared errors e0 e = i e2i we get h i E[e0 e] = E (Y − Ŷ )0 (Y − Ŷ ) = E [(Y − PX Y )0 (Y − PX Y )] = E [Y 0 (I − PX )Y ] = E[Y 0 Y ] + E[Ŷ 0 Ŷ ] for E[Y 0 Y ] we have E[Y 0 Y ] X = E[yi2 ] = X i X 2 V aryi + (Eyi )2 = σ + (Xβ)2i = i i = nσ 2 + (Xβ)0 Xβ And, similarly: E[Ŷ 0 Ŷ ] = X E[yˆi 2 ] = i 2 X X 2 σ (PX )ii + (Xβ)2i = V arŷi + (E ŷi )2 = i i 0 = σ tr(PX ) + (Xβ) Xβ = σ 2 rk(X) + (Xβ)0 Xβ Therefore, for the sum of squared errors we get: E[e0 e] = σ 2 (n − rk(X)) This last result suggests an estimate s2 for σ 2 as the standard estimate: s2 = e0 e SSE = = M SE n − rk(X) df E Properties of the Least Squares Estimator 0 0 − 0 0β • Linearity: since cc OLS = c (X X) X Y , the least squares estimator is the result from a linear | {z } row vector re-combination of entries in Y . h i 0 0β • Unbiased: E cc OLS = c β, the least squares estimator is unbiased. The least squares estimator is a “BLUE” (best linear unbiased estimator), if the variance is minimal. That is the next theorem’s content: Theorem 1.4 Gauss-Markov Estimates In the linear model Y = Xβ + with E = 0 and V ar = σ 2 I for estimates c0 β 0 0 − 0 0β the BLUE is cc OLS = c (X X) X Y . 17 Theorem 1.5 Gauss-Markov Estimates In the linear model Y = Xβ + with E = 0 and V ar = σ 2 I for estimates c0 β 0 0 − 0 0β the BLUE is cc OLS = c (X X) X Y . We have to show, that among the linear unbiased estimators the least squares estimator is the one with the smallest variance. Proof: Idea: take one arbitrary unbiased linear estimator. Show that its variance is at least as big as the variance of the least squares estimator. 0 0β Notation: let %0 := c0 (X 0 X)− X 0 , then cc OLS = % Y . First, an observation: PX % = %, because % = X · (X 0 X)− X 0 c ∈ C(X) {z } | some vector Let v ∈ IRn with E [v 0 Y ] = c0 β for all β (linear, unbiased estimator). Since c0 β = E [v 0 Y ] = v 0 EY = v 0 Xβ for all β, this implies c0 = v 0 X. The variance of v 0 Y is then: V ar[v 0 Y ] = V ar [(v 0 Y − %0 Y ) − %0 Y ] = = V ar [(v − %)0 Y ] +V ar [%0 Y ] + 2 | {z } ≥0 Cov ((v − %)0 Y, %0 Y ) | {z } ≥ V ar(%0 Y ) =0 still has to be shown, see (*) We still have to show (*): Cov((v − %)0 Y, %0 Y ) = (v − %)0 V ar(Y )% = σ 2 (v − %)0 % = = σ 2 (v 0 % −%0 %) = σ 2 (v 0 PX % − %0 %) = |{z} =PX % 0 = σ 2 (v X (X 0 X)− X 0 % − %0 %) = |{z} =c0 0 0 = σ (c (X X)− X 0 % − %0 %) = 0 {z } | 2 =%0 2 Under Gauss-Markov assumption OLS estimator are optimal. This does not hold in a general setting. Example: Anova yij = µi + ij with i = 1, 2 repetitions and j = 1, 2, 3 treatments. It is known that the second repetition has a lower variance: y11 1 0 0 11 y21 1 0 0 21 µ1 y12 0 1 0 = µ2 + 12 , with cov() = σ 2 diag(1, 0.01, 1, 0.01, 1, 0.01) y22 0 1 0 22 µ3 y13 0 0 1 13 y23 0 0 1 23 y.1 The OLS estimate b = y.2 is then the wrong idea, since y.3 1 1 1 (y11 + y21 ) = σ 2 (1 + 0.01) ≥ σ 2 V arb1 = V ar 2 4 4 18 But, if we decide to ignore the first repetition and use b̃1 = y21 , we get a variance of V arb̃1 = V ar (y21 ) = 0.01σ 2 < 1 2 σ 4 What do we do under the even more general assumptions of the Aitken Model? Aitken Let V be a symmetric, positive definite matrix with V ar = σ 2 V for linear model Y = Xβ + , where E = 0. Then there exists a symmetric matrix V −1/2 with V −1/2 V −1/2 = V −1 and V 1/2 V 1/2 = V . Small Excursion: Why does V 1/2 exist? • V is a symmetric matrix, therefore there exists orthonormal matrix Q which diagonalizes V , i.e. V = QDV Q0 , where DV = diag(λ1 , ..., λn ) is the matrix of eigenvalues λi of V . • V is positive definite, therefore all its eigenvalues λi are strictly positive (which implies that we can take the square roots of λi ). These two properties of V lead to defining the square root matrix V 1/2 as p p p 1/2 1/2 V 1/2 = QDV Q0 , where DV = diag( λ1 , λ2 , ..., λn ). 1/2 1/2 1/2 1/2 Then V 1/2 · V 1/2 = QDV Q0 · QDV Q0 = QDV DV Q0 = QDV Q0 = V , and V 1/2 is symmetric. −1 −1/2 The inverse V −1/2 = V 1/2 = QDV Q0 . End of Small Excursion Define U := V −1/2 Y . We will have a look at EU and V arU . 19 EU V arU = V −1/2 EY = V −1/2 Xβ 0 = V −1/2 V arY V −1/2 = V −1/2 · σ 2 V · V −1/2 = σ 2 V −1/2 · V 1/2 V 1/2 · V −1/2 = σ 2 I. These are the Gauss Markov assumptions. Looking at the model U = W β + ∗ , with U = V −1/2 Y, W = V −1/2 X and ∗ = V −1/2 , we have found a transformation of the original model, that transforms the Aitken’s assumptions into Gauss-Markov assumptions. The gain is (hopefully) obvious: we will be able to apply all the theory we know for models with GaussMarkov assumptions to models with Aitken assumption, including BLUEs. Example: Anova µ1 We are going to find the BLUE for β = µ2 in the model µ3 yij = µi + ij with i = 1, 2 repetitions and j = 1, 2, 3 treatments, where y11 y21 y12 y22 y13 y23 = 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 µ1 µ2 + µ3 11 21 12 22 13 23 , with cov() = σ 2 diag(1, 0.01, 1, 0.01, 1, 0.01) = σ 2 V The transformed model U = W β + ∗ in this example is then: V 1/2 V −1/2 = diag(1, 0.1, 1, 0.1, 1, 0.1) = diag(1, 10, 1, 10, 1, 10) Therefore U = V −1/2 Y = W = V −1/2 X = y11 10y21 y12 10y22 y13 10y23 1 0 10 0 0 1 0 10 0 0 0 0 0 0 0 0 1 10 The BLUE for model U = W β + ∗ is the OLS estimator β̂ = βOLS = (W 0 W )− W 0 U . 20 W 0W 1 = 0 0 − (W 0 W ) (W 0 W ) W 0 1 101 = 0 0 − 10 0 0 1 101 = 0 0 0 1 0 0 1 101 0 10 0 0 0 10 0 0 1 101 = 0 0 0 101 0 0 0 101 0 0 0 1 101 10 101 0 0 1 101 10 101 0 0 0 0 1 0 0 10 0 0 0 1 0 0 10 0 0 0 1 0 0 10 0 0 1 101 0 0 10 101 The BLUE is then \ 1 µ1 101 µ2 = (W 0 W )− W 0 U = 0 µ3 0 10 101 0 0 0 0 1 101 10 101 0 0 0 0 1 101 0 0 10 101 y11 10y21 y12 10y22 y13 10y23 = y11 +100y21 101 y12 +100y22 101 y13 +100y23 101 We can use the same estimate for the original model Y = Xβ + , since: Û = W β̂OLS(U ) = V −1/2 X β̂OLS(U ) = µ1 10µ1 µ2 10µ2 µ3 10µ3 and Ŷ = V 1/2 Û = (µ1 , µ1 , µ2 , µ2 , µ3 , µ3 )0 = µ̂OLS(U ) ∈ C(X). The OLS for U minimizes (U − Û )0 (U − Û ) over choices of C(W ). (U − Û )0 (U − Û ) = (V −1/2 Y − V −1/2 Ŷ )0 (V −1/2 Y − V −1/2 Ŷ ) = (Y − Ŷ )0 V −1 (Y − Ŷ ) This expression is minimized over choices of Ŷ in C(V 1/2 W ) = C(X). This is the generalized weighted least squares based on Y . The BLUE in the Aitken model is therefore 0 0 −1 0 β = cc 0β cc X)− X 0 V −1 Y OLS(U ) = c (X V What happens outside the Aitken model? Example: Anova yij = µ + ij , 21 where ij are independent with V arij = σi2 for i = 1, 2. For 3 groups and 2 observations per group this gives 11 1 y11 21 y21 1 y12 1 = µ + 12 , with cov() = diag(σ12 , σ22 , σ12 , σ22 , σ12 , σ22 ) 22 y22 1 13 y13 1 23 1 y23 we get the BLUE for µ as µ̂ = 1/σ12 y.1 + 1/σ22 y.2 σ22 y.1 + σ12 y.2 = 1/σ12 + 1/σ22 σ12 + σ22 But: usually we don’t know σ12 /σ22 . We could use the sample variances s21 and s22 , but that takes us out of the framework of linear estimators. 22 Reparametrizations Consider the two models (Gauss-Markov assumptions): Version (I) Version (II) Y = Xβ + Y = Wγ + If C(X) = C(W ) these models are the same, i.e. same predictions Ŷ and Y − Ŷ , same estimable functions c0 β. What estimable functions in (II) correspond to estimable functions in (I)? Let c0 β be an estimable function in (I), i.e. c0 ∈ C(X). If C(X) = C(W ), then W is a linear re-combination of columns in X, i.e. ∃ F with W = XF, then c0 β |{z} est. in (I) = a0 Xβ = a0 W γ = a0 XF γ = (c0 F )γ . | {z } est. in (II) Example: Anova: 3 groups, 2 reps per group Version (I): full rank means model Version (II): rank deficient effects model yij = µ + αj + ij yij = µj + ij 1 0 0 1 1 0 0 1 0 0 1 1 0 0 µ µ 1 0 1 0 1 0 1 0 α1 µ2 + Y = Y = 0 1 0 1 0 1 0 α2 µ 3 0 0 1 1 0 0 1 α3 0 0 1 1 0 0 1 | | {z } {z } X W 1 1 0 0 then W = XF for F = 1 0 1 0 . 1 0 0 1 µ1 So e.g. µ1 = (1, 0, 0) µ2 is estimable in (I). This corresponds to µ3 µ α1 c0 F γ = (c0 F ) α2 = (1, 1, 0, 0) α3 µ α1 = µ + α1 . α2 α3 0 0β [ Clearly: cc OLS = y.1 = c F γ OLS . What is to choose between these models? 1. the full rank model behaves nicer mathematically and is computational simpler. 2. scientific interpretability of parameters pushes for other models sometimes. Note: we can use any matrix B with same column space as X. 23 + Example: two way factorial model Let i = 1, 2; j = 1, 2; k = 1, 2, consider the model yijk = µ + αi + βj + (αβ)ij + ijk . This corresponds to the design matrix W : 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 µ α1 α2 β1 β2 αβ11 αβ12 αβ21 αβ22 Idea: find a second matrix X with full rank and the same column space as W , to resolve problems of rank deficiency. The way we do this is by introducing constraints for the parameters. There are different choices of constraints: Statistician’s favorite: zero sum constraints Most software uses: baseline constraints R: first effects are zero X αi = 0 i α1 = 0 X β1 = 0 βj = 0 j αβ1j = 0 for all j X αβij = 0 for all j αβi1 = 0 for all i i X αβij = 0 for all i j In the 2 × 2 factorial model with 2 repetitions the sum restrictions translate to: α1 = −α2 β1 = −β2 αβ11 = −αβ21 , αβ12 = −αβ22 , and αβ11 = −αβ12 = αβ22 . This leads to a different version of W : 1 1 1 1 1 1 1 1 1 1 1 1 −1 −1 −1 −1 1 1 −1 −1 1 1 −1 −1 1 1 −1 −1 −1 −1 1 1 24 µ α1 β1 αβ11 Baseline restrictions give a different matrix: 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 25 µ α2 β2 αβ22 We are about to make distribution assumptions now. Since we are going to be talking about normal distributions quite a bit, it might be a good idea to review some of its properties. Some Useful Facts About Multivariate Distributions (in Particular Multivariate Normal Distributions) Here are some important facts about multivariate distributions in general and multivariate normal distributions specifically. 1. If a random vector X has mean vector µ and covariance matrix Σ , then Y = B X + d has k×1 k×k k×1 l×1 l×kk×1 l×1 mean vector EY = Bµ + d and covariance matrix VarY = BΣB0 . 2. The MVN distribution is most usefully defined as the distribution of X = A Z + µ , for Z a k×1 k×pp×1 k×1 vector of independent standard normal random variables. Such a random vector has mean vector µ and covariance matrix Σ = AA0 . (This definition turns out to be unambiguous. Any dimension p and any matrix A giving a particular Σ end up producing the same k-dimensional joint distribution.) 3. If is X is multivariate normal, so is Y = B X + d . k×1 l×1 l×kk×1 l×1 4. If X is MVNk (µ, Σ), its individual marginal distributions are univariate normal. Further, any subvector of dimension l < k is multivariate normal (with mean vector the appropriate sub-vector of µ and covariance matrix the appropriate sub-matrix of Σ). 5. If X is MVNk (µ1 , Σ11 ) and independent Y of which is MVNl (µ2 , Σ22 ), then X Y Σ11 µ 1 ∼ MVNk+l , µ2 0 0 Σ22 k×l l×k 6. For non-singular Σ, the MVNk (µ, Σ) distribution has a (joint) pdf on k-dimensional space given by −k 2 fX (x) = (2π) − 12 |det Σ| 1 0 exp − (x − µ) Σ−1 (x − µ) 2 7. The joint pdf given in 6 above can be studied and conditional distributions (given values for part of the X vector) identified. For X1 µ1 X = l×1 ∼ MVNk l×1 , k×1 X2 (k−l)×1 µ2 (k−l)×1 Σ11 Σ12 l×l l×(k−l) Σ21 Σ22 (k−l)×l (k−l)×(k−l) the conditional distribution of X1 given that X2 = x2 is −1 X1 |X2 = x2 ∼ MVNl µ1 + Σ12 Σ−1 22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21 8. All correlations between two parts of a MVN vector equal to 0 implies that those parts of the vector are independent. The next paragraph looks at the distribution of quadratic forms, i.e. distributions of Y 0 AY . This will help in the analysis of all sum of squares involved (SSE, SST , ...) 26 Distribution of Quadratic Forms χ2k density functions for various degrees of freedom k: 5=10 10 15 m 0.0 0.2 0.4 0.6 0 0.8 1.0 1 y1.2 k=3 k=2 k=1 kk=5 M k=10 =3 =2 =1 =5 0 5 .4 .6 .8 .0 .2 1.2 Definition 1.6 (χ2 distribution) P For Z ∼ M V N (0, Ik×k ) then Y := Z 0 Z = i Zi2 ∼ χ2k , where k is the degrees of freedom. 1.0 Then E[Z] = k, V ar[Z] = 2k. 0.4 0.6 0.8 k=1 k=2 0.2 k=3 k=5 0.0 k=10 0 5 10 15 y χ23 (λ) density functions for various noncentrality parameters λ: 5ambda=0 10 1 15 m 0.00 0.05 0.10 0.15 0.20 0 x0.25 lambda=10 lambda=5 lambda=2 llambda=1 M lambda=0 ambda=10 ambda=5 ambda=2 ambda=1 0 5 .00 .05 .10 .15 .20 .25 0.25 Definition 1.7 (noncentral χ2 distribution) P For Z ∼ M V N (µ, Ik×k ) then Y := Z 0 Z = i Zi2 ∼ χ2k (λ), where k is the degrees of freedom and λ = µ0 µ the noncentrality parameter. 0.20 lambda=0 Then E[Z] = k + λ, V ar[Z] = 2k + 4λ. 0.15 lambda=1 0.10 lambda=2 0.05 lambda=5 0.00 lambda=10 0 5 10 15 x Theorem 1.8 Let Y ∼ M V N (µ, Σ) with positive definite covariance Σ. Let A be a symmetric n × n matrix with rk(A) = k ≤ n. If AΣ is idempotent, then Y 0 AY ∼ χ2rk(A) (µ0 Aµ) If Aµ = 0 then Y 0 AY ∼ χ2rk(A) . Proof: The overall goal is to find a random variable Z ∼ M V N (µ, I) with Z 0 Z = Y 0 AY . We will do that in several steps: • AΣ is idempotent, i.e. AΣ · AΣ = AΣ. • Σ is positive definite, i.e. Σ has full rank, therefore Σ−1 exists. With that, the above formula becomes AΣAΣ · Σ−1 = AΣ · Σ−1 ⇐⇒ AΣA = A 27 (1) • then we can show that A itself is semi-positive definite: A0 =A (1) x0 Ax = x0 · AΣA · x = x0 · A0 ΣA · x = (Ax)0 · Σ · (Ax) Σ pos. def. ≥ 0. A is semi positive, therefore it has k strictly positive eigenvalues λ1 , ..., λk and n − k eigenvalues that are 0. • A is semi pos. def, i.e. ∃ Q ∈ IRn×k with Q0 Q = Ik×k and QQ0 = In×n and A = QDA Q0 , where DA = diag(λ1 , ..., λk ) −1/2 • Let B = QDA • Define Z := B 0 AY . Then Z ∼ M V N (B 0 Aµ, B 0 A · Σ · A0 B) = N (B 0 Aµ, Ik×k ), because 0 B0 A {z· A} B | ·Σ −1/2 = B 0 AB = DA A −1/2 Q0 · A · Q D | {z } A = Ik×k DA Therefore Z 0 Z ∼ χ2k (µ0 A0 BB 0 Aµ). • For this Z we have to show two things, first that Z 0 Z = Y 0 Y , second that µ0 A0 BB 0 Aµ = µ0 Aµ. Both are equivalent to showing that A0 BB 0 A = A: −1/2 A0 BB 0 A = QDA Q0 · QDA −1/2 · DA −1 0 Q Q DA Q0 = QDA Q0 = A Q0 · QDA Q0 = QDA Q0 Q DA |{z} |{z} Ik×k Ik×k 2 Example: Gauss-Markov Model Y ∼ M V N (Xβ, σ 2 In×n ) In a linear model, the sum of squared errors SSE is SSE 1 1 1 0 = 2 (Y − Ŷ )0 (Y − Ŷ ) = 2 [(I − PX )Y ] [(I − PX )Y ] = 2 Y 0 (I − PX )Y. 2 σ σ σ σ We want to find a distribution for this quadratic form Y 0 (I − PX )σ −2 Y . Since (I − PX )σ −2 · σ 2 I = I − PX is an idempotent matrix, we can use the previous theorem: Y 0 (I − PX )σ −2 Y ∼ χ2rk(I−PX ) (β 0 X 0 (I − PX )σ −2 Xβ) = χ2n−rk(X) , since (I − PX )X = X − PX X = X − X = 0 and rk(I − PX ) = dim C(X)⊥ = n − rk(X). 28 2 2 2 Since SSE σ 2 ∼ χn−rk(X) in model Y ∼ M V N (Xβ, σ I), we can use this to get confidence intervals for σ . For α ∈ (0, 1) the (1 − α)100% confidence interval for σ 2 is given as: α α SSE ≤ (1 − ) quantile of χ2n−rk(X) ) = 1 − α quantile of χ2n−rk(X) ≤ 2 σ2 2 SSE SSE P( ≤ σ2 ≤ α )=1−α (1 − α2 ) quantile 2 quantile P( ⇐⇒ i.e. ( (1− αSSE ) quantile , 2 α 2 SSE quantile ) is (1 − α)100% confidence interval. Theorem 1.9 cf. Christensen 1.3.7 Let Y ∼ M V N (µ, σ 2 I) and A, B ∈ IRn×n with BA = 0. Then 1. if A symmetric, then Y 0 AY and BY are independent. 2. if both A and B symmetric, then Y 0 AY and Y 0 BY are independent. Proof: Both statements can be shown similarly. In order to get a statement about independence, we will introduce a new random variable and analyze its covariance structure. Look at A AY Y = . B BY This has covariance structure A A AA0 cov(Y )(A0 , B 0 ) = σ 2 I(A0 , B 0 ) = σ 2 B B BA0 AB 0 BB 0 = σ2 AA0 0 0 BB 0 , A=A0 because BA0 = BA = 0 and AB 0 = (BA0 )0 = 0. Therefore, AY and BY are independent. Any function in AY is then also independent of BY . Write Ais symm. A=AA− A Y 0 AY = Y 0 AA− AY = (AY )0 A− (AY ). Therefore A0 AY is independent of BY . If B is also symmetric, then with the same argument, Y 0 AY and Y 0 BY are independent. 2 Example: estimable fundtions in Gauss Markov model Let Y ∼ M V N (Xβ, σ 2 I) and take A = I = PX and B = PX . Then BA = 0, both A and B are symmetric. Therefore Y 0 AY = Y 0 (I − PX )Y = (Y − Ŷ )0 (Y − Ŷ ) = SSE is independent of BY = PX Y = Ŷ . For an estimable function c0 β we get: 0 0 − 0 0β cc OLS = c (X X) X Y ∃a:a0 X=x0 = a0 X(X 0 X)− X 0 Y = a0 PX Y = a0 Ŷ , i.e. any estimable function c0 β can be written as a linear combination of Ŷ . Since Ŷ is independent of SSE, so is c0 β. 29 Definition 1.10 (t distribution) Let Z ∼ N (0, 1) and W ∼ χ2k with Z, W independent and 0.4 Z ∼ tk T := p W/k tν density functions for various degrees of freedom k: 0.3 − − − − − 0.0 0.1 0.2 Then E[T ] = 0, V ar[T ] = k/(k − 2). normal curve k=1 k=2 k=5 k=10 −4 −2 0 2 x In the previous example, we have SSE 0 2 0 0 − 0β ∼ χ2n−rk(X) and cc OLS ∼ N (c β, σ c (X X) c) ind. σ2 Then 0β cc − c0 β p OLS σ 2 c0 (X 0 X)− c ,s 0 0β SSE cc OLS − c β p √ = ∼ tn−rk(X) σ 2 (n − rk(X)) c0 (X 0 X)− c · M SE We can test H0 : c0 β = # using test statistic 0β cc OLS − # √ T =p and tn−rk(X) as null distribution 0 0 c (X X)− c M SE and get confidence intervals for c0 β as 0 0β cc OLS − c β √ ≤ t∗ )1 − α, c0 (X 0 X)− c · M SE P (−t∗ ≤ p where t∗ is the (1 − α/2) quantile of tn−rk(X) distribution. Therefore c0 β has (1 − α)100% C.I. ∗ 0β cc OLS ± t p √ c0 (X 0 X)− c · M SE. Theorem 1.11 Pk n×n 0 Cochran’s Theorem Let Y ∼ M V N (0, In×n with Y 0 Y = i=1 Qi , where Qi = Y Bi Y and Bi ∈ IR positive semi-definite matrices with rank ri ≤ n. Then the following equivalences hold” P i. i ri = n ii. Qi ∼ χ2ri iii. Qi are mutually independent Applications: Predictions Assume that for the linear model Y ∼ M V N (Xβ, σ 2 I) new observations become available. Let c0 β an estimable function and y ∗ be a vector of (statistics of) the new observations, with Ey ∗ = c0 β and V ary ∗ = γσ 2 , where γ is some known constant. 30 4 Example: Assume a simple means model yij = µj + ij , with j = 1, 2, 3 treatments and i = 3, 2, 2 replications respectively. Two additional experiments for treatment 3 will be done, i.e. c0 = (0, 0, 1). Let y ∗ be the mena of the new observations, then V ary ∗ = 1/2σ 2 . ∗ 0β For the difference between cc OLS and y we have distribution: ∗ 2 0 0 − 0β cc OLS − y ∼ N (0, σ (c (X X) c + γ)). 0β Since M SE and cc OLS are independent, the ratio has a t distribution: ∗ 0β cc OLS − y p √ ∼ tn−rk(X) c0 (X 0 X)− c + γ · M SE Then use ∗ 0β cc OLS ± t p √ c0 (X 0 X)− c + γ · M SE as 1 − α level prediction limits for t∗ . 31 Application: Testing Assuming a Gauss-Markov Model Y ∼ M V N (Xβ, σ 2 I), let H0 : Cβ = dl×1 be a testable hypothesis. c Build a test on Cβ OLS − d, then 2 0 − 0 c Cβ OLS − d ∼ M V N (Cβ − d, σ C(X X) C ) and the expression c Cβ OLS − d 0 c (σ 2 C(X 0 X)− C 0 )−1 Cβ OLS − d c is a measure of mismatch between Cβ OLS and d. What is its distribution? It has the form Z 0 AZ, with AΣ = (σ 2 C(X 0 X)− C 0 )−1 σ 2 C(X 0 X)− C 0 = Il×l , which clearly is idempotent. A is also symmetric. Theorem 1.8 therefore holds, and Y 0 AY has a χ2l distribution with non-centrality parameter δ 2 given as 0 2 0 − 0 −1 c c δ 2 = Cβ Cβ OLS − d OLS − d (σ C(X X) C ) Define the sum of squares of hypothesis H0 : Cβ = d as 0 0 − 0 −1 c c SSH0 := Cβ Cβ OLS − d = σ 2 δ 2 OLS − d (C(X X) C ) Then SSH0 ∼ χ2l (δ 2 ) σ2 When the null hypothesis H0 holds, then δ 2 = 0 and SSH0 /σ 2 has a central chi2 distribution. If H0 does not hold, SSH0 /σ 2 tends to have a larger value. Idea for testing: compare SSH0 to SSE. We already know, that SSE/σ 2 ∼ χ2n−rk(X) , and SSE and SSH0 are independent, because: Ŷ and c and Cβ SSH0 and SSE independent c = SSE independent, because for estimable Cβ, there exists A, such that AX = C. Therefore Cβ 0 − 0 AX · (X X) X Y = AŶ , function in Ŷ . 0 0 − 0 −1 c c c SSE independent, because SSH0 = Cβ − d (C(X X) C ) Cβ − d is function in Cβ OLS OLS and therefore function in Ŷ . Definition 1.12 (F distribution) Let U ∼ χ2ν1 and V ∼ χ2ν2 independent. Then U/ν1 ∼ Fν1 ,ν2 , V /ν2 the (Snedecor) F distribution with ν1 and ν2 degrees of freedom. EFν1 ,ν2 = ν2 2ν22 (ν1 + ν2 − 2) and V arFν1 ,ν2 = . ν2 − 2 ν1 (nu2 − 2)2 (ν2 − 4) Definition 1.13 (non-central F distribution) Let U ∼ χ2ν1 (λ2 ) and V ∼ χ2ν2 independent. Then U/ν1 ∼ Fν1 ,ν2 ,λ2 , V /ν2 32 the non-central F distribution with ν1 and ν2 degrees of freedom and non-centrality parameter λ2 . EFν1 ,ν2 ,λ2 = (λ2 + ν1 )ν2 2ν 2 (ν 2 + (2λ2 + ν2 − 2)ν1 + λ2 (λ2 + 2ν2 − 4) . and V arFν1 ,ν2 ,λ2 = 2 1 ν1 (ν2 − 2) ν12 (nu2 − 2)2 (ν2 − 4) Define F = SSH0 /(l · σ 2 ) ∼ Fl,n−rk(X),δ2 SSE/((n − rk(X))σ 2 ) For α level hypothesis test of H0 : Cβ = d we reject, if F > (1 − α) quantile of central Fl,n−rk(X) . Let f ∗ = (1 − α) quantile of central Fl,n−rk(X) . The p value of this test is then P (F > f ∗ ). (Remember: the p value of a test is the probability to observe a value as given by the test statistic or something that’s more extreme, given the null hypothesis is true. We are able to reject the null hypothesis, if the p value is small) The power of a test is the probability that the null hypothesis is rejected given that the null hypothesis is false, i.e. P ( reject H0 | H0 false ) If the null hypothesis is false, the test statistic F has a non-central F distribution. We can therefore compute the power as P ( reject H0 | H0 false ) = P (F > f ∗ ) = 1 − Fl,n−rk(X),δ2 (f ∗ ) For F3,5 this gives a power function in δ 2 as sketched below: 0.0 0.2 0.4 prob 0.6 0.8 1.0 Power of F(3,5) test 0 20 40 60 80 100 delta The R code for the above graphic is delta <- seq(0,100,by=0.5) # varying non-centrality parameter fast <- qf(1-0.05,3,5) # cut-off value of central F distribution for alpha=0.05 plot(delta, 1-pf(fast, 3,5,delta),type="l",col=2,main="Power of F(3,5) test",ylab="prob",ylim=c(0,1)) 33 Normal Theory & Maximum Likelihood (as justification for least squares estimates) Definition 1.14 (ML estimates) Suppose r.v. U has a pmf or pdf f (u | θ). If U = u is observed and ˆ f (u | theta) = max f (u | θ) θ ˆ is the maiximum likelihood estimate of θ. theta For a normal Gauss Markov model: f (y | Xβ, σ 2 I) = = −1/2 1 (2π)−n/2 det σ 2 I exp − (y − Xβ)0 (σ 2 I)−1 (y − Xβ) = 2 1 −n/2 2 −n/2 (2π) (σ ) exp − 2 (y − Xβ)0 (y − Xβ). 2σ d d For fixed σ 2 this expression is maximized, if (y−Xβ)0 (y−Xβ) is minimal, i.e. Xβ M L = Xβ OLS = PX Y = Ŷ . 2 For an ML estimate of σ consider the log likelihood function: log f (y | Xβ, σ 2 I) = − n 1 n log(2π) − log σ 2 − 2 SSE 2 2 2σ Then d n 1 log f (y | Xβ, σ 2 I) = 0 − 2 + 4 SSE dσ 2 2σ 2σ Therefore c2 M L = SSE/n = n − rk(X) M SE. σ n The MLE of (Xβ, σ 2 ) is (Ŷ , SSE/n). Note: the MLE of σ 2 is biased low: E(SSE/n) = 1/nE [(n − rk(X))M SE] = n − rk(X) 2 σ < σ2 . n Regression Analysis as special case of GLIMs A regression equation yi = β0 + β1 x1i + β2 x2i + ... + βr xri + i corresponds to Y = Xβ + , with X= 1 1 .. . x11 x12 x21 x22 ... 1 x1n x2n ... β0 xr1 β1 xr2 and β = β2 .. . xrn n×(r+1) βr Unless n is very small or we are extremely unlucky, matrix X has full rank. 34 Example: Biomass Data Source: Rick A. Linthurst: Aeration, nitrogen, pH, and salinity as factors affecting Spartina alterniflora growth and dieback. Ph.D. thesis, North Carolina State University, 1979. Description: These data were obtained from a study of soil characteristics on aerial biomass production of the marsh grass Spartina alterniflora, in the Cape Fear Estuary of North Carolina. Number of cases: 45 Variable Description Location Type Type of Spartina vegetation: revegetated areas, short grass areas, Tall grass areas biomass aerial biomass (gm−2 ) salinity soil salinity (o/oo) pH soil acidity as measured in water (pH) K soil potassium (ppm) Na soil sodium (ppm) Zn soil zinc (ppm) At first, we want to get an overview of the data. We will draw pairwise scatterplots and add smooth lines to try to get an idea of possible trends. 7 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● 500 1000 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● 30000 ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● 2000 ● ● ●● ● ● ● ● ● 36 ●● 20000 ● ● ● ●● ● 32 10000 ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 28 7 ● ● ●● ● 6 ● 6 ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● salinity ● 5 ● 24 4 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● 25000 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2500 ● ● ● ● ●● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● 500 ● ● ● ● ● ● ● 24 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 28 32 ● ● ● ● 36 ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ●● ● 400 Zn ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 800 ● ●● ● ● ●● ●● ● ● ●●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ● 1200 ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● Na ● ● ●● ● ●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1500 ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ● K ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 1200 ● ● ● ● ● ● ● 800 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 400 ● ● ●●● ● ● ● ● ●● ● 30 ● ●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ●● 20 ●● ● ● ● ● ● ●● ● ● ● ● 10 pH ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● 0 4 5 ● 5 ● y 10 15 20 25 30 The strongest linear relationship between biomass (y) and any of the explanatory variables seems to be the one between y and pH. Salinity shows an almost zero trend with very noisy data, the trend between potassium, sodium, zinc and y is slightly negative. > biomass <- read.table("biomass.txt",header=T) > dim(biomass) # check whether read worked [1] 45 8 > names(biomass) # variables 35 [1] "Location" "Type" "biomass" "salinity" "pH" [7] "Na" "Zn" > y <- biomass$biomass > x <- biomass[,4:8] > > # get overview > points.lines <- function(x,y) { + points(x,y) + lines(loess.smooth(x,y,0.9),col=3) + } > pairs(cbind(x,y),panel=points.lines) "K" A check of the rank ensures, that we indeed have a full rank matrix X in the model: > X <- as.matrix(cbind(rep(1,45),x)) > qr(X)$rank # compute rank of X matrix [1] 6 Closer Look at the Hat matrix In order to get predictions in a linear model, we use the projection matrix PX . Usually, in a regression, we deal with hat matrix H, for which holds: Ŷ = HY, i.e. the hat matrix is identical to the projection matrix, H = PX = X(X 0 X)−1 X 0 . Sometimes, the diagonal elements hii are used as a measure of influence of observation i in the model. What can we say about hii ? Since H is a semi positive matrix, hii ≥ 0. Pn On the other hand, i=1 hii = tr(H) = rk(X) = r + 1. On average, hii = rk(X)/n = (r + 1)/n. as “influential”. It is standard to flag observations with hii > 2 · r+1 n Example: Biomass Data - Predictions and Influential points > X <- as.matrix(cbind(rep(1,45),x)) > qr(X)$rank # compute rank of X matrix [1] 6 > > H <- X %*% solve(t(X) %*% X) %*% t(X) # compute hat matrix > > hist(diag(H), main="Histogram of H_ii") > oi <- diag(H) > 6/length(y)*2 # flag influential points > sum(oi) [1] 2 A histogram of the diagonal elements in the hat matrix reveals, that almost all are well below the threshold of 2(r + 1)/n. Two points show up as being influential. 36 0 5 10 Frequency 15 20 Histogram of H_ii 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 diag(H) These points are marked red in the graphic below. Here’s the code for doing so. points.lines <- function(x,y) { points(x,y) points(x[oi],y[oi],col=2) # mark influential points red lines(loess.smooth(x,y,0.9),col=3) } pairs(cbind(x,y),panel=points.lines) 10000 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● salinity ● ● ● ● ● ● ● ● ● ●● 30000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● 1500 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 6 ● ●● ● ● 2500 ● ● ● ● ●●● 2000 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● 1000 ● ● ● ● ● ● ● ● ● 500 ● ● ● ● ● ● ● ●●●● ● ● ● ● ● 7 20000 ● ● 36 7 32 6 28 5 ● 24 4 ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● 30000 20000 10000 ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2500 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 24 26 28 30 32 34 36 38 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● 400 ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● Zn ● ● ● ●● ●● ● ● ● ● ●● ●● ●● ●●● ● ●● ● 800 ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● 1200 ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 37 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 5 10 ● ● ● ● ● ● ● ● ● 20 ● ● 30 ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● y 25 ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ●●● ●●● ● ● ●● ● 15 ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● 600 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● Na ●●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● 1500 ● ● ● ● ● ● ● ● K ●● ●● ● ● ● ● ● ● ● ● ● 500 ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 1200 ● ● ●●● ● ● ● ● ● 800 ● ● ● ● ● ● ● ● ● ● ●● 400 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ●● ● ● ● ● 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● 5 10 pH ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 4 5 ● The variance of both predicted values and residuals are connected to the values of the hat matrix, too: V ar(Ŷ ) = V ar(HY ) = σ 2 H, therefore V ar(Ŷi ) = σ 2 · hii An estimate of the standard deviation of Ŷi is then M SE = √ √ M SE hii , with SSE . n − (r + 1) For residuals e = Y − Ŷ , the variance structure is given as: V ar(e) = V ar(Y − Ŷ ) = (I − H)V ar(Y ) = σ 2 (I − H), therefore V ar(ei ) = σ 2 · (1 − hii ) Adjusted residuals (standardized residuals) are given as e∗i = √ ei ∼ N (0, 1) √ M SE · 1 − hii Both predicted values Ŷ and explanatory variables Xi , i = 1, ..., r are independent of the residuals, i.e. Cov(Ŷ , e) = 0 and Cov(Xi , e) = 0 for all i = 1, ..., r This is the reason for looking at residual plots. We do not want to see any structure or patterns in these plots. Example: Biomass Data > > > > > > > > > > > + + + + e <- (diag(rep(1,length(y))) - H) %*% y # vector of residuals par(mfrow=c(1,2)) plot(yhat,e) yhat <- H %*% y sse <- crossprod(y - yhat) estd <- e/(sqrt(sse/(length(y)-qr(X)$rank)) * sqrt(diag(H))) plot(yhat,estd) par(mfrow=c(1,5)) for (i in 1:5) { plot(X[,i+1],estd) points(X[oi,i+1],estd[oi],col=2) # mark influential points red lines(loess.smooth(X[,i+1],estd,0.9),col=3) } In the plot of residuals versus predicted values, a slight asymmetry is apparent: positive residual values are larger than negative residuals. For standardized residuals the scale is -4 to 8. More residuals than expected under normality have absolute values over 4, indicating a poor fitting model. 38 1000 ● 8 ● ● ● ● 500 6 ● 4 ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 1000 1500 ● ● ● ● ● ●● ● ● ● ● 500 ● ● ● ● ● −4 ● ● ● ● ● ● ● ● ● ● −500 ● 2 estd ● ● −2 ● ● ● ● ● 0 e 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2000 500 1000 1500 yhat 2000 yhat In the partial residual plots smooth curves are included to reveal possible problems: under the model, we would expect the smoothed line to be close to line y = 0. This is not true in this example - plots number 2 and 3 in particular show a “bump” along the x-axis. This suggests that linear inclusions of these variables are not enough to explain all of the variation. Quadratic terms could, for example, be included. ● ● ● ●●● ● ● ● ● ● 28 30 32 34 36 38 6 ●● ● ● ● ● 5 X[, i + 1] 6 7 400 600 X[, i + 1] 800 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1200 1400 ● ● ● ● ● ● ● 10000 15000 20000 25000 30000 35000 X[, i + 1] ●● ● ● ● ● ● 1000 ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ●● ●● ● ● ● ● ● 2 estd ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● 4 4 2 estd ●● ● 0 ● ● ● ● ● ● ● ● ● ● 26 8 8 6 6 estd 2 0 0 ● ● ● ● 24 ● ● ● ● ●● ● ● ● ● −2 ● ● ● ●● ●●● ● ●● ● ●● ● ● −2 ● ● ● −4 ● ● ● ●● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 ● ● ● −4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −2 ● ● −4 ● ● ● ● ● ● ● ● ● −4 0 ● ● ● ● ● ● ● 4 4 ● ● ● ● ● ● ● estd ● ● ● ● ● ● ● 2 4 2 ● ● ● ● estd ● 6 6 ● ● ● 8 ● 8 ● 8 ● ● X[, i + 1] 0 5 10 15 20 25 30 X[, i + 1] More details about F testing in regressions: Common hypothesis (after re-ordering variables appropriately): H0 : βp+1 = βp+2 = ... = βr = 0 testable form h0 : Cβ = d with 1 .. C= 0 (p+1)×(r−p) . 1(r−p)×(r−p) and 0 0 −1 0 −1 c c SSH0 = Cβ C ) Cβ OLS OLS (C(X X) This hypothesis fits the framework of full model/ reduced model testing. Full/reduced model tests are usually reported in anova tables using expressions Y 0 AY , where A is some projection matrix. We will have to look at partitions of Y 0 Y : Y 0 Y = Y 0 P1I Y + Y 0 (PXp − P1I )Y + Y 0 (PX − PXp )Y + Y 0 (I − PX )Y, where 1I is the vector, which consists of 1s only. Since Y 0 Y − Y 0 P1I Y = Y 0 (PXp − P1I )Y + Y 0 (PX − PXp )Y + Y 0 (I − PX )Y . {z } | | {z } {z } | {z } | SST SSE SSH0 SSR reduced model | {z SSR full model 39 } (2) Anova Table for H0 : βp+1 = βp+2 = ... = βr = 0: Source of variation Regression (x1 , ..., xr ) Regression (x1 , ..., xp ) Regression (x1 , ..., xr ) | (x1 , ..., xp ) Error Total Sum of Squares SSRfull SSRreduced SSH0 SSEfull SST degrees of freedom rk(PX − P1I ) = r rk(PXp − P1I ) = p rk(PX − PXp ) = r − p n − (r + 1) n−1 Since σ1 (Y − Ŷ ) ∼ N (0, 1), we can apply Cochran’s theorem to the partition (2): since the ranks of the matrices on the right hand side add up to n − 1, the rank of the left hand side, Cochran’s Theorem tells us that the terms on the right are mutually independent and have chi square distributions with degrees of freedom p, r − p, and n − (r + 1) respectively. Other partitions exist, e.g. Type I sequential sum of squares: Y 0 Y − Y 0 P1I Y = Y 0 (PX1 − P1I )Y + Y 0 (PX2 − PX1 )Y +... Y 0 (PXr − PXr−1 )Y + Y 0 (I − PXr )Y . | {z } | {z } | {z } {z } | {z } | R(β0 ) | {z R(β1 |β0 ) R(β2 |β1 ,β0 ) SSE R(βr |β0 ,β1 ,...,βr−1 ) } (Y −Ȳ )0 (Y −Ȳ ) Again, Cochran’s theorem gives us independence of all the terms. Use R(β2 | β0 , β1 ) as numerator to test β2 = 0 in model of constant, X1 and X2 . Use R(βi | β0 , ..., βi−1 , βi+1 , ..., βr ) to test βi = 0 in full model. Factorial Model as special case of GLIMs Let A and B two factors in a linear model with I and J levels respectively. Consider the means model yijk = µij + ijk , for i = 1, ..., I , j = 1, ..., J , | {z } | {z } factor A factor B k = 1, ..., nij | {z } . repetition for treatment (i,j) Then µij is estimable as long as nij ≥ 1 (i.e. we have at least one observation for each treatment). In the following assume, that nij ≥ 1 for all i, j. Since all µij estimable, all linear combinations of µij are estimable. “Interesting” combinations are: y.j = yi. = µ.. = 1X µij = row j average mean I i 1X µij = column i average mean J j 1 X µij = grand average mean IJ i,j 40 Contrasts: “main effects” for factor A: = µ̄i. − µ̄.. µ̄i. − µ̄k. i = 1, 2, . . . , I i 6= k effect of level jβj = µ̄.j − µ̄.. difference in effects i and ` µ̄.j − µ̄.` j = 1, 2, . . . , J j 6= ` effect of level iαi difference in effects i and k “main effects” for factor B: Interaction Effects: = µij − (µ.. + µi. + µ.j ) αβij For these effects, sum restriction constraints hold, i.e.: X X X X αi = βj = αβij = αβij = 0 i j Different view on interaction effects: i 1 j 2 ... 1 2 . . . J I ij kj il kl If there is no interaction effect, the difference in effects for factor A is independent from the level in B: αβij − αβkj = αi − αk for all j = 1, ..., J Similarly, the difference in effects for factor B does not dependent on the level in A: αβij − αβi` = βj − β` for all i = 1, ..., I Without an interaction effect, the effects for factor B show as parallel lines: j=3 j=2 j=1 1 2 3 4 ... levels of A Often, the first test done in this setting, is to check H0 : αβij = 0 versus the alternative that at least one of the effects is non-zero. We could do that for H0 : αβij = 0 ⇐⇒ H0 : µij − (µ.. + µ.j + µi. ) = 0 More natural might be to consider an effects model yijk = µ + αi + βj + αβij + ijk 41 This model does not have full rank, though - with additional constraints, we can make it full rank: Baseline restrictions (last effects are zero) αI βJ αβiJ αβIj = 0 = 0 = 0 for all i = 1, . . . , I = 0 for all j = 1, . . . , J For I = 2 and J = 3, this gives estimates µij : Factor A i=1 i=2 j=1 µ11 = µ + α1 +β1 + αβ11 Factor B j=2 µ12 = µ + α1 +β2 + αβ12 j=3 µ13 = µ + α1 µ21 = µ + β1 µ22 = µ + β2 µ23 = µ A Means µ + α1 + +(β1 + β2 )/3+ +(αβ11 + αβ12 )/3 µ+ B Means µ+ α1 2 + β1 + αβ11 2 µ+ α1 2 + β2 + αβ12 2 µ+ β1 +β2 3 α1 2 Baseline restrictions (first effects are zero) α1 β1 αβi1 αβ1j = = = = 0 0 0 for all i = 1, . . . , I 0 for all j = 1, . . . , J For I = 2 and J = 3, this gives estimates µij : Factor A i=1 i=2 B Means j=1 µ11 = µ µ21 = µ + α2 µ + α2 /2 Factor B j=2 µ12 = µ + β2 j=3 µ13 = µ + β3 µ22 = µ + α2 + +β2 + αβ22 µ23 = µ + α2 + +β3 + αβ23 µ+ α2 2 + β2 + αβ22 2 µ+ α2 2 + β3 + A Means µ + (β2 + β3 )/3 µ + α2 + +(β2 + β3 )/3+ +(αβ22 + αβ23 )/3 αβ23 2 The corresponding full rank matrix X ∗ stemming from an effects model with imposed restrictions has then the form X ∗ = (1I | Xα∗ | Xβ ∗ | Xαβ ∗ ), where Xα∗ consists of I − 1 columns for effects of factor A, Xβ ∗ consists of J − 1 columns for effects of factor B and Xαβ ∗ has (I − 1)(J − 1) linearly independent columns for interaction effects. A test of H0 : αβij = 0 then translates to a goodness of fit test of the reduced model corresponding to (1I | Xα∗ | Xβ ∗ ) versus the full model corresponding to X ∗ . The sum of squares of the null hypothesis is then written as SSH0 = Y 0 (PX ∗ − P(1I|Xα∗ |Xβ∗ ) )Y, 42 where SSH0 /σ 2 has a χ2 distribution with (I − 1)(J − 1) degrees of freedom and non-centrality parameter δ 2 , where δ 2 = (Cβ)0 (C(X ∗ 0 X ∗ )− C 0 )−1 Cβ and C defined as 1 .. C= 0 . . (I−1)(J−1)×(1+(I−1)+(J−1)) 1(I−1)(J−1)×(I−1)(J−1) Under H0 δ 2 = 0. 43 Testing for effects How do we test hypothesis of the form H0 : αi = 0 for all i or, similarly, H0 : βj = 0 for all j? In a 2 by 3 factorial model using zero sum constraints we know: 1 1 (µ11 + µ12 + µ13 ) − (µ21 + µ22 + µ23 ) 6 6 The null hypothesis can then be written as: µ11 1 1 1 1 1 1 H0 : α1 = 0 ⇐⇒ H0 : ( , , , − , − , − ) ... 6 6 6 6 6 6 µ23 α1 = Similarly, β1 = µ+1 − µ++ = 1 1X 1 µ11 + µ21 − µij = 2 2 6 i,j 1 1 1 1 1 1 µ11 − µ12 − µ13 + µ21 − µ22 − µ23 , 3 6 6 3 6 6 1 1 1 1 1 1 = − µ11 + µ12 − µ13 − µ21 + µ22 − µ23 . 6 3 6 6 3 6 = β2 , which is used in a hypothesis test as H0 : 1 6 2 −1 −1 2 −1 −1 −1 2 −1 −1 2 −1 µ11 µ12 µ13 µ21 µ22 µ23 0 = 0 −1 To test these hypotheses, we can come up with SSH0 = (Cβ − d)0 (C(X 0 X)− C 0 ) (Cβ − d) Alternatively, we could come up with other test statistics: For hypothesis H0 : α = 0, we might test for factor A in different ways: test for A in model µ + αi∗ R(α∗ | µ) = Y 0 (P(1I|Xα∗ ) − P1I )Y ∗ ∗ test for A in model µ + αi + βj R(α∗ | µ, β ∗ ) = Y 0 (P(1I|Xα∗ |Xβ∗ ) − P(1I|β∗ ) )Y ∗ ∗ ∗ test for A in model µ + αi + βj + αβij R(α∗ | µ, β ∗ , αβ ∗ ) = Y 0 (P(1I|Xα∗ |Xβ∗ |Xαβ∗ ) − P(1I|β∗ |Xαβ∗ ) )Y Which of these sum of squares should (or could?) we consider. Since they all test for the same effect, all are in some sense valid. In a balanced design (and all orthogonal designs for that matter) it turns out, that all of these sums coincide anyway. For an unbalanced design, (i.e. nij > 0 for all i, j, but not equal), these sums give different results. Simulation: get: For a small 2 × 3 example, with factors A and B on two and three levels, respectively, we j=1 2 3 µij based on i = 1 2.25 22.00 18.33 2 7.17 26.25 18.80 In this situation, the three sums of squares yield different values: > drop(t(Y) %*% (P(cbind(one,A)) - P(one)) %*% Y) [1] 58.90667 44 nij i=1 2 j=1 4 6 2 3 4 3 3 5 > drop(t(Y) %*% (P(cbind(one,A,B)) - P(cbind(one,B))) %*% Y) [1] 66.52381 > drop(t(Y) %*% (P(X) - P(cbind(one,B,A*B))) %*% Y) [1] 60.52246 where A <- c(rep(1,15), rep(-1,10)) B1 <- c(rep(1,5), rep(0,4), rep(-1,6), rep(1,3), B2 <- c(rep(0,5), rep(1,4), rep(-1,6), rep(0,3), B <- cbind(B1,B2) one <- rep(1,25) rep(0,3), rep(-1,4)) rep(1,3), rep(-1,4)) and Y [1] 21 18 24 10 21 30 23 28 24 13 8 -1 10 5 8 10 25 20 17 28 21 -3 13 0 -1 and SSH0 for H0 : α = 0 is > C <- c(0,1,0,0,0,0) > b <- ginv(t(X) %*% X) %*% t(X) %*% Y > SSH <- t(C %*% b) %*% solve( t(C) %*% ginv(t(X) %*% X) %*% C) %*% C %*% b > drop(SSH) [1] 60.52246 which is the same as the last of the three sum of squares above. This is generally true: SSH0 is the same as the sum of squares we get, by comparing the full model to a model without the factor of interest. Different software produces and regards different sum of squares - we need to know, which ones. SAS introduced the concepts of type I, type II, and type III sum of squares: • The Type I (sequential) sum of squares is computed by fitting the model in steps according to the order of the effects specified in the design and recording the difference in the sum of squares of errors (SSE) at each step. • A Type II sum of squares is the reduction in SSE due to adding an effect after all other terms have been added to the model except effects that contain the effect being tested. For any two effects F and F̃ , F is contained in F̃ if the following conditions are true: – Both effects F and F̃ involve the same covariate, if any. – F̃ consists of more factors than F . – All factors in F also appear in F̃ . • The Type III sum of squares for an effect F is the sum of squares for F adjusted for effects that do not contain it, and orthogonal to effects (if any) that contain it. If we assume the order A, B, and AB on effects, we get the following three types of sum of squares: Factor Type I A R(α∗ | µ) B R(β ∗ | µ, α∗ ) ∗ ∗ ∗ R(αβ | µ, α , β ) R(αβ ∗ | µ, α∗ , β ∗ )AB Type II Type III R(α∗ | µ, β ∗ ) R(α∗ | µ, β ∗ , αβ ∗ ) ∗ ∗ R(β | µ, α ) R(β ∗ | µ, α∗ , αβ ∗ ) ∗ ∗ ∗ R(αβ | µ, α , β ) The advantage of Type I sum of squares is their additivity: the sum of type I sum of squares adds to the total sum of square T SS = Y 0 (PX − P1I )Y = R(αβ ∗ , α∗ , β ∗ | µ). 45 The obvious disadvantage is the obvious order-dependency of type I sums. Changing the order of factors changes their effect. Type III sum of squares do have order invariance, but they do not add up to any interpretable value, nor are they independent of each other. The two properties - order invariance and additivity (sometimes called orthogonality) - usually do not occur together. Balanced designs do have them, though. Example For the simulated example above, type I sums of squares in R are given as: > ## Type I Sum of Squares: order A,B, AB > lmfit <- lm(Y~1+A+B+A*B) > anova(lmfit) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) A 1 58.91 58.91 1.8660 0.1879 B 2 1695.07 847.53 26.8475 2.91e-06 *** A:B 2 22.87 11.43 0.3622 0.7009 Residuals 19 599.80 31.57 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > > ## Type I Sum of Squares: order B,A, AB > lmfit <- lm(Y~1+B+A+A*B) > anova(lmfit) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) B 2 1687.45 843.73 26.7269 3.003e-06 *** A 1 66.52 66.52 2.1073 0.1629 B:A 2 22.87 11.43 0.3622 0.7009 Residuals 19 599.80 31.57 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Depending on the order of effects specified in the formula, the output varies. Type II and III sums of squares are produced by Anova in the package car: > Anova(lmfit,type="III") Anova Table (Type III tests) Response: Y Sum Sq Df F value Pr(>F) (Intercept) 5861.1 1 185.6638 2.946e-11 *** B 1679.4 2 26.6000 3.105e-06 *** A 60.5 1 1.9172 0.1822 B:A 22.9 2 0.3622 0.7009 Residuals 599.8 19 --46 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Anova(lmfit,type="II") Anova Table (Type II tests) Response: Y Sum Sq Df F value Pr(>F) B 1695.07 2 26.8475 2.91e-06 *** A 66.52 1 2.1073 0.1629 B:A 22.87 2 0.3622 0.7009 Residuals 599.80 19 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 47 Related to the problem of unbalanced designs is the problem of how to handle experiments that are not complete, i.e. designs, in which nij = 0 for some is and js. In a small factorial 2 × 3 design, we might not have any values for cell (2, 3). Clearly, this means that µ23 is note estimable. On top of that, we cannot estimate anything involving µ23 , such as µ.3 or µ2. . A work-around would, of course be, to estimate the row and column averages based on only the information we have, i.e. 1 µ2. = (µ21 + µ22 ) 2 Similarly, by using αβij = µij − µil − (µkj − µkl ) we can find estimates for interaction effects as long, as there are four cells with data (in the corners of an imaginary rectangle). Generally, the following is true: In an experiment with K non-empty cells (K < IJ), we can get matrix Xn×K . Provided K is big enough (at least K ≥ 1 + (I − 1) + (J − 1)) and the pattern of empty cells is not nasty (i.e. there are not whole rows or columns missing), all parameters µ, α∗ and β ∗ are estimable and we can construct matrix X ∗ = (1I | Xα∗ | Xβ ∗ ). A test on interaction effects is the done (in the reduced/full model framework): H0 : all estimable interaction effects are zero ∗ )Y , and SSH0 = Y 0 (PX − PX F = SSH0 /(K − 1 − (I − 1) − (J − 1)) ∼ FK−1−(I−1)−(J−1),n−K Y 0 (I − PX )Y /(n − K) Under the assumption that all interaction effects are zero, we can get estimates for µ, α∗ and β ∗ out of a minimum of 1 + (I − 1) + (J − 1) cells by moving “hand over hand” from one level to the next: α1 α2 α3 α4 α5 β1 β2 β3 β4 β5 2 Nonlinear Regression So far, we have been dealing with models of the form Y = Xβ + , i.e. models that were linear as functions of the parameter vector β. The new situation is that Y = f (X, β) + , where f is a known function, non-linear in β1 , β2 , ..., βk and in some sense “well behaved” (we want f to be at least continuous, at a later stage we will also want it to be differentiable, ...) 48 Example: Chemical Process Assume, we have an irreversible chemical progress of reactors A, B, and C: θ2 θ1 C B→ A→ Let A(t) denote the amount of A at time t (and similarly, we have B(t) and C(t)). θ1 and θ2 are the reaction rates, and are typically at the center of the problem, i.e. we want to find estimates for them. Then we get the set of differential equations: d A(t) = −θ1 A(t) dt d B(t) = θ1 A(t) − θ2 B(t) dt d C(t) = θ2 B(t) dt with A(0) = 1, B(0) = C(0) = 0. Calculus tells us that B(t) can be written as B(t) = B(t, θ1 , θ2 ) = θ1 e−θ1 t − e−θ2 t θ1 − θ2 0.4 Usually, we observe yi at time ti where yi = B(ti , θ1 , θ2 ) + i and we want to find estimates of θ1 and θ2 . Simulation example: ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● B 0.2 0.3 ● 0.1 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.0 ● ● ● ● ● ● ● 0 2 4 6 8 10 t For the simulation shown in the graphic, the parameters have been chosen as θ1 = 0.3 and θ2 = 0.4, and ∼ M V N (0, σ 2 I), with σ = 0.05. The R code for that is > > + + > > > > > > > > # function B(t) with default parameters specified f <- function (x,theta1=0.3,theta2=0.4) { return (theta1/(theta1-theta2)*(exp(-theta2*x) -exp(-theta1*x))) } # assuming measurements at every 0.05 units between 0 and 10 t <- seq(0,10,by=0.05) e <- rnorm(length(x),mean=0,sd=0.05) # simulated observations B <- f(t)+e 49 > plot(t,B) > > # overlay graph of B(t) > points(t,f(t),type="l",col=2) Using a non-linear least squares estimation in R, we can get a good estimate for θ1 and θ2 : # finding non-linear least squares estimates fit <- nls(B ~ f(t,theta1=t1,theta2=t2), start=list(t1=0.55,t2=0.45),trace = TRUE ) 1.458538 : 0.55 0.45 0.7911527 : 0.2248839 0.4364552 0.4879797 : 0.2985262 0.4087470 0.485227 : 0.3007713 0.4013401 0.4852263 : 0.3008083 0.4014700 > > summary(fit) Formula: B ~ f(t, theta1 = t1, theta2 = t2) Parameters: Estimate Std. Error t value Pr(>|t|) t1 0.300808 0.009702 31.00 <2e-16 *** t2 0.401470 0.007745 51.84 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.04938 on 199 degrees of freedom Correlation of Parameter Estimates: t1 t2 0.284 50 For a non-linear model yi = f (xi , β) + i least squares estimates of β can be found by minimizing X (yi − f (xi , β))2 (∗) i This expression does usually not have a closed form solution. We can find solutions using numerical analysis: 1. grid based search: for low dimensional problems (in terms of the dimension of vector β) we might use a grid of possible values b for β and take the minimum as our solution. Sometimes several steps of this grid based search are done: The disadvantage of this method is its complexity in terms of computation: since the method is multiplicative in the number of parameters, it becomes computationally infeasible quickly. 2. Gradient Method (Hill Climbing, Greedy Algorithm) To minimize (*) the idea is to find a zero point of the first derivative of f w.r.t. β: 0 0k×1 = Dk×n (Yn×1 − f (X, β)n×1 ) with D= f (x1 , β) ! ∂ f (x2 , β) f (xi , β) and f (X, β) = .. ∂βj . β=b f (xn , β) In the case of a linear model f (X, β) = Xβ, with f (xi , β) = x0i β and D= ! ∂ f (xi , β) = (xij ) = X ∂βj β=b 0 Therefore 0k×1 = Dk×n (Yn×1 − f (X, β)n×1 ) becomes 0 = X 0 Y − X 0 Xb - the normal equations! A standard method of solving g(z) = 0 is the 51 Gauss-Newton Algorithm 1. Identify some starting value z ∗ 2. Find a linear approximation g̃ of g(z) in point z = z ∗ . 3. For linear approximation find z ∗∗ with g̃(z ∗∗ ) = 0. 4. Replace z ∗ by z ∗∗ and repeat from (2) until convergence, i.e. until z ∗ is reasonably close to z ∗∗ . x z1 x z2 x z3 x z z4 Obviously: • The choice of starting value is critical • If function g is monotone the algorithm converges: • Convergence is slow, if function is flat • Results might jump between several solutions: any starting point in the shaded area to the left or right of the x axis below will lead to a series jumping between the left and right hand side. Math in more detail: Let b0 = (b01 , ..., b0p )0 the starting value, br := parameter after the rth iteration, and ∂ Dr := ∂β f (x , β) r i j β=b A linear approximation of f (x, b) is (Taylor expansion): f (X, β) ≈ f (X, br ) + Dr (β − br ) with Y = f (X, β) + ≈ f (X, br ) + Dr (β − br ) + . Then the approximation gives rise to a linear model: Y − f (X, br ) = |{z} Dr (β − br ) + {z } | | {z } ∗ X Y∗ γ The ordinary least squares solution helps to find an iterative process: 0 0 γ̂OLS = (Dr Dr )−1 Dr (Y − f (X, br )) | {z } β−br This suggests 0 0 br+1 = br + (Dr Dr )−1 Dr (Y − f (X, br )) for the iteration. 52 The iteration step 0 0 br+1 = br + (Dr Dr )−1 Dr (Y − f (X, br )) is repeated until convergence. For convergence, we need to define a stopping criterion. There are different options: 1. by comparing the solutions of step r and r + 1 we get an idea of how much progress the algorithm still does. If the relative difference ! r+1 b − brj j max br + c j j is sufficiently small, we can stop. c can be chosen as any small number (it’s a technicality to prevent us from dividing by zero) 2. “Deviance”: SSE r := X 2 (yi − f (X, br )) , i the error sum of squares after r iterations. Stop, if SSE r+1 is close to SSE r . For the non-linear model y = f (x, β) + we can find β̂ by using Gauss-Newton: Guess some starting value b0 . Then iterate where Dr = ∂ ∂βj f (xi , β) β=β r br+1 = br + (Dr 0 Dr )−1 Dr 0 (Y − f (X, br )), . Inference for β̂, σ̂ 2 ? Large Sample Inference Assumption ∼ M V N (0, σ 2 I), then (bOLS , SSE/n) are MLE of (β, σ 2 ). The following holds: 1. the ordinary least squares estimate bols ahs an approximate multivariate normal distribution: ! ∂ . bOLS ∼ M V Nk (β, σ 2 (D0 D)−1 ) with D = f (xi , b) ∂bj b=β 2. mean squared error is approximately chi square M SE = SSE = σ̂ 2 n−k and SSE . 2 ∼ χn−k n−k 3. estimating derivative D̂ = ! ∂ f (xi , b) ∂bj b=bOLS 53 4. for some smooth differentiable function h : IRk → IRq we get (Delta method): . 0 2 −1 h(bOLS ) ∼ M V N (h(β), σ G(D D) 0 G) with G = ! ∂ hi (b) ∂bj b=β 5. Estimating G: ! ∂ f (xi , b) ∂bj b=bOLS Ĝ = Inference fro single βj From 1) bOLSj − bj . q ∼ N (0, 1) σ (D0 D)−1 jj From 2 and 3) √ bOLSj − bj . q ∼ tn−k −1 0 M SE (D̂ D̂)jj hypothesis test: H0 : βj = #: T =√ (1 − α) C.I. for βj : bOLSj − # q M SE (D̂0 D̂)−1 jj . then T ∼ tn−k q √ bOLSj ± t1−α/2,n−k M SE (D̂0 D̂)−1 jj Inference for single mean f (X, β) . For function f : IRk → IR we get (from 4): 2 0 −1 f (x, bOLS ) ∼ N (f (x, β), σ G(D D) 0 G ), with Ĝ = ! ∂ f (x, b) ∂bj b=bOLS A (1 − α) C.I. for f (x, β) is then: q √ f (x, bOLS ) ± t1−α/2,n−k M SE Ĝ(D̂0 D̂)−1 Ĝ0 Prediction In the future we observe y ∗ independently from y1 , ..., yn with mean h(β) and variance γσ 2 , e.g. h(β) = f (x, β) and γ = 1 for one new observation at x; h(β) = f (x1 , β) − f (x2 , β) and γ = 2 for the difference between of observations at x − 1 and x2 . Then (1 − α) prediction limits are given as: q h(bOLS ) ± t1−α,n−k sqrtM SE γ + Ĝ(D̂0 D̂)−1 Ĝ0 54 Inference based on large sample behavior: Profile likelihood Again, ∼ M V N (0, σ 2 I). Then L(β, σ 2 | Y ) = (2π)−n/2 1/σ 2 n/2 exp −1/(2σ 2 ) X 2 (yi f (xi , β)) i and the log-likelihood is `(β, σ 2 | Y ) = log L(β, σ 2 | Y ) = −n/2 log(2π) − n log σ − 1/(2σ 2 ) X 2 (yi f (xi , β)) i Idea of profile likelihood: assume, the parameter vector θ can be split into two parts, θ= θ1 θ2 with θ1 ∈ IRp , θ2 ∈ IRr−p . We then can write the log-likelihood function ` as `(θ) = `(θ1 , θ2 ). The idea of profile likelihoods is that we want to draw inference on θ1 and ignore θ2 . Suppose now that for every (fixed) θ1 the function θ̂2 (θ1 ) maximizes the loglikelihood `(θ1 , θ̂2 (θ1 )) =: `∗ (θ1 ). A (large sample) 1 − α confidence set for θ1 is: 1 θ1 | `(θ1 , θ2 (θ1 )) > θM L − cα 2 where cα is the 1 − α quantile of χ2p . The confidence set for θ1 is therefore the set of all parameters within a cα range of the absolute maximum (given by θM L ). The function `∗ (θ1 ) is also called the profile log-likelihood function for θ1 l(θML) - 1/ 2 χp 2 θ2 l(θML) - 1/ 2 χr 2 x θML (1−α) confidence region for θ (1−α)confidence interval for θ1 θ1 Use this in nonlinear model: Application # 1: θ1 = σ 2 For fixed σ 2 , the loglikelihood `(β, σ 2 ) is maximized by bOLS . Then the profile log likelihood for σ 2 is: `∗ (σ 2 ) = `(bOLS , σ 2 ) = − n n SSE log 2π − log σ 2 − 2 2 2σ 2 55 l(b OLS , l(b OLS , SSE/ n) \ ` (β, σ 2 )M L − `∗ (σ 2 ) = } n SSE n log − 2 n 2 n SSE 2 = χ21,1−α − − log σ − 2 2σ 2 σ2) 1/ 2 χ1 2 = − . This is an alternative to SSE/σ 2 ∼ χ2n−k SSE/ n (1−α)confidence interval for σ2 σ2 Application # 2: θ1 = β For given β the likelihood `(β, σ 2 ) is maximized by σ̂ 2 (β) = 1X 2 (yi f (xi , β)) n i The profile log likelihood function for β is `∗ = `(β, σ̂ 2 (β)) and \ ` (β, σ 2 )M L n SSE n n − `∗ (β) = − log + log σ̂ 2 (β) = 2 n 2 2 ! log X 2 (yi f (xi , β)) − log i An approximate confidence region for β is (see picture) ( ) X 1 2 2 β | log (yi f (xi , β)) − log SSE < χk,1−α = n i ( ) X 1 2 2 = β| (yi f (xi , β)) < SSEe n χk,1−α i β2 (β1,β2) where n σ2(β) is not m uch larger than SSE b OLS β1 In a linear model the exact confidence region is the “Beale” region: ( ) X k 2 (yi f (xi , β)) < SSE(1 + β| Fk,n−k,1−α ) n−k i This is carried over directly to non-linear models. 56 X i 2 (yi f (xi , bOLS )) 3 Mixed Linear Models Another extension of linear models: introduce term to capture “random effects” in the model. A mixed effects model has the following form Yn×1 = Xn×p βp×1 + Zn×q uq×1 + n×1 , X, Z where β u, We will make are matrices of known constants, is a vector of unknown parameters, vectors of unobservable random variables. standard assumptions: E = 0 Eu = 0 V ar = R V aru = G where R and G are known except for some parameters, called the variance components. and u are assumed to be independent, i.e. cov(u, ) = G 0 0 R Then EY V arY 3.1 = E [Xβ + Zu + ] = Xβ = V ar [Zu + ] = V arZu + V ar = ZGZ 0 + R Example: One way Random Effects Model Let some batch process making widgets. Randomly select three batches (j = 1, 2, 3). Out of these, randomly select two widgets (i = 1, 2). Measure hardness yij = measured hardness of widget i from batch j. Could be modeled as yij = µ |{z} + overall hardness αj |{z} + random effect of batch j ij |{z} within batch of effect Assumptions: α1 E α2 = α3 E = α1 0 V ar α2 = σα2 I3×3 = G = V aru α3 0 V ar = σ 2 I6×6 = R Then y11 y21 y12 y22 y31 y32 = 1 1 1 1 1 1 µ + 57 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 α1 α2 + α3 With that, expected value and variance of Y is: EY = µ V arY 0 2 = ZGZ + R = σα σ 2 + σα2 σα2 = N 1 1 0 0 0 0 σα2 2 σ + σα2 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 + σ2 I = σ 2 + σα2 σα2 σα2 2 σ + σα2 σ 2 + σα2 σα2 σα2 σ + σα2 2 O = σα2 I3×3 J2×2 + σ 2 I6×6 is called the Kronecker product: An×m O a11 B Br×s = a21 B .. . a12 B .. . ... mr×ns V arY is not an Aitken matrix, because we have two unknown parameters σ 2 and σα2 . 58 3.2 Example: Two way Mixed Effects Model without Interaction 2 analytical chemists; each make 2 analyses on 2 specimen (each specimen is cut into four parts). yijk is one result from these analyses (e.g. the content of some component A) for the kth analysis of specimen i done by chemist j. We can model this by yijk = µ |{z} + average content + αi |{z} random effect of specimeni bj |{z} +ijk fixed effect of chemistj Assumptions are E α1 α2 α1 α2 = σα2 I2×2 = G = 0 and V ar E = 0 and V ar = σ 2 I = R 1 1 0 0 1 1 0 0 Then y111 y112 y121 y122 y211 y212 y221 y222 = 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 µ β1 + β2 Then EY = Xβ and V arY = σα2 ZZ 0 + σ 2 I = σα2 I2×2 3.3 N 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 α1 + α2 J4×4 + σ 2 I Estimation of Parameters Assume Y ∼ M V N (Xβ, V ) with V = var(Y ) = R + ZGZ 0 with variance components σ12 , σ22 , ..., σp2 . Consider the normal likelihood function: L(Xβ, σ 2 ) = f (Y | Xβ, σ 2 ) = p = (2π)−n/2 1/ | det V (σ 2 )| exp 1/2(Y − Xβ)0 V (σ 2 )−1 (Y − Xβ) For fixed σ 2 = (σ12 , σ22 , ..., σp2 ) this is maximized by weighted linear squares in Xβ: d 2 ) = Ŷ ∗ (σ 2 ) = X(X 0 V −1 X)− X 0 V −1 Y Xβ(σ Plugging this into L(Xβ, σ 2 ) gives profile likelihood for vector σ12 , σ22 , ..., σp2 of variance components: L∗ (σ 2 ) = L(Ŷ ∗ (σ 2 ), σ 2 ) There is no closed form for a maximum of this function, i.e. we need iterative procedure, such as GaussNewton. At every step we need to find the inverse of V (σ 2 ) to get L∗ . This is a computation of order p3 , i.e. it becomes computationally infeasible fairly quickly as the number of variance components increases. Even if it is possible to estimate variance components - remember MLE of variances are biased - and underestimate: 59 Small Example For Y ∼ N (µ, σ 2 ) with Y = 1 1 .. . 2 µ + Maximum likelihood gives σ̂M L = 1 n P i (yi − ŷi )2 , yet we know 1 P 1 2 that s2 = n−1 (y − ŷ ) is an unbiased estimator of the variance. i i i There are several ways to try to “fix” the biasedness of ML estimates - the one that we will be looking at leads to restricted maximum likelihood estimates: The idea is to replace Y by Y − ŶOLS , i.e. E[Y − Ŷ ] = 0, then do ML. Doing so, removes all fixed effects from the model. In the small example this gives: Y − Ŷ ∼ M V N (0, (I − n1 J)σ 2 I(I − n1 J)), since Ŷ = PX Y = 1I(1I0 1I)−1 1I0 Y = 1 n Jn×n Y . The covariance matrix of Y − Ŷ does not have full rank: n−1 − n1 ... − n1 n .. − 1 n−1 . n n cov(Y − Ŷ ) = . . .. − 1 .. n − n1 ... − n1 n−1 n For finding a ML drop the last row of the residual vector to Y1 − Ȳ .. e= . make the model full rank: Yn−1 − Ȳ ML estimation for σ 2 based on this vector turns out to be n 2 σM Le = 1 X (yi − ȳ)2 , n − 1 i=1 A generalization of this leads to REML (restricted maximum likelihood estimates) of variance components: Estimates of variance components Let B ∈ Rm×n with rk(B) = m = n − rk(X) and BX = 0. Define r = BY . The REML estimate of σ 2 is maximizer of likelihood based on r: −1/2 Lr (σ 2 ) = f (r | σ 2 ) = (2π)−m/2 det BV (σ 2 )B 0 exp −1/2 r0 (BV (σ 2 )B 0 )−1 2 2 2 σ̂REM L is larger than σM L , σREM L does not depend on the specific choice of B (all B fulfilling the above conditions will yield same estimate). 60 Estimable functions Cβ Estimability of functions only depends on the form of X, i.e. c0 β is estimable if c ∈ C(X 0 ). The BLUE in model Y ∼ M V N (Xβ, V ) is given as weighted least squares estimate: 0 −1 c Cβ X)− X 0 V −1 Y W LS = C(X V C=AX = AŶ (σ 2 ) with variance 0 −1 c V ar(Cβ X)− C 0 W LS ) = C(X V This is not useful in practice, because we usually do not know σ 2 . Plugging in an estimate σ̂ 2 gives us estimates, which generally are neither linear nor have minimal variance: \ 0 −1 c Cβ X)− X 0 V̂ −1 Y W LS = C(X V̂ C=AX = AŶ (σ̂ 2 ) and \ 0 −1 c V ar(Cβ X)− C 0 W LS ) = C(X V̂ Predicting random vector u Remember: for multivariate normal variables X and Y the conditional expectation is E[Y | X] = E[Y ] + cov(Y, X)V ar(X)−1 (X − E[X]) (see Rencher Theorem 4.4D for a proof) E[u | Y ] = E[u] + cov(Y, u)V −1 (Y − Xβ) = GZ 0 V −1 (Y − Xβ), since cov(u, Y ) = cov(u, Xβ + Zu + ) = cov(u, Zu + ) = cov(Iu, Zu) + 0 = IGZ 0 + 0 = GZ 0 For the predictions of u we then get: û d = GZ 0 V −1 (I − X(X 0 V −1 X)− X 0 V −1 ) Y = GZ 0 V −1 P Y = GZ 0 V −1 (Y − Xβ) {z } | P û is linear in Y . û is the best linear unbiased predictor (BLUP) of u. ˆ = ĜZ 0 P̂ Y to make expression useful. Use û û is best in the sense, that var(u − û) has minimal variance over the choice of linear predictions û with E û = 0. For the BLUP, V ar(u − û) = G − GZ 0 P ZG. ˆ and hope Both û and V ar(u − û) depend on σ 2 - after estimating the variance via ML or REML we get û that \ V ar(u − û) = Ĝ − ĜZ 0 P̂ Z Ĝ is a sensible estimate for the variance. Example: Ergometrics experiment with stool types Four different types of stools were tested on nine persons. For each person the “easiness” to get up from the stool was measured on a Borg scale. Variables: effort effort (Borg scale) required to arise from a stool. Type factor of stool type. Subject factor for the subject in the experiment. 61 library(lattice) trellis.par.set(theme=col.whitebg()) plot(ergoStool) ● T1 2 T2 ● 7 Subject ● ● 6 ● 9 ● 4 ● 5 8 T4 ● 1 3 T3 ● ● 8 10 12 14 Effort required to arise (Borg scale) What becomes apparent in the plot is, that for different subjects we have different values on the Borg scale. The overall range is similar, though. The different types of stools seem to get a similar ordering by all of the test persons. This indicates that the stool effect will be significant in the model. We model these data by a mixed effects model: yij = µ + βj + bi + ij , i = 1, ..., 9, j = 1, ..., 4 with bi ∼ N (0, σb2 ), ij ∼ N (0, σ 2 ) where µ is the overall “easiness” to get up from the stool, βj is the effect of the type of stool, and bi is the random effect for each person. We want to draw inference for the whole population - this is why we think of the test persons as a sample of the population, and treat bi as random. For each subject we get a subset model i = 1, 2, ..., 9: yi1 1 1 0 0 0 1 yi2 1 0 1 0 0 1 yi3 = 1 0 0 1 0 β + 1 ui + i yi4 1 0 0 0 1 1 62 The matrix X for each subject is singular: the last three columns add to the first. We need to define some side conditions to make the problem doable computationally: Helmert Contrasts Helmert contrasts is the default in Splus. In Helmert Contrasts, the jth linear combination is the difference between level j + 1 and the average of the first j. The following example returns a Helmert parametrization based on four levels: > options(contrasts=c("contr.helmert",contrasts=T)) > contrasts(ergoStool$Type) [,1] [,2] [,3] T1 -1 -1 -1 T2 1 -1 -1 T3 0 2 -1 T4 0 0 3 Helmert contrasts have the advantage of being orthogonal, i.e. estimates will be independent. For each subject the model matrix looks like this: > model.matrix(effort~Type,data=ergoStool[ergoStool$Subject==1,]) (Intercept) Type1 Type2 Type3 1 1 -1 -1 -1 2 1 1 -1 -1 3 1 0 2 -1 4 1 0 0 3 attr(,"assign") [1] 0 1 1 1 attr(,"contrasts") attr(,"contrasts")$Type [1] "contr.helmert" Interpretation: for Helmert contrasts β1 is overall mean effort of stool types, β2 is difference between T2 and T1, β3 is difference between T3 and and average effects of T1 and T2, ... Model Fit Also included in the nlme package is the function lme, which fits a linear mixed effects model. > fm1Stool <- lme(effort~Type, random = ~1|Subject, data=ergoStool) > summary(fm1Stool) Linear mixed-effects model fit by REML Data: ergoStool AIC BIC logLik 139.4869 148.2813 -63.74345 Random effects: Formula: ~1 | Subject (Intercept) Residual StdDev: 1.332465 1.100295 Fixed effects: effort Value (Intercept) 10.250000 Type1 1.944444 Type2 0.092593 Type3 -0.342593 ~ Type Std.Error 0.4805234 0.2593419 0.1497311 0.1058759 DF t-value p-value 24 21.330905 0.0000 24 7.497610 0.0000 24 0.618392 0.5421 24 -3.235794 0.0035 63 Correlation: (Intr) Type1 Type2 Type1 0 Type2 0 0 Type3 0 0 0 Standardized Within-Group Residuals: Min Q1 Med Q3 -1.80200344 -0.64316591 0.05783115 0.70099706 Max 1.63142053 Number of Observations: 36 Number of Groups: 9 The figure below shows effort averages for both stool types and subjects: T2 11 T3 3 7 6 9 10 mean of effort 12 2 1 4 9 T4 T1 5 8 Subject Type Factors Alternative side conditions: Treatment Contrasts > options(contrasts=c("contr.treatment",contrasts=T)) > contrasts(ergoStool$Type) T2 T3 T4 T1 0 0 0 T2 1 0 0 T3 0 1 0 T4 0 0 1 > model.matrix(effort~Type,data=ergoStool[ergoStool$Subject==1,]) (Intercept) TypeT2 TypeT3 TypeT4 1 1 0 0 0 2 1 1 0 0 64 3 1 0 4 1 0 attr(,"assign") [1] 0 1 1 1 attr(,"contrasts") attr(,"contrasts")$Type [1] "contr.treatment" 1 0 0 1 > > fm2Stool <- lme(effort~Type, random = ~1|Subject, data=ergoStool) > summary(fm2Stool) Linear mixed-effects model fit by REML Data: ergoStool AIC BIC logLik 133.1308 141.9252 -60.5654 Random effects: Formula: ~1 | Subject (Intercept) Residual StdDev: 1.332465 1.100295 Fixed effects: effort ~ Type Value Std.Error (Intercept) 8.555556 0.5760122 TypeT2 3.888889 0.5186838 TypeT3 2.222222 0.5186838 TypeT4 0.666667 0.5186838 Correlation: (Intr) TypeT2 TypeT3 TypeT2 -0.45 TypeT3 -0.45 0.50 TypeT4 -0.45 0.50 0.50 DF t-value p-value 24 14.853079 0.0000 24 7.497609 0.0000 24 4.284348 0.0003 24 1.285304 0.2110 Standardized Within-Group Residuals: Min Q1 Med Q3 -1.80200341 -0.64316592 0.05783113 0.70099704 Number of Observations: 36 Number of Groups: 9 Checking Residuals > plot(fm2Stool, col=1) > intervals(fm1Stool) Approximate 95% confidence intervals Fixed effects: lower (Intercept) 7.3667247 TypeT2 2.8183781 TypeT3 1.1517114 TypeT4 -0.4038442 est. 8.5555556 3.8888889 2.2222222 0.6666667 upper 9.744386 4.959400 3.292733 1.737178 65 Max 1.63142052 attr(,"label") [1] "Fixed effects:" Random Effects: Level: Subject lower est. upper sd((Intercept)) 0.7494109 1.332465 2.369145 Within-group standard error: lower est. upper 0.8292432 1.1002946 1.4599434 ● ● ● 1 ● ● Standardized residuals ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ●● −1 ● ● ● ● 8 10 12 14 Fitted values (Borg scale) The residual plot does not show any outliers nor any remaining pattern - the model seems to fit well. Measuring the precision of σ 2 Since all the estimates are based on ML and REML, we could use ML theory to find variances - but that’s hard. Instead, we will use large sample theory for MLEs. Idea: Let θ be a q dimensional vector of parameters with likelihood function `(θ) maximized by θ̂. Then −1 ∂2 [ V arθ̂ = − 2 `(θ) =V ∂ θi θj θ=θ̂ A confidence interval for θi is: θˆi ± z p Vii Application to estimation of σ 2 : Let γ be a vector of the log variance components: γ = (γ1 , ..., γq ) = (log σ12 , ..., log σq2 ) With likelihood functions `(β, σ 2 ) and restricted likelihood function `∗r (σ 2 ), we get corresponding likelihood functions `(β, γ) `∗r (γ) = `(β, eγ ) = `∗r (eγ ) 66 Construct covariance matrix: Let M= M11 M21 M12 M22 with M11 = M21 = ! ∂2 `(β, γ) ∂ 2 γi γj [ (β,γ) ML q×q ! 2 ∂ `(β, γ) 2 ∂ βi γj [ (β,γ) ML p×q 67 M12 = M22 = ! ∂2 `(β, γ) ∂ 2 γi βj [ (β,γ) ML q×p ! 2 ∂ `(β, γ) 2 ∂ βi βj [ (β,γ) ML p×p With M containing second partial derivatives of the parameters, we can define Q to be an estimate of the variance covariance structure by using Q = −M −1 . Confidence limits for γi are then p γˆi ± z qi i which yields approximate confidence limits for σi2 as (eγˆi −z √ qi i , eγˆi +z √ qi i ) Similarly, for REML estimates, we set up matrix ! ∂ 2 ∗ ` (γ) ∂ 2 γi γj γ bREM L Mr = q×q A variance-covariance structure is then obtained in the same way as before for ML estimates. Both of these versions are implemented in R command lme in package nlme. You can choose between REML and ML estimates by specifying the option method = "REML" or method = "ML" in the call of lme. Default are REML estimates. 3.4 Anova Models A different (and more old-fashioned) approach on variance estimation is inference based on anova tables. Let M S1 , ..., M Sl be a set of independent random variables with dfi M Si ∼ χ2dfi E[M Si ] For convenience, we will use EM Si for E[M Si ]. The idea now is to write variances as linear combinations of these MSs. Let s2 be some linear combination of these MSs: s2 = a1 M S1 + a2 M S2 + ... + al M Sl S 2 then has the following properties: Es2 = X ai EM Si i 2 V ars = X a2i V arM Si = i X a2i i EM Si dfi 2 M Si V ar dfi = E[M Si ] | {z } 2dfi = X i a2i EM Si dfi 2 · 2dfi = 2 X i 2 a2i (EM Si ) dfi We have now several choices for an estimate of the variance of V ars2 : • plug in M Si as an estimate of EM Si : V\ ars2 = 2 X i 68 a2i M Si2 dfi 2 Si ) • more subtle: since E[M Si2 ] = V arM Si + (EM Si )2 = 2a2i (EM + (EM Si )2 = (EM Si )2 dfi have therefore dfi (EM Si )2 = E[M Si2 ] dfi + 2 2+dfi dfi , we This yields a different estimate for V ars2 : V\ ars2 = 2 X a2i i M Si2 , dfi + 2 which is smaller than the first one and therefore provides us with the smaller confidence intervals. To get an approximate density for s2 we use Cochran-Satterthwaite, which says, that for a linear combination s2 , s2 = a1 M S1 + a2 M S2 + ... + al M Sl with dfi M Si ∼ χ2dfi E[M Si ] the distribution of s2 is approximately multiplicative χ2 , i.e. ν s2 ∼ χ2ν E[s2 ] with ν = E[s2 ]2 1/2V ar[s2 ] The degrees of freedom ν come from setting ν = E[s2 ] and 2ν = V ar[νs2 /E[s2 ]]. An approximate confidence interval for the expected value of s2 is then: ( νs2 χ2ν,upper , νs2 ) χ2ν,lower This is not realizable because ν depends on EM Si . Estimate ν by nu: ˆ nu ˆ =2 (s2 )2 V\ ar(s2 ) This gives (very) approximate confidence intervals for Es2 : ( ν̂s2 , ν̂s2 χ2ν̂,upper χ2ν̂,lower 69 ) Example: Machines Goal: Compare three brands of machines A, B, and C used in an industrial process. Six workers were chosen randomly to operate each machine three times. Response is overall productivity score taking into account the number and quality of components produced. Variables: bi Worker a factor giving a unique identifier for workers. βj Machine a factor with levels A, B, and C identifying the machine brand. yijk score a productivity score. i = 1, ..., 6, j = 1, 2, 3, k = 1, 2, 3 (repetitions) library(lattice) trellis.par.set(theme=col.whitebg()) plot(Machines) The graphic below gives a first overview of the data: productivity score is plotted against workers. The ordering of workers on the vertical axis is decided on an individual’s highest productivity, worker 6 showed the overall lowest performance, worker 5 had the highest productivity score. Color and symbol correspond to machines A to C. Productivity scores for each machine seem to be fairly stable across workers, i.e. each worker got the highest productivity score on machine C, most of the workers (except worker 6) had the lowest score on machine A. Repetitions on the same brand gave very similar results in terms of the productivity score. ● 5 A B C ● ● ● 3 ● 1 ●● Worker ● ●● 4 ● 2 6 ● ● ● ● 45 ● ●● ● 50 55 60 65 70 Productivity score Model Machines Data A first idea for modeling these data is to treat machines as a fixed effect and workers as random (in order to be able to draw inference on the workers’ population) independently from machine: yijk = βj + bi + ijk Assume bi ∼ N (0, σb2 ) and ijk ∼ N (0, σ 2 ) The variance assumptions seem to be justified based on the first plot. > mm1 <- lme(score~Machine, random=~1|Worker, data=Machines) > summary(mm1) Linear mixed-effects model fit by REML Data: Machines 70 AIC BIC logLik 296.8782 306.5373 -143.4391 Random effects: Formula: ~1 | Worker (Intercept) Residual StdDev: 5.146552 3.161647 Fixed effects: score ~ Machine Value Std.Error (Intercept) 52.35556 2.229312 MachineB 7.96667 1.053883 MachineC 13.91667 1.053883 Correlation: (Intr) MachnB MachineB -0.236 MachineC -0.236 0.500 DF t-value p-value 46 23.48507 0 46 7.55935 0 46 13.20514 0 Standardized Within-Group Residuals: Min Q1 Med Q3 -2.7248806 -0.5232891 0.1327564 0.6513056 Max 1.7559058 Number of Observations: 54 Number of Groups: 6 REML estimates for σβ2 and σ 2 are c2 = 3.161647 σ cβ = 5.146552, and σ The effect of machine brands are in the same direction as we already saw in the plot: machine A (effect set to zero) is estimated to be less helpful than machine B, which itself is estimated to be less helpful than machine C in order for workers to get a high productivity score. In order to inspect the fit of the model further, we might want to plot the fitted value and compare the fit to the raw data. That way we also see what exactly the above model is doing: different productivity scores for each machine are fitted, each worker gets an “offset” by which these scores are moved horizontally. The biggest difference between the raw data and the fitted seems to be for the productivity scores of worker 6, where estimated productivity scores of machines A and B switch their ranking. Visualizing Fitted Values ● 1 2 5 Worker attach(Machines) MM1 <- groupedData( fitted(mm1)~factor(Machine) | Worker, data.frame(cbind(Worker,Machine,fitted(mm1))) ) plot(MM1) detach(Machines) ● 6 ● 4 ● 3 ● 2 1 3 ● ● 45 50 55 60 65 70 fitted(mm1) Since we still see major differences between raw and fitted values, we might want to include an interaction effect to the model: 71 Assessing Interaction Effects Worker 50 55 60 65 5 3 1 4 2 6 45 mean of score 70 attach(Machines) interaction.plot(Machine, Worker, score,col=2:7) detach(Machines) A B C Machine Model: Add Interaction Effect yijk = βj + bi + bij + ijk Assume bi ∼ N (0, σ12 ), bij ∼ N (0, σ22 ) and ijk ∼ N (0, σ 2 ) Now we are dealing with random effects on two levels: random effects for each worker, random effects of each machine for each worker (Machine within Worker). Whenever we are dealing with the interaction of a random effect and a fixed effect the resulting interaction effect has to be treated as a random effect. mm2 <- update(mm1, random=~1| Worker/Machine) The resulting parameter estimates are shown in summary(mm2): the overall residual standard deviation is reduced to σ̂ 2 = 0.9615768. > summary(mm2) Linear mixed-effects model fit by REML Data: Machines AIC BIC logLik 227.6876 239.2785 -107.8438 Random effects: Formula: ~1 | Worker (Intercept) StdDev: 4.781049 Formula: ~1 | Machine %in% Worker (Intercept) Residual StdDev: 3.729536 0.9615768 Fixed effects: score ~ Machine Value Std.Error (Intercept) 52.35556 2.485829 MachineB 7.96667 2.176974 MachineC 13.91667 2.176974 Correlation: DF t-value p-value 36 21.061606 0.0000 10 3.659514 0.0044 10 6.392665 0.0001 72 (Intr) MachnB MachineB -0.438 MachineC -0.438 0.500 Standardized Within-Group Residuals: Min Q1 Med Q3 -2.26958756 -0.54846582 -0.01070588 0.43936575 Max 2.54005852 Number of Observations: 54 Number of Groups: Worker Machine %in% Worker 6 18 A plot of the fitted value also indicates that the values are much closer to the raw data. ● 6 1 2 ● 5 Worker 3 ● 4 ● 3 ● 2 ● 1 ● 45 50 55 60 65 70 fitted(mm2) MM2 <- groupedData(fitted(mm2)~factor(Machine) | Worker, data.frame(cbind(Worker,Machine,fitted(mm2)))) plot(MM2) Statistically, we can test whether we actually need the interaction effect by using an anova table (in the framework of a reduced/full model test): the difference between mm1 and mm2 turns out to be highly significant, meaning that we have to reject the hypothesis that we do not need the interaction effect of model mm2 (i.e. we need the interaction effect). This does not give an indication, however, whether model mm2 is a “good enough” model. Comparison of Models > plot(MM2) > anova(mm1,mm2) Model df AIC BIC logLik Test L.Ratio p-value mm1 1 5 296.8782 306.5373 -143.4391 mm2 2 6 227.6876 239.2785 -107.8438 1 vs 2 71.19063 <.0001 We were able to estimate interaction effects because of the repetitions (unlike the stool data example). 73 4 Bootstrap Methods Different approach in getting variance estimates for parameters. Situation: assume sample X1 , ..., Xn F̃ i.i.d. for some distribution F . The Xi are possibly vector valued observations. We want to make some inference on a characteristic θ of F , e.g. mean, median, variance, correlation. Compute estimate tn of θ from the observed sample: P mean: tn = n1 i Xi 2 P 1 standard deviation: t2n = n−1 i Xi − X̄ ... What can we say about the distribution of tn ? - We want to have suggestions for E[tn ], V ar[tn ], distribution, C.I., probabilities, ... 1. If F is known, one way of producing estimates is by doing simulation: (a) Draw samples X1b , ..., Xnb from F for b = 1, ..., B. (b) For each sample compute tbn (c) Use these B realizations of tn to compute P i. mean: Etn = B1 b tbn P b 1 ¯ 2 ii. variance: V artn = B−1 b tn − tn iii. empirical (cumulative) distribution function of F : # samples tbn < t B 2. If F is not known, but we know the family F is in, i.e. we know that F = Fθ for some unknown θ, we could estimate θ̂ from the sample, and follow through simulation based on B samples from F̂ = Fθ̂ . This is called parametric bootstrap. F̂ (t) = 3. If F is completely unknown, we cannot generate random numbers from this distribution. X1 , ..., Xn are the only realizations of F known to us. The idea is to approximate F by F̂ of X1 , ..., Xn : (a) draw B samples X1b , ..., Xnb from X1 , ..., Xn with replacement. This is then called a bootstrap sample. (b) for each bootstrap sample compute tbn . (c) Use these B realizations of tn to compute P i. mean: Etn = B1 b tbn P b ¯ 2 ii. variance: V artn = 1 b tn − tn B−1 iii. empirical (cumulative) distribution function of F : # samples tbn < t B If used properly this yields consistent estimates: for n → ∞, B → ∞ P EF̂ tn = B1 b tbn → EF tn P b 1 ¯ 2 → V arF tn V arF̂ tn = B−1 b tn − tn F̂ (t) = What are good values for B - the number of bootstrap samples? standard deviation B ≈ 200 C.I. B ≈ 1000 more demanding B ≈ 5000 74 Example: Law Schools 15 law schools report admission conditions on LSAT and GPA scores. 660 ● ● ● 620 600 ● ● ● ● ● ● ● 560 580 LSAT 640 ● ● ● ● ● 2.8 2.9 3.0 3.1 GPA Interested in correlation between scores > > > > options(digits=4) library(bootstrap) data(law) law LSAT GPA 1 576 339 2 635 330 3 558 281 4 578 303 5 666 344 6 580 307 7 555 300 8 661 343 9 651 336 10 605 313 11 653 312 12 575 274 13 545 276 14 572 288 15 594 296 > attach(law) > cor(GPA,LSAT) [1] 0.7764 How can we compute a C.I. of Correlation for scores? The first two bootstrap samples look like this: > > > > ID <- 1:15 #### C.I. for correlation? # first bootstrap sample b1 <- sample(ID, size=15, replace=T); b1 75 3.2 3.3 3.4 [1] 10 9 8 8 5 7 11 6 10 12 > law[b1,] LSAT GPA 10 605 313 9 651 336 8 661 343 8.1 661 343 5 666 344 7 555 300 11 653 312 6 580 307 10.1 605 313 12 575 274 3 558 281 2 635 330 12.1 575 274 11.1 653 312 6.1 580 307 > cor(law[b1,]$LSAT, law[b1,]$GPA) [1] 0.8393 3 > # second bootstrap sample > b2 <- sample(ID, size=15, replace=T); b2 [1] 7 12 1 11 2 3 9 6 7 2 4 5 13 > law[b2,] LSAT GPA 7 555 300 12 575 274 1 576 339 11 653 312 2 635 330 3 558 281 9 651 336 6 580 307 7.1 555 300 2.1 635 330 4 578 303 5 666 344 13 545 276 1.1 576 339 12.1 575 274 > cor(law[b2,]$LSAT, law[b2,]$GPA) [1] 0.6534 2 12 11 6 1 12 Iterating 5000 times: > > > + + + > # not recommended - because of for loop boot.cor <- rep(NA,5000) # output dummy for (i in 1:5000) { b <- sample(ID, size=15, replace=T) boot.cor[i] <- cor(law[b,]$LSAT, law[b,]$GPA) } summary(boot.cor) Min. 1st Qu. Median Mean 3rd Qu. Max. 76 0.08493 0.68830 0.78850 0.76900 0.87360 0.99440 > sd(boot.cor) [1] 0.1334 > > # rather > scor <- function(x) { + b <- sample(ID, size=15, replace=T) + return(cor(law[b,]$LSAT, law[b,]$GPA)) + } > > boot.cor <- sapply(1:5000,FUN=scor) > summary(boot.cor) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05641 0.69120 0.78860 0.76980 0.87130 0.99520 > sd(boot.cor) [1] 0.1337 > > # maybe easiest: use bootstrap command in package bootstrap > boot.cor <- bootstrap(ID,5000,scor) > summary(boot.cor) Length Class Mode thetastar 5000 -none- numeric func.thetastar 0 -none- NULL jack.boot.val 0 -none- NULL jack.boot.se 0 -none- NULL call 4 -none- call > summary(boot.cor$thetastar) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0677 0.6920 0.7930 0.7720 0.8740 0.9980 > sd(boot.cor$thetastar) [1] 0.1326 Histogram of 5000 Bootstrap Correlations > hist(boot.cor) 400 200 0 Frequency 600 5000 bootstrap correlations 0.2 0.4 0.6 boot.cor 77 0.8 1.0 Influence of Number of Bootstrap Samples: > n <- c(10,50,100,200,250, 500, 600,700,800, 900,1000,2000,3000) > for(i in 1:length(n)) { + b <- sapply(1:n[i],FUN=scor) + print(c(n[i],mean(b),sd(b))) + } [1] 10.0000000 0.8565279 0.1370131 [1] 50.0000000 0.7996950 0.1059167 [1] 100.0000000 0.7756764 0.1306812 [1] 200.0000000 0.7777854 0.1299204 [1] 250.0000000 0.7567017 0.1347792 [1] 500.0000000 0.7716557 0.1341673 [1] 600.0000000 0.7673644 0.1328523 [1] 700.0000000 0.7715935 0.1312316 [1] 800.0000000 0.7633297 0.1379392 [1] 900.0000000 0.7737205 0.1334535 [1] 1000.0000000 0.7724381 0.1276535 [1] 2000.0000000 0.7742773 0.1315137 [1] 3000.0000000 0.7683884 0.1341691 The Real Value: Data of all 82 Law Schools For the law school data information for all of the population (all 82 schools at that time) is available. Data points sampled for the first example are marked by filled circles. ● ● 3.4 ● ● ● ● ● ● 3.2 3.0 GPA ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● 2.8 2.6 ●● ● ● ● ● ● ●● ● ● ● ● ● ● 500 550 600 650 700 LSAT We are in the unique situation, of being able to determine the true parameter value: the value for the correlation coefficient turns out to be 0.7600 > data(law82) > plot(GPA~LSAT,data=law82) > points(LSAT,GPA,pch=17) > > cor(law82$GPA,law82$LSAT) [1] 0.7600 Having all the data also makes it possible to get values for the true distribution of correlation coefficients based on samples of size 15 from the 82 schools. How many samples of size 15 are there (with replacement)? 78 > # binomial coefficent: choose 15 from 82 > fac(82)/(fac(15)*fac(67)) # lower limit - does not include replicates [1] 9.967e+15 > 82^15 # upper limit - regards ordering of draws [1] 5.096e+28 Clearly, these are too many possibilities to regard. We will therefore only have a look at 100.000 samples of size 15 from 82 Schools: > > > + + + > N <- 100000 corN <- rep(NA,N) for (i in 1:N) { b <- sample(1:82,size=15,replace=T) corN[i] <- cor(law82[b,2],law82[b,3]) } summary(corN); Min. 1st Qu. Median Mean 3rd Qu. Max. -0.2653 0.6794 0.7702 0.7476 0.8413 0.9885 > sd(corN); [1] 0.1291587 The true value for the correlation based on 15 samples therefore is 0.7476. In the comparison with the true value of the population, we see a difference - the bias of our estimate is -0.0124. Bias of Bootstrap Estimates BiasF (tn ) = −θ EF [tn ] | {z } average of all possible values of size n from population Law school: 0.7476 - 0.7600 = -0.0124 Since we usually do not know F nor θ, we have to use estimates for the bias: Estimate of Bias: BiasF̂ (tn ) = E [tn ] − tn |{z} | F̂{z } average of B bootstrap samples value from original sample Law school: 0.7698 - 0.7764 = -0.0066 Efron & Tibshirani (2003) suggested a correction fro the bootstrap estimate to reduce the bias: Improved Bias corrected Estimates t˜n = tn − BiasF̂ (tn ) = 2tn − EF̂ [tn ] Law school: 0.7764 - (-0.0066) = 0.7830 In the law school example, this correction fires backwards - after the correction the estimate is further from the true value than the raw bootstrap average. This is something that happens quite frequently in practice, because it turns out that the mean squared error of the bias corrected estimate can be larger than the mean square error of the raw estimate. Confidence Intervals The idea for confidence intervals based on a bootstrap sample, is to get the empirical distribution function from the samples by sorting the values from smallest to largest and identifying the α% extremest cases: 1. sort bootstrap samples tb(1) ≤ tb(2) ≤ ... ≤ tb(n) 79 2. compute upper and lower α/2 percentiles: α α kL = largest integer ≤ (B + 1) = b(B + 1) c 2 2 kU = B + 1 − kL 3. a (1 − α)100% C.I. for θ is then (t(kL ) , t(kU ) ) Bootstrap 90% C.I. for correlation in Law School Data > N <- 5000 > alpha <- 0.1 > kL <- trunc((N+1)*alpha/2); kL [1] 250 > kU <- N+1 - kL; kU [1] 4751 > > round(sort(boot.cor)[c(kL,kU)],4) [1] 0.5196 0.9465 80 Properties of percentile bootstrap C.I.: • Bootstrap confidence interval is consistent for B → ∞ • C.I. is invariant under transformation • bootstrap approximation becomes more accurate for larger sample (n → ∞) • for smaller samples bootstrap coverage tends to be smaller than the nominal (1 − α)100% level. • percentile interval is entirely inside the parameter space. In order to overcome the problem of biasedness in the C.I. limits and to correct for the under coverage problem, we look at an alternative to percentile C.I.: Bias corrected accelerated BCa bootstrap C.I.: Draw B bootstrap samples and order estimates: t∗n(1) ≤ t∗n(2) ≤ ... ≤ t∗n(B) BCa interval of intended coverage of (1 − α) is given by (t∗nbα1 (B+1)c , t∗nbα2 Bc ) where z0 − zα/2 Φ z0 + , and 1 − a(z0 − zα/2 ) z0 + zα/2 Φ z0 + , 1 − a(z0 + zα/2 ) α1 = α2 = and zα/2 is the upper α/2 quantile of the standard normal. z0 Φ−1 ( proportion of bootstrap samples t∗ni lower than tn ) = ∗ #tni < tn = Φ−1 , B = z0 is a bias correction, which measures the median bias of tn in normal units. If tn is close to the median of t∗ni then z0 is close to 0. a is the acceleration parameter: 3 P j tn,(.) − tn,−j a= 2 3/2 P 6 j tn,(.) − tn,−j P where tn,−j is the value of statistic tn computed without value xj ; and tn,(.) is the average tn,(.) = n1 j tn,−j . • BCa are second order accurate, i.e. α Clower + 2 n α Cupper P (θ > upper end of BCa interval ) = + 2 n P (θ < lower end of BCa interval ) = 81 Percentile intervals are only 1st order accurate: ∗ α Clower + √ 2 n ∗ C α upper P (θ > upper limit ) = + √ 2 n P (θ < lower limit ) = • like percentile intervals, BCa intervals are transformation respecting • disadvantage: BCa intervals are computationally intensive In the example of the Law School Data we get a BCa interval of ( 0.4288, 0.9245 ). One way to come up with this interval in R, is part of the library boot: > + + > > > scor <- function(x,b) { # x is the dataframe, b is an index of the samples included return(cor(x[b,]$LSAT, x[b,]$GPA)) } boot.cor <- boot(law,scor,5000) boot.cor ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = law, statistic = scor, R = 5000) Bootstrap Statistics : original bias std. error t1* 0.7764 -0.008126 0.1346 > > summary(boot.cor) Length Class Mode t0 1 -nonenumeric t 5000 -nonenumeric R 1 -nonenumeric data 2 data.frame list seed 626 -nonenumeric statistic 1 -nonefunction sim 1 -nonecharacter call 4 -nonecall stype 1 -nonecharacter strata 15 -nonenumeric weights 15 -nonenumeric > > boot.cor$t0 [1] 0.7764 > > summary(boot.cor$t) 82 object Min. :-0.0203 1st Qu.: 0.6907 Median : 0.7913 Mean : 0.7703 3rd Qu.: 0.8731 Max. : 0.9967 A set of bootstrap confidence intervals is then given by boot.ci: > boot.ci(boot.cor, conf=0.9) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates CALL : boot.ci(boot.out = boot.cor, conf = 0.9) Intervals : Level Normal 90% ( 0.5655, 1.0000 ) Basic ( 0.6065, 1.0234 ) Level Percentile BCa 90% ( 0.5293, 0.9462 ) ( 0.4288, 0.9245 ) Calculations and Intervals on Original Scale Warning message: Bootstrap variances needed for studentized intervals in: boot.ci(boot.cor, conf = 0.9) “Manually” this could be done, too: > z0 <- qnorm(sum(boot.cor$t < boot.cor$t0)/5000); z0 [1] -0.06673 > > corj <- function(x) { + return(cor(LSAT[-x], GPA[-x])) + } > tnj <- sapply(1:15, corj) > > acc <- sum((tnj-mean(tnj))^3)/(6*sum((tnj-mean(tnj))^2)^1.5) > acc [1] 0.07567 > > alpha <- 0.1 > alpha1 <- pnorm(z0 + (z0+qnorm(alpha/2))/(1-acc*(z0+qnorm(1-alpha/2)))); alpha1 [1] 0.02219 > alpha2 <- pnorm(z0 + (z0-qnorm(alpha/2))/(1-acc*(z0-qnorm(1-alpha/2)))); alpha2 [1] 0.9578 For an effective 95% coverage 2.2% and 96% limits are used in the biased corrected & accelerated confidence interval. ABC intervals ABC intervals (approximate bias confidence intervals) are a computationally less intensive alternative of BCa s. Only a small percentage of computation needed compared to a BCa . ABC intervals are also 2nd order accurate. 83 Bootstrap can fail Let Xi ∼ U [0, θ] be i.i.d. An estimate of θ is given as θ̂ = maxi Xi . A C.I. for θ based on a bootstrap tends to be too short: let θ = 2: > theta <- 2 > x <- runif(100, 0,theta) > max(x) [1] 1.998 > > maxb <- function(x,b) return(max(x[b])) > boot.theta <- boot(x,maxb,5000) > boot.ci(boot.theta,conf=0.9) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates CALL : boot.ci(boot.out = boot.theta, conf = 0.9) Intervals : Level Normal 90% ( 1.981, 2.030 ) Basic ( 1.998, 2.058 ) Level Percentile BCa 90% ( 1.939, 1.998 ) ( 1.939, 1.998 ) Calculations and Intervals on Original Scale Warning message: Bootstrap variances needed for studentized intervals in: boot.ci(boot.theta, conf = 0.9) Both percentile intervals and BCa bootstrapping Intervals do not extend far enough to the right to cover θ. 84 Example: Stormer Data Viscometer measure viscosity of fluids by measuring the time an inner cylinder needs to perform a certain number of rotations. Calibration is done by adding weights to the cylinder in fluids of known viscosity Physics gives us the nonlinear model: β1 V T = + W − β2 This is equivalent to W · T = β1 V + β2 T + (W − β2 ) For a start we can treat this equation as a linear model by not regarding the error term. We fit a linear model to get an idea for initial values of β1 , β2 : > library(MASS) > data(stormer) > names(stormer) [1] "Viscosity" "Wt" > > > lm1 <- lm(Wt*Time ~ > summary(lm1) "Time" Viscosity + Time-1, data=stormer) Call: lm(formula = Wt * Time ~ Viscosity + Time - 1, data = stormer) Residuals: Min 1Q Median -304.4 -144.1 84.2 3Q 209.1 Coefficients: Estimate Std. Viscosity 28.876 Time 2.844 --Signif. codes: 0 ‘***’ Max 405.5 Error t value Pr(>|t|) 0.554 52.12 <2e-16 *** 0.766 3.71 0.0013 ** 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 236 on 21 degrees of freedom Multiple R-Squared: 0.998,Adjusted R-squared: 0.997 F-statistic: 4.24e+03 on 2 and 21 DF, p-value: <2e-16 A nonlinear model is then fit by nls using the coeficients of the linear model as starting values. The resulting parameter estimates are not very far off the initial values: > bc <- coef(lm1) > fit <- nls(Time~b1*Viscosity/(Wt-b2),data=stormer, + start=list(b1=bc[1],b2=bc[2]),trace=T) 885.4 : 28.876 2.844 825.1 : 29.393 2.233 825 : 29.401 2.218 825 : 29.401 2.218 > coef(fit) b1 b2 29.401 2.218 85 We would now like to find confidence intervals for these values. We can do that by using bootstrap samples and fitting a nonlinear model for each of it. Bootstrap nonlinear regression: > + + + > > nlsb <- function(x,b) { return(coef(nls(Time~b1*Viscosity/(Wt-b2),data=x[b,], start=list(b1=bc[1],b2=bc[2])))) } stormer.boot <- boot(stormer,nlsb, 1000) stormer.boot ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = stormer, statistic = nlsb, R = 1000) 8 Bootstrap Statistics : original bias std. error t1* 29.401 -0.04682 0.6629 t2* 2.218 0.07708 0.7670 > > plot(stormer.boot$t[,1], stormer.boot$t[,2]) ● 6 ● ● ● 4 ● 2 ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ●●●● ●● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●●● ●●●●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ● ●●● ●● ● ● ● ●●● ● ● ● ●●● ●● ●●●● ● ● ●● ● ● ●● ●● ● ● ● ●●● ● ●● ●● ● ●● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 stormer.boot$t[, 2] ● 27 28 29 stormer.boot$t[, 1] 86 30 31 > boot.ci(stormer.boot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = stormer.boot) Intervals : Level Normal 95% (28.15, 30.75 ) Basic (28.13, 30.77 ) Studentized (27.54, 30.49 ) Level Percentile BCa 95% (28.03, 30.68 ) (28.07, 30.73 ) Calculations and Intervals on Original Scale > boot.ci(stormer.boot,index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = stormer.boot, index = 2) Intervals : Level Normal 95% ( 0.670, 3.635 ) Basic ( 0.735, 3.551 ) Level Percentile BCa 95% ( 0.885, 3.701 ) ( 0.650, 3.391 ) Calculations and Intervals on Original Scale Some BCa intervals may be unstable Warning message: Bootstrap variances needed for studentized intervals in: boot.ci(stormer.boot, index = 2) 87 5 Generalized Linear Model Generalized Linear Models (GLMs) consist of three components: random component: response variable Y with independent observations y1 , y2 , ..., yN , Y has a distribution in the exponential family, i.e. the distribution of Y has parameters θ and φ, and we can write the density in the following form: yi θi − b(θi ) f (yi ; θi , φ) = exp − c(yi , φ) , a(φ) with functions a(.), b(.), and c(.). θi is called the canonical (or natural) parameter and φ is the dispersion. The exponential family is very general and includes the normal, binomial, Poisson, Γ, inverse Γ, ... systematic component: X1 , ..., Xp explanatory variables, where xij is the value of variable j for subject i. The systematic component is then a linear combination in Xj : ηi = p X βj xij j=1 This is equivalent to a design matrix in a standard linear model. link function The link function builds the connection between the systematic and the random component: let µi be the expected value of Yi , i.e. E[Yi ] = µi , then X h(µi ) = ηi = βj xij j for i = 1, ..., N , where h is a differentiable, monotonic function. h(µ) = µ is the identity link, the link function that transforms the mean to the natural parameter is called the canonical link: h(µ) = θ. 5.1 Members of the Natural Exponential Family The natural exponential family is large and includes a lot of familiar distributions, among them the Binomial and the Poisson distribution. For a lot of distributions, it is just the question of re-writing the density to identify appropriate functions a(), b() and Q(). We will do that for both the Normal and Binomial distribution: Normal Distribution Let Y ∼ N (µ, σ 2 ). The density function of Y is then 1 1 exp − 2 (y − µ)2 f (y; µ, σ 2 ) = √ 2σ 2πσ 2 Using the basic idea that θ = µ and φ = σ 2 this can be re-written as 1 1 f (y; θ, φ) = exp log √ exp − (y − θ)2 = 2φ 2πφ 1/2(y − θ)2 1 √ = exp − + log = φ 2πφ yθ − 1/2θ2 1 2 1 = exp − y − log(2πφ) φ 2φ 2 88 The normal distribution is part of the exponential family and a(φ) b(θ) = 1/2θ2 c(y, φ) = φ = 1 2 1 y + log(2πφ) 2φ 2 The canonical link is then h(µ) = µ the identity link. Ordinary least squares regression with normal response would be a GLM with identity link. Usually, we transform Y , if Y is not normal - with GLM we use the link function to connect to explanatory variables, fitting process maximizes likelihood for the choice of distribution of Y (not necessarily normal). Binomial Distribution Let Y be a binary variable with a Binomial distribution, and let P (Y = 1) = π. The density function then is given as n y f (y; π) = π (1 − π)n−y y This function can be re-expressed as n y f (y; π) = π (1 − π)n−y = y n = exp y log π + (n − y) log(1 − π) + log = y n π + n log(1 − π) + log = exp y log y 1−π The Binomial distribution is a member of the exponential family with φ = θ b(θ) = n log(1 − = 1 and a(φ) = 1 π eθ log ⇒π= 1−π 1 + eθ eθ ) = −n log(1 + eθ ) 1 + eθ c(y, φ) n = − log y The natural parameter of the Binomial is log π1 − π, which is also called the logit of π. the canonical link is therefore h(nπ) = n log π1 − π the logit link. Let Y be a binary response, with P (Y = 1)π = E[Y ], V ar(Y ) = π(1 − π). Linear Probability Model π(x) = α + βx GLM with identity link function, i.e. g(π) = π. Problem: sufficiently large or small values of x will make π(x) leave the proper range of [0, 1]. Logistic Regression Model Relationship between x and π(x) might not be linear: an increase in x will mean something different, if π(x) is close to 0.5 or near 0 or 1. An S- shaped curve as sketched below might be more appropriate to describe the relationship. 89 0.0 0.2 0.4 p(x) 0.6 0.8 1.0 --4 -2 2 4 m 0.0 0.2 0.4 0.6 0 0.8 1 x1.0 p M p(x) 4(x) 2 .2 .4 .6 .8 .0 -4 -2 0 2 4 x These curves can be described in two parameters, α, β as π(x) = exp(α + βx) 1 + exp(α + βx) solving for α + βx gives: α + βx = log π(x) . 1 − π(x) This is a GLM with logit link - the canonical link for a model with binomial response. 5.2 Inference in GLMs Idea: do large sample general theory for the exponential family. That way we can apply all results to situations with familiar distributions belonging to the exponential family. Let `(y; θ, φ) be the log likelihood function in the exponential family, i.e. `(y; θ, φ) = (yθ − b(θ))/a(φ) − c(y, φ) and the partial derivative of ` in θ is: ∂ `(y; θ, φ) = (y − b0 (θ))/a(φ) ∂θ Assume weak regularity conditions of the Maximum Likelihood: ∂ E `(y; θ, φ) = 0 and ∂θ 2 ∂ ∂ V ar `(y; θ, φ) = −E 2 `(y; θ, φ) ∂θ ∂ θ This gives us the following results for expected value and variance of Y : ∂ E `(y; θ, φ) = 0 ∂θ ⇐⇒ E [(y − b0 (θ))/a(φ)] = 0 ⇐⇒ E[y] = b0 (θ). and 2 ∂ ∂ V ar `(y; θ, φ) = −E 2 `(y; θ, φ) ∂θ ∂ θ 0 ⇐⇒ V ar [(y − b (θ))/a(φ)] = −E [b00 (θ)/a(φ)] ⇐⇒ V ar(y) = b00 (θ)a(φ). Now it is up to us to choose a function h(.) with h(E[Y ]) = Xβ, i.e. h(b0 (θ)) = Xβ. Whenever h(b0 (θ)) = θ, h is the canonical link. 90 Normal distribution b(θ) = −θ2 /2 then b0 (θ) = θ = E[Y ]; the identity link is therefore the canonical link. Binomial distribution θ e b(θ) = n log(1 + eθ ) then b0 (θ) = n 1+e θ = nπ = E[Y ]; the canonical link is therefore the logit link: θ = log π 1−π Othe possibilities for link functions: • logit link: h(π) = log π 1−π = Xβ therefore π = eθ 1+eθ • probit link: h(π) = Φ−1 (π) then π = Φ(θ) θ 0.6 0.4 0.0 0.2 H(theta) 0.8 1.0 • cloglog link (complementary log log link): h(p) = log(−log(1 − π)) = Xβ and π = 1 − e−e . −4 −2 0 2 4 theta Figure 4: Comparison of links in the binomial model: logit link is drawn in black, probit link in green, and complementary log log link is drawn in red. The differences between the links are subtle - the probit link shows a steeper increase than the logit, whereas the cloglog link shows an asymmetric increase. 5.3 Binomial distribution of response Example: Beetle Mortality Insects were exposed to gaseous carbon disulphide for a period of 5 hours. Eight experiments were run with different concentrations of carbon disulphide. Variable Description Dose Dose of carbon disulphide Exposed Number of beetles exposed Mortality Number of beetles killed The scatterplot below shows rate of death versus dose: 91 1.0 ● ● ● 0.6 ● 0.4 Rate of Death 0.8 ● 0.2 ● ● ● 1.70 1.75 1.80 1.85 Dose Let Yj count the number of killed beetles at dose dj (j = 1, ..., 8) with Yj ∼ B(nj , πj ) independently, i.e. each beetle has a probability of πj of dying. Using a logit link, we get logit πj = (Xβ)j = β0 + β1 dj We can then set up the likelihood as: L(β, y, x) = J Y f (yj ; θj , φ) = j=1 J Y nj exp yj (β0 + β1 dj ) − nj log(1 + eβ0 +β1 dj ) + log yj j=1 Maximize over β? - set first partial derivatives of log-likelihood to zero: ∂ `(β; y, x) ∂β0 J X = j=1 J y j − nj X eβ0 +β1 dj ! = y j − n j πj = 0 1 + eβ0 +β1 dj | {z } j=1 πj ∂ `(β; y, x) ∂β1 J X = J xj yj − nj xj j=1 X eβ0 +β1 dj ! = yj − nj xj πj = 0 β +β d 0 1 j 1+e | {z } j=1 πj • these equations generally do not have a closed form • if a solution exists, it is unique • summarize the system to the likelihood equations (score function): 0 = X 0 (Y − m) =: Q(∗) find m such that (*) holds. Use algorithm to find minimum, i.e. set up matrix of Newton-Raphson ∂2 2nd partial derivatives H = − ∂βi ∂βj `(β) and iterate: i,j β t+1 = β t + H −1 Q, where both H and Q are evaluated at β t and β 0 is an initial guess. Large sample inference gives us . β̂ ∼ N (β, H −1 ), if nπj > 5 and n(1 − πj ) > 5 ∂2 is the Fisher information matrix. where H = E − ∂βi ∂βj `(β) i,j 92 A fit of a GLM with Binomial response and logit link gives the following model: 1.0 fit1 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=logit)) points(Dose,fit1$fitted,type="l") ● ● ● 0.6 ● 0.4 Rate of Death 0.8 ● 0.2 ● ● ● 1.70 1.75 1.80 1.85 Dose Additionally, we can fit models corresponding to a probit and a complementary log log link: fit2 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=probit)) points(Dose,fit2$fitted,type="l",col="red") 1.0 fit3 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=cloglog)) points(Dose,fit3$fitted,type="l",col=3) ● ● ● 0.6 ● 0.4 Rate of Death 0.8 ● 0.2 ● ● ● 1.70 1.75 1.80 1.85 Dose Visually, the green fit corresponding to the cloglog link seems to give the best model. How can we compare GLMs statistically? 5.4 Likelihood Ratio Tests (Deviance) Model M1 is nested within model M2 , if M1 is a simpler model than M2 , i.e. all terms in M1 are in M2 , and M2 has more terms than M1 . The deviance between models M1 and M2 is defined as D(M1 , M2 ) = −2(log likelihoodM1 − log likelihoodM2 ) . then D(M1 , M2 ) ∼ χ2# par M2 −# par M1 and is a test statistic for H0 : M2 does not explain significantly more than M1 . based on this concept, some special deviances are defined: 93 • residual deviance of model M : D(M, Mf ull ) = −2(log likelihoodMf ull − log likelihoodM ) this expression makes sense, as by definition every model is nested within the full model. This test gives us a goodness-of fit statistic for model M : the null hypothesis H0 assumes that the full model does not explain significantly more than model M . If we cannot reject this null hypothesis, this implies, that model M is a reasonably good model. • null deviance: D(Mnull , Mf ull ) = −2(log likelihoodMf ull − log likelihoodMnull ) This deviance gives an upper boundary for the deviance of all models: we cannot possibly find a model with a worse (= higher) residual deviance. • explanatory power of model M : D(M, Mn ull) = −2(log likelihoodM − log likelihoodMnull ) This gives a measure for how much better the current model is than the null model, i.e. this tests whether M is a significant improvement. The chi2 approximation holds in each case under Cochran’s rule: all expected counts should be > 1 and at least 80% of all cells should be > 5. In the Beetle Mortality examples likelihood ratio tests give us a statistical handle of comparing models: > summary(fit1) Call: glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = logit)) Deviance Residuals: Min 1Q Median -1.594 -0.394 0.833 3Q 1.259 Max 1.594 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -60.72 5.18 -11.7 <2e-16 *** Dose 34.27 2.91 11.8 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 284.202 Residual deviance: 11.232 AIC: 41.43 on 7 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 > summary(fit2) Call: 94 glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = probit)) Deviance Residuals: Min 1Q Median -1.57 -0.47 0.75 3Q 1.06 Max 1.34 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -34.94 2.65 -13.2 <2e-16 *** Dose 19.73 1.49 13.3 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 284.202 Residual deviance: 10.120 AIC: 40.32 on 7 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 > summary(fit3) Call: glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = cloglog)) Deviance Residuals: Min 1Q Median -0.8033 -0.5513 0.0309 3Q 0.3832 Max 1.2888 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -39.57 3.24 -12.2 <2e-16 *** Dose 22.04 1.80 12.2 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 284.2024 Residual deviance: 3.4464 AIC: 33.64 on 7 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 All of the models have a degree of freedom in the residual deviance of 6. The probit model has a slightly smaller deviance than the logit link model; the cloglog link model shows a huge improvement in the residual deviance, making it the best overall model, coinciding with the visual impression. 95 6 Model Free Curve Fitting Situation: response variable Y and p explanatory variables X1 , ..., Xp with Y = f (X1 , .., Xp ) + Goal: summarize relationship between Y and Xi (i.e. find fˆ) Advantage: data driven method, no modelling assumptions necessary Used to • get initial idea about relationship between Y and Xi (particularly useful in noisy data) • make predictions using interpolation • extrapolation? • check fit of parametric model Example: Diabetes Data Available is data on 43 children diagnosed with diabetes. Of interest in the study are factors affecting insulin-dependent diabetes. Variables: Cpeptide level of serum at diagnosis age age in years at diagnosis > diabetes <- read.table("http://www.public.iastate.edu/~hofmann/stat511/data/cpeptide.txt", + sep="\t",header=T) > diabetes subject age basedef Cpeptide 1 1 5.2 -8.1 4.8 2 2 8.8 -16.1 4.1 3 3 10.5 -0.9 5.2 ... 41 41 13.2 -1.9 4.6 42 42 8.9 -10.0 4.9 43 43 10.8 -13.5 5.1 6.0 6.5 ● ● ● 5.5 ● ● ● ● 5.0 ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● 4.5 4.0 Cpeptide ● ● ● ● ● ● ● ● ● ● ● ● 3.0 3.5 ● ● ● ● ● 5 10 15 age What is the relationship between Age and Cpeptide? using a parametric approach, we might fit polynomials of various degrees in Age to get an estimate of Cpeptide: from left to right polynomials of degree 1, 2, and 3 are fitted. 96 ● 6.0 6.0 6.0 6.5 ● 6.5 ● 6.5 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.5 ● 3.5 10 ●● ● ● ● ● 3.0 ● 3.0 ● ● ● ● ● ● ● ● 3.5 ● 5 ● ● 4.0 ● ● ● ● ● 4.0 ● ● ● ●● ● ● ● ● 4.0 ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 ● ● ● ●● ● 4.5 4.5 ● ● ● ● Cpeptide ● ● Cpeptide ● ● ● ● ● ● 5.0 5.0 Cpeptide ● ● ●● ● ● ● ● ● ● ● ● 5.0 ● 5.5 ● 5.5 ● ● 3.0 ● ● 5.5 ● ● ● ● ● 15 age 5 10 age 15 ● ● 5 10 15 age The parabola peaks at around age 10, afterwards the fit indicates a downwards trend in Cpeptide, which seems to be dictated by the shape of a parabola rather than the data. A polynomial of degree 3 shows an increase in Cpeptide after age 10, but the third degree term is not significant with respect to the parabola: > anova(fit2,fit3) Analysis of Variance Table Model 1: Model 2: Res.Df 1 40 2 39 Cpeptide ~ age + I(age^2) Cpeptide ~ age + I(age^2) + I(age^3) RSS Df Sum of Sq F Pr(>F) 14.62 13.85 1 0.77 2.16 0.15 In terms of a parametric fitting we are caught between statistical theory and the data interpretation. What we can do instead, to get an answer towards the nature of relationship between age and Cpeptide is to use a smoothing approach. 6.1 Bin Smoother Idea: partition data into (disjoint &) exhaustive regions with about the same number of cases in each bin. Let NK (x) be the set of K nearest neighbors of x. There are different ways of computing this set: 1. symmetric nearest neighbors NK (x) contains the nearest K cases of x to the left and the K nearest cases on the right. 2. nearest neighbors NK (x) contains the nearest K cases of x. Running Mean: for value x an estimate for y is given as the mean (or median) of the response values corresponding to the set of K neighbors NK (x) of x, i.e.: X 1 ŷ = yi |NK (x)| xi ∈NK (x) Running means are • easy to compute, • may not be smooth enough, • tends to flatten out boundary trends 97 Example: Diabetes The scatterplot below shows running means for a neighborhood of 11 (red line), 15 (green) and 19 (blue line) points. The positive trend between Age and Cpeptide for low values of age is only hinted at for a neighborhood of 11 points, but is not visible at all for 15 and 19 points. 6.0 6.5 ● ● ● 5.5 ● ● ● ● 5.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 4.5 ● ● ● ● ● ● ● 4.0 Cpeptide ● ● ● ● ● ● 3.0 3.5 ● ● ● ● ● 5 10 15 age library(gregmisc) K <- 5 runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center") runmean plot(Cpeptide~age,data=diabetes,pch=20) points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=2) K <- 7 runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center") points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=3) K <- 9 runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center") points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=4) Running Lines: fit line to points near x. Predict mean response at x to get estimate for y: ŷx = b0,x + b1,x x • larger neighborhoods give smoother curves • points inside the neighborhood have equal weight → jaggedness • idea: give more weight to points closer to x 6.2 Kernel Smoothers One of the reasons why the previous smoothers are wiggly is because when we move from xi to xi+1 two points are usually changed in the group we average. If the two new points are very different then ŷ(xi ) and ŷ(xi+1 ) may be quite different. One way to try and fix this is by making the transition smoother. That’s the idea behind kernel smoothers. 98 Generally speaking a kernel smoother defines a set of weights {Wi (x)}ni=1 for each x and defines a weighted average of the response values: n X s(x) = Wi (x)yi . i=1 What is called a kernel smoother in practice has a simple approach to represent the weight sequence {Wi (x)}ni=1 by describing the shape of the weight function Wi (x) by a density function with a scale parameter that adjusts the size and the form of the weights near x. It is common to refer to this shape function as a kernel K. The kernel is a continuous, bounded, and symmetric real function K which integrates to one, Z K(u) du = 1. For a given scale parameter h, the weight sequence is then defined by i K x−x h Whi (x) = Pn x−xi i=1 K h Pn Notice: i=1 Whi (xi ) = 1 The kernel smoother is then defined for any x as before by s(x) = n X Whi (x)Yi . i=1 A natural candidate for K is the standard Gaussian density. (This is very inconvenient computationally because it is never 0): 1 1 xi − x 2 √ = exp − 2 (xi − x) K h 2h 2πh2 The minimum variance kernel provides an estimator with minimal variance: (3 (3 − 5(xi − x)2 /h2 ) if xih−x < 1, xi − x 8h K = h 0 otherwise. 99