Contents

advertisement
Contents
1 Linear Models
1
2 Nonlinear Regression
48
3 Mixed Linear Models
3.1 Example: One way Random Effects Model . . . .
3.2 Example: Two way Mixed Effects Model without
3.3 Estimation of Parameters . . . . . . . . . . . . .
3.4 Anova Models . . . . . . . . . . . . . . . . . . . .
. . . . . . .
Interaction
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Bootstrap Methods
5 Generalized Linear Model
5.1 Members of the Natural Exponential Family
5.2 Inference in GLMs . . . . . . . . . . . . . .
5.3 Binomial distribution of response . . . . . .
5.4 Likelihood Ratio Tests (Deviance) . . . . .
57
57
59
59
68
74
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
88
88
90
91
93
6 Model Free Curve Fitting
96
6.1 Bin Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Kernel Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
1
Linear Models
Basic linear model structure:
Y = Xβ + ,
where

Y


= 


y1
y2
..
.


 , vector of (observable) random variables

yn

X


= 

x11
x21
..
.
xn1

β


= 

β1
β2
..
.
x12
x22
..
.
...
...
..
.
x1p
x2p
..
.
xn2
...
xnp



 , vector of known constants




 , vector of unknown parameters

βp



= 

1
2
..
.



 , vector of (unobservable) random errors

n
Almost always the assumption is that E = 0, often V ar = σ 2 Id; often ∼ M V Nn (in the presence of
V ar = σ 2 Id this means, that i are i.i.d. N (0, σ 2 ))
Examples
(a) Multiple regression
yi = α + β1i xi1 + β2i xi2 + i ,
translates to





y1
y2
..
.
yn


 
 
=
 
1
1
..
.
x11
x21
..
.
x12
x22
..
.
1 xn1
xn2
for i = 1, ..., n




α




β1  + 



β2
1
2
..
.





n
(b) One-way Anova
Version 1: yij = µi + ij
Version 2: yij = µ + τi + ij
Version 1: (3 treatments, 2 observations per
 

y11
1
 y12   1

 
 y21   0

 
 y22  =  0
 

 y31   0
y32
0
treatment)


0 0
11

  12
0 0 


µ1

1 0 
  µ2  +  21
 22
1 0 


µ
3
 31
0 1 
0 1
32
1








Version 2: (3 treatments, 2 observations

 
y11
 y12  

 
 y21  

 
 y22  = 

 
 y31  
y32
Assume solutions to version 2 are

per treatment)

1 1 0 0

1 1 0 0 


1 0 1 0 


1 0 1 0 
1 0 0 1 
1 0 0 1
 
µ
5
 τ1   1

 
 τ2  =  2
τ3
3



µ


τ1  

+
τ2  


τ3
11
12
21
22
31
32









 
7
µ


 

 and  τ1  =  −1 
 τ2   0 

1
τ3


Both produce




EY = 



6
6
7
7
8
8




,



i.e. they produce the same set of mean values for the observations. There is no way of telling the solutions
apart based on the data. This is because the matrix X in version 2 does not have full rank (the first column
is the sum of the last three columns).
Definition: Column Space
The column space of matrix X is the space of vectors that can be reached by linear combinations of the
columns of X.
In the example (b) both versions have the same column space:



a 






  a 









b


a,
b,
c
real
numbers
C(X) = 





 b 




 c 






c
The dimension of the column space C(X) turns out to be the same as the rank of the matrix X.
2
All linear models can be written in the form
Y = Xβ + Our wish list is to:
• estimate Xβ = E[Y ] = Ŷ ,
• make sensible point estimates for σ 2 , β, c0 β for “interesting” linear combinations c of β,
• find confidence intervals of σ 2 , c0 β,
• get prediction intervals for new responses,
• test hypotheses H0 : βj = βj+1 = ... = βj+r = 0.
Example: Pizza Delivery Experiment conducted by Bill Afantenou, second year statistics student at
QUT. Here is his description of the experiment:
As I am a big pizza lover, I had much pleasure in involving pizza in my experiment. I became
curious to find out the time it took for a pizza to be delivered to the front door of my house. I
was interested to see how, by varying whether I ordered thick or thin crust, whether Coke was
ordered with the pizza and whether garlic bread was ordered with the pizza, the response would
be affected.
Variables:
Variable
Crust
Coke
Bread
Driver
Hour
Delivery
Using R to
Description
Thin=0, Thick=1
No=0, Yes=1
Garlic bread. No=0, Yes=1
Male=M, Female=F
Time of order in hours since midnight
Delivery time in minutes
read the data:
> pizza <- read.table("http://www.statsci.org/data/oz/pizza.txt",
+ header=T,sep="\t") % ASCII file with a header line and tabulator separated entries
> pizza
Crust Coke Bread Driver Hour Delivery
1
0
1
1
M 20.87
14
2
1
1
0
M 20.78
21
3
0
0
0
M 20.75
18
4
0
0
1
F 20.60
17
5
1
0
0
M 20.70
19
6
1
0
1
M 20.95
17
7
0
1
0
F 21.08
19
8
0
0
0
M 20.68
20
9
0
1
0
F 20.62
16
10
1
1
1
M 20.98
19
11
0
0
1
M 20.78
18
12
1
1
0
M 20.90
22
13
1
0
1
M 20.97
19
14
0
1
1
F 20.37
16
15
1
0
0
M 20.52
20
16
1
1
1
M 20.70
18
> attach(pizza)
3
Make boxplots of the data (result in figure 1):
>
>
>
>
>
par(mfrow=c(2,2))
boxplot(Delivery~Crust,col=c(2,3),main="Crust")
boxplot(Delivery~Bread,col=c(2,3),main="Bread")
boxplot(Delivery~Coke,col=c(2,3),main="Coke")
boxplot(Delivery~Driver,col=c(2,3),main="Driver")
Crust
B
Bread
0
C
Coke
F
m
14
16
1
18
20
2
22
D
M
Driver
M
4read
6
8
0
2
rust
oke
river
20
18
16
14
14
16
18
20
22
Bread
22
Crust
0
1
0
20
18
16
14
14
16
18
20
22
Driver
22
Coke
1
0
1
F
M
Figure 1: On average, pizzas with a thin crust are delivered faster; the delivery seems to be faster, if additionally garlic bread is ordered. There does not seem to be difference in delivery times dependent on whether coke
is ordered, but the variance in time is increased, if coke is ordered. Women drivers seem to deliver faster
than men, but, looking at the data directly, we see that there were only four deliveries by female drivers.
Checking the interaction between bread and crust, we get four boxplots (see figure 2) - one for each combination of the two binary variables.
> boxplot(Delivery~Crust*Bread,col=c(2,3),main="Crust and Bread")
This example is written mathematically as
yijk =
µ
|{z}
average delivery time
+
αj
|{z}
effect of thick/thin crust
+
βk
|{z}
effect of bread/no bread
where i = 1, ..., 4, j = 1, 2 and k = 1, 2.
4
+
αβjk
| {z }
interaction effect bread/crust
+ijk ,
0.0
1.0
0
0.1
1.1
m
14
16
1
18
20
2
22
Crust
M
C
M
.0
.1
4
6
8
0
2
rust and Bread
14
16
18
20
22
Crust and Bread
0.0
1.0
0.1
1.1
Figure 2: Boxplots of delivery times comparing all combinations of bread (yes/no) and crust (thin/thick).
The difference between delivery times seems to be the same regardless of whether gralic bread was ordered.
This is a hint, that in a model the interaction term might not be necessary.
In matrix notation this translates to

 
y111
1
 y112   1

 
 y121   1

 
 y122   1

 
 y211   1

 
 y212   1

=
 y221   1

 
 y222   1

 
 y311   1

 
 ..   ..
 .   .
y422
1
1
1
0
0
1
1
0
0
1
..
.
0
0
1
1
0
0
1
1
0
..
.
1
0
1
0
1
0
1
0
1
..
.
0
1
0
1
0
1
0
1
0
..
.
1
0
0
0
1
0
0
0
1
..
.
0
1
0
0
0
1
0
0
0
..
.
0
0
1
0
0
0
1
0
0
..
.
0
0
0
1
0
0
0
1
0
..
.
0
1
0
1
0
0
0
1




















µ
α1
α2
β1
β2
αβ11
αβ12
αβ21
αβ22



 
 
 
 
 
 
+
 
 
 
 
 
 



111
112
121
122
211
212
221
222
311
..
.



















422
Here, the matrix X does not have full rank - we need to look at the column space of X closer.
Excursion: Vector Spaces V is a real-valued vector space, if and only if:
(i) V ⊂ IRn , i.e. V is a subset of IRn .
(ii) 0 ∈ V , i.e. the origin is in V .
(iii) For v, w ∈ V also v + w ∈ V , i.e. sums of vectors are in V .
(iv) For v ∈ V also λv ∈ V for all λ ∈ IR, i.e. scalar products of vectors are in V .
Examples of vector spaces in IRn are 0 or IRn itself; lines or planes through the origin are also vector spaces.
Lemma 1.1 The column space C(X) of matrix X is a vector space.
5
Proof:
For matrix X ∈ IRn×p , i.e. the matrix X has n rows and p columns, the column space C(X) is defined as:
C(X) = {Xb | for all vectors b = (b1 , b2 , ..., bp )0 ∈ IRp } ⊂ IRn
The origin is included in C(X), since for b = (0, 0, ..., 0) Xb = (0, ..., 0) = 0.
If a vector v is in C(X), this means that there exists bv , such that v = Xbv .
For two vectors v, w ∈ C(X), we therefore have bv and bw , such that v = Xbv and w = Xbw . With that
v + w = Xbv + Xbw = X(bv + bw ) ∈ C(X)
Similarly, for v ∈ C(X) we have
λv = λXbv = X(λbv ) ∈ C(X).
C(X) therefore fulfills all four conditions of a vector space. We can think of C(X) as being a line or a plane
or some other higher-dimensional space.
2
Why do we care about the column space C(X) at all?
With E = 0 we have EY = Xβ, i.e. EY (= Ŷ ) is in C(X)! The ordinary least squares solution for Ŷ is
therefore done by finding the point in C(X) that is closest to Y . Ŷ is therefore the orthogonal projection of
Y onto C(X) (see figure 3 for the three-dimensional case).
y3
C(X)
Y
^
Y
y1
y2
Figure 3: Ŷ is the orthogonal projection of Y onto C(X).
6
How do we find Ŷ ?
Let’s start by finding Ŷ by hand in one of our first examples:
Example: One-way Anova (3 treatments, 2 repetitions each)
yij = µ + αj + ij
with




Y =



1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1










µ
α1 
 + .
α2 
α3
This model has column space










C(X) = 










a
a
b
b
c
c












 a, b, c ∈ IR










Therefore Ŷ = (ao , ao , bo , bo , co , co )0 for some ao , bo , co ∈ IR, which minimize kŶ − Y k2 :
X
kY − Ŷ k2 =
(yij − ŷij )2 =
i,j
=
(y11 − a)2 + (y12 − a)2 +
+(y21 − b)2 + (y22 − b)2 +
+(y31 − c)2 + (y32 − c)2 .
This is minimal, when a = (y11 + y12 )/2 = y1. , b = y2. , and c = y3. . This gives a solution Ŷ as


y1.
 y1. 


 y2. 

.
Ŷ = 

 y2. 
 y3. 
y3.
In order to find this result directly, we need a bit of math now:
Projection Matrices
Since C(X) is a vector space, there exists a projection matrix PX for which
• v ∈ C(X) then PX v = v (identity on the column space)
• w ∈ C(X)⊥ then PX w = 0 (null on the space perpendicular to the column space)
Then for any y ∈ IRn , we have y = y1 + y2 with y1 ∈ C(X) and y2 ∈ C(X)⊥ :
PX y = PX (y1 + y2 ) = PX y1 + PX y2 = y1 + 0 = y1 .
Some properties of projection matrices:
7
1. idempotence:
2
PX
= PX
(relatively easy to see - apply PX twice to the y above - since this does not change anything, and y
could be just any vector, this proves that PX is idempotent)
2. symmetry:
0
PX
= PX
Proof: let v, w be any vector in IRn . then there exist v1 , w1 ∈ C(X) and v2 , w2 ∈ C(X)⊥ with
v = v1 + v2 and w = w1 + w2 .
0
Then have a look at matrix PX
(I − PX ):
0
v 0 PX
(I − PX )w
= (PX v)0 (w − PX w) =
= v10 (w − w1 ) = v10 w2 = 0,
because v1 ∈ C(X) and w2 ∈ C(X)⊥ . Since we have chosen v and w arbitrarily, this proves that
0
0
0
PX
(I − PX ) = 0. This is equal to PX
= PX
PX . The second matrix is symmetric, which implies the
0
and PX .
symmetry of PX
PX = X(X 0 X)− X 0 - sometimes PX is called the hat matrix H.
Here, (X 0 X)− is a generalized inverse of X 0 X. If X is a full (column) rank matrix, we can use the regular
inverse (X 0 X)−1 instead.
Excursion: Generalized Inverse Definition: A− is a general inverse of matrix A, iff
AA− A = A
Properties:
1. A− exists for all matrices A, but is not necessarily unique.
2. If A is square and full rank, A− = A−1 .
3. If A is symmetric, there exists at least one symmetric generalized inverse A− .
4. Note, that A− A does not need to be the identity matrix!
In the example of the one-way anova: we have

6

2
X 0X = 
 2
2
2
2
0
0
2
0
2
0

2
0 
,
0 
2
with a generalized inverse of


3 1
1
1

1 
 1 11 −5 −5 
(X 0 X)− =

1 −5 11 −5 
32
1 −5 −5 11
I got this inverse by using R:
library(MASS); ginv(A)
8
Note that (X 0 X)X 0 X is not the identity matrix:


3 1
1
1
1  1 3 −1 1 

(X 0 X)X 0 X = 
4  1 −1 3 −1 
1 −1 −1 3
The projection matrix PX is

PX



1
0
− 0

= X(X X) X =
32 


16
16
0
0
0
0
16 0
16 0
0 16
0 16
0
0
0
0
0
0
16
16
0
0
0
0
0
0
16
16
0
0
0
0
16
16








Then Ŷ = PX Y = (y1. , y1. , y2. , y2. , y3. , y3. )0 (the same result we found manually already).
9
How do we find a generalized inverse?
A11 A12
n×p
Let A ∈ IR
with rank r, assume A =
where A22 ∈ IRr×r is a full rank matrix. Then
A21 A22
0
0
−
A :=
.
0 A−1
22
Proof:
0 A12 A−1
A12 A−1
A12
22
22 A21
AA− A =
A=
0
Id
A21
A22
−1
Id −A12 A22
−1
in order to show that A12 A22 A21 equals A11 define B =
. The rank of B is r, therefore
0
Id
the rank of BA is also r.
A11 − A12 A−1
A21
0
22
BA =
A21
A22
Since A22 has rank r, the first column of matrices in BA has to be a linear combination of the second column,
i.e. there exists some orthogonal matrix Q (get from a Gaussian elimination process) with
0
0
A11 − A12 A−1
22 A21
=
Q=
A22
A22 Q
A21
−1
Therefore A11 − A12 A−1
22 A21 = 0, i.e. A11 = A12 A22 A21 .
2
Example:

6

2
X 0X = 
 2
2
2
2
0
0
2
0
2
0


2

0 
 has inverse (X 0 X)0 = 1 
0 
2
2
0
0
0
0
0
1
0
0
0
0
1
0

0
0 

0 
1
Generalized Inverse in five steps:
1. identify square sub-matrix C of X with full rank r (do not need to be adjacent rows or columns)
2. find inverse C −1 of C
3. replace elements of C in X by elements of (C −1 )0
4. replace all other entries in X by 0
5. transpose to get X − .
Example:


1
x =  2  has inverses x− = (1, 0, 0) or x− = (0, 0, 1/3) or x− = (a, b/2, c/3) with a + b + c = 1.
3
Remember: Ŷ = PX Y = X(X 0 X)X 0 Y
Claim: PX is the orthogonal projection onto C(X).
Proof:
still to show
for v ∈ C(X) : PX v = X(X 0 X)− X 0 v = X(X 0 X)− X 0 Xc
=
Xc = v.
in order to show X(X 0 X)− X 0 X = X, use y ∈ IRn with y = y1 + y2 :
y 0 X(X 0 X)− X 0 X = y10 X(X 0 X)− X 0 X
y1 =Xb
=
b0 X 0 X(X 0 X)− X 0 X = b0 X 0 X = y10 X = y 0 X
10
2
0
for w ∈ C(X)⊥ : PX w = X(X 0 X)− X
w = 0 = 0.
|{z}
With Ŷ = PX Y we get = Y − Ŷ = (I − PX )Y .
Properties:
• I − PX orthogonal projection onto C(X)⊥ .
• C(PX ) = C(X) and rk(X) = rk(PX ) = tr(PX )
Proof: (only for rk(PX ) = tr(PX ))
P
from linear algebra we know that tr(PX ) = i λi , i.e. the trace of a matrix is equal to the sum of its
2
eigenvalues. Since PX
= PX , this matrix has only eigenvalues 0 and 1. The sum of the eigenvalues is
therefore equal to the rank of PX .
2
• C(I − PX ) = C(X)⊥ and rk(I − PX ) = n − rk(PX )
• Pythagorean theorem, anova identity
kY k2
0
= Y 0 Y = [(PX + I − PX )Y ] [(PX + I − PX )Y ] =
0
0
= (PX Y )0 (PX Y ) + ((I − PX )Y ) ((I − PX )Y ) + Y 0 PX
(I − PX ) Y + Y 0 (I − PX )0 PX Y =
{z
}
|
{z
}
|
=0
2
2
= kPX Y k + k(I − PX )Y k
11
=0
Identifying β:
Since Ŷ = Xβ, we have X 0 · X(X 0 X)− XY = X 0 · Xβ
If X has full rank, then X 0 X is full rank and (X 0 X)− = (X 0 X)−1 , then β̂OLS = (X 0 X)−1 XY . For full rank
X, this solution is unique.
If X is not full rank, there are infinitely many β that solve Xβ = Ŷ .
What do we do without full rank X?
Example: One way anova :
Means Model:
Effects Model:
yij = µ + αj + ij
yij = µj + ij
has full rank matrix X and unique solution µ̂i = y.i
has not full rank and generally not a unique solution. But
clearly,
µ + α1 = µ1
This can be written as


µ
 α1 
0

(1100) 
 α2  = c β
α3
Question: what makes c special? (it changes the ambiguous β
to an unambiguous c0 β)
We have to look at linear combinations c0 β more closely:
Theorem 1.2
Estimability For some c ∈ IRp the following properties are equivalent:
1. If Xβ1 = Xβ2 ⇒ c0 β1 = c0 β2
2. c ∈ C(X 0 )
3. There exists a ∈ IRn such that a0 Xβ = c0 β for all β.
Definition: if any of the three properties above holds for some c (and with that all will hold), the expression
c0 β is called estimable.
Proof:
1) ⇒ 2): Xβ1 = Xβ2 is equivalent to X(β1 − β2 ) = 0, which is equivalent to β1 − β2 ∈ C(X 0 )⊥ . Since
⊥
1) holds, c0 (β1 − β2 ) = 0 for all β1 , β2 , i.e. c ⊥ β1 − β2 , therefore c ∈ C(X 0 )⊥ = C(X 0 ).
2) ⇒ 3): c ∈ C(X) ⇒ ∃ a such that X 0 a = c ⇒ c0 = a0 X ⇒ c0 β = a0 Xβ for all β
3) ⇒ 1): For Xβ1 = Xβ2 there exists a such that c0 β1 = a0 Xβ1 = a0 Xβ2 = c0 β2 .
2
12
If c0 β is estimable, then
c0 β = a0 Xβ = a0 PX Y = |{z}
a0 X (X 0 X)− X 0 Y = c0 (X 0 X)− X 0 Y.
c
This generalizes the formula for ordinary least square estimators with a full rank matrix X.
We can therefore define
ˆ = a0 Ŷ .
cˆ0 β OLS := a0 Xβ
Example: One-way Anova




Y =



1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1










µ
α1 
+
α2 
α3
β itself is not estimable.
Let c = (0, 1, −1, 0)0
Is c0 β estimable?
Yes, because there exists a ∈ IR6 such that X 0 a = c:


0



 1 
1−1
1 1 1 1 1 1







0
1
1
0
0
0
0
  1

X 0a = 
 0 0 1 1 0 0   −1  =  −1


 0 
0
0 0 0 0 1 1
0

0
  1 

=
  −1 
0


c is therefore in C(X 0 ), which makes c0 β estimable.
In practice we want to estimate several c01 β, c02 β, ..., c0l β simultaneously. Define



C=

c01
c02
..
.
c0l



 ∈ Rl×p ,

estimate by
0
− 0
c
Cβ
OLS = C(X X) X Y
Testability
We want to test hypotheses of the form H0 : Cβ = d.
This is a very general way of writing hypotheses - and fits our standard way, e.g.
Example: Simple Linear Regression
yi = a + bxi + i
H0 : b = 0 translates to H0 : (0, 1)
a
b
=0
13
Example: One-way anova
yij = µ + αj + ij , with i = 1, 2 and j = 1, 2, 3
H0 : α1 = α2 = α3 translates to

1 −1 0
1 0 −1
0
0
H0 :

µ
 α1 
0

=
 α2 
0
α3
First condition: each row of C has to be estimable (i.e. the rows in C are linear combinations of rows of X)
Estimability alone is not enough, look at (assuming a regression model):
1
1
0
0
0
0


α0
3
 α1  =
7
α2
Both rows are estimable, but the expression is still nonsensical. These leads to a necessary second condition:
rows in C have to be linearly independent.
Definition 1.3 (Testability)
The hypothesis H0 : Cβ = d is estimable, if
1. every row in C is estimable,
2. the rank of C is l.
The concept of testability is sometimes strange:
1
1
0
0
0
0


α0
3
 α1  =
is not testable
3
α2
but

1
0
How does a hypothesis ‘look’ 
like?
a01
 a02

Therefore A ∈ IRl×n AX = C, A =  .
 ..
0

α0
 α1  =
α2
3
is testable.
n
0
0
Since every row in C is estimable, i.e. ∃a ∈ IR with a X = c .




a0l
Most hypothesis are of the form H0 : Cβ = 0. With Cβ = 0, A · Xβ = 0. This means, that Xβ is
perpendicular to each row in A, i.e. Ŷ = Xβ ∈ C(A0 )⊥ . On the other hand, we know, that Ŷ ∈ C(X).
Under H0 the predicted value Ŷ is in the intersection of these two spaces, i.e.
Ŷ ∈ C(X) ∩ C(A0 )⊥
14
Example

 
y1
1
 y2  =  1
y3
0

0
µ1
0 
+
µ2
1
µ1
Consider H0 : µ1 = µ2 . This can be written as H0 : (1 − 1)
= 0, which is equivalent to
µ2


1 0
µ1
=0
(1, 0, −1)  1 0 
µ2
| {z }
0
1
A
In this setting
= {v ∈ IR3
C(A0 )
= {v ∈ IR3
C(A0 )⊥
= {v ∈ IR3
C(X) ∩ C(A0 )⊥
= {v ∈ IR3
C(X) ∩ C(A′)
v3
C(X)
v2
v1
15
T
C(X)



 a

: v1 = v2 } =  a  : a, b ∈ IR


b



a


: v1 = −v3 , v2 = 0} =  0  : a ∈ IR


−a



 a

: v1 = v3 } =  b  : a, b ∈ IR


a



 a

: v1 = v2 = v3 } =  a  : a ∈ IR


a
Since the null hypothesis H0 : Cβ = 0 is equivalent to the notion that Ŷ ∈ C(X) ∩ C(A0 )⊥ with AX = C,
we need to talk about the distribution of errors (and with that the distribution of Y ).
We are going to look at two different models closer: the Gauss Markov Model, and the Aitken Model. Both
make assumptions on mean and variance of the error term , but do not specify a full distribution.
For a linear model of the form
Y = Xβ + ,
the Gauss-Markov assumptions are
E = 0
V ar = σ 2 I,
i.e. we are assuming independence among errors, and identical variances.
The Aitken assumptions for the above linear model are
E = 0
V ar = σ 2 V,
where V is a known (symmetric and positive definite) matrix.
Gauss-Markov
Based on Gauss Markov error terms, we are want to derive means and variances for observed and predicted
responses Y and Ŷ , draw conclusions about estimators c0 β and get an estimate s2 for σ 2 .
With E = 0, V ar = σ 2 I we get
• EY = E[Xβ + ] = Xβ + E = Xβ,
V arY = V ar[Xβ + ] = V ar = σ 2 I.
• E Ŷ = E [PX Y ] = PX EY = PX Xβ = Xβ,
|{z}
∈C(X)
0
=
arY PX
0
= σ 2 PX .
PX · σ 2 I · PX
V arŶ = V ar [PX Y ] = PX V
h
i
h
i
• E Y − Ŷ = 0, V ar Y − Ŷ = (I − PX )V arY (I − PX )0 = σ 2 (I − PX ).
0
− 0
c
• for estimable Cβ, we get for the least squares estimate Cβ
OLS = C(X X) X Y :
h
i
c
E Cβ
= C(X 0 X)− X 0 EY = C(X 0 X)− X 0 Xβ =
OLS
= A X(X 0 X)− X 0 Xβ = AXβ = Cβ i.e. unbiased estimator
|
{z
}
PX
h
c
V ar Cβ
OLS
i
= C(X 0 X)− X 0 V ar[Y ] C(X 0 X)− X 0
0
=
= σ 2 C (X 0 X)− X 0 · X (X 0 X)− C 0 =
| {z }
symmetric
|
{z
=(X 0 X)− ,
see (*)
}
= σ 2 C(X 0 X)− C 0
(*) holds because for a generalized inverse A0 of matrix A, matrix A is a generalized inverse of A− , i.e.
A− AA− = A− .
For a full rank model and C = I the above expression for the variance of β̂ simplifies to the usual
expression:
V arβ̂OLS = σ 2 (X 0 X)−1 .
16
P
• For the sum of squared errors e0 e = i e2i we get
h
i
E[e0 e] = E (Y − Ŷ )0 (Y − Ŷ ) = E [(Y − PX Y )0 (Y − PX Y )] = E [Y 0 (I − PX )Y ] = E[Y 0 Y ] + E[Ŷ 0 Ŷ ]
for E[Y 0 Y ] we have
E[Y 0 Y ]
X
=
E[yi2 ] =
X
i
X 2
V aryi + (Eyi )2 =
σ + (Xβ)2i =
i
i
= nσ 2 + (Xβ)0 Xβ
And, similarly:
E[Ŷ 0 Ŷ ]
=
X
E[yˆi 2 ] =
i
2
X
X 2
σ (PX )ii + (Xβ)2i =
V arŷi + (E ŷi )2 =
i
i
0
= σ tr(PX ) + (Xβ) Xβ
= σ 2 rk(X) + (Xβ)0 Xβ
Therefore, for the sum of squared errors we get:
E[e0 e] = σ 2 (n − rk(X))
This last result suggests an estimate s2 for σ 2 as the standard estimate:
s2 =
e0 e
SSE
=
= M SE
n − rk(X)
df E
Properties of the Least Squares Estimator
0
0
− 0
0β
• Linearity: since cc
OLS = c (X X) X Y , the least squares estimator is the result from a linear
|
{z
}
row vector
re-combination of entries in Y .
h
i
0
0β
• Unbiased: E cc
OLS = c β, the least squares estimator is unbiased.
The least squares estimator is a “BLUE” (best linear unbiased estimator), if the variance is minimal. That
is the next theorem’s content:
Theorem 1.4
Gauss-Markov Estimates In the linear model Y = Xβ + with E = 0 and V ar = σ 2 I for estimates c0 β
0
0
− 0
0β
the BLUE is cc
OLS = c (X X) X Y .
17
Theorem 1.5
Gauss-Markov Estimates In the linear model Y = Xβ + with E = 0 and V ar = σ 2 I for estimates c0 β
0
0
− 0
0β
the BLUE is cc
OLS = c (X X) X Y .
We have to show, that among the linear unbiased estimators the least squares estimator is the one with the
smallest variance.
Proof:
Idea: take one arbitrary unbiased linear estimator. Show that its variance is at least as big as the variance
of the least squares estimator.
0
0β
Notation: let %0 := c0 (X 0 X)− X 0 , then cc
OLS = % Y .
First, an observation: PX % = %, because % = X · (X 0 X)− X 0 c ∈ C(X)
{z
}
|
some vector
Let v ∈ IRn with E [v 0 Y ] = c0 β for all β (linear, unbiased estimator).
Since c0 β = E [v 0 Y ] = v 0 EY = v 0 Xβ for all β, this implies c0 = v 0 X.
The variance of v 0 Y is then:
V ar[v 0 Y ]
= V ar [(v 0 Y − %0 Y ) − %0 Y ] =
= V ar [(v − %)0 Y ] +V ar [%0 Y ] + 2
|
{z
}
≥0
Cov ((v − %)0 Y, %0 Y )
|
{z
}
≥ V ar(%0 Y )
=0 still has to be shown, see (*)
We still have to show (*):
Cov((v − %)0 Y, %0 Y )
= (v − %)0 V ar(Y )% = σ 2 (v − %)0 % =
= σ 2 (v 0 % −%0 %) = σ 2 (v 0 PX % − %0 %) =
|{z}
=PX %
0
= σ 2 (v
X (X 0 X)− X 0 % − %0 %) =
|{z}
=c0
0
0
= σ (c (X X)− X 0 % − %0 %) = 0
{z
}
|
2
=%0
2
Under Gauss-Markov assumption OLS estimator are optimal. This does not hold in a general setting.
Example: Anova
yij = µi + ij with i = 1, 2 repetitions and j = 1, 2, 3 treatments.
It is known that the second repetition has a lower variance:

 



y11
1 0 0
11
 y21   1 0 0  
  21 

 



µ1
 y12   0 1 0 



=
  µ2  +  12  , with cov() = σ 2 diag(1, 0.01, 1, 0.01, 1, 0.01)
 y22   0 1 0 
 22 

 



µ3
 y13   0 0 1 
 13 
y23
0 0 1
23


y.1
The OLS estimate b =  y.2  is then the wrong idea, since
y.3
1
1
1
(y11 + y21 ) = σ 2 (1 + 0.01) ≥ σ 2
V arb1 = V ar
2
4
4
18
But, if we decide to ignore the first repetition and use b̃1 = y21 , we get a variance of
V arb̃1 = V ar (y21 ) = 0.01σ 2 <
1 2
σ
4
What do we do under the even more general assumptions of the Aitken Model?
Aitken
Let V be a symmetric, positive definite matrix with V ar = σ 2 V for linear model
Y = Xβ + , where E = 0.
Then there exists a symmetric matrix V −1/2 with V −1/2 V −1/2 = V −1 and V 1/2 V 1/2 = V .
Small Excursion: Why does V 1/2 exist?
• V is a symmetric matrix, therefore there exists orthonormal matrix Q which diagonalizes V , i.e.
V = QDV Q0 ,
where DV = diag(λ1 , ..., λn ) is the matrix of eigenvalues λi of V .
• V is positive definite, therefore all its eigenvalues λi are strictly positive (which implies that we can
take the square roots of λi ).
These two properties of V lead to defining the square root matrix V 1/2 as
p p
p
1/2
1/2
V 1/2 = QDV Q0 , where DV = diag( λ1 , λ2 , ..., λn ).
1/2
1/2
1/2
1/2
Then V 1/2 · V 1/2 = QDV Q0 · QDV Q0 = QDV DV Q0 = QDV Q0 = V , and V 1/2 is symmetric.
−1
−1/2
The inverse V −1/2 = V 1/2
= QDV Q0 .
End of Small Excursion
Define U := V −1/2 Y . We will have a look at EU and V arU .
19
EU
V arU
= V −1/2 EY = V −1/2 Xβ
0
= V −1/2 V arY V −1/2 = V −1/2 · σ 2 V · V −1/2 = σ 2 V −1/2 · V 1/2 V 1/2 · V −1/2 = σ 2 I.
These are the Gauss Markov assumptions. Looking at the model
U = W β + ∗ ,
with U = V −1/2 Y, W = V −1/2 X and ∗ = V −1/2 , we have found a transformation of the original model,
that transforms the Aitken’s assumptions into Gauss-Markov assumptions.
The gain is (hopefully) obvious: we will be able to apply all the theory we know for models with GaussMarkov assumptions to models with Aitken assumption, including BLUEs.
Example: Anova


µ1
We are going to find the BLUE for β =  µ2  in the model
µ3
yij = µi + ij with i = 1, 2 repetitions and j = 1, 2, 3 treatments,
where








y11
y21
y12
y22
y13
y23


 
 
 
=
 
 
 
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1



 


µ1


  µ2  + 




µ3


11
21
12
22
13
23




 , with cov() = σ 2 diag(1, 0.01, 1, 0.01, 1, 0.01) = σ 2 V



The transformed model U = W β + ∗ in this example is then:
V 1/2
V −1/2
= diag(1, 0.1, 1, 0.1, 1, 0.1)
= diag(1, 10, 1, 10, 1, 10)
Therefore

U



= V −1/2 Y = 




W



= V −1/2 X = 



y11
10y21
y12
10y22
y13
10y23








1
0
10 0
0
1
0 10
0
0
0
0
0
0
0
0
1
10








The BLUE for model U = W β + ∗ is the OLS estimator β̂ = βOLS = (W 0 W )− W 0 U .
20


W 0W
1
=  0
0

−
(W 0 W )
(W 0 W ) W 0
1
101
=  0
0

−
10
0
0
1
101
=  0
0
0
1
0
0
1
101
0
10
0


0

0 


10

0
0
1

 

101

= 0


0

0
101
0

0
0 
101

0
0 
0
1
101
10
101
0
0
1
101
10
101
0
0
0
0
1
0
0
10
0
0
0
1
0
0 10
0
0
0
1
0
0 10
0
0
1
101

0
0 
10
101
The BLUE is then

\
 1
µ1
101
 µ2  = (W 0 W )− W 0 U =  0
µ3
0
10
101
0
0
0
0
1
101
10
101
0
0
0
0
1
101


0

0 

10

101

y11
10y21
y12
10y22
y13
10y23



 
 
=
 


y11 +100y21
101

y12 +100y22
101




y13 +100y23
101
We can use the same estimate for the original model Y = Xβ + , since:

Û
= W β̂OLS(U ) = V −1/2 X β̂OLS(U )



=



µ1
10µ1
µ2
10µ2
µ3
10µ3








and Ŷ = V 1/2 Û = (µ1 , µ1 , µ2 , µ2 , µ3 , µ3 )0 = µ̂OLS(U ) ∈ C(X).
The OLS for U minimizes (U − Û )0 (U − Û ) over choices of C(W ).
(U − Û )0 (U − Û ) = (V −1/2 Y − V −1/2 Ŷ )0 (V −1/2 Y − V −1/2 Ŷ ) = (Y − Ŷ )0 V −1 (Y − Ŷ )
This expression is minimized over choices of Ŷ in C(V 1/2 W ) = C(X). This is the generalized weighted least
squares based on Y .
The BLUE in the Aitken model is therefore
0
0 −1
0 β = cc
0β
cc
X)− X 0 V −1 Y
OLS(U ) = c (X V
What happens outside the Aitken model?
Example: Anova
yij = µ + ij ,
21
where ij are independent with V arij = σi2 for i = 1, 2. For 3 groups and 2 observations per group this
gives



  
11
1
y11
 21 
 y21   1 



  


 y12   1 

 =   µ +  12  , with cov() = diag(σ12 , σ22 , σ12 , σ22 , σ12 , σ22 )
 22 
 y22   1 



  
 13 
 y13   1 
23
1
y23
we get the BLUE for µ as
µ̂ =
1/σ12 y.1 + 1/σ22 y.2
σ22 y.1 + σ12 y.2
=
1/σ12 + 1/σ22
σ12 + σ22
But: usually we don’t know σ12 /σ22 . We could use the sample variances s21 and s22 , but that takes us out of
the framework of linear estimators.
22
Reparametrizations
Consider the two models (Gauss-Markov assumptions):
Version (I)
Version (II)
Y = Xβ + Y = Wγ + If C(X) = C(W ) these models are the same, i.e. same predictions Ŷ and Y − Ŷ , same estimable functions
c0 β.
What estimable functions in (II) correspond to estimable functions in (I)?
Let c0 β be an estimable function in (I), i.e. c0 ∈ C(X).
If C(X) = C(W ), then W is a linear re-combination of columns in X, i.e.
∃ F with W = XF,
then
c0 β
|{z}
est. in (I)
= a0 Xβ = a0 W γ = a0 XF γ = (c0 F )γ .
| {z }
est. in (II)
Example: Anova: 3 groups, 2 reps per group
Version (I): full rank means model
Version (II): rank deficient effects model
yij = µ + αj + ij
yij = µj + ij




1 0 0
1 1 0 0

 1 0 0 
 1 1 0 0 

µ




µ
1
 0 1 0 
 1 0 1 0   α1
  µ2  + 
Y =
Y =
 0 1 0 
 1 0 1 0   α2




µ
3
 0 0 1 
 1 0 0 1 
α3
0 0 1
1 0 0 1
|
|
{z
}
{z
}
X
W


1 1 0 0
then W = XF for F =  1 0 1 0 .
1 0 0 1


µ1
So e.g. µ1 = (1, 0, 0)  µ2  is estimable in (I). This corresponds to
µ3



µ
 α1 



c0 F γ = (c0 F ) 
 α2  = (1, 1, 0, 0) 
α3

µ
α1 
 = µ + α1 .
α2 
α3
0
0β
[
Clearly: cc
OLS = y.1 = c F γ OLS .
What is to choose between these models?
1. the full rank model behaves nicer mathematically and is computational simpler.
2. scientific interpretability of parameters pushes for other models sometimes.
Note: we can use any matrix B with same column space as X.
23


+

Example: two way factorial model
Let i = 1, 2; j = 1, 2; k = 1, 2, consider the model
yijk = µ + αi + βj + (αβ)ij + ijk .
This corresponds to the design matrix W :












1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
0
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1















µ
α1
α2
β1
β2
αβ11
αβ12
αβ21
αβ22














Idea: find a second matrix X with full rank and the same column space as W , to resolve problems of rank
deficiency.
The way we do this is by introducing constraints for the parameters. There are different choices of constraints:
Statistician’s favorite: zero sum constraints
Most software uses: baseline constraints
R: first effects are zero
X
αi = 0
i
α1 = 0
X
β1 = 0
βj = 0
j
αβ1j = 0 for all j
X
αβij = 0 for all j
αβi1 = 0 for all i
i
X
αβij = 0 for all i
j
In the 2 × 2 factorial model with 2 repetitions the sum restrictions translate to:
α1 = −α2
β1 = −β2
αβ11 = −αβ21 ,
αβ12 = −αβ22 , and
αβ11 = −αβ12 = αβ22 .
This leads to a different version of W :












1
1
1
1
1
1
1
1
1
1
1
1
−1
−1
−1
−1
1
1
−1
−1
1
1
−1
−1
1
1
−1
−1
−1
−1
1
1
24




µ

  α1

  β1


αβ11






Baseline restrictions give a different matrix:

1
 1

 1

 1

 1

 1

 1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
0
0
0
0
1
1
25




µ

  α2

  β2


αβ22






We are about to make distribution assumptions now. Since we are going to be talking about normal
distributions quite a bit, it might be a good idea to review some of its properties.
Some Useful Facts About Multivariate Distributions (in Particular Multivariate Normal Distributions) Here are some important facts about multivariate distributions in general and multivariate
normal distributions specifically.
1. If a random vector X has mean vector µ and covariance matrix Σ , then Y = B X + d has
k×1
k×k
k×1
l×1
l×kk×1
l×1
mean vector EY = Bµ + d and covariance matrix VarY = BΣB0 .
2. The MVN distribution is most usefully defined as the distribution of X = A Z + µ , for Z a
k×1
k×pp×1
k×1
vector of independent standard normal random variables. Such a random vector has mean vector µ
and covariance matrix Σ = AA0 . (This definition turns out to be unambiguous. Any dimension p
and any matrix A giving a particular Σ end up producing the same k-dimensional joint distribution.)
3. If is X is multivariate normal, so is Y = B X + d .
k×1
l×1
l×kk×1
l×1
4. If X is MVNk (µ, Σ), its individual marginal distributions are univariate normal. Further, any subvector of dimension l < k is multivariate normal (with mean vector the appropriate sub-vector of µ
and covariance matrix the appropriate sub-matrix of Σ).
5. If X is MVNk (µ1 , Σ11 ) and independent Y of which is MVNl (µ2 , Σ22 ), then
X
Y


Σ11
µ
1
∼ MVNk+l 
,
µ2
0
0

Σ22

k×l
l×k
6. For non-singular Σ, the MVNk (µ, Σ) distribution has a (joint) pdf on k-dimensional space given by
−k
2
fX (x) = (2π)
− 12
|det Σ|
1
0
exp − (x − µ) Σ−1 (x − µ)
2
7. The joint pdf given in 6 above can be studied and conditional distributions (given values for part of
the X vector) identified. For

X1


µ1
 
X =
l×1
 ∼ MVNk 
l×1
,
k×1
X2
(k−l)×1
µ2
(k−l)×1
Σ11
Σ12
l×l
l×(k−l)
Σ21
Σ22
(k−l)×l
(k−l)×(k−l)


the conditional distribution of X1 given that X2 = x2 is
−1
X1 |X2 = x2 ∼ MVNl µ1 + Σ12 Σ−1
22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
8. All correlations between two parts of a MVN vector equal to 0 implies that those parts of the vector
are independent.
The next paragraph looks at the distribution of quadratic forms, i.e. distributions of Y 0 AY . This will help
in the analysis of all sum of squares involved (SSE, SST , ...)
26
Distribution of Quadratic Forms
χ2k density functions for various degrees of
freedom k:
5=10
10
15
m
0.0
0.2
0.4
0.6
0
0.8
1.0
1
y1.2
k=3
k=2
k=1
kk=5
M
k=10
=3
=2
=1
=5
0
5
.4
.6
.8
.0
.2
1.2
Definition 1.6 (χ2 distribution)
P
For Z ∼ M V N (0, Ik×k ) then Y := Z 0 Z = i Zi2 ∼ χ2k ,
where k is the degrees of freedom.
1.0
Then E[Z] = k, V ar[Z] = 2k.
0.4
0.6
0.8
k=1
k=2
0.2
k=3
k=5
0.0
k=10
0
5
10
15
y
χ23 (λ) density functions for various noncentrality parameters λ:
5ambda=0
10
1
15
m
0.00
0.05
0.10
0.15
0.20
0
x0.25
lambda=10
lambda=5
lambda=2
llambda=1
M
lambda=0
ambda=10
ambda=5
ambda=2
ambda=1
0
5
.00
.05
.10
.15
.20
.25
0.25
Definition 1.7 (noncentral χ2 distribution)
P
For Z ∼ M V N (µ, Ik×k ) then Y := Z 0 Z = i Zi2 ∼ χ2k (λ),
where k is the degrees of freedom and λ = µ0 µ the noncentrality parameter.
0.20
lambda=0
Then E[Z] = k + λ, V ar[Z] = 2k + 4λ.
0.15
lambda=1
0.10
lambda=2
0.05
lambda=5
0.00
lambda=10
0
5
10
15
x
Theorem 1.8
Let Y ∼ M V N (µ, Σ) with positive definite covariance Σ.
Let A be a symmetric n × n matrix with rk(A) = k ≤ n.
If AΣ is idempotent, then
Y 0 AY ∼ χ2rk(A) (µ0 Aµ)
If Aµ = 0 then Y 0 AY ∼ χ2rk(A) .
Proof:
The overall goal is to find a random variable Z ∼ M V N (µ, I) with Z 0 Z = Y 0 AY . We will do that in several
steps:
• AΣ is idempotent, i.e. AΣ · AΣ = AΣ.
• Σ is positive definite, i.e. Σ has full rank, therefore Σ−1 exists. With that, the above formula becomes
AΣAΣ · Σ−1 = AΣ · Σ−1
⇐⇒ AΣA = A
27
(1)
• then we can show that A itself is semi-positive definite:
A0 =A
(1)
x0 Ax = x0 · AΣA · x = x0 · A0 ΣA · x = (Ax)0 · Σ · (Ax)
Σ pos. def.
≥
0.
A is semi positive, therefore it has k strictly positive eigenvalues λ1 , ..., λk and n − k eigenvalues that
are 0.
• A is semi pos. def, i.e. ∃ Q ∈ IRn×k with Q0 Q = Ik×k and QQ0 = In×n and
A = QDA Q0 , where DA = diag(λ1 , ..., λk )
−1/2
• Let B = QDA
• Define Z := B 0 AY .
Then Z ∼ M V N (B 0 Aµ, B 0 A · Σ · A0 B) = N (B 0 Aµ, Ik×k ), because
0
B0 A
{z· A} B
| ·Σ
−1/2
= B 0 AB = DA
A
−1/2
Q0 · A · Q D
| {z } A
= Ik×k
DA
Therefore Z 0 Z ∼ χ2k (µ0 A0 BB 0 Aµ).
• For this Z we have to show two things, first that Z 0 Z = Y 0 Y , second that µ0 A0 BB 0 Aµ = µ0 Aµ.
Both are equivalent to showing that A0 BB 0 A = A:
−1/2
A0 BB 0 A = QDA Q0 · QDA
−1/2
· DA
−1 0
Q Q DA Q0 = QDA Q0 = A
Q0 · QDA Q0 = QDA Q0 Q DA
|{z}
|{z}
Ik×k
Ik×k
2
Example: Gauss-Markov Model
Y ∼ M V N (Xβ, σ 2 In×n )
In a linear model, the sum of squared errors SSE is
SSE
1
1
1
0
= 2 (Y − Ŷ )0 (Y − Ŷ ) = 2 [(I − PX )Y ] [(I − PX )Y ] = 2 Y 0 (I − PX )Y.
2
σ
σ
σ
σ
We want to find a distribution for this quadratic form Y 0 (I − PX )σ −2 Y .
Since (I − PX )σ −2 · σ 2 I = I − PX is an idempotent matrix, we can use the previous theorem:
Y 0 (I − PX )σ −2 Y ∼ χ2rk(I−PX ) (β 0 X 0 (I − PX )σ −2 Xβ) = χ2n−rk(X) ,
since (I − PX )X = X − PX X = X − X = 0 and rk(I − PX ) = dim C(X)⊥ = n − rk(X).
28
2
2
2
Since SSE
σ 2 ∼ χn−rk(X) in model Y ∼ M V N (Xβ, σ I), we can use this to get confidence intervals for σ . For
α ∈ (0, 1) the (1 − α)100% confidence interval for σ 2 is given as:
α
α
SSE
≤ (1 − ) quantile of χ2n−rk(X) ) = 1 − α
quantile of χ2n−rk(X) ≤
2
σ2
2
SSE
SSE
P(
≤ σ2 ≤ α
)=1−α
(1 − α2 ) quantile
2 quantile
P(
⇐⇒
i.e. ( (1− αSSE
) quantile ,
2
α
2
SSE
quantile )
is (1 − α)100% confidence interval.
Theorem 1.9
cf. Christensen 1.3.7 Let Y ∼ M V N (µ, σ 2 I) and A, B ∈ IRn×n with BA = 0. Then
1. if A symmetric, then Y 0 AY and BY are independent.
2. if both A and B symmetric, then Y 0 AY and Y 0 BY are independent.
Proof:
Both statements can be shown similarly. In order to get a statement about independence, we will introduce
a new random variable and analyze its covariance structure. Look at
A
AY
Y =
.
B
BY
This has covariance structure
A
A
AA0
cov(Y )(A0 , B 0 ) =
σ 2 I(A0 , B 0 ) = σ 2
B
B
BA0
AB 0
BB 0
= σ2
AA0
0
0
BB 0
,
A=A0
because BA0 = BA = 0 and AB 0 = (BA0 )0 = 0.
Therefore, AY and BY are independent. Any function in AY is then also independent of BY .
Write
Ais symm.
A=AA− A
Y 0 AY
=
Y 0 AA− AY
=
(AY )0 A− (AY ).
Therefore A0 AY is independent of BY .
If B is also symmetric, then with the same argument, Y 0 AY and Y 0 BY are independent.
2
Example: estimable fundtions in Gauss Markov model
Let Y ∼ M V N (Xβ, σ 2 I) and take A = I = PX and B = PX . Then BA = 0, both A and B are symmetric.
Therefore Y 0 AY = Y 0 (I − PX )Y = (Y − Ŷ )0 (Y − Ŷ ) = SSE is independent of BY = PX Y = Ŷ .
For an estimable function c0 β we get:
0
0
− 0
0β
cc
OLS = c (X X) X Y
∃a:a0 X=x0
=
a0 X(X 0 X)− X 0 Y = a0 PX Y = a0 Ŷ ,
i.e. any estimable function c0 β can be written as a linear combination of Ŷ . Since Ŷ is independent of SSE,
so is c0 β.
29
Definition 1.10 (t distribution)
Let Z ∼ N (0, 1) and W ∼ χ2k with Z, W independent and
0.4
Z
∼ tk
T := p
W/k
tν density functions for various degrees of
freedom k:
0.3
−
−
−
−
−
0.0
0.1
0.2
Then E[T ] = 0, V ar[T ] = k/(k − 2).
normal curve
k=1
k=2
k=5
k=10
−4
−2
0
2
x
In the previous example, we have
SSE
0
2 0
0
−
0β
∼ χ2n−rk(X) and cc
OLS ∼ N (c β, σ c (X X) c) ind.
σ2
Then
0β
cc
− c0 β
p OLS
σ 2 c0 (X 0 X)− c
,s
0
0β
SSE
cc
OLS − c β
p
√
=
∼ tn−rk(X)
σ 2 (n − rk(X))
c0 (X 0 X)− c · M SE
We can test H0 : c0 β = # using test statistic
0β
cc
OLS − #
√
T =p
and tn−rk(X) as null distribution
0
0
c (X X)− c M SE
and get confidence intervals for c0 β as
0
0β
cc
OLS − c β
√
≤ t∗ )1 − α,
c0 (X 0 X)− c · M SE
P (−t∗ ≤ p
where t∗ is the (1 − α/2) quantile of tn−rk(X) distribution. Therefore c0 β has (1 − α)100% C.I.
∗
0β
cc
OLS ± t
p
√
c0 (X 0 X)− c · M SE.
Theorem 1.11
Pk
n×n
0
Cochran’s Theorem Let Y ∼ M V N (0, In×n with Y 0 Y =
i=1 Qi , where Qi = Y Bi Y and Bi ∈ IR
positive semi-definite matrices with rank ri ≤ n. Then the following equivalences hold”
P
i.
i ri = n
ii. Qi ∼ χ2ri
iii. Qi are mutually independent
Applications: Predictions Assume that for the linear model Y ∼ M V N (Xβ, σ 2 I) new observations
become available. Let c0 β an estimable function and y ∗ be a vector of (statistics of) the new observations,
with
Ey ∗ = c0 β and V ary ∗ = γσ 2 ,
where γ is some known constant.
30
4
Example:
Assume a simple means model
yij = µj + ij ,
with j = 1, 2, 3 treatments and i = 3, 2, 2 replications respectively. Two additional experiments for treatment
3 will be done, i.e. c0 = (0, 0, 1). Let y ∗ be the mena of the new observations, then V ary ∗ = 1/2σ 2 .
∗
0β
For the difference between cc
OLS and y we have distribution:
∗
2 0
0
−
0β
cc
OLS − y ∼ N (0, σ (c (X X) c + γ)).
0β
Since M SE and cc
OLS are independent, the ratio has a t distribution:
∗
0β
cc
OLS − y
p
√
∼ tn−rk(X)
c0 (X 0 X)− c + γ · M SE
Then use
∗
0β
cc
OLS ± t
p
√
c0 (X 0 X)− c + γ · M SE
as 1 − α level prediction limits for t∗ .
31
Application: Testing Assuming a Gauss-Markov Model Y ∼ M V N (Xβ, σ 2 I), let H0 : Cβ = dl×1 be a
testable hypothesis.
c
Build a test on Cβ
OLS − d, then
2
0
− 0
c
Cβ
OLS − d ∼ M V N (Cβ − d, σ C(X X) C )
and the expression
c
Cβ
OLS − d
0
c
(σ 2 C(X 0 X)− C 0 )−1 Cβ
OLS − d
c
is a measure of mismatch between Cβ
OLS and d.
What is its distribution? It has the form Z 0 AZ, with
AΣ = (σ 2 C(X 0 X)− C 0 )−1 σ 2 C(X 0 X)− C 0 = Il×l ,
which clearly is idempotent. A is also symmetric. Theorem 1.8 therefore holds, and Y 0 AY has a χ2l
distribution with non-centrality parameter δ 2 given as
0
2
0
− 0 −1 c
c
δ 2 = Cβ
Cβ OLS − d
OLS − d (σ C(X X) C )
Define the sum of squares of hypothesis H0 : Cβ = d as
0
0
− 0 −1 c
c
SSH0 := Cβ
Cβ OLS − d = σ 2 δ 2
OLS − d (C(X X) C )
Then
SSH0
∼ χ2l (δ 2 )
σ2
When the null hypothesis H0 holds, then δ 2 = 0 and SSH0 /σ 2 has a central chi2 distribution. If H0 does
not hold, SSH0 /σ 2 tends to have a larger value.
Idea for testing: compare SSH0 to SSE.
We already know, that SSE/σ 2 ∼ χ2n−rk(X) , and SSE and SSH0 are independent, because:
Ŷ and
c and
Cβ
SSH0 and
SSE independent
c =
SSE independent, because for estimable Cβ, there exists A, such that AX = C. Therefore Cβ
0
− 0
AX · (X X) X Y = AŶ , function in Ŷ .
0
0
− 0 −1 c
c
c
SSE independent, because SSH0 = Cβ
−
d
(C(X
X)
C
)
Cβ
−
d
is function in Cβ
OLS
OLS
and therefore function in Ŷ .
Definition 1.12 (F distribution)
Let U ∼ χ2ν1 and V ∼ χ2ν2 independent. Then
U/ν1
∼ Fν1 ,ν2 ,
V /ν2
the (Snedecor) F distribution with ν1 and ν2 degrees of freedom.
EFν1 ,ν2 =
ν2
2ν22 (ν1 + ν2 − 2)
and V arFν1 ,ν2 =
.
ν2 − 2
ν1 (nu2 − 2)2 (ν2 − 4)
Definition 1.13 (non-central F distribution)
Let U ∼ χ2ν1 (λ2 ) and V ∼ χ2ν2 independent. Then
U/ν1
∼ Fν1 ,ν2 ,λ2 ,
V /ν2
32
the non-central F distribution with ν1 and ν2 degrees of freedom and non-centrality parameter λ2 .
EFν1 ,ν2 ,λ2 =
(λ2 + ν1 )ν2
2ν 2 (ν 2 + (2λ2 + ν2 − 2)ν1 + λ2 (λ2 + 2ν2 − 4)
.
and V arFν1 ,ν2 ,λ2 = 2 1
ν1 (ν2 − 2)
ν12 (nu2 − 2)2 (ν2 − 4)
Define
F =
SSH0 /(l · σ 2 )
∼ Fl,n−rk(X),δ2
SSE/((n − rk(X))σ 2 )
For α level hypothesis test of H0 : Cβ = d we reject, if F > (1 − α) quantile of central Fl,n−rk(X) . Let
f ∗ = (1 − α) quantile of central Fl,n−rk(X) .
The p value of this test is then P (F > f ∗ ). (Remember: the p value of a test is the probability to observe a
value as given by the test statistic or something that’s more extreme, given the null hypothesis is true. We
are able to reject the null hypothesis, if the p value is small)
The power of a test is the probability that the null hypothesis is rejected given that the null hypothesis is
false, i.e.
P ( reject H0 | H0 false )
If the null hypothesis is false, the test statistic F has a non-central F distribution. We can therefore compute
the power as
P ( reject H0 | H0 false ) = P (F > f ∗ ) = 1 − Fl,n−rk(X),δ2 (f ∗ )
For F3,5 this gives a power function in δ 2 as sketched below:
0.0
0.2
0.4
prob
0.6
0.8
1.0
Power of F(3,5) test
0
20
40
60
80
100
delta
The R code for the above graphic is
delta <- seq(0,100,by=0.5) # varying non-centrality parameter
fast <- qf(1-0.05,3,5) # cut-off value of central F distribution for alpha=0.05
plot(delta, 1-pf(fast, 3,5,delta),type="l",col=2,main="Power of F(3,5) test",ylab="prob",ylim=c(0,1))
33
Normal Theory & Maximum Likelihood
(as justification for least squares estimates)
Definition 1.14 (ML estimates)
Suppose r.v. U has a pmf or pdf f (u | θ). If U = u is observed and
ˆ
f (u | theta)
= max f (u | θ)
θ
ˆ is the maiximum likelihood estimate of θ.
theta
For a normal Gauss Markov model:
f (y | Xβ, σ 2 I)
=
=
−1/2
1
(2π)−n/2 det σ 2 I exp − (y − Xβ)0 (σ 2 I)−1 (y − Xβ) =
2
1
−n/2 2 −n/2
(2π)
(σ )
exp − 2 (y − Xβ)0 (y − Xβ).
2σ
d
d
For fixed σ 2 this expression is maximized, if (y−Xβ)0 (y−Xβ) is minimal, i.e. Xβ
M L = Xβ OLS = PX Y = Ŷ .
2
For an ML estimate of σ consider the log likelihood function:
log f (y | Xβ, σ 2 I) = −
n
1
n
log(2π) − log σ 2 − 2 SSE
2
2
2σ
Then
d
n
1
log f (y | Xβ, σ 2 I) = 0 − 2 + 4 SSE
dσ 2
2σ
2σ
Therefore
c2 M L = SSE/n = n − rk(X) M SE.
σ
n
The MLE of (Xβ, σ 2 ) is (Ŷ , SSE/n).
Note: the MLE of σ 2 is biased low:
E(SSE/n) = 1/nE [(n − rk(X))M SE] =
n − rk(X) 2
σ < σ2 .
n
Regression Analysis as special case of GLIMs
A regression equation
yi = β0 + β1 x1i + β2 x2i + ... + βr xri + i
corresponds to
Y = Xβ + ,
with



X=

1
1
..
.
x11
x12
x21
x22
...
1 x1n
x2n
...


β0
xr1
 β1

xr2 


and β =  β2

 ..

 .
xrn n×(r+1)
βr
Unless n is very small or we are extremely unlucky, matrix X has full rank.
34







Example: Biomass Data
Source: Rick A. Linthurst: Aeration, nitrogen, pH, and salinity as factors affecting Spartina alterniflora
growth and dieback. Ph.D. thesis, North Carolina State University, 1979.
Description: These data were obtained from a study of soil characteristics on aerial biomass production of
the marsh grass Spartina alterniflora, in the Cape Fear Estuary of North Carolina.
Number of cases: 45
Variable Description
Location
Type
Type of Spartina vegetation: revegetated areas, short grass areas, Tall grass areas
biomass
aerial biomass (gm−2 )
salinity
soil salinity (o/oo)
pH
soil acidity as measured in water (pH)
K
soil potassium (ppm)
Na
soil sodium (ppm)
Zn
soil zinc (ppm)
At first, we want to get an overview of the data. We will draw pairwise scatterplots and add smooth lines
to try to get an idea of possible trends.
7
●
●
●
● ● ●
●
●
●●
●
●
●
●
●
● ●
● ●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
● ●
●●
●
●
●
● ●
●
●
●
●●
● ●
●
500 1000
●
●
●
●
●● ●
● ●
● ● ●
●
● ●
● ● ●●
● ● ●●
●
●●
●
●●
30000
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●●
● ● ●● ●
●
●
●
●
●●
●
●●
● ●
●●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
2000
●
●
●●
●
●
●
●
●
36
●●
20000
●
●
●
●●
●
32
10000
●
●
●● ● ● ●
●
●
●
●
●
●●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
28
7
●
● ●●
●
6
●
6
●
●
●
●
●
● ●
● ●
●●
●
●
●
●●
● ● ●
●
● ●
●● ● ●
●
●
●
salinity
●
5
●
24
4
●
●
●
●
●
●
●
● ● ●
●
●
●●
●
●
●
●●
●
●●●
●
●
25000
10000
●
● ● ●
●
●
●
●
●
● ●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
2500
●
●
●
●
●● ●●●
●●●
●
●●
●
● ●
●
● ●
●
●
●
●
500
●
●
●
● ● ●
●
24
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
● ●
●
●
28
32
●
●
●
●
36
●
●
●
●
●●
● ●
● ● ● ●
●
●● ● ●
●● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●● ●●
● ●● ●
400
Zn
●
●
●●
●
●● ●
● ●●
● ●●
●
●
●
●
●
● ●
●
●
●
●
●
800
●
●●
●
●
●●
●● ●
● ●●●
● ● ●
●
●●
●●
●
●●
●
●
●
●
●●
●● ●
●
●
●●
●● ●
●●
● ●●●
●
●
●
●●
●
●
●
●
1200
●
●
●
●
●
●
●
● ●
●● ●
●
●
0
● ● ●
●
●
●
●
● ●● ●
●
●
● ●
●●
● ●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●●
●● ●
●
● ●
●●●
●
●
●
●●
●
● ●● ●
●
●
●
●
● ● ●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●
● ● ● ●● ●●
● ●●
● ●●●
●
●
●●
●
●●
●
● ●
●
●
●
●●
●●
●
●●
●
● ●
●● ●
●● ●●
● ●
● ●
●
●
●
●● ●●
●●
●
●●
●
●● ●
●
● ●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ● ● ●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
● ●●
●
●
Na
●
●
●●
●
●● ●
●
● ●●●
●
●●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●●
●
●●
●
●●
●
●
●
●
●●
● ●
●
●●
●● ● ● ●
●
●● ● ●
●
●
●●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
1500
●
●
●●
●
●
●
●
●
●
●
●
●●
● ● ●
●
●
●
●
● ●
● ●
● ●● ●
● ●
●●●
●●
●
● ●
●
●
●●
● ● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●●
●●
● ●●
● ●
●● ●
●
●● ● ●
●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
●
●
●●●
●
● ●
●●
● ●
●
●
●
●
●
●
●
●●
●● ● ●
● ●
●●
●
● ● ●
● ●● ● ●
●
●● ●● ●●
●●
●
●
K
●
●
●
● ●
●
●●
● ●
●
● ●
● ●
●
●
●
●
● ●● ●
●
●
●
●
●
● ● ●
● ●
●
●
●
●●
●●
●
●
●●
●
●
●
● ●
● ●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
● ●
● ●
● ●
●
●
● ● ● ●● ●
●● ●
●
●
●
●
●
●
●
●●●
●●
●
●●
●●
● ●
●● ● ●●
● ●
●
●●● ●
●
● ●
●
●
●
●
●●●
●
● ●
● ● ●
●
●
●
●
● ● ●
●
●
●
●
● ● ●
● ●
● ●
●
●
●
●
●
●●
1200
●
●
●
●
●
●
●
800
●
●
●
● ● ●
●
● ●
●
●
●
● ●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
400
●
●
●●●
●
●
● ● ●● ●
30
● ●●
●●●●● ●
●
●
●
●
●
●● ●
●
●
●●
20
●●
●
●
●
●
● ●●
●
● ●
●
10
pH
●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ● ● ●
● ● ●
●● ●
● ●●●
●
● ●●
●
●
●
●
●
0
4
5
●
5
●
y
10 15 20 25 30
The strongest linear relationship between biomass (y) and any of the explanatory variables seems to be
the one between y and pH. Salinity shows an almost zero trend with very noisy data, the trend between
potassium, sodium, zinc and y is slightly negative.
> biomass <- read.table("biomass.txt",header=T)
> dim(biomass) # check whether read worked
[1] 45 8
> names(biomass) # variables
35
[1] "Location" "Type"
"biomass" "salinity" "pH"
[7] "Na"
"Zn"
> y <- biomass$biomass
> x <- biomass[,4:8]
>
> # get overview
> points.lines <- function(x,y) {
+ points(x,y)
+ lines(loess.smooth(x,y,0.9),col=3)
+ }
> pairs(cbind(x,y),panel=points.lines)
"K"
A check of the rank ensures, that we indeed have a full rank matrix X in the model:
> X <- as.matrix(cbind(rep(1,45),x))
> qr(X)$rank # compute rank of X matrix
[1] 6
Closer Look at the Hat matrix In order to get predictions in a linear model, we use the projection
matrix PX . Usually, in a regression, we deal with hat matrix H, for which holds:
Ŷ = HY,
i.e. the hat matrix is identical to the projection matrix, H = PX = X(X 0 X)−1 X 0 .
Sometimes, the diagonal elements hii are used as a measure of influence of observation i in the model. What
can we say about hii ?
Since H is a semi positive matrix, hii ≥ 0.
Pn
On the other hand, i=1 hii = tr(H) = rk(X) = r + 1. On average, hii = rk(X)/n = (r + 1)/n.
as “influential”.
It is standard to flag observations with hii > 2 · r+1
n
Example: Biomass Data - Predictions and Influential points
> X <- as.matrix(cbind(rep(1,45),x))
> qr(X)$rank # compute rank of X matrix
[1] 6
>
> H <- X %*% solve(t(X) %*% X) %*% t(X) # compute hat matrix
>
> hist(diag(H), main="Histogram of H_ii")
> oi <- diag(H) > 6/length(y)*2 # flag influential points
> sum(oi)
[1] 2
A histogram of the diagonal elements in the hat matrix reveals, that almost all are well below the threshold
of 2(r + 1)/n. Two points show up as being influential.
36
0
5
10
Frequency
15
20
Histogram of H_ii
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
diag(H)
These points are marked red in the graphic below. Here’s the code for doing so.
points.lines <- function(x,y) {
points(x,y)
points(x[oi],y[oi],col=2) # mark influential points red
lines(loess.smooth(x,y,0.9),col=3)
}
pairs(cbind(x,y),panel=points.lines)
10000
●
●●
●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●
●
●●
● ● ●
●
● ●
●● ● ●
●
●
●
salinity
●
●
●
●
● ●
●
●
●
●●
30000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
● ●● ●
●●
●●
● ●
●
●
●
●
●
●
●
●
1500
●
●
●
● ●
●
●
●
●●
●
●
● ●
●●
●
●
●
●●
●●
●●● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●● ● ● ●
●
●
●
● ● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
6
●
●● ●
●
2500
●
●
●
●
●●●
2000
●
●
●
●
●
●● ●
●
● ●
● ● ●
●
● ●
● ● ●●
● ●
●●
●
●●
●
●●
●
●
1000
●
●
●
●
●
●
●
●
●
500
●
●
●
●
●
●
● ●●●●
●
●
●
●
●
7
20000
●
●
36
7
32
6
28
5
●
24
4
●
●●
●
●
●
●●●
●●●
●
●
●
●
●
●
30000
20000
10000
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
2500
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
● ●
●
●
●
●
●
●
● ●
● ● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
24 26 28 30 32 34 36 38
●●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
400
●
●
●
●●
●
●
●
●
●
●●
● ●●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
● ●●
● ●●
●
●
●●
●
●
Zn
●
●
● ●●
●●
●
●
●
●
●●
●●
●●
●●●
● ●● ●
800
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
1200
●
●
●
●●
●● ●
● ●●●
● ● ●
●
●
●
● ●
●
●●
●
●
●
0
37
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
5
10
●
●
●
●
●
●
●
●
●
20
●
●
30
● ●
●
●
●●
●●
●
● ●● ●
●
●
● ●● ●
●
●
y
25
●
●
●
● ●
●●
●
●●
●
●●
● ●
●● ●
●
●
●●
●● ●
● ●●●
●●●
●
●
●●
●
15
●
● ●
●
●
●
●
●● ●●
● ●
● ●●
● ● ●
●
●
●
●●
● ●
●
●
●
●
●
●●
●● ●
●
●
● ●●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●●
●●
●
●●
●
●● ● ●
●
● ● ●●
●
● ●
●
●
●
●
●
●● ●
●●
600
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
● ●
●● ●
●●
● ● ●
●
●
● ●
●
●
●
●
●
●
Na
●●●
●●
●
●
●●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●●
●●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●●
●● ● ●
●
●
●● ● ●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●
●
● ●
●●
● ●●
●
●
●
●
●
●
●
●●
●● ● ●
●
● ● ●●● ●
●
●● ●
●●
●
●
●
●
●
●
● ●
●
● ●● ●
●
●
●
●
● ● ●
●
●● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
● ●● ●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●
1500
●
●
●
●
●
●
●
●
K
●●
●●
●
●
●
● ●
● ●
● ●
500
●
●
●
●
●
●● ●
●
●
●
● ●
●
●●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
● ●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●● ●
●
●
●● ●● ●●
●
●● ● ● ●
●
●
● ●
● ● ●
●
●
●
● ● ●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●●
●
1200
●
●
●●●
●
●
●
●
●
800
●
●
●
●
●
●
●
●
●
●
●●
400
●
●
● ●
● ●
●
● ● ● ●
●
●
●
●
●
●
●●
●●● ● ●●
●
●
●
●
●●
●
●
●
●
30
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
20
●
●
●
●
●●
●●
●
●
●
● ●
● ●
● ●
●
●●
●
●
●
●
●
●● ●
● ●
●
5 10
pH
●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
● ●
●
● ● ●
0
4
5
●
The variance of both predicted values and residuals are connected to the values of the hat matrix, too:
V ar(Ŷ )
= V ar(HY ) = σ 2 H, therefore
V ar(Ŷi ) = σ 2 · hii
An estimate of the standard deviation of Ŷi is then
M SE =
√
√
M SE hii , with
SSE
.
n − (r + 1)
For residuals e = Y − Ŷ , the variance structure is given as:
V ar(e) = V ar(Y − Ŷ ) = (I − H)V ar(Y ) = σ 2 (I − H), therefore
V ar(ei ) = σ 2 · (1 − hii )
Adjusted residuals (standardized residuals) are given as
e∗i = √
ei
∼ N (0, 1)
√
M SE · 1 − hii
Both predicted values Ŷ and explanatory variables Xi , i = 1, ..., r are independent of the residuals, i.e.
Cov(Ŷ , e) = 0 and Cov(Xi , e) = 0 for all i = 1, ..., r
This is the reason for looking at residual plots. We do not want to see any structure or patterns in these
plots.
Example: Biomass Data
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
e <- (diag(rep(1,length(y))) - H) %*% y # vector of residuals
par(mfrow=c(1,2))
plot(yhat,e)
yhat <- H %*% y
sse <- crossprod(y - yhat)
estd <- e/(sqrt(sse/(length(y)-qr(X)$rank)) * sqrt(diag(H)))
plot(yhat,estd)
par(mfrow=c(1,5))
for (i in 1:5) {
plot(X[,i+1],estd)
points(X[oi,i+1],estd[oi],col=2) # mark influential points red
lines(loess.smooth(X[,i+1],estd,0.9),col=3)
}
In the plot of residuals versus predicted values, a slight asymmetry is apparent: positive residual values are
larger than negative residuals. For standardized residuals the scale is -4 to 8. More residuals than expected
under normality have absolute values over 4, indicating a poor fitting model.
38
1000
●
8
●
●
●
●
500
6
●
4
●
●
●●
●●
●●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
1000
1500
●
●
●
●
●
●●
●
●
●
●
500
●
●
●
● ●
−4
●
●
●
●
●
●
●
●
●
●
−500
●
2
estd
●
●
−2
●
●
● ●
●
0
e
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2000
500
1000
1500
yhat
2000
yhat
In the partial residual plots smooth curves are included to reveal possible problems: under the model, we
would expect the smoothed line to be close to line y = 0. This is not true in this example - plots number 2
and 3 in particular show a “bump” along the x-axis. This suggests that linear inclusions of these variables
are not enough to explain all of the variation. Quadratic terms could, for example, be included.
●
●
●
●●●
●
●
●
●
●
28
30
32
34
36
38
6
●●
●
●
●
●
5
X[, i + 1]
6
7
400
600
X[, i + 1]
800
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1200
1400
●
● ●
●
●
●
●
10000 15000 20000 25000 30000 35000
X[, i + 1]
●●
●
●
●
●
●
1000
●
●
●
●
●
●
●
●
●
●
●
●
4
●
●
●
●●
●●
● ●
●
●
●
2
estd
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
4
4
2
estd
●●
●
0
●
●
●
●
●
●
●
●
●
●
26
8
8
6
6
estd
2
0
0
●
●
●
●
24
●
●
●
●
●●
●
●
● ●
−2
●
●
●
●●
●●●
●
●●
●
●●
●
●
−2
●
●
●
−4
●
●
●
●●
−2
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
−4
●
●
●
−4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
−2
●
●
−4
●
●
●
●
●
●
●
●
●
−4
0
●
●
●
●
●
●
●
4
4
●
●
●
●
●
●
●
estd
●
●
●
●
●
●
●
2
4
2
●
●
●
●
estd
●
6
6
●
●
●
8
●
8
●
8
●
●
X[, i + 1]
0
5
10
15
20
25
30
X[, i + 1]
More details about F testing in regressions:
Common hypothesis (after re-ordering variables appropriately):
H0 : βp+1 = βp+2 = ... = βr = 0
testable form h0 : Cβ = d with


1
..

C= 0
(p+1)×(r−p)


.
1(r−p)×(r−p)
and
0
0
−1 0 −1 c
c
SSH0 = Cβ
C ) Cβ OLS
OLS (C(X X)
This hypothesis fits the framework of full model/ reduced model testing. Full/reduced model tests are usually
reported in anova tables using expressions Y 0 AY , where A is some projection matrix. We will have to look
at partitions of Y 0 Y :
Y 0 Y = Y 0 P1I Y + Y 0 (PXp − P1I )Y + Y 0 (PX − PXp )Y + Y 0 (I − PX )Y,
where 1I is the vector, which consists of 1s only.
Since
Y 0 Y − Y 0 P1I Y = Y 0 (PXp − P1I )Y + Y 0 (PX − PXp )Y + Y 0 (I − PX )Y .
{z
} |
|
{z
}
{z
} |
{z
} |
SST
SSE
SSH0
SSR reduced model
|
{z
SSR full model
39
}
(2)
Anova Table for H0 : βp+1 = βp+2 = ... = βr = 0:
Source of variation
Regression (x1 , ..., xr )
Regression (x1 , ..., xp )
Regression (x1 , ..., xr ) | (x1 , ..., xp )
Error
Total
Sum of Squares
SSRfull
SSRreduced
SSH0
SSEfull
SST
degrees of freedom
rk(PX − P1I ) = r
rk(PXp − P1I ) = p
rk(PX − PXp ) = r − p
n − (r + 1)
n−1
Since σ1 (Y − Ŷ ) ∼ N (0, 1), we can apply Cochran’s theorem to the partition (2): since the ranks of the
matrices on the right hand side add up to n − 1, the rank of the left hand side, Cochran’s Theorem tells
us that the terms on the right are mutually independent and have chi square distributions with degrees of
freedom p, r − p, and n − (r + 1) respectively.
Other partitions exist, e.g. Type I sequential sum of squares:
Y 0 Y − Y 0 P1I Y = Y 0 (PX1 − P1I )Y + Y 0 (PX2 − PX1 )Y +... Y 0 (PXr − PXr−1 )Y + Y 0 (I − PXr )Y .
| {z } |
{z
} |
{z
}
{z
}
|
{z
} |
R(β0 )
|
{z
R(β1 |β0 )
R(β2 |β1 ,β0 )
SSE
R(βr |β0 ,β1 ,...,βr−1 )
}
(Y −Ȳ )0 (Y −Ȳ )
Again, Cochran’s theorem gives us independence of all the terms. Use R(β2 | β0 , β1 ) as numerator to test
β2 = 0 in model of constant, X1 and X2 .
Use R(βi | β0 , ..., βi−1 , βi+1 , ..., βr ) to test βi = 0 in full model.
Factorial Model as special case of GLIMs
Let A and B two factors in a linear model with I and J levels respectively. Consider the means model
yijk = µij + ijk , for i = 1, ..., I , j = 1, ..., J ,
| {z } | {z }
factor A
factor B
k = 1, ..., nij
|
{z
}
.
repetition for treatment (i,j)
Then µij is estimable as long as nij ≥ 1 (i.e. we have at least one observation for each treatment).
In the following assume, that nij ≥ 1 for all i, j.
Since all µij estimable, all linear combinations of µij are estimable.
“Interesting” combinations are:
y.j
=
yi.
=
µ..
=
1X
µij = row j average mean
I i
1X
µij = column i average mean
J j
1 X
µij = grand average mean
IJ i,j
40
Contrasts:
“main effects” for factor A:
= µ̄i. − µ̄..
µ̄i. − µ̄k.
i = 1, 2, . . . , I
i 6= k
effect of level jβj = µ̄.j − µ̄..
difference in effects i and `
µ̄.j − µ̄.`
j = 1, 2, . . . , J
j 6= `
effect of level iαi
difference in effects i and k
“main effects” for factor B:
Interaction Effects:
= µij − (µ.. + µi. + µ.j )
αβij
For these effects, sum restriction constraints hold, i.e.:
X
X
X
X
αi =
βj =
αβij =
αβij = 0
i
j
Different view on interaction effects:
i
1
j
2 ...
1
2
.
.
.
J
I
ij
kj
il
kl
If there is no interaction effect, the difference in effects for factor A is independent from the level in B:
αβij − αβkj = αi − αk for all j = 1, ..., J
Similarly, the difference in effects for factor B does not dependent on the level in A:
αβij − αβi` = βj − β` for all i = 1, ..., I
Without an interaction effect, the effects for factor B show as parallel lines:
j=3
j=2
j=1
1
2
3
4
...
levels of A
Often, the first test done in this setting, is to check H0 : αβij = 0 versus the alternative that at least one of
the effects is non-zero. We could do that for
H0 : αβij = 0 ⇐⇒ H0 : µij − (µ.. + µ.j + µi. ) = 0
More natural might be to consider an effects model
yijk = µ + αi + βj + αβij + ijk
41
This model does not have full rank, though - with additional constraints, we can make it full rank:
Baseline restrictions (last effects are zero)
αI
βJ
αβiJ
αβIj
= 0
= 0
= 0 for all i = 1, . . . , I
= 0 for all j = 1, . . . , J
For I = 2 and J = 3, this gives estimates µij :
Factor
A
i=1
i=2
j=1
µ11 = µ + α1
+β1 + αβ11
Factor B
j=2
µ12 = µ + α1
+β2 + αβ12
j=3
µ13 = µ + α1
µ21 = µ + β1
µ22 = µ + β2
µ23 = µ
A Means
µ + α1 +
+(β1 + β2 )/3+
+(αβ11 + αβ12 )/3
µ+
B Means
µ+
α1
2
+ β1 +
αβ11
2
µ+
α1
2
+ β2 +
αβ12
2
µ+
β1 +β2
3
α1
2
Baseline restrictions (first effects are zero)
α1
β1
αβi1
αβ1j
=
=
=
=
0
0
0 for all i = 1, . . . , I
0 for all j = 1, . . . , J
For I = 2 and J = 3, this gives estimates µij :
Factor
A
i=1
i=2
B Means
j=1
µ11 = µ
µ21 = µ + α2
µ + α2 /2
Factor B
j=2
µ12 = µ + β2
j=3
µ13 = µ + β3
µ22 = µ + α2 +
+β2 + αβ22
µ23 = µ + α2 +
+β3 + αβ23
µ+
α2
2
+ β2 +
αβ22
2
µ+
α2
2
+ β3 +
A Means
µ + (β2 + β3 )/3
µ + α2 +
+(β2 + β3 )/3+
+(αβ22 + αβ23 )/3
αβ23
2
The corresponding full rank matrix X ∗ stemming from an effects model with imposed restrictions has then
the form X ∗ = (1I | Xα∗ | Xβ ∗ | Xαβ ∗ ), where Xα∗ consists of I − 1 columns for effects of factor A, Xβ ∗
consists of J − 1 columns for effects of factor B and Xαβ ∗ has (I − 1)(J − 1) linearly independent columns
for interaction effects.
A test of H0 : αβij = 0 then translates to a goodness of fit test of the reduced model corresponding to
(1I | Xα∗ | Xβ ∗ ) versus the full model corresponding to X ∗ . The sum of squares of the null hypothesis is then
written as
SSH0 = Y 0 (PX ∗ − P(1I|Xα∗ |Xβ∗ ) )Y,
42
where SSH0 /σ 2 has a χ2 distribution with (I − 1)(J − 1) degrees of freedom and non-centrality parameter
δ 2 , where δ 2 = (Cβ)0 (C(X ∗ 0 X ∗ )− C 0 )−1 Cβ and C defined as


1


..
C= 0
.
.
(I−1)(J−1)×(1+(I−1)+(J−1))
1(I−1)(J−1)×(I−1)(J−1)
Under H0 δ 2 = 0.
43
Testing for effects
How do we test hypothesis of the form H0 : αi = 0 for all i or, similarly, H0 : βj = 0 for all j?
In a 2 by 3 factorial model using zero sum constraints we know:
1
1
(µ11 + µ12 + µ13 ) − (µ21 + µ22 + µ23 )
6
6
The null hypothesis can then be written as:


µ11
1 1 1 1 1 1 

H0 : α1 = 0 ⇐⇒ H0 : ( , , , − , − , − )  ... 
6 6 6 6 6 6
µ23
α1 =
Similarly,
β1
= µ+1 − µ++ =
1
1X
1
µ11 + µ21 −
µij =
2
2
6 i,j
1
1
1
1
1
1
µ11 − µ12 − µ13 + µ21 − µ22 − µ23 ,
3
6
6
3
6
6
1
1
1
1
1
1
= − µ11 + µ12 − µ13 − µ21 + µ22 − µ23 .
6
3
6
6
3
6
=
β2
, which is used in a hypothesis test as

H0 :
1
6


2 −1 −1 2 −1 −1 

−1 2 −1 −1 2 −1 


µ11
µ12
µ13
µ21
µ22
µ23


 
0
=

0


−1
To test these hypotheses, we can come up with SSH0 = (Cβ − d)0 (C(X 0 X)− C 0 ) (Cβ − d)
Alternatively, we could come up with other test statistics:
For hypothesis H0 : α = 0, we might test for factor A in different ways:
test for A in model µ + αi∗
R(α∗ | µ) = Y 0 (P(1I|Xα∗ ) − P1I )Y
∗
∗
test for A in model µ + αi + βj
R(α∗ | µ, β ∗ ) = Y 0 (P(1I|Xα∗ |Xβ∗ ) − P(1I|β∗ ) )Y
∗
∗
∗
test for A in model µ + αi + βj + αβij R(α∗ | µ, β ∗ , αβ ∗ ) = Y 0 (P(1I|Xα∗ |Xβ∗ |Xαβ∗ ) − P(1I|β∗ |Xαβ∗ ) )Y
Which of these sum of squares should (or could?) we consider. Since they all test for the same effect, all are
in some sense valid.
In a balanced design (and all orthogonal designs for that matter) it turns out, that all of these sums coincide
anyway.
For an unbalanced design, (i.e. nij > 0 for all i, j, but not equal), these sums give different results.
Simulation:
get:
For a small 2 × 3 example, with factors A and B on two and three levels, respectively, we
j=1
2
3
µij
based on
i = 1 2.25 22.00 18.33
2
7.17 26.25 18.80
In this situation, the three sums of squares yield different values:
> drop(t(Y) %*% (P(cbind(one,A)) - P(one)) %*% Y)
[1] 58.90667
44
nij
i=1
2
j=1
4
6
2
3
4
3
3
5
> drop(t(Y) %*% (P(cbind(one,A,B)) - P(cbind(one,B))) %*% Y)
[1] 66.52381
> drop(t(Y) %*% (P(X) - P(cbind(one,B,A*B))) %*% Y)
[1] 60.52246
where
A <- c(rep(1,15), rep(-1,10))
B1 <- c(rep(1,5), rep(0,4), rep(-1,6), rep(1,3),
B2 <- c(rep(0,5), rep(1,4), rep(-1,6), rep(0,3),
B <- cbind(B1,B2)
one <- rep(1,25)
rep(0,3), rep(-1,4))
rep(1,3), rep(-1,4))
and
Y
[1] 21 18 24 10 21 30 23 28 24 13
8 -1 10
5
8 10 25 20 17 28 21 -3 13
0 -1
and SSH0 for H0 : α = 0 is
> C <- c(0,1,0,0,0,0)
> b <- ginv(t(X) %*% X) %*% t(X) %*% Y
> SSH <- t(C %*% b) %*% solve( t(C) %*% ginv(t(X) %*% X) %*% C) %*% C %*% b
> drop(SSH)
[1] 60.52246
which is the same as the last of the three sum of squares above. This is generally true: SSH0 is the same as
the sum of squares we get, by comparing the full model to a model without the factor of interest.
Different software produces and regards different sum of squares - we need to know, which ones.
SAS introduced the concepts of type I, type II, and type III sum of squares:
• The Type I (sequential) sum of squares is computed by fitting the model in steps according to the order
of the effects specified in the design and recording the difference in the sum of squares of errors (SSE)
at each step.
• A Type II sum of squares is the reduction in SSE due to adding an effect after all other terms have
been added to the model except effects that contain the effect being tested. For any two effects F and
F̃ , F is contained in F̃ if the following conditions are true:
– Both effects F and F̃ involve the same covariate, if any.
– F̃ consists of more factors than F .
– All factors in F also appear in F̃ .
• The Type III sum of squares for an effect F is the sum of squares for F adjusted for effects that do not
contain it, and orthogonal to effects (if any) that contain it.
If we assume the order A, B, and AB on effects, we get the following three types of sum of squares:
Factor
Type I
A
R(α∗ | µ)
B
R(β ∗ | µ, α∗ )
∗
∗
∗
R(αβ | µ, α , β ) R(αβ ∗ | µ, α∗ , β ∗ )AB
Type II
Type III
R(α∗ | µ, β ∗ )
R(α∗ | µ, β ∗ , αβ ∗ )
∗
∗
R(β | µ, α )
R(β ∗ | µ, α∗ , αβ ∗ )
∗
∗
∗
R(αβ | µ, α , β )
The advantage of Type I sum of squares is their additivity: the sum of type I sum of squares adds to the
total sum of square T SS = Y 0 (PX − P1I )Y = R(αβ ∗ , α∗ , β ∗ | µ).
45
The obvious disadvantage is the obvious order-dependency of type I sums. Changing the order of factors
changes their effect.
Type III sum of squares do have order invariance, but they do not add up to any interpretable value, nor
are they independent of each other.
The two properties - order invariance and additivity (sometimes called orthogonality) - usually do not occur
together. Balanced designs do have them, though.
Example
For the simulated example above, type I sums of squares in R are given as:
> ## Type I Sum of Squares: order A,B, AB
> lmfit <- lm(Y~1+A+B+A*B)
> anova(lmfit)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value
Pr(>F)
A
1
58.91
58.91 1.8660
0.1879
B
2 1695.07 847.53 26.8475 2.91e-06 ***
A:B
2
22.87
11.43 0.3622
0.7009
Residuals 19 599.80
31.57
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
> ## Type I Sum of Squares: order B,A, AB
> lmfit <- lm(Y~1+B+A+A*B)
> anova(lmfit)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value
Pr(>F)
B
2 1687.45 843.73 26.7269 3.003e-06 ***
A
1
66.52
66.52 2.1073
0.1629
B:A
2
22.87
11.43 0.3622
0.7009
Residuals 19 599.80
31.57
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Depending on the order of effects specified in the formula, the output varies.
Type II and III sums of squares are produced by Anova in the package car:
> Anova(lmfit,type="III")
Anova Table (Type III tests)
Response: Y
Sum Sq Df F value
Pr(>F)
(Intercept) 5861.1 1 185.6638 2.946e-11 ***
B
1679.4 2 26.6000 3.105e-06 ***
A
60.5 1
1.9172
0.1822
B:A
22.9 2
0.3622
0.7009
Residuals
599.8 19
--46
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Anova(lmfit,type="II")
Anova Table (Type II tests)
Response: Y
Sum Sq Df F value
Pr(>F)
B
1695.07 2 26.8475 2.91e-06 ***
A
66.52 1 2.1073
0.1629
B:A
22.87 2 0.3622
0.7009
Residuals 599.80 19
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
47
Related to the problem of unbalanced designs is the problem of how to handle experiments that are not
complete, i.e. designs, in which nij = 0 for some is and js.
In a small factorial 2 × 3 design, we might not have any values for cell (2, 3). Clearly, this means that µ23 is
note estimable. On top of that, we cannot estimate anything involving µ23 , such as µ.3 or µ2. .
A work-around would, of course be, to estimate the row and column averages based on only the information
we have, i.e.
1
µ2. = (µ21 + µ22 )
2
Similarly, by using
αβij = µij − µil − (µkj − µkl )
we can find estimates for interaction effects as long, as there are four cells with data (in the corners of an
imaginary rectangle).
Generally, the following is true:
In an experiment with K non-empty cells (K < IJ), we can get matrix Xn×K . Provided K is big enough
(at least K ≥ 1 + (I − 1) + (J − 1)) and the pattern of empty cells is not nasty (i.e. there are not
whole rows or columns missing), all parameters µ, α∗ and β ∗ are estimable and we can construct matrix
X ∗ = (1I | Xα∗ | Xβ ∗ ).
A test on interaction effects is the done (in the reduced/full model framework):
H0 : all estimable interaction effects are zero
∗
)Y , and
SSH0 = Y 0 (PX − PX
F =
SSH0 /(K − 1 − (I − 1) − (J − 1))
∼ FK−1−(I−1)−(J−1),n−K
Y 0 (I − PX )Y /(n − K)
Under the assumption that all interaction effects are zero, we can get estimates for µ, α∗ and β ∗ out of a
minimum of 1 + (I − 1) + (J − 1) cells by moving “hand over hand” from one level to the next:
α1 α2 α3 α4 α5
β1 β2 β3 β4 β5
2
Nonlinear Regression
So far, we have been dealing with models of the form
Y = Xβ + ,
i.e. models that were linear as functions of the parameter vector β.
The new situation is that
Y = f (X, β) + ,
where f is a known function, non-linear in β1 , β2 , ..., βk and in some sense “well behaved” (we want f to be
at least continuous, at a later stage we will also want it to be differentiable, ...)
48
Example: Chemical Process Assume, we have an irreversible chemical progress of reactors A, B, and
C:
θ2
θ1
C
B→
A→
Let A(t) denote the amount of A at time t (and similarly, we have B(t) and C(t)). θ1 and θ2 are the reaction
rates, and are typically at the center of the problem, i.e. we want to find estimates for them.
Then we get the set of differential equations:
d
A(t) = −θ1 A(t)
dt
d
B(t) = θ1 A(t) − θ2 B(t)
dt
d
C(t) = θ2 B(t)
dt
with A(0) = 1, B(0) = C(0) = 0.
Calculus tells us that B(t) can be written as
B(t) = B(t, θ1 , θ2 ) =
θ1
e−θ1 t − e−θ2 t
θ1 − θ2
0.4
Usually, we observe yi at time ti where yi = B(ti , θ1 , θ2 ) + i and we want to find estimates of θ1 and θ2 .
Simulation example:
●
●
●
●
●
●
●
● ●
●
●● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●● ● ●
●
●● ●● ● ● ●
●
●
●●● ●
●
● ●
● ● ●●
● ●
● ●
● ●
●
●
● ● ●●
● ●
●●
●●●
●
●
●
● ●● ●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
●●
●
●●● ●● ●
●
●
●
●
B
0.2
0.3
●
0.1
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●● ●
●
●
● ●●● ● ●●
●
●
●
●
● ● ●
●
●
● ●●● ● ● ●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
0.0
●
●
●
●
●
●
●
0
2
4
6
8
10
t
For the simulation shown in the graphic, the parameters have been chosen as θ1 = 0.3 and θ2 = 0.4, and
∼ M V N (0, σ 2 I), with σ = 0.05.
The R code for that is
>
>
+
+
>
>
>
>
>
>
>
>
# function B(t) with default parameters specified
f <- function (x,theta1=0.3,theta2=0.4) {
return (theta1/(theta1-theta2)*(exp(-theta2*x) -exp(-theta1*x)))
}
# assuming measurements at every 0.05 units between 0 and 10
t <- seq(0,10,by=0.05)
e <- rnorm(length(x),mean=0,sd=0.05)
# simulated observations
B <- f(t)+e
49
> plot(t,B)
>
> # overlay graph of B(t)
> points(t,f(t),type="l",col=2)
Using a non-linear least squares estimation in R, we can get a good estimate for θ1 and θ2 :
# finding non-linear least squares estimates
fit <- nls(B ~ f(t,theta1=t1,theta2=t2), start=list(t1=0.55,t2=0.45),trace = TRUE )
1.458538 : 0.55 0.45
0.7911527 : 0.2248839 0.4364552
0.4879797 : 0.2985262 0.4087470
0.485227 : 0.3007713 0.4013401
0.4852263 : 0.3008083 0.4014700
>
> summary(fit)
Formula: B ~ f(t, theta1 = t1, theta2 = t2)
Parameters:
Estimate Std. Error t value Pr(>|t|)
t1 0.300808
0.009702
31.00
<2e-16 ***
t2 0.401470
0.007745
51.84
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04938 on 199 degrees of freedom
Correlation of Parameter Estimates:
t1
t2 0.284
50
For a non-linear model yi = f (xi , β) + i least squares estimates of β can be found by minimizing
X
(yi − f (xi , β))2 (∗)
i
This expression does usually not have a closed form solution. We can find solutions using numerical analysis:
1. grid based search: for low dimensional problems (in terms of the dimension of vector β) we might use
a grid of possible values b for β and take the minimum as our solution.
Sometimes several steps of this grid based search are done:
The disadvantage of this method is its complexity in terms of computation: since the method is
multiplicative in the number of parameters, it becomes computationally infeasible quickly.
2. Gradient Method (Hill Climbing, Greedy Algorithm)
To minimize (*) the idea is to find a zero point of the first derivative of f w.r.t. β:
0
0k×1 = Dk×n
(Yn×1 − f (X, β)n×1 )
with
D=

f (x1 , β)
!

∂
 f (x2 , β)
f (xi , β)
and f (X, β) = 
..
∂βj

.
β=b
f (xn , β)





In the case of a linear model f (X, β) = Xβ, with f (xi , β) = x0i β and
D=
!
∂
f (xi , β)
= (xij ) = X
∂βj
β=b
0
Therefore 0k×1 = Dk×n
(Yn×1 − f (X, β)n×1 ) becomes 0 = X 0 Y − X 0 Xb - the normal equations!
A standard method of solving g(z) = 0 is the
51
Gauss-Newton Algorithm
1. Identify some starting value z ∗
2. Find a linear approximation g̃ of g(z) in point z = z ∗ .
3. For linear approximation find z ∗∗ with g̃(z ∗∗ ) = 0.
4. Replace z ∗ by z ∗∗ and repeat from (2) until convergence, i.e.
until z ∗ is reasonably close to z ∗∗ .
x
z1
x
z2
x
z3
x
z z4
Obviously:
• The choice of starting value is critical
• If function g is monotone the algorithm converges:
• Convergence is slow, if function is flat
• Results might jump between several solutions: any starting point in the shaded area to the left or right
of the x axis below will lead to a series jumping between the left and right hand side.
Math in more detail:
Let b0 = (b01 , ..., b0p )0 the starting value,
br := parameter
after the rth iteration, and
∂
Dr := ∂β
f
(x
,
β)
r
i
j
β=b
A linear approximation of f (x, b) is (Taylor expansion):
f (X, β) ≈ f (X, br ) + Dr (β − br )
with
Y = f (X, β) + ≈ f (X, br ) + Dr (β − br ) + .
Then the approximation gives rise to a linear model:
Y − f (X, br ) = |{z}
Dr (β − br ) +
{z
}
|
| {z }
∗
X
Y∗
γ
The ordinary least squares solution helps to find an iterative process:
0
0
γ̂OLS = (Dr Dr )−1 Dr (Y − f (X, br ))
| {z }
β−br
This suggests
0
0
br+1 = br + (Dr Dr )−1 Dr (Y − f (X, br ))
for the iteration.
52
The iteration step
0
0
br+1 = br + (Dr Dr )−1 Dr (Y − f (X, br ))
is repeated until convergence. For convergence, we need to define a stopping criterion. There are different
options:
1. by comparing the solutions of step r and r + 1 we get an idea of how much progress the algorithm still
does. If the relative difference
!
r+1
b
− brj j
max
br + c
j
j
is sufficiently small, we can stop. c can be chosen as any small number (it’s a technicality to prevent
us from dividing by zero)
2. “Deviance”:
SSE r :=
X
2
(yi − f (X, br )) ,
i
the error sum of squares after r iterations. Stop, if SSE r+1 is close to SSE r .
For the non-linear model
y = f (x, β) + we can find β̂ by using Gauss-Newton:
Guess some starting value b0 . Then iterate
where Dr =
∂
∂βj f (xi , β)
β=β r
br+1 = br + (Dr 0 Dr )−1 Dr 0 (Y − f (X, br )),
.
Inference for β̂, σ̂ 2 ?
Large Sample Inference
Assumption ∼ M V N (0, σ 2 I), then (bOLS , SSE/n) are MLE of (β, σ 2 ).
The following holds:
1. the ordinary least squares estimate bols ahs an approximate multivariate normal distribution:
!
∂
.
bOLS ∼ M V Nk (β, σ 2 (D0 D)−1 ) with D =
f (xi , b)
∂bj
b=β
2. mean squared error is approximately chi square
M SE =
SSE
= σ̂ 2
n−k
and
SSE . 2
∼ χn−k
n−k
3. estimating derivative
D̂ =
!
∂
f (xi , b)
∂bj
b=bOLS
53
4. for some smooth differentiable function h : IRk → IRq we get (Delta method):
.
0
2
−1
h(bOLS ) ∼ M V N (h(β), σ G(D D)
0
G)
with G =
!
∂
hi (b)
∂bj
b=β
5. Estimating G:
!
∂
f (xi , b)
∂bj
b=bOLS
Ĝ =
Inference fro single βj
From 1)
bOLSj − bj .
q
∼ N (0, 1)
σ (D0 D)−1
jj
From 2 and 3)
√
bOLSj − bj
.
q
∼ tn−k
−1
0
M SE (D̂ D̂)jj
hypothesis test: H0 : βj = #:
T =√
(1 − α) C.I. for βj :
bOLSj − #
q
M SE (D̂0 D̂)−1
jj
.
then T ∼ tn−k
q
√
bOLSj ± t1−α/2,n−k M SE (D̂0 D̂)−1
jj
Inference for single mean f (X, β)
.
For function f : IRk → IR we get (from 4):
2
0
−1
f (x, bOLS ) ∼ N (f (x, β), σ G(D D)
0
G ),
with Ĝ =
!
∂
f (x, b)
∂bj
b=bOLS
A (1 − α) C.I. for f (x, β) is then:
q
√
f (x, bOLS ) ± t1−α/2,n−k M SE Ĝ(D̂0 D̂)−1 Ĝ0
Prediction In the future we observe y ∗ independently from y1 , ..., yn with mean h(β) and variance γσ 2 ,
e.g. h(β) = f (x, β) and γ = 1 for one new observation at x; h(β) = f (x1 , β) − f (x2 , β) and γ = 2 for the
difference between of observations at x − 1 and x2 .
Then (1 − α) prediction limits are given as:
q
h(bOLS ) ± t1−α,n−k sqrtM SE γ + Ĝ(D̂0 D̂)−1 Ĝ0
54
Inference based on large sample behavior: Profile likelihood
Again, ∼ M V N (0, σ 2 I). Then
L(β, σ 2 | Y ) = (2π)−n/2 1/σ 2
n/2
exp −1/(2σ 2 )
X
2
(yi f (xi , β))
i
and the log-likelihood is
`(β, σ 2 | Y ) = log L(β, σ 2 | Y ) = −n/2 log(2π) − n log σ − 1/(2σ 2 )
X
2
(yi f (xi , β))
i
Idea of profile likelihood: assume, the parameter vector θ can be split into two parts,
θ=
θ1
θ2
with θ1 ∈ IRp , θ2 ∈ IRr−p . We then can write the log-likelihood function ` as `(θ) = `(θ1 , θ2 ). The idea of
profile likelihoods is that we want to draw inference on θ1 and ignore θ2 .
Suppose now that for every (fixed) θ1 the function θ̂2 (θ1 ) maximizes the loglikelihood `(θ1 , θ̂2 (θ1 )) =: `∗ (θ1 ).
A (large sample) 1 − α confidence set for θ1 is:
1
θ1 | `(θ1 , θ2 (θ1 )) > θM L − cα
2
where cα is the 1 − α quantile of χ2p . The confidence set for θ1 is therefore the set of all parameters within
a cα range of the absolute maximum (given by θM L ).
The function `∗ (θ1 ) is also called the profile log-likelihood function for θ1
l(θML) - 1/ 2 χp 2
θ2
l(θML) - 1/ 2 χr 2
x
θML
(1−α)
confidence
region for
θ
(1−α)confidence
interval for θ1
θ1
Use this in nonlinear model:
Application # 1: θ1 = σ 2
For fixed σ 2 , the loglikelihood `(β, σ 2 ) is maximized by bOLS . Then the profile log likelihood for σ 2 is:
`∗ (σ 2 ) = `(bOLS , σ 2 ) = −
n
n
SSE
log 2π − log σ 2 −
2
2
2σ 2
55
l(b OLS ,
l(b OLS , SSE/ n)
\
` (β,
σ 2 )M L − `∗ (σ 2 ) =
}
n
SSE
n
log
−
2
n
2
n
SSE
2
= χ21,1−α
− − log σ −
2
2σ 2
σ2)
1/ 2 χ1 2
= −
.
This is an alternative to SSE/σ 2 ∼ χ2n−k
SSE/ n
(1−α)confidence
interval for σ2
σ2
Application # 2: θ1 = β
For given β the likelihood `(β, σ 2 ) is maximized by
σ̂ 2 (β) =
1X
2
(yi f (xi , β))
n i
The profile log likelihood function for β is `∗ = `(β, σ̂ 2 (β)) and
\
` (β,
σ 2 )M L
n
SSE
n
n
− `∗ (β) = − log
+ log σ̂ 2 (β) =
2
n
2
2
!
log
X
2
(yi f (xi , β)) − log
i
An approximate confidence region for β is (see picture)
(
)
X
1 2
2
β | log
(yi f (xi , β)) − log SSE < χk,1−α =
n
i
(
)
X
1 2
2
= β|
(yi f (xi , β)) < SSEe n χk,1−α
i
β2
(β1,β2)
where n σ2(β) is not m uch
larger than SSE
b OLS
β1
In a linear model the exact confidence region is the “Beale” region:
(
)
X
k
2
(yi f (xi , β)) < SSE(1 +
β|
Fk,n−k,1−α )
n−k
i
This is carried over directly to non-linear models.
56
X
i
2
(yi f (xi , bOLS ))
3
Mixed Linear Models
Another extension of linear models: introduce term to capture “random effects” in the model.
A mixed effects model has the following form
Yn×1 = Xn×p βp×1 + Zn×q uq×1 + n×1 ,
X, Z
where β
u, We will make
are matrices of known constants,
is a vector of unknown parameters,
vectors of unobservable random variables.
standard assumptions:
E = 0
Eu = 0
V ar = R
V aru = G
where R and G are known except for some parameters, called the variance components.
and u are assumed to be independent, i.e.
cov(u, ) =
G
0
0
R
Then
EY
V arY
3.1
= E [Xβ + Zu + ] = Xβ
= V ar [Zu + ] = V arZu + V ar = ZGZ 0 + R
Example: One way Random Effects Model
Let some batch process making widgets. Randomly select three batches (j = 1, 2, 3). Out of these, randomly
select two widgets (i = 1, 2). Measure hardness yij = measured hardness of widget i from batch j.
Could be modeled as
yij =
µ
|{z}
+
overall hardness
αj
|{z}
+
random effect of batch j
ij
|{z}
within batch of effect
Assumptions:


α1
E  α2  =
α3
E =


α1
0 V ar  α2  = σα2 I3×3 = G = V aru
α3
0 V ar = σ 2 I6×6 = R
Then








y11
y21
y12
y22
y31
y32


 
 
 
=
 
 
 
1
1
1
1
1
1








µ + 






57
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1




α1

  α2  + 

α3

With that, expected value and variance of Y is:
EY
= µ

V arY



0
2 
= ZGZ + R = σα 


σ 2 + σα2

σα2


= 




N
1
1
0
0
0
0
σα2
2
σ + σα2
1
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1




 + σ2 I =




σ 2 + σα2
σα2
σα2
2
σ + σα2
σ 2 + σα2
σα2
σα2
σ + σα2
2


O

 = σα2 I3×3
J2×2 + σ 2 I6×6



is called the Kronecker product:

An×m
O
a11 B

Br×s = 
 a21 B
..
.
a12 B
..
.
...




mr×ns
V arY is not an Aitken matrix, because we have two unknown parameters σ 2 and σα2 .
58
3.2
Example: Two way Mixed Effects Model without Interaction
2 analytical chemists; each make 2 analyses on 2 specimen (each specimen is cut into four parts).
yijk is one result from these analyses (e.g. the content of some component A) for the kth analysis of specimen
i done by chemist j.
We can model this by
yijk =
µ
|{z}
+
average content
+
αi
|{z}
random effect of specimeni
bj
|{z}
+ijk
fixed effect of chemistj
Assumptions are
E
α1
α2
α1
α2
= σα2 I2×2 = G
=
0
and
V ar
E =
0
and
V ar = σ 2 I = R

1
1
0
0
1
1
0
0
Then












y111
y112
y121
y122
y211
y212
y221
y222

 
 
 
 
 
=
 
 
 
 
 
1
1
1
1
1
1
1
1
0
0
1
1
0
0
1
1







 


µ


  β1  + 




β2






Then EY = Xβ and V arY = σα2 ZZ 0 + σ 2 I = σα2 I2×2
3.3
N
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1





 α1

+
 α2




J4×4 + σ 2 I
Estimation of Parameters
Assume Y ∼ M V N (Xβ, V ) with V = var(Y ) = R + ZGZ 0 with variance components σ12 , σ22 , ..., σp2 .
Consider the normal likelihood function:
L(Xβ, σ 2 )
= f (Y | Xβ, σ 2 ) =
p
= (2π)−n/2 1/ | det V (σ 2 )| exp 1/2(Y − Xβ)0 V (σ 2 )−1 (Y − Xβ)
For fixed σ 2 = (σ12 , σ22 , ..., σp2 ) this is maximized by weighted linear squares in Xβ:
d 2 ) = Ŷ ∗ (σ 2 ) = X(X 0 V −1 X)− X 0 V −1 Y
Xβ(σ
Plugging this into L(Xβ, σ 2 ) gives profile likelihood for vector σ12 , σ22 , ..., σp2 of variance components:
L∗ (σ 2 ) = L(Ŷ ∗ (σ 2 ), σ 2 )
There is no closed form for a maximum of this function, i.e. we need iterative procedure, such as GaussNewton. At every step we need to find the inverse of V (σ 2 ) to get L∗ . This is a computation of order p3 ,
i.e. it becomes computationally infeasible fairly quickly as the number of variance components increases.
Even if it is possible to estimate variance components - remember MLE of variances are biased - and
underestimate:
59
Small Example



For Y ∼ N (µ, σ 2 ) with Y = 

1
1
..
.



2
 µ + Maximum likelihood gives σ̂M
L =

1
n
P
i (yi
− ŷi )2 , yet we know
1
P
1
2
that s2 = n−1
(y
−
ŷ
)
is
an
unbiased estimator of the variance.
i
i i
There are several ways to try to “fix” the biasedness of ML estimates - the one that we will be looking at
leads to restricted maximum likelihood estimates:
The idea is to replace Y by Y − ŶOLS , i.e. E[Y − Ŷ ] = 0, then do ML. Doing so, removes all fixed effects
from the model.
In the small example this gives: Y − Ŷ ∼ M V N (0, (I − n1 J)σ 2 I(I − n1 J)), since Ŷ = PX Y = 1I(1I0 1I)−1 1I0 Y =
1
n Jn×n Y .
The covariance matrix of Y − Ŷ does not have full rank:
 n−1

− n1
... − n1
n

.. 
 − 1 n−1
. 
n
n


cov(Y − Ŷ ) =  .

.
.. − 1 
 ..
n
− n1
... − n1 n−1
n
For finding a ML drop the last row of the residual vector to

Y1 − Ȳ

..
e=
.
make the model full rank:



Yn−1 − Ȳ
ML estimation for σ 2 based on this vector turns out to be
n
2
σM
Le =
1 X
(yi − ȳ)2 ,
n − 1 i=1
A generalization of this leads to REML (restricted maximum likelihood estimates) of variance components:
Estimates of variance components
Let B ∈ Rm×n with rk(B) = m = n − rk(X) and BX = 0. Define r = BY .
The REML estimate of σ 2 is maximizer of likelihood based on r:
−1/2
Lr (σ 2 ) = f (r | σ 2 ) = (2π)−m/2 det BV (σ 2 )B 0 exp −1/2 r0 (BV (σ 2 )B 0 )−1
2
2
2
σ̂REM
L is larger than σM L , σREM L does not depend on the specific choice of B (all B fulfilling the above
conditions will yield same estimate).
60
Estimable functions Cβ
Estimability of functions only depends on the form of X, i.e. c0 β is estimable if c ∈ C(X 0 ).
The BLUE in model Y ∼ M V N (Xβ, V ) is given as weighted least squares estimate:
0 −1
c
Cβ
X)− X 0 V −1 Y
W LS = C(X V
C=AX
=
AŶ (σ 2 )
with variance
0 −1
c
V ar(Cβ
X)− C 0
W LS ) = C(X V
This is not useful in practice, because we usually do not know σ 2 . Plugging in an estimate σ̂ 2 gives us
estimates, which generally are neither linear nor have minimal variance:
\
0 −1
c
Cβ
X)− X 0 V̂ −1 Y
W LS = C(X V̂
C=AX
=
AŶ (σ̂ 2 )
and
\
0 −1
c
V ar(Cβ
X)− C 0
W LS ) = C(X V̂
Predicting random vector u
Remember: for multivariate normal variables X and Y the conditional expectation is
E[Y | X] = E[Y ] + cov(Y, X)V ar(X)−1 (X − E[X])
(see Rencher Theorem 4.4D for a proof)
E[u | Y ] = E[u] + cov(Y, u)V −1 (Y − Xβ) = GZ 0 V −1 (Y − Xβ),
since
cov(u, Y ) = cov(u, Xβ + Zu + ) = cov(u, Zu + ) = cov(Iu, Zu) + 0 = IGZ 0 + 0 = GZ 0
For the predictions of u we then get:
û
d = GZ 0 V −1 (I − X(X 0 V −1 X)− X 0 V −1 ) Y = GZ 0 V −1 P Y
= GZ 0 V −1 (Y − Xβ)
{z
}
|
P
û is linear in Y .
û is the best linear unbiased predictor (BLUP) of u.
ˆ = ĜZ 0 P̂ Y to make expression useful.
Use û
û is best in the sense, that var(u − û) has minimal variance over the choice of linear predictions û with
E û = 0.
For the BLUP, V ar(u − û) = G − GZ 0 P ZG.
ˆ and hope
Both û and V ar(u − û) depend on σ 2 - after estimating the variance via ML or REML we get û
that
\
V ar(u
− û) = Ĝ − ĜZ 0 P̂ Z Ĝ
is a sensible estimate for the variance.
Example: Ergometrics experiment with stool types
Four different types of stools were tested on nine persons. For each person the “easiness” to get up from the
stool was measured on a Borg scale.
Variables:
effort
effort (Borg scale) required to arise from a stool.
Type
factor of stool type.
Subject factor for the subject in the experiment.
61
library(lattice)
trellis.par.set(theme=col.whitebg())
plot(ergoStool)
●
T1
2
T2
●
7
Subject
●
●
6
●
9
●
4
●
5
8
T4
●
1
3
T3
●
●
8
10
12
14
Effort required to arise (Borg scale)
What becomes apparent in the plot is, that for different subjects we have different values on the Borg scale.
The overall range is similar, though. The different types of stools seem to get a similar ordering by all of the
test persons. This indicates that the stool effect will be significant in the model.
We model these data by a mixed effects model:
yij = µ + βj + bi + ij ,
i = 1, ..., 9, j = 1, ..., 4
with
bi ∼ N (0, σb2 ),
ij ∼ N (0, σ 2 )
where µ is the overall “easiness” to get up from the stool, βj is the effect of the type of stool, and bi is the
random effect for each person. We want to draw inference for the whole population - this is why we think of
the test persons as a sample of the population, and treat bi as random.
For each subject we get a subset model i = 1, 2, ..., 9:

 

 
yi1
1 1 0 0 0
1
 yi2   1 0 1 0 0 
 1 

 

 
 yi3  =  1 0 0 1 0  β +  1  ui + i
yi4
1 0 0 0 1
1
62
The matrix X for each subject is singular: the last three columns add to the first. We need to define some
side conditions to make the problem doable computationally:
Helmert Contrasts
Helmert contrasts is the default in Splus. In Helmert Contrasts, the jth linear combination is the difference
between level j + 1 and the average of the first j. The following example returns a Helmert parametrization
based on four levels:
> options(contrasts=c("contr.helmert",contrasts=T))
> contrasts(ergoStool$Type)
[,1] [,2] [,3]
T1
-1
-1
-1
T2
1
-1
-1
T3
0
2
-1
T4
0
0
3
Helmert contrasts have the advantage of being orthogonal, i.e. estimates will be independent.
For each subject the model matrix looks like this:
> model.matrix(effort~Type,data=ergoStool[ergoStool$Subject==1,])
(Intercept) Type1 Type2 Type3
1
1
-1
-1
-1
2
1
1
-1
-1
3
1
0
2
-1
4
1
0
0
3
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Type
[1] "contr.helmert"
Interpretation: for Helmert contrasts β1 is overall mean effort of stool types, β2 is difference between T2 and
T1, β3 is difference between T3 and and average effects of T1 and T2, ...
Model Fit Also included in the nlme package is the function lme, which fits a linear mixed effects model.
> fm1Stool <- lme(effort~Type, random = ~1|Subject, data=ergoStool)
> summary(fm1Stool)
Linear mixed-effects model fit by REML
Data: ergoStool
AIC
BIC
logLik
139.4869 148.2813 -63.74345
Random effects:
Formula: ~1 | Subject
(Intercept) Residual
StdDev:
1.332465 1.100295
Fixed effects: effort
Value
(Intercept) 10.250000
Type1
1.944444
Type2
0.092593
Type3
-0.342593
~ Type
Std.Error
0.4805234
0.2593419
0.1497311
0.1058759
DF
t-value p-value
24 21.330905 0.0000
24 7.497610 0.0000
24 0.618392 0.5421
24 -3.235794 0.0035
63
Correlation:
(Intr) Type1 Type2
Type1 0
Type2 0
0
Type3 0
0
0
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-1.80200344 -0.64316591 0.05783115 0.70099706
Max
1.63142053
Number of Observations: 36
Number of Groups: 9
The figure below shows effort averages for both stool types and subjects:
T2
11
T3
3
7
6
9
10
mean of effort
12
2
1
4
9
T4
T1
5
8
Subject
Type
Factors
Alternative side conditions: Treatment Contrasts
> options(contrasts=c("contr.treatment",contrasts=T))
> contrasts(ergoStool$Type)
T2 T3 T4
T1 0 0 0
T2 1 0 0
T3 0 1 0
T4 0 0 1
> model.matrix(effort~Type,data=ergoStool[ergoStool$Subject==1,])
(Intercept) TypeT2 TypeT3 TypeT4
1
1
0
0
0
2
1
1
0
0
64
3
1
0
4
1
0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Type
[1] "contr.treatment"
1
0
0
1
>
> fm2Stool <- lme(effort~Type, random = ~1|Subject, data=ergoStool)
> summary(fm2Stool)
Linear mixed-effects model fit by REML
Data: ergoStool
AIC
BIC
logLik
133.1308 141.9252 -60.5654
Random effects:
Formula: ~1 | Subject
(Intercept) Residual
StdDev:
1.332465 1.100295
Fixed effects: effort ~ Type
Value Std.Error
(Intercept) 8.555556 0.5760122
TypeT2
3.888889 0.5186838
TypeT3
2.222222 0.5186838
TypeT4
0.666667 0.5186838
Correlation:
(Intr) TypeT2 TypeT3
TypeT2 -0.45
TypeT3 -0.45
0.50
TypeT4 -0.45
0.50
0.50
DF
t-value p-value
24 14.853079 0.0000
24 7.497609 0.0000
24 4.284348 0.0003
24 1.285304 0.2110
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-1.80200341 -0.64316592 0.05783113 0.70099704
Number of Observations: 36
Number of Groups: 9
Checking Residuals
> plot(fm2Stool, col=1)
> intervals(fm1Stool)
Approximate 95% confidence intervals
Fixed effects:
lower
(Intercept) 7.3667247
TypeT2
2.8183781
TypeT3
1.1517114
TypeT4
-0.4038442
est.
8.5555556
3.8888889
2.2222222
0.6666667
upper
9.744386
4.959400
3.292733
1.737178
65
Max
1.63142052
attr(,"label")
[1] "Fixed effects:"
Random Effects:
Level: Subject
lower
est.
upper
sd((Intercept)) 0.7494109 1.332465 2.369145
Within-group standard error:
lower
est.
upper
0.8292432 1.1002946 1.4599434
●
●
●
1
●
●
Standardized residuals
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●●
−1
●
●
●
●
8
10
12
14
Fitted values (Borg scale)
The residual plot does not show any outliers nor any remaining pattern - the model seems to fit well.
Measuring the precision of σ 2
Since all the estimates are based on ML and REML, we could use ML theory to find variances - but that’s
hard. Instead, we will use large sample theory for MLEs.
Idea: Let θ be a q dimensional vector of parameters with likelihood function `(θ) maximized by θ̂. Then
−1
∂2
[
V arθ̂ = − 2
`(θ)
=V
∂ θi θj
θ=θ̂
A confidence interval for θi is:
θˆi ± z
p
Vii
Application to estimation of σ 2 :
Let γ be a vector of the log variance components:
γ = (γ1 , ..., γq ) = (log σ12 , ..., log σq2 )
With likelihood functions `(β, σ 2 ) and restricted likelihood function `∗r (σ 2 ), we get corresponding likelihood
functions
`(β, γ)
`∗r (γ)
= `(β, eγ )
= `∗r (eγ )
66
Construct covariance matrix:
Let
M=
M11
M21
M12
M22
with
M11
=
M21
=
!
∂2
`(β, γ)
∂ 2 γi γj
[
(β,γ)
ML
q×q
!
2
∂
`(β, γ)
2
∂ βi γj
[
(β,γ)
ML
p×q
67
M12 =
M22 =
!
∂2
`(β, γ)
∂ 2 γi βj
[
(β,γ)
ML
q×p
!
2
∂
`(β, γ)
2
∂ βi βj
[
(β,γ)
ML
p×p
With M containing second partial derivatives of the parameters, we can define Q to be an estimate of the
variance covariance structure by using Q = −M −1 .
Confidence limits for γi are then
p
γˆi ± z qi i
which yields approximate confidence limits for σi2 as
(eγˆi −z
√
qi i
, eγˆi +z
√
qi i
)
Similarly, for REML estimates, we set up matrix
!
∂ 2 ∗ ` (γ)
∂ 2 γi γj
γ
bREM L
Mr =
q×q
A variance-covariance structure is then obtained in the same way as before for ML estimates.
Both of these versions are implemented in R command lme in package nlme. You can choose between REML
and ML estimates by specifying the option method = "REML" or method = "ML" in the call of lme. Default
are REML estimates.
3.4
Anova Models
A different (and more old-fashioned) approach on variance estimation is inference based on anova tables.
Let M S1 , ..., M Sl be a set of independent random variables with
dfi
M Si
∼ χ2dfi
E[M Si ]
For convenience, we will use EM Si for E[M Si ].
The idea now is to write variances as linear combinations of these MSs.
Let s2 be some linear combination of these MSs:
s2 = a1 M S1 + a2 M S2 + ... + al M Sl
S 2 then has the following properties:
Es2
=
X
ai EM Si
i
2
V ars
=
X
a2i V
arM Si =
i
X
a2i
i
EM Si
dfi
2
M Si
V ar dfi
=
E[M Si ]
|
{z
}
2dfi
=
X
i
a2i
EM Si
dfi
2
· 2dfi = 2
X
i
2
a2i
(EM Si )
dfi
We have now several choices for an estimate of the variance of V ars2 :
• plug in M Si as an estimate of EM Si :
V\
ars2 = 2
X
i
68
a2i
M Si2
dfi
2
Si )
• more subtle: since E[M Si2 ] = V arM Si + (EM Si )2 = 2a2i (EM
+ (EM Si )2 = (EM Si )2
dfi
have therefore
dfi
(EM Si )2 =
E[M Si2 ]
dfi + 2
2+dfi
dfi
, we
This yields a different estimate for V ars2 :
V\
ars2 = 2
X
a2i
i
M Si2
,
dfi + 2
which is smaller than the first one and therefore provides us with the smaller confidence intervals.
To get an approximate density for s2 we use Cochran-Satterthwaite, which says, that for a linear combination
s2 ,
s2 = a1 M S1 + a2 M S2 + ... + al M Sl
with
dfi
M Si
∼ χ2dfi
E[M Si ]
the distribution of s2 is approximately multiplicative χ2 , i.e.
ν
s2
∼ χ2ν
E[s2 ]
with ν =
E[s2 ]2
1/2V ar[s2 ]
The degrees of freedom ν come from setting ν = E[s2 ] and 2ν = V ar[νs2 /E[s2 ]].
An approximate confidence interval for the expected value of s2 is then:
(
νs2
χ2ν,upper
,
νs2
)
χ2ν,lower
This is not realizable because ν depends on EM Si . Estimate ν by nu:
ˆ
nu
ˆ =2
(s2 )2
V\
ar(s2 )
This gives (very) approximate confidence intervals for Es2 :
(
ν̂s2
,
ν̂s2
χ2ν̂,upper χ2ν̂,lower
69
)
Example: Machines
Goal: Compare three brands of machines A, B, and C used in an industrial process. Six workers were chosen
randomly to operate each machine three times. Response is overall productivity score taking into account
the number and quality of components produced.
Variables:
bi
Worker
a factor giving a unique identifier for workers.
βj
Machine a factor with levels A, B, and C identifying the machine brand.
yijk score
a productivity score.
i = 1, ..., 6, j = 1, 2, 3, k = 1, 2, 3 (repetitions)
library(lattice)
trellis.par.set(theme=col.whitebg())
plot(Machines)
The graphic below gives a first overview of the data: productivity score is plotted against workers. The
ordering of workers on the vertical axis is decided on an individual’s highest productivity, worker 6 showed
the overall lowest performance, worker 5 had the highest productivity score. Color and symbol correspond
to machines A to C. Productivity scores for each machine seem to be fairly stable across workers, i.e. each
worker got the highest productivity score on machine C, most of the workers (except worker 6) had the lowest
score on machine A. Repetitions on the same brand gave very similar results in terms of the productivity
score.
●
5
A
B
C
● ● ●
3
●
1
●●
Worker
●
●●
4
●
2
6
●
●
●
●
45
●
●●
●
50
55
60
65
70
Productivity score
Model Machines Data
A first idea for modeling these data is to treat machines as a fixed effect and workers as random (in order
to be able to draw inference on the workers’ population) independently from machine:
yijk = βj + bi + ijk
Assume bi ∼ N (0, σb2 ) and ijk ∼ N (0, σ 2 )
The variance assumptions seem to be justified based on the first plot.
> mm1 <- lme(score~Machine, random=~1|Worker, data=Machines)
> summary(mm1)
Linear mixed-effects model fit by REML
Data: Machines
70
AIC
BIC
logLik
296.8782 306.5373 -143.4391
Random effects:
Formula: ~1 | Worker
(Intercept) Residual
StdDev:
5.146552 3.161647
Fixed effects: score ~ Machine
Value Std.Error
(Intercept) 52.35556 2.229312
MachineB
7.96667 1.053883
MachineC
13.91667 1.053883
Correlation:
(Intr) MachnB
MachineB -0.236
MachineC -0.236 0.500
DF t-value p-value
46 23.48507
0
46 7.55935
0
46 13.20514
0
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-2.7248806 -0.5232891 0.1327564 0.6513056
Max
1.7559058
Number of Observations: 54
Number of Groups: 6
REML estimates for σβ2 and σ 2 are
c2 = 3.161647
σ
cβ = 5.146552, and σ
The effect of machine brands are in the same direction as we already saw in the plot: machine A (effect
set to zero) is estimated to be less helpful than machine B, which itself is estimated to be less helpful than
machine C in order for workers to get a high productivity score.
In order to inspect the fit of the model further, we might want to plot the fitted value and compare the fit
to the raw data. That way we also see what exactly the above model is doing: different productivity scores
for each machine are fitted, each worker gets an “offset” by which these scores are moved horizontally. The
biggest difference between the raw data and the fitted seems to be for the productivity scores of worker 6,
where estimated productivity scores of machines A and B switch their ranking.
Visualizing Fitted Values
●
1
2
5
Worker
attach(Machines)
MM1 <- groupedData(
fitted(mm1)~factor(Machine) | Worker,
data.frame(cbind(Worker,Machine,fitted(mm1)))
)
plot(MM1)
detach(Machines)
●
6
●
4
●
3
●
2
1
3
●
●
45
50
55
60
65
70
fitted(mm1)
Since we still see major differences between raw and fitted values, we might want to include an interaction
effect to the model:
71
Assessing Interaction Effects
Worker
50
55
60
65
5
3
1
4
2
6
45
mean of score
70
attach(Machines)
interaction.plot(Machine, Worker, score,col=2:7)
detach(Machines)
A
B
C
Machine
Model: Add Interaction Effect
yijk = βj + bi + bij + ijk
Assume bi ∼ N (0, σ12 ), bij ∼ N (0, σ22 ) and ijk ∼ N (0, σ 2 )
Now we are dealing with random effects on two levels: random effects for each worker, random effects of
each machine for each worker (Machine within Worker). Whenever we are dealing with the interaction of a
random effect and a fixed effect the resulting interaction effect has to be treated as a random effect.
mm2 <- update(mm1, random=~1| Worker/Machine)
The resulting parameter estimates are shown in summary(mm2): the overall residual standard deviation is
reduced to σ̂ 2 = 0.9615768.
> summary(mm2)
Linear mixed-effects model fit by REML
Data: Machines
AIC
BIC
logLik
227.6876 239.2785 -107.8438
Random effects:
Formula: ~1 | Worker
(Intercept)
StdDev:
4.781049
Formula: ~1 | Machine %in% Worker
(Intercept) Residual
StdDev:
3.729536 0.9615768
Fixed effects: score ~ Machine
Value Std.Error
(Intercept) 52.35556 2.485829
MachineB
7.96667 2.176974
MachineC
13.91667 2.176974
Correlation:
DF
t-value p-value
36 21.061606 0.0000
10 3.659514 0.0044
10 6.392665 0.0001
72
(Intr) MachnB
MachineB -0.438
MachineC -0.438 0.500
Standardized Within-Group Residuals:
Min
Q1
Med
Q3
-2.26958756 -0.54846582 -0.01070588 0.43936575
Max
2.54005852
Number of Observations: 54
Number of Groups:
Worker Machine %in% Worker
6
18
A plot of the fitted value also indicates that the values are much closer to the raw data.
●
6
1
2
●
5
Worker
3
●
4
●
3
●
2
●
1
●
45
50
55
60
65
70
fitted(mm2)
MM2 <- groupedData(fitted(mm2)~factor(Machine) | Worker,
data.frame(cbind(Worker,Machine,fitted(mm2))))
plot(MM2)
Statistically, we can test whether we actually need the interaction effect by using an anova table (in the
framework of a reduced/full model test): the difference between mm1 and mm2 turns out to be highly
significant, meaning that we have to reject the hypothesis that we do not need the interaction effect of model
mm2 (i.e. we need the interaction effect). This does not give an indication, however, whether model mm2 is
a “good enough” model.
Comparison of Models
> plot(MM2)
> anova(mm1,mm2)
Model df
AIC
BIC
logLik
Test L.Ratio p-value
mm1
1 5 296.8782 306.5373 -143.4391
mm2
2 6 227.6876 239.2785 -107.8438 1 vs 2 71.19063 <.0001
We were able to estimate interaction effects because of the repetitions (unlike the stool data example).
73
4
Bootstrap Methods
Different approach in getting variance estimates for parameters.
Situation: assume sample X1 , ..., Xn F̃ i.i.d. for some distribution F . The Xi are possibly vector valued
observations.
We want to make some inference on a characteristic θ of F , e.g. mean, median, variance, correlation.
Compute estimate tn of θ from the observed sample:
P
mean: tn = n1 i Xi
2
P
1
standard deviation: t2n = n−1
i Xi − X̄
...
What can we say about the distribution of tn ? - We want to have suggestions for E[tn ], V ar[tn ], distribution,
C.I., probabilities, ...
1. If F is known, one way of producing estimates is by doing simulation:
(a) Draw samples X1b , ..., Xnb from F for b = 1, ..., B.
(b) For each sample compute tbn
(c) Use these B realizations of tn to compute
P
i. mean: Etn = B1 b tbn
P b
1
¯ 2
ii. variance: V artn = B−1
b tn − tn
iii. empirical (cumulative) distribution function of F :
# samples tbn < t
B
2. If F is not known, but we know the family F is in, i.e. we know that F = Fθ for some unknown θ, we
could estimate θ̂ from the sample, and follow through simulation based on B samples from F̂ = Fθ̂ .
This is called parametric bootstrap.
F̂ (t) =
3. If F is completely unknown, we cannot generate random numbers from this distribution. X1 , ..., Xn
are the only realizations of F known to us. The idea is to approximate F by F̂ of X1 , ..., Xn :
(a) draw B samples X1b , ..., Xnb from X1 , ..., Xn with replacement. This is then called a bootstrap
sample.
(b) for each bootstrap sample compute tbn .
(c) Use these B realizations of tn to compute
P
i. mean: Etn = B1 b tbn
P b
¯ 2
ii. variance: V artn = 1
b tn − tn
B−1
iii. empirical (cumulative) distribution function of F :
# samples tbn < t
B
If used properly this yields consistent estimates: for n → ∞, B → ∞
P
EF̂ tn = B1 b tbn → EF tn
P b
1
¯ 2 → V arF tn
V arF̂ tn = B−1
b tn − tn
F̂ (t) =
What are good values for B - the number of bootstrap samples?
standard deviation B ≈ 200
C.I.
B ≈ 1000
more demanding
B ≈ 5000
74
Example: Law Schools 15 law schools report admission conditions on LSAT and GPA scores.
660
●
●
●
620
600
●
●
●
●
●
●
●
560
580
LSAT
640
●
●
●
●
●
2.8
2.9
3.0
3.1
GPA
Interested in correlation between scores
>
>
>
>
options(digits=4)
library(bootstrap)
data(law)
law
LSAT GPA
1
576 339
2
635 330
3
558 281
4
578 303
5
666 344
6
580 307
7
555 300
8
661 343
9
651 336
10 605 313
11 653 312
12 575 274
13 545 276
14 572 288
15 594 296
> attach(law)
> cor(GPA,LSAT)
[1] 0.7764
How can we compute a C.I. of Correlation for scores?
The first two bootstrap samples look like this:
>
>
>
>
ID <- 1:15
#### C.I. for correlation?
# first bootstrap sample
b1 <- sample(ID, size=15, replace=T); b1
75
3.2
3.3
3.4
[1] 10 9 8 8 5 7 11 6 10 12
> law[b1,]
LSAT GPA
10
605 313
9
651 336
8
661 343
8.1
661 343
5
666 344
7
555 300
11
653 312
6
580 307
10.1 605 313
12
575 274
3
558 281
2
635 330
12.1 575 274
11.1 653 312
6.1
580 307
> cor(law[b1,]$LSAT, law[b1,]$GPA)
[1] 0.8393
3
> # second bootstrap sample
> b2 <- sample(ID, size=15, replace=T); b2
[1] 7 12 1 11 2 3 9 6 7 2 4 5 13
> law[b2,]
LSAT GPA
7
555 300
12
575 274
1
576 339
11
653 312
2
635 330
3
558 281
9
651 336
6
580 307
7.1
555 300
2.1
635 330
4
578 303
5
666 344
13
545 276
1.1
576 339
12.1 575 274
> cor(law[b2,]$LSAT, law[b2,]$GPA)
[1] 0.6534
2 12 11
6
1 12
Iterating 5000 times:
>
>
>
+
+
+
>
# not recommended - because of for loop
boot.cor <- rep(NA,5000) # output dummy
for (i in 1:5000) {
b <- sample(ID, size=15, replace=T)
boot.cor[i] <- cor(law[b,]$LSAT, law[b,]$GPA)
}
summary(boot.cor)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
76
0.08493 0.68830 0.78850 0.76900 0.87360 0.99440
> sd(boot.cor)
[1] 0.1334
>
> # rather
> scor <- function(x) {
+ b <- sample(ID, size=15, replace=T)
+ return(cor(law[b,]$LSAT, law[b,]$GPA))
+ }
>
> boot.cor <- sapply(1:5000,FUN=scor)
> summary(boot.cor)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.05641 0.69120 0.78860 0.76980 0.87130 0.99520
> sd(boot.cor)
[1] 0.1337
>
> # maybe easiest: use bootstrap command in package bootstrap
> boot.cor <- bootstrap(ID,5000,scor)
> summary(boot.cor)
Length Class Mode
thetastar
5000
-none- numeric
func.thetastar
0
-none- NULL
jack.boot.val
0
-none- NULL
jack.boot.se
0
-none- NULL
call
4
-none- call
> summary(boot.cor$thetastar)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.0677 0.6920 0.7930 0.7720 0.8740 0.9980
> sd(boot.cor$thetastar)
[1] 0.1326
Histogram of 5000 Bootstrap Correlations
> hist(boot.cor)
400
200
0
Frequency
600
5000 bootstrap correlations
0.2
0.4
0.6
boot.cor
77
0.8
1.0
Influence of Number of Bootstrap Samples:
> n <- c(10,50,100,200,250, 500, 600,700,800, 900,1000,2000,3000)
> for(i in 1:length(n)) {
+ b <- sapply(1:n[i],FUN=scor)
+ print(c(n[i],mean(b),sd(b)))
+ }
[1] 10.0000000 0.8565279 0.1370131
[1] 50.0000000 0.7996950 0.1059167
[1] 100.0000000
0.7756764
0.1306812
[1] 200.0000000
0.7777854
0.1299204
[1] 250.0000000
0.7567017
0.1347792
[1] 500.0000000
0.7716557
0.1341673
[1] 600.0000000
0.7673644
0.1328523
[1] 700.0000000
0.7715935
0.1312316
[1] 800.0000000
0.7633297
0.1379392
[1] 900.0000000
0.7737205
0.1334535
[1] 1000.0000000
0.7724381
0.1276535
[1] 2000.0000000
0.7742773
0.1315137
[1] 3000.0000000
0.7683884
0.1341691
The Real Value: Data of all 82 Law Schools
For the law school data information for all of the population (all 82 schools at that time) is available. Data
points sampled for the first example are marked by filled circles.
●
●
3.4
●
●
●
●
●
●
3.2
3.0
GPA
●
●
●● ●
●
●
●
●● ●
●
●
●
●
●● ● ●
● ●
●●
●
●
●
● ●
●●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●●
2.8
2.6
●●
●
●
●
●
●
●●
●
●
●
●
●
●
500
550
600
650
700
LSAT
We are in the unique situation, of being able to determine the true parameter value: the value for the
correlation coefficient turns out to be 0.7600
> data(law82)
> plot(GPA~LSAT,data=law82)
> points(LSAT,GPA,pch=17)
>
> cor(law82$GPA,law82$LSAT)
[1] 0.7600
Having all the data also makes it possible to get values for the true distribution of correlation coefficients
based on samples of size 15 from the 82 schools. How many samples of size 15 are there (with replacement)?
78
> # binomial coefficent: choose 15 from 82
> fac(82)/(fac(15)*fac(67))
# lower limit - does not include replicates
[1] 9.967e+15
> 82^15
# upper limit - regards ordering of draws
[1] 5.096e+28
Clearly, these are too many possibilities to regard. We will therefore only have a look at 100.000 samples of
size 15 from 82 Schools:
>
>
>
+
+
+
>
N <- 100000
corN <- rep(NA,N)
for (i in 1:N) {
b <- sample(1:82,size=15,replace=T)
corN[i] <- cor(law82[b,2],law82[b,3])
}
summary(corN);
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
-0.2653 0.6794 0.7702 0.7476 0.8413 0.9885
> sd(corN);
[1] 0.1291587
The true value for the correlation based on 15 samples therefore is 0.7476. In the comparison with the true
value of the population, we see a difference - the bias of our estimate is -0.0124.
Bias of Bootstrap Estimates
BiasF (tn ) =
−θ
EF [tn ]
| {z }
average of all possible values of size n from population
Law school: 0.7476 - 0.7600 = -0.0124
Since we usually do not know F nor θ, we have to use estimates for the bias:
Estimate of Bias:
BiasF̂ (tn ) =
E [tn ]
−
tn
|{z}
| F̂{z }
average of B bootstrap samples
value from original sample
Law school: 0.7698 - 0.7764 = -0.0066
Efron & Tibshirani (2003) suggested a correction fro the bootstrap estimate to reduce the bias:
Improved Bias corrected Estimates
t˜n = tn − BiasF̂ (tn ) = 2tn − EF̂ [tn ]
Law school: 0.7764 - (-0.0066) = 0.7830
In the law school example, this correction fires backwards - after the correction the estimate is further from
the true value than the raw bootstrap average.
This is something that happens quite frequently in practice, because it turns out that the mean squared
error of the bias corrected estimate can be larger than the mean square error of the raw estimate.
Confidence Intervals The idea for confidence intervals based on a bootstrap sample, is to get the empirical
distribution function from the samples by sorting the values from smallest to largest and identifying the α%
extremest cases:
1. sort bootstrap samples tb(1) ≤ tb(2) ≤ ... ≤ tb(n)
79
2. compute upper and lower α/2 percentiles:
α
α
kL =
largest integer ≤ (B + 1)
= b(B + 1) c
2
2
kU = B + 1 − kL
3. a (1 − α)100% C.I. for θ is then (t(kL ) , t(kU ) )
Bootstrap 90% C.I. for correlation in Law School Data
> N <- 5000
> alpha <- 0.1
> kL <- trunc((N+1)*alpha/2); kL
[1] 250
> kU <- N+1 - kL; kU
[1] 4751
>
> round(sort(boot.cor)[c(kL,kU)],4)
[1] 0.5196 0.9465
80
Properties of percentile bootstrap C.I.:
• Bootstrap confidence interval is consistent for B → ∞
• C.I. is invariant under transformation
• bootstrap approximation becomes more accurate for larger sample (n → ∞)
• for smaller samples bootstrap coverage tends to be smaller than the nominal (1 − α)100% level.
• percentile interval is entirely inside the parameter space.
In order to overcome the problem of biasedness in the C.I. limits and to correct for the under coverage
problem, we look at an alternative to percentile C.I.:
Bias corrected accelerated BCa bootstrap C.I.:
Draw B bootstrap samples and order estimates:
t∗n(1) ≤ t∗n(2) ≤ ... ≤ t∗n(B)
BCa interval of intended coverage of (1 − α) is given by
(t∗nbα1 (B+1)c , t∗nbα2 Bc )
where
z0 − zα/2
Φ z0 +
, and
1 − a(z0 − zα/2 )
z0 + zα/2
Φ z0 +
,
1 − a(z0 + zα/2 )
α1
=
α2
=
and zα/2 is the upper α/2 quantile of the standard normal.
z0
Φ−1 ( proportion of bootstrap samples t∗ni lower than tn ) =
∗
#tni < tn
= Φ−1
,
B
=
z0 is a bias correction, which measures the median bias of tn in normal units. If tn is close to the median of
t∗ni then z0 is close to 0.
a is the acceleration parameter:
3
P
j tn,(.) − tn,−j
a= 2 3/2
P
6
j tn,(.) − tn,−j
P
where tn,−j is the value of statistic tn computed without value xj ; and tn,(.) is the average tn,(.) = n1 j tn,−j .
• BCa are second order accurate, i.e.
α Clower
+
2
n
α Cupper
P (θ > upper end of BCa interval ) = +
2
n
P (θ < lower end of BCa interval ) =
81
Percentile intervals are only 1st order accurate:
∗
α Clower
+ √
2
n
∗
C
α
upper
P (θ > upper limit ) = + √
2
n
P (θ < lower limit ) =
• like percentile intervals, BCa intervals are transformation respecting
• disadvantage: BCa intervals are computationally intensive
In the example of the Law School Data we get a BCa interval of ( 0.4288, 0.9245 ). One way to come up
with this interval in R, is part of the library boot:
>
+
+
>
>
>
scor <- function(x,b) { # x is the dataframe, b is an index of the samples included
return(cor(x[b,]$LSAT, x[b,]$GPA))
}
boot.cor <- boot(law,scor,5000)
boot.cor
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = law, statistic = scor, R = 5000)
Bootstrap Statistics :
original
bias
std. error
t1*
0.7764 -0.008126
0.1346
>
> summary(boot.cor)
Length Class
Mode
t0
1
-nonenumeric
t
5000
-nonenumeric
R
1
-nonenumeric
data
2
data.frame list
seed
626
-nonenumeric
statistic
1
-nonefunction
sim
1
-nonecharacter
call
4
-nonecall
stype
1
-nonecharacter
strata
15
-nonenumeric
weights
15
-nonenumeric
>
> boot.cor$t0
[1] 0.7764
>
> summary(boot.cor$t)
82
object
Min.
:-0.0203
1st Qu.: 0.6907
Median : 0.7913
Mean
: 0.7703
3rd Qu.: 0.8731
Max.
: 0.9967
A set of bootstrap confidence intervals is then given by boot.ci:
> boot.ci(boot.cor, conf=0.9)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = boot.cor, conf = 0.9)
Intervals :
Level
Normal
90%
( 0.5655, 1.0000 )
Basic
( 0.6065,
1.0234 )
Level
Percentile
BCa
90%
( 0.5293, 0.9462 )
( 0.4288, 0.9245 )
Calculations and Intervals on Original Scale
Warning message:
Bootstrap variances needed for studentized intervals in: boot.ci(boot.cor, conf = 0.9)
“Manually” this could be done, too:
> z0 <- qnorm(sum(boot.cor$t < boot.cor$t0)/5000); z0
[1] -0.06673
>
> corj <- function(x) {
+
return(cor(LSAT[-x], GPA[-x]))
+ }
> tnj <- sapply(1:15, corj)
>
> acc <- sum((tnj-mean(tnj))^3)/(6*sum((tnj-mean(tnj))^2)^1.5)
> acc
[1] 0.07567
>
> alpha <- 0.1
> alpha1 <- pnorm(z0 + (z0+qnorm(alpha/2))/(1-acc*(z0+qnorm(1-alpha/2)))); alpha1
[1] 0.02219
> alpha2 <- pnorm(z0 + (z0-qnorm(alpha/2))/(1-acc*(z0-qnorm(1-alpha/2)))); alpha2
[1] 0.9578
For an effective 95% coverage 2.2% and 96% limits are used in the biased corrected & accelerated confidence
interval.
ABC intervals
ABC intervals (approximate bias confidence intervals) are a computationally less intensive alternative of
BCa s. Only a small percentage of computation needed compared to a BCa .
ABC intervals are also 2nd order accurate.
83
Bootstrap can fail
Let Xi ∼ U [0, θ] be i.i.d. An estimate of θ is given as θ̂ = maxi Xi . A C.I. for θ based on a bootstrap tends
to be too short:
let θ = 2:
> theta <- 2
> x <- runif(100, 0,theta)
> max(x)
[1] 1.998
>
> maxb <- function(x,b) return(max(x[b]))
> boot.theta <- boot(x,maxb,5000)
> boot.ci(boot.theta,conf=0.9)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = boot.theta, conf = 0.9)
Intervals :
Level
Normal
90%
( 1.981, 2.030 )
Basic
( 1.998, 2.058 )
Level
Percentile
BCa
90%
( 1.939, 1.998 )
( 1.939, 1.998 )
Calculations and Intervals on Original Scale
Warning message:
Bootstrap variances needed for studentized intervals in:
boot.ci(boot.theta, conf = 0.9)
Both percentile intervals and BCa bootstrapping Intervals do not extend far enough to the right to cover θ.
84
Example: Stormer Data
Viscometer measure viscosity of fluids by measuring the time an inner cylinder needs to perform a certain
number of rotations. Calibration is done by adding weights to the cylinder in fluids of known viscosity
Physics gives us the nonlinear model:
β1 V
T =
+
W − β2
This is equivalent to
W · T = β1 V + β2 T + (W − β2 )
For a start we can treat this equation as a linear model by not regarding the error term.
We fit a linear model to get an idea for initial values of β1 , β2 :
> library(MASS)
> data(stormer)
> names(stormer)
[1] "Viscosity" "Wt"
>
>
> lm1 <- lm(Wt*Time ~
> summary(lm1)
"Time"
Viscosity + Time-1, data=stormer)
Call:
lm(formula = Wt * Time ~ Viscosity + Time - 1, data = stormer)
Residuals:
Min
1Q Median
-304.4 -144.1
84.2
3Q
209.1
Coefficients:
Estimate Std.
Viscosity
28.876
Time
2.844
--Signif. codes: 0 ‘***’
Max
405.5
Error t value Pr(>|t|)
0.554
52.12
<2e-16 ***
0.766
3.71
0.0013 **
0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 236 on 21 degrees of freedom
Multiple R-Squared: 0.998,Adjusted R-squared: 0.997
F-statistic: 4.24e+03 on 2 and 21 DF, p-value: <2e-16
A nonlinear model is then fit by nls using the coeficients of the linear model as starting values. The resulting
parameter estimates are not very far off the initial values:
> bc <- coef(lm1)
> fit <- nls(Time~b1*Viscosity/(Wt-b2),data=stormer,
+ start=list(b1=bc[1],b2=bc[2]),trace=T)
885.4 : 28.876 2.844
825.1 : 29.393 2.233
825 : 29.401 2.218
825 : 29.401 2.218
> coef(fit)
b1
b2
29.401 2.218
85
We would now like to find confidence intervals for these values. We can do that by using bootstrap samples
and fitting a nonlinear model for each of it.
Bootstrap nonlinear regression:
>
+
+
+
>
>
nlsb <- function(x,b) {
return(coef(nls(Time~b1*Viscosity/(Wt-b2),data=x[b,],
start=list(b1=bc[1],b2=bc[2]))))
}
stormer.boot <- boot(stormer,nlsb, 1000)
stormer.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = stormer, statistic = nlsb, R = 1000)
8
Bootstrap Statistics :
original
bias
std. error
t1*
29.401 -0.04682
0.6629
t2*
2.218 0.07708
0.7670
>
> plot(stormer.boot$t[,1], stormer.boot$t[,2])
●
6
●
●
●
4
●
2
●
●●
●
●
●
●●
● ● ● ● ●● ●
●
● ●
●
●●
●●●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●● ●
● ●
●●
●
●●
●
●
●●
●●
● ●●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●● ●●
●
●
●●●●
●●
●
●
●
●●● ●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
● ●
●●
●
●● ●
●●
●● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●● ●
●● ● ●
●
●
●
●
●
● ● ●●
●●
●●
●●●
●
●
●
●●
●
●
●●●
●
●●
●
●
●
●●
●●
●● ●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
● ● ●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●●● ●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●●●●
●●●●●
●
●
●●
●
●●●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●●
●●
●
●
●●
●●
●●
● ●● ●●● ● ●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●●
●
●
●
●●
●
●
●
●
●
●●●
●●
● ●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●●●
●
●
●●
● ●●
●●
●
● ●●●
●● ●
●
●
●●●
●
●
●
●●●
●●
●●●●
●
●
●●
●
●
●●
●●
●
●
●
●●●
●
●●
●●
●
●●
●
●
●●
●
●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
● ● ●●
●●●●●●
●●●●
● ●
●●
●
●
●
●
●
●
● ●
● ●
●
●
●
0
stormer.boot$t[, 2]
●
27
28
29
stormer.boot$t[, 1]
86
30
31
> boot.ci(stormer.boot)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = stormer.boot)
Intervals :
Level
Normal
95%
(28.15, 30.75 )
Basic
(28.13, 30.77 )
Studentized
(27.54, 30.49 )
Level
Percentile
BCa
95%
(28.03, 30.68 )
(28.07, 30.73 )
Calculations and Intervals on Original Scale
> boot.ci(stormer.boot,index=2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = stormer.boot, index = 2)
Intervals :
Level
Normal
95%
( 0.670, 3.635 )
Basic
( 0.735, 3.551 )
Level
Percentile
BCa
95%
( 0.885, 3.701 )
( 0.650, 3.391 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
Warning message:
Bootstrap variances needed for studentized intervals in:
boot.ci(stormer.boot, index = 2)
87
5
Generalized Linear Model
Generalized Linear Models (GLMs) consist of three components:
random component:
response variable Y with independent observations y1 , y2 , ..., yN , Y has a distribution in the exponential
family, i.e. the distribution of Y has parameters θ and φ, and we can write the density in the following
form:
yi θi − b(θi )
f (yi ; θi , φ) = exp
− c(yi , φ) ,
a(φ)
with functions a(.), b(.), and c(.).
θi is called the canonical (or natural) parameter and φ is the dispersion.
The exponential family is very general and includes the normal, binomial, Poisson, Γ, inverse Γ, ...
systematic component:
X1 , ..., Xp explanatory variables, where xij is the value of variable j for subject i. The systematic
component is then a linear combination in Xj :
ηi =
p
X
βj xij
j=1
This is equivalent to a design matrix in a standard linear model.
link function
The link function builds the connection between the systematic and the random component: let µi be
the expected value of Yi , i.e. E[Yi ] = µi , then
X
h(µi ) = ηi =
βj xij
j
for i = 1, ..., N , where h is a differentiable, monotonic function.
h(µ) = µ is the identity link, the link function that transforms the mean to the natural parameter is
called the canonical link: h(µ) = θ.
5.1
Members of the Natural Exponential Family
The natural exponential family is large and includes a lot of familiar distributions, among them the Binomial
and the Poisson distribution. For a lot of distributions, it is just the question of re-writing the density
to identify appropriate functions a(), b() and Q(). We will do that for both the Normal and Binomial
distribution:
Normal Distribution Let Y ∼ N (µ, σ 2 ). The density function of Y is then
1
1
exp − 2 (y − µ)2
f (y; µ, σ 2 ) = √
2σ
2πσ 2
Using the basic idea that θ = µ and φ = σ 2 this can be re-written as
1
1
f (y; θ, φ) = exp log √
exp − (y − θ)2
=
2φ
2πφ
1/2(y − θ)2
1
√
= exp −
+ log
=
φ
2πφ
yθ − 1/2θ2
1 2 1
= exp
−
y − log(2πφ)
φ
2φ
2
88
The normal distribution is part of the exponential family and
a(φ)
b(θ) = 1/2θ2
c(y, φ)
= φ
=
1 2 1
y + log(2πφ)
2φ
2
The canonical link is then h(µ) = µ the identity link.
Ordinary least squares regression with normal response would be a GLM with identity link. Usually, we
transform Y , if Y is not normal - with GLM we use the link function to connect to explanatory variables,
fitting process maximizes likelihood for the choice of distribution of Y (not necessarily normal).
Binomial Distribution Let Y be a binary variable with a Binomial distribution, and let P (Y = 1) = π.
The density function then is given as
n y
f (y; π) =
π (1 − π)n−y
y
This function can be re-expressed as
n y
f (y; π) =
π (1 − π)n−y =
y
n
= exp y log π + (n − y) log(1 − π) + log
=
y
n
π
+ n log(1 − π) + log
= exp y log
y
1−π
The Binomial distribution is a member of the exponential family with
φ =
θ
b(θ) = n log(1 −
=
1
and a(φ) = 1
π
eθ
log
⇒π=
1−π
1 + eθ
eθ
) = −n log(1 + eθ )
1 + eθ
c(y, φ)
n
= − log
y
The natural parameter of the Binomial is log π1 − π, which is also called the logit of π. the canonical link is
therefore h(nπ) = n log π1 − π the logit link.
Let Y be a binary response, with P (Y = 1)π = E[Y ], V ar(Y ) = π(1 − π).
Linear Probability Model
π(x) = α + βx
GLM with identity link function, i.e. g(π) = π.
Problem: sufficiently large or small values of x will make π(x) leave the proper range of [0, 1].
Logistic Regression Model Relationship between x and π(x) might not be linear: an increase in x will
mean something different, if π(x) is close to 0.5 or near 0 or 1. An S- shaped curve as sketched below might
be more appropriate to describe the relationship.
89
0.0
0.2
0.4
p(x)
0.6
0.8
1.0
--4
-2
2
4
m
0.0
0.2
0.4
0.6
0
0.8
1
x1.0
p
M
p(x)
4(x)
2
.2
.4
.6
.8
.0
-4
-2
0
2
4
x
These curves can be described in two parameters, α, β as
π(x) =
exp(α + βx)
1 + exp(α + βx)
solving for α + βx gives:
α + βx = log
π(x)
.
1 − π(x)
This is a GLM with logit link - the canonical link for a model with binomial response.
5.2
Inference in GLMs
Idea: do large sample general theory for the exponential family. That way we can apply all results to
situations with familiar distributions belonging to the exponential family.
Let `(y; θ, φ) be the log likelihood function in the exponential family, i.e.
`(y; θ, φ) = (yθ − b(θ))/a(φ) − c(y, φ)
and the partial derivative of ` in θ is:
∂
`(y; θ, φ) = (y − b0 (θ))/a(φ)
∂θ
Assume weak regularity conditions of the Maximum Likelihood:
∂
E
`(y; θ, φ)
= 0 and
∂θ
2
∂
∂
V ar
`(y; θ, φ)
= −E 2 `(y; θ, φ)
∂θ
∂ θ
This gives us the following results for expected value and variance of Y :
∂
E
`(y; θ, φ) = 0
∂θ
⇐⇒
E [(y − b0 (θ))/a(φ)] = 0
⇐⇒
E[y] = b0 (θ).
and
2
∂
∂
V ar
`(y; θ, φ) = −E 2 `(y; θ, φ)
∂θ
∂ θ
0
⇐⇒
V ar [(y − b (θ))/a(φ)] = −E [b00 (θ)/a(φ)]
⇐⇒
V ar(y) = b00 (θ)a(φ).
Now it is up to us to choose a function h(.) with h(E[Y ]) = Xβ, i.e. h(b0 (θ)) = Xβ. Whenever h(b0 (θ)) = θ,
h is the canonical link.
90
Normal distribution
b(θ) = −θ2 /2 then b0 (θ) = θ = E[Y ]; the identity link is therefore the canonical link.
Binomial distribution
θ
e
b(θ) = n log(1 + eθ ) then b0 (θ) = n 1+e
θ = nπ = E[Y ]; the canonical link is therefore the logit link:
θ = log
π
1−π
Othe possibilities for link functions:
• logit link: h(π) = log
π
1−π
= Xβ therefore π =
eθ
1+eθ
• probit link: h(π) = Φ−1 (π) then π = Φ(θ)
θ
0.6
0.4
0.0
0.2
H(theta)
0.8
1.0
• cloglog link (complementary log log link): h(p) = log(−log(1 − π)) = Xβ and π = 1 − e−e .
−4
−2
0
2
4
theta
Figure 4: Comparison of links in the binomial model: logit link is drawn in black, probit link in green, and
complementary log log link is drawn in red. The differences between the links are subtle - the probit link
shows a steeper increase than the logit, whereas the cloglog link shows an asymmetric increase.
5.3
Binomial distribution of response
Example: Beetle Mortality
Insects were exposed to gaseous carbon disulphide for a period of 5 hours. Eight experiments were run with
different concentrations of carbon disulphide.
Variable
Description
Dose
Dose of carbon disulphide
Exposed
Number of beetles exposed
Mortality Number of beetles killed
The scatterplot below shows rate of death versus dose:
91
1.0
●
●
●
0.6
●
0.4
Rate of Death
0.8
●
0.2
●
●
●
1.70
1.75
1.80
1.85
Dose
Let Yj count the number of killed beetles at dose dj (j = 1, ..., 8) with Yj ∼ B(nj , πj ) independently, i.e.
each beetle has a probability of πj of dying.
Using a logit link, we get
logit πj = (Xβ)j = β0 + β1 dj
We can then set up the likelihood as:
L(β, y, x) =
J
Y
f (yj ; θj , φ) =
j=1
J
Y
nj
exp yj (β0 + β1 dj ) − nj log(1 + eβ0 +β1 dj ) + log
yj
j=1
Maximize over β? - set first partial derivatives of log-likelihood to zero:
∂
`(β; y, x)
∂β0
J
X
=
j=1
J
y j − nj
X
eβ0 +β1 dj
!
=
y j − n j πj = 0
1 + eβ0 +β1 dj
|
{z
} j=1
πj
∂
`(β; y, x)
∂β1
J
X
=
J
xj yj − nj xj
j=1
X
eβ0 +β1 dj
!
=
yj − nj xj πj = 0
β
+β
d
0
1
j
1+e
|
{z
} j=1
πj
• these equations generally do not have a closed form
• if a solution exists, it is unique
• summarize the system to the likelihood equations (score function):
0 = X 0 (Y − m) =: Q(∗)
find m such that (*) holds. Use
algorithm to find minimum, i.e. set up matrix of
Newton-Raphson
∂2
2nd partial derivatives H = − ∂βi ∂βj `(β)
and iterate:
i,j
β t+1 = β t + H −1 Q,
where both H and Q are evaluated at β t and β 0 is an initial guess.
Large sample inference gives us
.
β̂ ∼ N (β, H −1 ), if nπj > 5 and n(1 − πj ) > 5
∂2
is the Fisher information matrix.
where H = E − ∂βi ∂βj `(β)
i,j
92
A fit of a GLM with Binomial response and logit link gives the following model:
1.0
fit1 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=logit))
points(Dose,fit1$fitted,type="l")
●
●
●
0.6
●
0.4
Rate of Death
0.8
●
0.2
●
●
●
1.70
1.75
1.80
1.85
Dose
Additionally, we can fit models corresponding to a probit and a complementary log log link:
fit2 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=probit))
points(Dose,fit2$fitted,type="l",col="red")
1.0
fit3 <- glm(cbind(Mortality,Exposed-Mortality)~Dose, family=binomial(link=cloglog))
points(Dose,fit3$fitted,type="l",col=3)
●
●
●
0.6
●
0.4
Rate of Death
0.8
●
0.2
●
●
●
1.70
1.75
1.80
1.85
Dose
Visually, the green fit corresponding to the cloglog link seems to give the best model.
How can we compare GLMs statistically?
5.4
Likelihood Ratio Tests (Deviance)
Model M1 is nested within model M2 , if M1 is a simpler model than M2 , i.e. all terms in M1 are in M2 , and
M2 has more terms than M1 .
The deviance between models M1 and M2 is defined as
D(M1 , M2 ) = −2(log likelihoodM1 − log likelihoodM2 )
.
then D(M1 , M2 ) ∼ χ2# par M2 −# par M1 and is a test statistic for H0 : M2 does not explain significantly more
than M1 .
based on this concept, some special deviances are defined:
93
• residual deviance of model M :
D(M, Mf ull ) = −2(log likelihoodMf ull − log likelihoodM )
this expression makes sense, as by definition every model is nested within the full model. This test
gives us a goodness-of fit statistic for model M : the null hypothesis H0 assumes that the full model
does not explain significantly more than model M . If we cannot reject this null hypothesis, this implies,
that model M is a reasonably good model.
• null deviance:
D(Mnull , Mf ull ) = −2(log likelihoodMf ull − log likelihoodMnull )
This deviance gives an upper boundary for the deviance of all models: we cannot possibly find a model
with a worse (= higher) residual deviance.
• explanatory power of model M :
D(M, Mn ull) = −2(log likelihoodM − log likelihoodMnull )
This gives a measure for how much better the current model is than the null model, i.e. this tests
whether M is a significant improvement.
The chi2 approximation holds in each case under Cochran’s rule: all expected counts should be > 1 and at
least 80% of all cells should be > 5.
In the Beetle Mortality examples likelihood ratio tests give us a statistical handle of comparing models:
> summary(fit1)
Call:
glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = logit))
Deviance Residuals:
Min
1Q Median
-1.594 -0.394
0.833
3Q
1.259
Max
1.594
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-60.72
5.18
-11.7
<2e-16 ***
Dose
34.27
2.91
11.8
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.202
Residual deviance: 11.232
AIC: 41.43
on 7
on 6
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
> summary(fit2)
Call:
94
glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = probit))
Deviance Residuals:
Min
1Q Median
-1.57
-0.47
0.75
3Q
1.06
Max
1.34
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-34.94
2.65
-13.2
<2e-16 ***
Dose
19.73
1.49
13.3
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.202
Residual deviance: 10.120
AIC: 40.32
on 7
on 6
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
> summary(fit3)
Call:
glm(formula = cbind(Mortality, Exposed - Mortality) ~ Dose, family = binomial(link = cloglog))
Deviance Residuals:
Min
1Q
Median
-0.8033 -0.5513
0.0309
3Q
0.3832
Max
1.2888
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-39.57
3.24
-12.2
<2e-16 ***
Dose
22.04
1.80
12.2
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.2024
Residual deviance:
3.4464
AIC: 33.64
on 7
on 6
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
All of the models have a degree of freedom in the residual deviance of 6. The probit model has a slightly
smaller deviance than the logit link model; the cloglog link model shows a huge improvement in the residual
deviance, making it the best overall model, coinciding with the visual impression.
95
6
Model Free Curve Fitting
Situation: response variable Y and p explanatory variables X1 , ..., Xp with
Y = f (X1 , .., Xp ) + Goal: summarize relationship between Y and Xi (i.e. find fˆ)
Advantage: data driven method, no modelling assumptions necessary
Used to
• get initial idea about relationship between Y and Xi (particularly useful in noisy data)
• make predictions using interpolation
• extrapolation?
• check fit of parametric model
Example: Diabetes Data Available is data on 43 children diagnosed with diabetes. Of interest in the
study are factors affecting insulin-dependent diabetes.
Variables:
Cpeptide level of serum at diagnosis
age
age in years at diagnosis
> diabetes <- read.table("http://www.public.iastate.edu/~hofmann/stat511/data/cpeptide.txt",
+ sep="\t",header=T)
> diabetes
subject age basedef Cpeptide
1
1 5.2
-8.1
4.8
2
2 8.8
-16.1
4.1
3
3 10.5
-0.9
5.2
...
41
41 13.2
-1.9
4.6
42
42 8.9
-10.0
4.9
43
43 10.8
-13.5
5.1
6.0
6.5
●
●
●
5.5
●
●
●
●
5.0
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
4.5
4.0
Cpeptide
●
●
●
●
●
●
●
●
●
●
●
●
3.0
3.5
●
●
●
●
●
5
10
15
age
What is the relationship between Age and Cpeptide?
using a parametric approach, we might fit polynomials of various degrees in Age to get an estimate of
Cpeptide: from left to right polynomials of degree 1, 2, and 3 are fitted.
96
●
6.0
6.0
6.0
6.5
●
6.5
●
6.5
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
3.5
●
3.5
10
●●
●
●
●
●
3.0
●
3.0
●
●
●
●
●
●
●
●
3.5
●
5
●
●
4.0
●
●
●
●
●
4.0
●
●
● ●●
●
●
●
●
4.0
●
●
●
●
●
●
●
●
●
●
●
●
●
4.5
●
●
● ●●
●
4.5
4.5
●
●
●
●
Cpeptide
●
●
Cpeptide
●
●
●
●
●
●
5.0
5.0
Cpeptide
●
● ●●
●
●
●
●
●
●
●
●
5.0
●
5.5
●
5.5
●
●
3.0
●
●
5.5
●
●
●
●
●
15
age
5
10
age
15
●
●
5
10
15
age
The parabola peaks at around age 10, afterwards the fit indicates a downwards trend in Cpeptide, which
seems to be dictated by the shape of a parabola rather than the data. A polynomial of degree 3 shows an
increase in Cpeptide after age 10, but the third degree term is not significant with respect to the parabola:
> anova(fit2,fit3)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
40
2
39
Cpeptide ~ age + I(age^2)
Cpeptide ~ age + I(age^2) + I(age^3)
RSS Df Sum of Sq
F Pr(>F)
14.62
13.85 1
0.77 2.16
0.15
In terms of a parametric fitting we are caught between statistical theory and the data interpretation.
What we can do instead, to get an answer towards the nature of relationship between age and Cpeptide is
to use a smoothing approach.
6.1
Bin Smoother
Idea: partition data into (disjoint &) exhaustive regions with about the same number of cases in each bin.
Let NK (x) be the set of K nearest neighbors of x. There are different ways of computing this set:
1. symmetric nearest neighbors
NK (x) contains the nearest K cases of x to the left and the K nearest cases on the right.
2. nearest neighbors
NK (x) contains the nearest K cases of x.
Running Mean: for value x an estimate for y is given as the mean (or median) of the response values
corresponding to the set of K neighbors NK (x) of x, i.e.:
X
1
ŷ =
yi
|NK (x)|
xi ∈NK (x)
Running means are
• easy to compute,
• may not be smooth enough,
• tends to flatten out boundary trends
97
Example: Diabetes The scatterplot below shows running means for a neighborhood of 11 (red line),
15 (green) and 19 (blue line) points. The positive trend between Age and Cpeptide for low values of
age is only hinted at for a neighborhood of 11 points, but is not visible at all for 15 and 19 points.
6.0
6.5
●
●
●
5.5
●
●
●
●
5.0
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
4.5
●
●
●
●
●
●
●
4.0
Cpeptide
●
●
●
●
●
●
3.0
3.5
●
●
●
●
●
5
10
15
age
library(gregmisc)
K <- 5
runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center")
runmean
plot(Cpeptide~age,data=diabetes,pch=20)
points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=2)
K <- 7
runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center")
points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=3)
K <- 9
runmean <- running(Cpeptide[order(age)], fun=mean, width=2*K+1, allow.fewer=F, align="center")
points(sort(age)[-c(1:K,(44-K):43)],runmean,type="l",col=4)
Running Lines:
fit line to points near x. Predict mean response at x to get estimate for y:
ŷx = b0,x + b1,x x
• larger neighborhoods give smoother curves
• points inside the neighborhood have equal weight → jaggedness
• idea: give more weight to points closer to x
6.2
Kernel Smoothers
One of the reasons why the previous smoothers are wiggly is because when we move from xi to xi+1 two
points are usually changed in the group we average. If the two new points are very different then ŷ(xi ) and
ŷ(xi+1 ) may be quite different. One way to try and fix this is by making the transition smoother. That’s the
idea behind kernel smoothers.
98
Generally speaking a kernel smoother defines a set of weights {Wi (x)}ni=1 for each x and defines a weighted
average of the response values:
n
X
s(x) =
Wi (x)yi .
i=1
What is called a kernel smoother in practice has a simple approach to represent the weight sequence
{Wi (x)}ni=1 by describing the shape of the weight function Wi (x) by a density function with a scale parameter that adjusts the size and the form of the weights near x. It is common to refer to this shape function
as a kernel K. The kernel is a continuous, bounded, and symmetric real function K which integrates to one,
Z
K(u) du = 1.
For a given scale parameter h, the weight sequence is then defined by
i
K x−x
h
Whi (x) = Pn
x−xi
i=1 K
h
Pn
Notice:
i=1 Whi (xi ) = 1
The kernel smoother is then defined for any x as before by
s(x) =
n
X
Whi (x)Yi .
i=1
A natural candidate for K is the standard Gaussian density. (This is very inconvenient computationally
because it is never 0):
1
1
xi − x
2
√
=
exp − 2 (xi − x)
K
h
2h
2πh2
The minimum variance kernel provides an estimator with minimal variance:
(3
(3 − 5(xi − x)2 /h2 ) if xih−x < 1,
xi − x
8h
K
=
h
0
otherwise.
99
Download