Uploaded by hema yassen

Linear Models

advertisement
18.650 – Fundamentals of Statistics
7. Generalized linear models
1/32
Linear model
A
linear model assumes
Y |X = x ⇠ N (µ(x),
2
I),
1
And
IE(Y |X = x) = µ(x) = x
1
>
,
Throughout we drop the boldface notation for vectors
2/32
Components of a linear model
The two model components (that we are going to relax) are
1. Random component: the response variable Y is continuous
and Y |X = x is
with mean µ(x).
2. Regression function: µ(x) =
>
x
.
3/32
Kyphosis
The Kyphosis data consist of measurements on 81 children
following corrective spinal surgery. The binary response variable,
Y , indicates the presence or absence of a postoperative deforming.
The three covariates are:
I X (1) : Age of the child in month,
I X (2) : Number of the vertebrae involved in the operation, and
I X (3) : Start of the range of the vertebrae involved.
Write X = (
(1)
(2)
(3)
>
,X ,X ,X )
2
4
IR
4/32
Kyphosis
I The response variable is binary so there is no choice:
Y |X = x is
with expected value
µ(x) = IE[Y |X = x] 2
I We cannot write
µ(x) = x
>
because the right-hand side ranges through
I We need an invertible function f such that f (x> ) 2
5/32
Generalization
A generalized linear model (GLM) generalizes normal linear
regression models in the following directions.
1. Random component:
Y |X = x ⇠ some distribution
(e.g. Bernoulli, exponential, Poisson)
2. Regression function:
µ(x) = x
>
where g called link function and µ(x) = IE(Y |X = x) is the
6/32
Predator/Prey
Consider the following model for the number of preys Y that a
predator (Hawk) catches per day a predator given a number X of
preys (mice) in its hunting territory.
Random component: Y > 0 and the variance of capture rate is
known to be approximately equal to its expectation so we propose
the following model:
Y |X = x ⇠
Where µ(x) = IE[Y |X = x].
Regression function: We assume
mx
µ(x) =
,
for some unknown m, h > 0.
h+x
where:
I m is the max expected daily preys the predator can cope with
I h is the number of preys such that µ(h) =
7/32
The regression function m(x) for m = h = 10
8/32
Example 2: Prey Capture Rate
Obviously µ(x) is not linear but using reciprocal link: g(x) =
the right-hand side can be made linear in the parameters:
1
g(µ(x)) =
=
µ(x)
,
1
= 0+ 1 .
x
9/32
Exponential Family
k
IR
A family of distribution {IP✓ : ✓ 2 ⇥}, ⇥ ⇢
is said to be a
q
k-parameter exponential family on IR , if there exist real valued
functions:
I ⌘1 , ⌘2 , · · · , ⌘k and B of ✓,
I T1 , T2 , · · · , Tk , and h of y 2 IRq such that the density
function (pmf or pdf) of IP✓ can be written as
f✓ (y) = exp
k
hX
i=1
⌘i (✓)Ti (y)
i
B(✓) h(y)
10/32
Normal distribution example
I Consider Y ⇠ N (µ,
2 ),
f✓ (y) = exp
2 ).
✓ = (µ,
⇣µ
y
2
1
2
y
2
2
The density is
2
µ
2
2
⌘
1
p ,
2⇡
which forms a two-parameter exponential family with
⌘1 =
µ
1
,
⌘
=
2
2
B(✓) =
2
2
µ
2
2
2
,
T
(y)
=
y,
T
(y)
=
y
,
1
2
2
+ log(
p
2⇡), h(y) = 1.
I When 2 is known, it becomes a one-parameter exponential
family on IR:
⌘=
µ
,
T
(y)
=
y,
B(✓)
=
2
2
µ
2
e
y2
2 2
p
,
h(y)
=
2
2⇡
.
11/32
Examples of discrete distributions
The following distributions form discrete exponential families of
distributions with pmf
I Bernoulli(p):
I Poisson( ):
y
p (1
p)
1 y
, y 2 {0, 1}
y
y!
e
, y = 0, 1, . . . .
12/32
Examples of Continuous distributions
The following distributions form continuous exponential families
of distributions with pdf:
y
1
a 1
b;
I Gamma(a, b):
y
e
(a)ba
I above: a: shape parameter, b: scale parameter
I reparametrize: µ = ab: mean parameter
✓ ◆a
ay
1
a
a 1
y
e µ.
(a) µ
I Inverse Gamma(↵, ):
I Inverse Gaussian(µ,
2 ):
↵
(↵)
s
y
↵ 1
2
2⇡y
e
3
e
/y
.
2 (y µ)2
2µ2 y
.
Others: Chi-square, Beta, Binomial, Negative binomial
distributions.
13/32
One-parameter canonical exponential family
I Canonical exponential family for k = 1, y 2 IR
f✓ (y) = exp
⇣ y✓
b(✓)
+ c(y, )
⌘
for some known functions b(·) and c(·, ·) .
I If is known, this is a one-parameter exponential family with
✓ being the canonical parameter .
I If is unknown, this may/may not be a two-parameter
exponential family.
I is called dispersion parameter.
I In this class, we always assume that
is known.
14/32
Normal distribution example
I Consider the following Normal density function with known
variance 2 ,
f✓ (y)
=
=
1
p e
2⇡
⇢
yµ
exp
I Therefore ✓ = µ,
=
(y µ)2
2 2
2,
c(y, ) =
✓
1 2
2µ
2
1
2
b(✓) =
✓2
,
2
1
(
2
2
y
y2
2
+ log(2⇡
2
)
◆
,
and
+ log(2⇡ )).
15/32
Other distributions
Table 1: Exponential Family
Notation
Range of y
Normal
2
N (µ, )
( 1, 1)
2
b(✓)
c(y, )
✓2
2
1 y2
(
2
+ log(2⇡ ))
Poisson
P(µ)
[0, 1)
1
e✓
log y!
Bernoulli
B(p)
{0, 1}
1
log(1 + e✓ )
0
16/32
Likelihood
Let `(✓) = log f✓ (Y ) denote the log-likelihood function.
The mean IE(Y ) and the variance var(Y ) can be derived from the
following identities
I First identity
@`
IE( ) =
@✓
I Second identity
@2`
@` 2
IE( 2 ) + IE( ) = 0.
@✓
@✓
17/32
Expected value
Note that
`(✓) =
Therefore
Y✓
b(✓)
+ c(Y ; ),
@`
=
@✓
It yields
@`
IE(Y )
0 = IE( ) =
@✓
0
b (✓)
,
which leads to
IE(Y ) =
18/32
Variance
On the other hand we have we have
@2`
@` 2
+
(
)
=
@✓2
@✓
and from the previous result,
0
b (✓)
Y
=
Y
IE(Y )
Together, with the second identity, this yields
0=
00
b (✓)
+
var(Y )
2
,
which leads to
var(Y ) =
19/32
Example: Poisson distribution
Example: Consider a Poisson likelihood,
µy
f (y) =
e
y!
µ
= exp y log µ
µ
log(y!)
Thus,
✓=
b(✓) =
=
c(y, ) =
log(y!),
So
✓
µ=e ,
b(✓) =
00
b (✓) =
20/32
Link function
I
is the parameter of interest, and needs to appear somehow
in the likelihood function to use maximum likelihood.
I A link function g relates the linear predictor X > to the mean
parameter µ,
>
X
= g(µ).
I g is required to be monotone increasing and di↵erentiable
µ=g
1
(X
>
).
21/32
Examples of link functions
I For LM, g(·) = identity.
I Poisson data. Suppose Y |X ⇠ Poisson(µ(X)).
I µ(X) > 0;
I log(µ(X)) = X > ;
I In general, a link function for the count data should map
(0, +1) to IR.
I The log link is a natural one.
I Bernoulli/Binomial data.
I 0 < µ < 1;
I g should map (0, 1) to IR:
I 3 choices:
⇣
⌘
1. logit: log
2. probit:
µ(X)
1 µ(X)
1
= X> ;
(µ(X)) = X >
where
(·) is the normal cdf;
I The logit link is the natural choice.
22/32
Examples of link functions for Bernoulli response
5
4
3
I in blue:
1
g1 (x) = f1 (x) =
x
log
(logit link)
1 x
I in red:
1
1
g2 (x) = f2 (x) =
(x)
(probit link)
2
1
0
-1
-2
-3
-4
-5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
23/32
Examples of link functions for Bernoulli response
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
ex
I in blue: f1 (x) =
1 + ex
I in red: f2 (x) = (x) (Gaussian CDF)
24/32
Canonical Link
I The function g that links the mean µ to the canonical
parameter ✓ is called Canonical Link:
g(µ) = ✓
I Since µ = b0 (✓), the canonical link is given by
0
g(µ) = (b )
1
(µ) .
I If > 0, the canonical link function is strictly increasing.
Why?
25/32
Example: the Bernoulli distribution
I We can check that
✓
b(✓) = log(1 + e )
I Hence we solve
exp(✓)
b (✓) =
=µ
1 + exp(✓)
0
,
✓=
I The canonical link for the Bernoulli distribution is the
26/32
Other examples
Normal
Poisson
Bernoulli
Gamma
b(✓)
2
✓ /2
exp(✓)
✓
log(1 + e )
log( ✓)
g(µ)
µ
log µ
µ
log 1 µ
1
µ
27/32
Model and notation
I Let (Xi , Yi ) 2 IRp ⇥ IR, i = 1, . . . , n be independent random
pairs such that the conditional distribution of Yi given
Xi = xi has density in the canonical exponential family:
f✓i (yi ) = exp
yi ✓ i
b(✓i )
o
+ c(yi , ) .
I Y = (Y1 , . . . , Yn )> , X = (X1 , . . . , Xn )>
I Here the mean µi = IE[Yi |Xi ] is related to the canonical
parameter ✓i via
µi =
I and µi depends linearly on the covariates through a link
function g:
g(µi ) =
.
28/32
Back to
I Given a link function g, note the following relationship
between and ✓:
0
1
(µi )
0
1
1
>
(Xi
g
1
✓i = (b )
= (b )
(g
)) ⌘
>
h(Xi
),
where h is defined as
0
h = (b )
1
0
= (g b )
1
.
I Remark: if g is the canonical link function, h is
29/32
Log-likelihood
I The log-likelihood is given by
`n (Y, X, ) =
X Yi ✓i
b(✓i )
i
=
X Yi h(X > )
i
>
b(h(Xi
))
i
up to a constant term.
I Note that when we use the canonical link function, we obtain
the simpler expression
`n (Y, X, ) =
X Yi X >
i
>
b(Xi
)
i
30/32
Strict concavity
I The log-likelihood `(✓) is strictly concave using the
canonical function when > 0. Why?
I As a consequence the maximum likelihood estimator is
I On the other hand, if another parameterization is used, the
likelihood function may not be strictly concave leading to
several local maxima.
31/32
Concluding remarks
I Maximum likelihood for Bernoulli Y and the logit link is called
I In general, there is no closed form for the MLE and we have
to use
I The asymptotic normality of the MLE also applies to GLMs.
32/32
Download