dels Mo Mixture using

advertisement
Work supported by
joint work with John Hinde, National University of Ireland, Galway.
München, 29. November 2005
National University of Ireland, Galway
Jochen Einbeck
On Statistical Modelling with Random Effects using Mixture Models
• Outlook: Mode Trees
Methodological improvements
• Problems of NPML;
• Example for NPML estimation: Irish suicide data
Nonparametric maximum likelihood (NPML) estimation
• Generalized linear models with random effects,
Outline
• individual
.. unit variability
• model misspecification
• unobserved covariates
Integral often intractable!
i=1
Parameter estimation requires maximizing the marginal likelihood
n Z
Y
L(β, g(z)) =
f (yi|zi, β)g(zi) dzi.
accounts for
with yi|zi exponential family distributed. The random effect zi with distribution g(·)
µi ≡ E(yi|zi, β) = h(ηi) ≡ h(x0iβ + zi),
Generalized linear model with random effect
?
normal
glmm - literature
classical
Random effect distr.:
(2) Conditional models −→ 3 families:
(1) Marginal models −→ GEE
fixed
all other
?
conjugate
random
fixed JML, CML
as
comb’s of Z and Y |Z.
works only for special
NPML
?
unspecified
Bayes, MCMC
––
random
Fixed part
gives analytical solution,
part
Random
Consider
Lots of approaches to solve this problem
EM
NewtonRaphson
ML
GaussQuadr.
FisherScoring
REML
Adaptive
Quadr.
QuasiNewton
MCIntegr.
….
…..
(2004).
See also McCulloch & Searle (2001), Reithinger (2003), Skrondal & Rabe-Hesketh
Maximization Tool
Maximization Objective
Integration
LaplaceApprox
There exist approaches to nearly all possible combinations of the following 3 levels:
Overview: Estimating glmm’s with normal random effects
i=1
k=1
• Fitted model is a K component mixture model.
• Simple simultaneous estimation of β, zk and πk via EM algorithm.
• No parametric assumption about the random effect distribution g(·).
with mass points zk and masses πk .
i=1
The marginal likelihood can then be approximated by a finite mixture (Laird, 1978)
( K
)
Z
n
n
Y
Y X
f (yi|zk , β)πk
L=
f (yi|zi, β)g(zi) dzi ≈
K mass points {zk } with masses {πk }
Idea: Approximate random effect Z by a finite discrete distribution:
Nonparametric maximum likelihood (NPML, Aitkin, 1996).
k=1
i=1
k=1
=⇒ can be solved by an standard EM algorithm.
πk fik
P
wik =
= P (obs. yi comes from comp. k).
π
f
` ` i`
P
1
The score equation for πk gives the simple solution π̂k = n i wik .
distribution score equations, with weights
The score equations for β and zk turn out to be weighted versions of the single-
i=1
For a fixed number of mass points K, consider the log-likelihood
)
( K
)
( K
n
n
X
X
X
X
`=
log
πk f (yi|zk , β) ≡
log
πk fik .
Estimation
weights wik .
M-Step Update parameter estimates fitting a weighted GLM, with
E-Step Adjust weights wik given current parameter estimates.
Starting points Select starting values β 0, zk0, πk0, k = 1, . . . , K.
EM algorithm for NPML estimation
• Explanatory variables: sex, age
sponding ’crude death rate’ out of a population of 100000.
• For each region, we have a total count of suicides over the 10 years, and a corre-
• 13 ’health regions’ (8 health boards + Cork, Dublin, Galway, Limerick, Waterford)
Example: Irish Suicide rates
Female
Male
Female
Male
Dublin
Dublin
Galway
Galway
150303
143648
212499
204327
25897
28805
227372
253118
61298
19.29
3.9
19.44
4.75
15.83
3.47
15.75
5.02
23.49
• Overall rate hides differences of interest

10
15
crude rate out of 100000
20
need something in between
5
Crude rates :

• Crude rates: based on small observed counts, very variable 
29
WHB % Galway Male
413
56
Male
SHB % Cork
97
41
10
358
127
144
WHB % Galway Female
Female
SHB % Cork
..
.
Male
6.83
Cork
65925
Female
Cork
45
Gender deaths population crude death rate
Region(s)
(”Variance component model”, ”Two-level-model”)
– region =⇒ regional heterogeneity
– observation =⇒ overdispersion
• Random effect models: Random effect Z at any appropriate level:
Ir regional indicator, αr : parameter for each region. Too much parameters!
distribution Y ∼ B(m, p), with rate p given by
µ
¶
p
log
= ’RegionalEffect’ + β · sex + . . .
1−p
P
• Fixed effect models: ’RegionalEffect’ = r αr Ir
We model the number Y of suicides out of a population of size m by a binomial
Modelling suicide rates
1
1
2
2
3
3
5
5
EM iterations
4
EM iterations
4
6
6
7
7
8
8
Disparity trend and EM Trajectories of mass points zk :
MASS1
MASS2
MASS3
MASS2
MASS3
213.1
rates.
medium, and low suicide
spond to regions with high,
The 3 mass points corre-
-2 log L:
0.0996 0.7128 0.1874
MASS1
1.432 -8.124 -7.757 -7.548
Mixture proportions:
sex
Coefficients:
Variance component model for regional random effect: NPML estimation
300
280
260
240
220
−2logL
mass points
−7.4
−7.6
−7.8
−8.0
z2
0.00
1.00
0.92
0.62
0.76
0.00
1.00
1.00
1.00
1.00
0.01
0.97
1.00
z1
0.00
0.00
0.06
0.00
0.23
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.03
0.99
0.00
0.00
0.00
0.00
0.00
0.01
0.38
0.01
0.00
1.00
z3
WHB % Galway
SHB % Cork
SEHB % Waterford
NWHB
NEHB
Midland HB
Mid WHB % Limerick
EHB % Dublin
Waterford
Limerick
Galway
Dublin
Cork
Region
Posterior probabilities wik
Emp. Bayes
−−−−−−−→
From crude to modelled (”shrunk”) rates
22.08
21.78
19.81
18.66
18.64
18.59
18.42
18.23
17.59
17.11
16.01
14.83
12.43
NEHB
NWHB
Midland HB
Galway
Dublin
Waterford
EHB % Dublin
Region
21.93
23.49
22.23
19.29
19.44
SEHB % Waterford
Cork CB
Limerick
WHB % Galway
SHB % Cork
18.80 Mid WHB % Limerick
17.98
17.83
17.02
15.83
15.75
12.76
12.17
Shrunk Rates Crude Rates
’Suicide league table’ for men
0.000
0.002
0.004
0.006
0.000
0.002
0.004
0.006
• men
crude rates
12345678910
11
12
13
14
15
16
Cork CB
• women
age
12345678910
11
12
13
14
15
16
Dublin CB EHB %Dub
Mid HB MWHB%Lim. NEHB
SHB%Cork Wat. CB WHB%Gal.
12345678910
11
12
13
14
15
16
Crude rates over regions
Gal. CB
12345678910
11
12
13
14
15
16
Lim. CB
NWHB SEHB%Wat.
0.000
0.002
0.004
0.006
Inclusion of age (and interaction sex/age)
0.001
0.002
0.003
0.001
0.002
0.003
12345678910
11
12
13
14
15
16
Cork CB
age
12345678910
11
12
13
14
15
16
Dublin CB EHB %Dub
Mid HB MWHB%Lim. NEHB
SHB%Cork Wat. CB WHB%Gal.
12345678910
11
12
13
14
15
16
Gal. CB
12345678910
11
12
13
14
15
16
Lim. CB
NWHB SEHB%Wat.
Modelled (”shrunk”) rates over regions
EB rates
0.001
0.002
0.003
parameter estimates.
• NPML gives nicely interpretable results (Posterior probabilities, ...) beyond the pure
and decrease for men with increasing age.
• Suicide rates tend to be bigger for men than for women, but increase for women
Waterford) are more reliable for the use in a league table than the crude rates.
• The modelled (”shrunk”) suicide rates of smaller districts (in particular cities Cork,
region Dublin.
• Suicide rates are highest in City Cork and SEHB without Waterford, and lowest in
Summary
15000
Finite Gaussian mixture?
10000
25000
galaxy$velocity
20000
Recession velocities (in km/s) of 82 galaxies.
Example: Galaxy Data
30000
16.13 22.78 19.72 33.04
-2 log L:
380.9
0.423 0.043 1.721 0.626 0.922
Standard deviations:
0.085 0.024 0.512 0.342 0.037
Mixture proportions:
9.71
MASS1 MASS2 MASS3 MASS4 MASS5
Coefficients:
i=1
n
X
E.g. K=5, unequal variances:
`=
log-likelihood for fixed K
log
k=1
( K
X
)
0
20
EM Trajectories:
2
πk f (yi|zk , σ(k)
) ,
mass points
Fitting a finite Gaussian mixture
35
30
25
20
15
10
60
EM iterations
40
80
1
n
P
(yi − ȳ)2.
”one of the things you do not know is the number of things that you do not know” (Richardson & Green, 1997)
• General problem: There does not exist an automatic routine to select K.
the optimal mass points. This makes automatic starting point selection difficult.
• The positions of the ’optimal’ starting points apparently ’have nothing to do’ with
• The EM trajectories behave quite erratically in the first cycles, and tend to cross.
• Finding the optimal solution requires a tedious grid search for tol.
where tol: scaling parameter, gk : Gauss-Hermite mass points, σ̂ =
zk0 = ȳ + tol ∗ σ̂ ∗ gk
• Results depend heavily on the choice of starting points zk0, usually defined as
Problems of current NPML implementations (as in GLIM 4)
dj = 1 − (1 − tol)j ,
(0 < tol ≤ 1)
• Starting in an ’optimal’ solution, the damped version does not escape.
• Reduces fluctuations and dependence on tol.
• Damping has main effect in the first cycles.
i.e. d1 = tol and dj −→ 1 for j −→ ∞.
by the factor
Shrink estimated standard deviation σ̂(k) of the mixture components in the j − th cycle
Improvement: Damping the EM algorithm
K=6
K=5
K=4
0
10
20
20
0
10
10
0
40
40
EM iterations
30
EM iterations
30
EM iterations
20
50
50
60
60
30
undamped
70
70
0
5
5
5
10
15
20
15
EM iterations
10
EM iterations
15
EM iterations
10
damped
EM Trajectories for galaxy data (equal variances)
30
20
10
30
20
mass points
mass points
mass points
10
30
20
10
mass points
mass points
mass points
30
20
10
30
20
10
30
20
10
20
25
20
30
25
K=6
K=5
K=4
50
20
0
0
10
0
30
60
150
EM iterations
100
EM iterations
40
EM iterations
20
200
40
undamped
250
80
50
10
10
0
0
10
0
30
40
30
EM iterations
20
EM iterations
30
EM iterations
20
20
damped
40
40
50
EM Trajectories for galaxy data (unequal variances)
30
20
10
30
20
mass points
mass points
mass points
10
30
20
10
mass points
mass points
mass points
30
20
10
30
20
10
30
20
10
50
50
60
0.0
0.2
0.4
k=4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
k=6
0.6
0.8
1.0
Statistics).
parameter, e.g. Gamma distribution (Einbeck & Hinde, 2005, Austrian Journal of
• Damping is straightforwardly adapted to other exponential families with dispersion
estimation
• Damping gives a significant improvement in stability and performance of NPML
0.0
k=5
Disparity (i.e. −2 log L) in dependence of tol, damped (- -) and undamped (–):
Sensitivity to tuning parameter reduced
420
415
410
405
395
390
400
410
405
400
395
390
385
380
400
390
380
370
360
20
25
velocity/1000
15
30
35
1
nh
i=1
10
20
25
velocity/1000
15
30
35
est. mixture components
´
K Yih−y .
³
components of a Gaussian mixture. But: Number of modes depends on bandwidth h.
Carreira & Williams, 2003: The number of modes is a lower bound for the number of
10
estimated density
Idea: Consider density estimate fˆ(y, h) =
f
n
P
pi_k*f_k
Finding optimal starting points
0.10
0.00
0.20
0.10
0.00
20
25
yi (velocity/1000)
15
30
35
.
..
saturated model (K = 82)
←− random effect models (1 < K < 82)
..
←− fixed effect model (K = 1)
”zoom into the random effect distribution”
More generally, applied on the ’residuals’ h−1(yi) − x0iβ̂ of a GLM:
10
The mode tree (Minnotte & Scott, 1993)
h
5
1
0.5
0.1
10
20
velocity/1000
15
Examples for bandwidth selection
h
5
1
0.5
0.1
25
30
35
AIC (22 modes)
Silverman (3 modes)
BCV (1 mode)
Bandwidth selectors:
is reached.
(Silverman, 1981)
hcrit = inf{h, fˆ(·, h) has at most k modes}
• From that bandwidth, climb down the mode tree until the next critical bandwidth
where A = min{σ̂, IQR/1.34}
hopt = 0.9An−1/2,
• Calculate Silverman’s optimal bandwidth
Bandwidth selection in 2 steps
−15
−5
0
5
10 15
2
are so accurate that one hardly needs EM at all!
4
6
iter
8
10
12
14
used as starting points for the EM algorithm. In many cases, these starting points
5
0.5
0.1
h
• a very accurate estimate for the location of the mass points, which then can be
• an estimate for the number of modes, and thus for the number of components
Using hcrit, the mode tree gives
z_k
35
30
25
20
15
10
−15
−15
−5
−5
0
0
5
5
10
10
15
15
z_k
z_k
0.1
0.5
h
h
5
0.1
0.5
5
Climbing down the tree
10 15 20 25 30 35
10 15 20 25 30 35
0
0
5
20
10
iter
40
iter
15
60
20
80
25
30
• Random coefficient models
• Variance component models
• NPML and Gaussian Quadrature
• Normal, Binomial, Poisson, Gamma - distributed response
in joint work with J. Hinde, based on initial work by R. Darnell.
www.nuigalway.ie/maths/je/npml.html
General R Package {npml} (under construction) at
density features. JCGS, 2, 51-68.
MINNOTTE, M. C. and SCOTT, D. W. (1993): The mode tree: A tool for visualization of nonparametric
McCULLOCH, C. E. and SEARLE, R. (2001) Generalized, linear, and mixed models. New York: Wiley.
LAIRD, N. M. (1978): Nonparametric maximum likelhood estimation of a mixing distribution. JASA, 73, 805–811.
HINDE, J. (1982): Compound Poisson regression models. Lecture Notes in Statistics 14 ,109-121.
unspecified dispersion parameter. Austrian Journal of Statistics, to appear.
EINBECK, J. and HINDE, J. (2005) A note on NPML estimation for exponential family regression models with
mixture. Lecture Notes in Computer Science, 2695, 625–640.
CARREIRA-PERPIÑAN, M. A. and WILLIAMS, C.K.I. (2003): On the number of modes of a Gaussian
ford, UK.
AITKIN, M., FRANCIS, B. and HINDE, J. (2005): Statistical Modelling in GLIM 4 (Second edition). Ox-
and Computing 6, 251–262.
AITKIN, M. (1996): A general maximum likelihood analysis of overdispersion in generalized linear models. Statistics
References
man & Hall/CRC.
SKRONDAL, A. and RABE-HESKETH, S. (2004): Generalized latent variable modelling. Boca Raton: Chap-
SILVERMAN. (1981): Using kernel density estimates to investigate multimodal regression. JRSSB, 43, 97–99.
Components (with discussion) JRSSB, 59, 731–792.
RICHARDSON, S. and GREEN, P. (1997): On Bayesian Analysis of Mixtures with an Unknown Number of
LMU München.
REITHINGER, F. (2003): Generalized linear models with random effects and smooth compoments. Diplomarbeit,
Download