Work supported by joint work with John Hinde, National University of Ireland, Galway. München, 29. November 2005 National University of Ireland, Galway Jochen Einbeck On Statistical Modelling with Random Effects using Mixture Models • Outlook: Mode Trees Methodological improvements • Problems of NPML; • Example for NPML estimation: Irish suicide data Nonparametric maximum likelihood (NPML) estimation • Generalized linear models with random effects, Outline • individual .. unit variability • model misspecification • unobserved covariates Integral often intractable! i=1 Parameter estimation requires maximizing the marginal likelihood n Z Y L(β, g(z)) = f (yi|zi, β)g(zi) dzi. accounts for with yi|zi exponential family distributed. The random effect zi with distribution g(·) µi ≡ E(yi|zi, β) = h(ηi) ≡ h(x0iβ + zi), Generalized linear model with random effect ? normal glmm - literature classical Random effect distr.: (2) Conditional models −→ 3 families: (1) Marginal models −→ GEE fixed all other ? conjugate random fixed JML, CML as comb’s of Z and Y |Z. works only for special NPML ? unspecified Bayes, MCMC –– random Fixed part gives analytical solution, part Random Consider Lots of approaches to solve this problem EM NewtonRaphson ML GaussQuadr. FisherScoring REML Adaptive Quadr. QuasiNewton MCIntegr. …. ….. (2004). See also McCulloch & Searle (2001), Reithinger (2003), Skrondal & Rabe-Hesketh Maximization Tool Maximization Objective Integration LaplaceApprox There exist approaches to nearly all possible combinations of the following 3 levels: Overview: Estimating glmm’s with normal random effects i=1 k=1 • Fitted model is a K component mixture model. • Simple simultaneous estimation of β, zk and πk via EM algorithm. • No parametric assumption about the random effect distribution g(·). with mass points zk and masses πk . i=1 The marginal likelihood can then be approximated by a finite mixture (Laird, 1978) ( K ) Z n n Y Y X f (yi|zk , β)πk L= f (yi|zi, β)g(zi) dzi ≈ K mass points {zk } with masses {πk } Idea: Approximate random effect Z by a finite discrete distribution: Nonparametric maximum likelihood (NPML, Aitkin, 1996). k=1 i=1 k=1 =⇒ can be solved by an standard EM algorithm. πk fik P wik = = P (obs. yi comes from comp. k). π f ` ` i` P 1 The score equation for πk gives the simple solution π̂k = n i wik . distribution score equations, with weights The score equations for β and zk turn out to be weighted versions of the single- i=1 For a fixed number of mass points K, consider the log-likelihood ) ( K ) ( K n n X X X X `= log πk f (yi|zk , β) ≡ log πk fik . Estimation weights wik . M-Step Update parameter estimates fitting a weighted GLM, with E-Step Adjust weights wik given current parameter estimates. Starting points Select starting values β 0, zk0, πk0, k = 1, . . . , K. EM algorithm for NPML estimation • Explanatory variables: sex, age sponding ’crude death rate’ out of a population of 100000. • For each region, we have a total count of suicides over the 10 years, and a corre- • 13 ’health regions’ (8 health boards + Cork, Dublin, Galway, Limerick, Waterford) Example: Irish Suicide rates Female Male Female Male Dublin Dublin Galway Galway 150303 143648 212499 204327 25897 28805 227372 253118 61298 19.29 3.9 19.44 4.75 15.83 3.47 15.75 5.02 23.49 • Overall rate hides differences of interest 10 15 crude rate out of 100000 20 need something in between 5 Crude rates : • Crude rates: based on small observed counts, very variable 29 WHB % Galway Male 413 56 Male SHB % Cork 97 41 10 358 127 144 WHB % Galway Female Female SHB % Cork .. . Male 6.83 Cork 65925 Female Cork 45 Gender deaths population crude death rate Region(s) (”Variance component model”, ”Two-level-model”) – region =⇒ regional heterogeneity – observation =⇒ overdispersion • Random effect models: Random effect Z at any appropriate level: Ir regional indicator, αr : parameter for each region. Too much parameters! distribution Y ∼ B(m, p), with rate p given by µ ¶ p log = ’RegionalEffect’ + β · sex + . . . 1−p P • Fixed effect models: ’RegionalEffect’ = r αr Ir We model the number Y of suicides out of a population of size m by a binomial Modelling suicide rates 1 1 2 2 3 3 5 5 EM iterations 4 EM iterations 4 6 6 7 7 8 8 Disparity trend and EM Trajectories of mass points zk : MASS1 MASS2 MASS3 MASS2 MASS3 213.1 rates. medium, and low suicide spond to regions with high, The 3 mass points corre- -2 log L: 0.0996 0.7128 0.1874 MASS1 1.432 -8.124 -7.757 -7.548 Mixture proportions: sex Coefficients: Variance component model for regional random effect: NPML estimation 300 280 260 240 220 −2logL mass points −7.4 −7.6 −7.8 −8.0 z2 0.00 1.00 0.92 0.62 0.76 0.00 1.00 1.00 1.00 1.00 0.01 0.97 1.00 z1 0.00 0.00 0.06 0.00 0.23 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.99 0.00 0.00 0.00 0.00 0.00 0.01 0.38 0.01 0.00 1.00 z3 WHB % Galway SHB % Cork SEHB % Waterford NWHB NEHB Midland HB Mid WHB % Limerick EHB % Dublin Waterford Limerick Galway Dublin Cork Region Posterior probabilities wik Emp. Bayes −−−−−−−→ From crude to modelled (”shrunk”) rates 22.08 21.78 19.81 18.66 18.64 18.59 18.42 18.23 17.59 17.11 16.01 14.83 12.43 NEHB NWHB Midland HB Galway Dublin Waterford EHB % Dublin Region 21.93 23.49 22.23 19.29 19.44 SEHB % Waterford Cork CB Limerick WHB % Galway SHB % Cork 18.80 Mid WHB % Limerick 17.98 17.83 17.02 15.83 15.75 12.76 12.17 Shrunk Rates Crude Rates ’Suicide league table’ for men 0.000 0.002 0.004 0.006 0.000 0.002 0.004 0.006 • men crude rates 12345678910 11 12 13 14 15 16 Cork CB • women age 12345678910 11 12 13 14 15 16 Dublin CB EHB %Dub Mid HB MWHB%Lim. NEHB SHB%Cork Wat. CB WHB%Gal. 12345678910 11 12 13 14 15 16 Crude rates over regions Gal. CB 12345678910 11 12 13 14 15 16 Lim. CB NWHB SEHB%Wat. 0.000 0.002 0.004 0.006 Inclusion of age (and interaction sex/age) 0.001 0.002 0.003 0.001 0.002 0.003 12345678910 11 12 13 14 15 16 Cork CB age 12345678910 11 12 13 14 15 16 Dublin CB EHB %Dub Mid HB MWHB%Lim. NEHB SHB%Cork Wat. CB WHB%Gal. 12345678910 11 12 13 14 15 16 Gal. CB 12345678910 11 12 13 14 15 16 Lim. CB NWHB SEHB%Wat. Modelled (”shrunk”) rates over regions EB rates 0.001 0.002 0.003 parameter estimates. • NPML gives nicely interpretable results (Posterior probabilities, ...) beyond the pure and decrease for men with increasing age. • Suicide rates tend to be bigger for men than for women, but increase for women Waterford) are more reliable for the use in a league table than the crude rates. • The modelled (”shrunk”) suicide rates of smaller districts (in particular cities Cork, region Dublin. • Suicide rates are highest in City Cork and SEHB without Waterford, and lowest in Summary 15000 Finite Gaussian mixture? 10000 25000 galaxy$velocity 20000 Recession velocities (in km/s) of 82 galaxies. Example: Galaxy Data 30000 16.13 22.78 19.72 33.04 -2 log L: 380.9 0.423 0.043 1.721 0.626 0.922 Standard deviations: 0.085 0.024 0.512 0.342 0.037 Mixture proportions: 9.71 MASS1 MASS2 MASS3 MASS4 MASS5 Coefficients: i=1 n X E.g. K=5, unequal variances: `= log-likelihood for fixed K log k=1 ( K X ) 0 20 EM Trajectories: 2 πk f (yi|zk , σ(k) ) , mass points Fitting a finite Gaussian mixture 35 30 25 20 15 10 60 EM iterations 40 80 1 n P (yi − ȳ)2. ”one of the things you do not know is the number of things that you do not know” (Richardson & Green, 1997) • General problem: There does not exist an automatic routine to select K. the optimal mass points. This makes automatic starting point selection difficult. • The positions of the ’optimal’ starting points apparently ’have nothing to do’ with • The EM trajectories behave quite erratically in the first cycles, and tend to cross. • Finding the optimal solution requires a tedious grid search for tol. where tol: scaling parameter, gk : Gauss-Hermite mass points, σ̂ = zk0 = ȳ + tol ∗ σ̂ ∗ gk • Results depend heavily on the choice of starting points zk0, usually defined as Problems of current NPML implementations (as in GLIM 4) dj = 1 − (1 − tol)j , (0 < tol ≤ 1) • Starting in an ’optimal’ solution, the damped version does not escape. • Reduces fluctuations and dependence on tol. • Damping has main effect in the first cycles. i.e. d1 = tol and dj −→ 1 for j −→ ∞. by the factor Shrink estimated standard deviation σ̂(k) of the mixture components in the j − th cycle Improvement: Damping the EM algorithm K=6 K=5 K=4 0 10 20 20 0 10 10 0 40 40 EM iterations 30 EM iterations 30 EM iterations 20 50 50 60 60 30 undamped 70 70 0 5 5 5 10 15 20 15 EM iterations 10 EM iterations 15 EM iterations 10 damped EM Trajectories for galaxy data (equal variances) 30 20 10 30 20 mass points mass points mass points 10 30 20 10 mass points mass points mass points 30 20 10 30 20 10 30 20 10 20 25 20 30 25 K=6 K=5 K=4 50 20 0 0 10 0 30 60 150 EM iterations 100 EM iterations 40 EM iterations 20 200 40 undamped 250 80 50 10 10 0 0 10 0 30 40 30 EM iterations 20 EM iterations 30 EM iterations 20 20 damped 40 40 50 EM Trajectories for galaxy data (unequal variances) 30 20 10 30 20 mass points mass points mass points 10 30 20 10 mass points mass points mass points 30 20 10 30 20 10 30 20 10 50 50 60 0.0 0.2 0.4 k=4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 k=6 0.6 0.8 1.0 Statistics). parameter, e.g. Gamma distribution (Einbeck & Hinde, 2005, Austrian Journal of • Damping is straightforwardly adapted to other exponential families with dispersion estimation • Damping gives a significant improvement in stability and performance of NPML 0.0 k=5 Disparity (i.e. −2 log L) in dependence of tol, damped (- -) and undamped (–): Sensitivity to tuning parameter reduced 420 415 410 405 395 390 400 410 405 400 395 390 385 380 400 390 380 370 360 20 25 velocity/1000 15 30 35 1 nh i=1 10 20 25 velocity/1000 15 30 35 est. mixture components ´ K Yih−y . ³ components of a Gaussian mixture. But: Number of modes depends on bandwidth h. Carreira & Williams, 2003: The number of modes is a lower bound for the number of 10 estimated density Idea: Consider density estimate fˆ(y, h) = f n P pi_k*f_k Finding optimal starting points 0.10 0.00 0.20 0.10 0.00 20 25 yi (velocity/1000) 15 30 35 . .. saturated model (K = 82) ←− random effect models (1 < K < 82) .. ←− fixed effect model (K = 1) ”zoom into the random effect distribution” More generally, applied on the ’residuals’ h−1(yi) − x0iβ̂ of a GLM: 10 The mode tree (Minnotte & Scott, 1993) h 5 1 0.5 0.1 10 20 velocity/1000 15 Examples for bandwidth selection h 5 1 0.5 0.1 25 30 35 AIC (22 modes) Silverman (3 modes) BCV (1 mode) Bandwidth selectors: is reached. (Silverman, 1981) hcrit = inf{h, fˆ(·, h) has at most k modes} • From that bandwidth, climb down the mode tree until the next critical bandwidth where A = min{σ̂, IQR/1.34} hopt = 0.9An−1/2, • Calculate Silverman’s optimal bandwidth Bandwidth selection in 2 steps −15 −5 0 5 10 15 2 are so accurate that one hardly needs EM at all! 4 6 iter 8 10 12 14 used as starting points for the EM algorithm. In many cases, these starting points 5 0.5 0.1 h • a very accurate estimate for the location of the mass points, which then can be • an estimate for the number of modes, and thus for the number of components Using hcrit, the mode tree gives z_k 35 30 25 20 15 10 −15 −15 −5 −5 0 0 5 5 10 10 15 15 z_k z_k 0.1 0.5 h h 5 0.1 0.5 5 Climbing down the tree 10 15 20 25 30 35 10 15 20 25 30 35 0 0 5 20 10 iter 40 iter 15 60 20 80 25 30 • Random coefficient models • Variance component models • NPML and Gaussian Quadrature • Normal, Binomial, Poisson, Gamma - distributed response in joint work with J. Hinde, based on initial work by R. Darnell. www.nuigalway.ie/maths/je/npml.html General R Package {npml} (under construction) at density features. JCGS, 2, 51-68. MINNOTTE, M. C. and SCOTT, D. W. (1993): The mode tree: A tool for visualization of nonparametric McCULLOCH, C. E. and SEARLE, R. (2001) Generalized, linear, and mixed models. New York: Wiley. LAIRD, N. M. (1978): Nonparametric maximum likelhood estimation of a mixing distribution. JASA, 73, 805–811. HINDE, J. (1982): Compound Poisson regression models. Lecture Notes in Statistics 14 ,109-121. unspecified dispersion parameter. Austrian Journal of Statistics, to appear. EINBECK, J. and HINDE, J. (2005) A note on NPML estimation for exponential family regression models with mixture. Lecture Notes in Computer Science, 2695, 625–640. CARREIRA-PERPIÑAN, M. A. and WILLIAMS, C.K.I. (2003): On the number of modes of a Gaussian ford, UK. AITKIN, M., FRANCIS, B. and HINDE, J. (2005): Statistical Modelling in GLIM 4 (Second edition). Ox- and Computing 6, 251–262. AITKIN, M. (1996): A general maximum likelihood analysis of overdispersion in generalized linear models. Statistics References man & Hall/CRC. SKRONDAL, A. and RABE-HESKETH, S. (2004): Generalized latent variable modelling. Boca Raton: Chap- SILVERMAN. (1981): Using kernel density estimates to investigate multimodal regression. JRSSB, 43, 97–99. Components (with discussion) JRSSB, 59, 731–792. RICHARDSON, S. and GREEN, P. (1997): On Bayesian Analysis of Mixtures with an Unknown Number of LMU München. REITHINGER, F. (2003): Generalized linear models with random effects and smooth compoments. Diplomarbeit,