Latent class binary regression models – identification and estimation Anders HolmI and Morten PedersenII Abstract In this paper we analyse the identification of the latent class binary regression model. In this model the latent classes are thought to represent unobserved heterogeneity. We show that ignoring unobserved heterogeneity might lead to very biased results. We furthermore illustrate that the model is well identified using panel data, but identification is fragile using cross-sectional data. We propose, based on insight into the model and simulations that a simplified model might work almost just as well for cross-sectional data. Finally we illustrate the applicability and performance of the latent class logit regression model as opposed to an ordinary logit regression model, in two different applications. Keywords: Latent class, logit model, regression model, panel data, unobserved heterogeneity. 1. Introduction In many social science applications of regression analysis one does not observe all the relevant independent variables. In linear models problems with unobserved independent variables depend on whether these variables are correlated with the observed independent variables, see e.g. Ejernæs and Holm (2006). In contrast, in non-linear regression models it is well known that unobserved independent variables potentially leads to bias in the effect of the observed independent variables, even when the unobserved and observed variables are uncorrelated, see Cameron and Heckman (1998), Bretagnolle and Huber-Carol (1988) or Abramson et al. (2000). Consequently, ignoring omitted independent variables, even when they are uncorrelated with the observed independent variables, might potentially lead to incorrect conclusions about the effect of the observed independent variables. In order to illustrate how to take into account omitted independent variables in non-linear regression models we propose a simple version of the latent class regression logit model. Furthermore, we illustrate why with this model the observed data can reveal information on the effects of both the observed and unobserved independent variables on the dependent variable. We also illustrate why Danish School of Education, University of Aarhus, Tuborgvej 164, 2400 Copenhagen NV, DK – 2400 Copenhagen, phone +45 8888 9566, e-mail: ahol@dpu.dk. II Department of sociology, University of Copenhagen. I 1 this model is sometime hard to estimate, especially with cross-sectional data, due to weak identification and we propose a simple strategy to improve identification in this case. In our approach the latent classes need not represent any particular type of omitted variables, but can be seen as a non-parametric approximation to any unknown distribution of omitted variables. Hence, even if the omitted variables are continuous or discrete, or both, we can think of the latent class as a non-parametric approximation to the unknown distribution of these variables. The justification of our approach comes from Lindsay (1983a) and (1983b) who showed that any mixing distribution representing unobserved heterogeneity can be sufficiently approximated by a latent class distribution with a fixed number of classes. However, the number of classes is proportional to the number of observations. Although this argument makes sense intuitively, it leads to a break down of some of the regularity conditions of maximum likelihood theory (a finite parameter space with parameters in the interior of the parameter space) and, hence, it is impossible to use the classical inference of maximum likelihood theory for these types of models. However, by assuming that the number of latent classes is known in advance, one can use the standard maximum likelihood for the parameters of the model. Furthermore, the exact number of latent classes can be determined by alternative goodness of fit measures, e.g. the Bayesian information criteria (BIC) see e.g. Dayton (1999). Although this strategy tends to understate the estimated standard errors of the parameter estimates it has been shown to be a feasible solution in the applied literature, see Greene (2003). In practical applications, see Heckman and Singer (1984), Davies (1993) or Holm (2002) one often finds that a small number of latent classes are sufficient to capture the significant features of the distribution of the omitted variables. The remainder of the paper is organized as follows: Section two introduces the model, section 3 discusses identification, section 4 presents simulations results, section 5 contains an application, and section 6 concludes. 2. The model We analyze a latent class binary logit regression model. The dependent variable is Y and takes the value y = 0 and y = 1. We formulate the latent regression logit model with J latent classes as: 2 jJ jJ exp( βx j ) P( j ) j 1 j 1 1 exp( βx j ) P(Y 1 | x) P(Y 1 | x, j )P( j ) (1) where is a constant term, x is a vector of explanatory variables, β is a corresponding row vector of regression coefficients, j is the effect of the j’th latent class on the probability of observing Y = 1, and finally P( j ) is the frequency of the j’th latent class in the population. The parameters of the model to be estimated are , β , j , P( j ) , j = 1,..,J, where J is the number of latent classes. This model takes into account unobserved heterogeneity arising from omitting independent variables. The unobserved heterogeneity might either be thought of as the representation of a true discrete distribution of unobserved heterogeneity or as an approximation to any unknown distribution of unobserved heterogeneity, discrete or continuous. The latent class frequencies, P( j ) , must meet the restrictions: P( j ) 0 and j j j 1 P( j ) 1. Hence, the following re-parameterization is useful when estimating the frequencies: P ( j ) exp exp j jJ j 1 j where now j , j = 1,…J are parameters to be estimated. Furthermore we divide with 1 to get: P ( j ) exp j 1 1 j 2 exp j 1 jJ exp j 1 j 2 exp j jJ . It follows that the number of identifiable parameters for the latent class frequencies is J –1. Furthermore, we also find that re-defining j j leaves P(Y 1 | x, j ) P(Y 1 | x, j ) , j = 1,…, J, hence a normalization of the effect of the latent classes is warranted. We follow the so called dummy-coding and normalize 1 0 . 3 As the purpose of this paper is a discussion of the intuition behind identification and not a full rigorous proof of identification of the latent class model, we work with a simplified two-class model with one independent variable (sentence is very long). The model is then written as: j 2 exp( x j ) P( j ) j 1 1 exp( x j ) P(Y 1 | x) (2) where x is now a single continuous variable and is a regression coefficient and where 1 0 and 2 . From (2) we construct the log-likelihood function for a sample of n independent observations: i n ln L yi lnP(Y 1 | xi ) 1 yi ln 1 P(Y 1 | xi ) (3) i 1 where P(Y 1| xi ) pP0i 1 p Pi and P0i exp xi exp xi , Pi 1 exp xi 1 exp xi and finally where P( 0) p and P( ) 1 p . Note that we now implicitly use: P( 1 0) exp 2 1 ; P ( 2 ) 1 exp 2 1 exp 2 In the following example we illustrate how the latent class logistic regression model and the standard logit regression model might lead to very different estimates of the effect of the observed independent variable. Consider the following two way table: - TABLE 1 HERE From the table we find the log-odds ratio to be, on average, roughly one. However, the table can be thought of as comprised of the following two sub-tables: 4 - TABLE 2 HERE – From table two, it is evident, that in both sub-samples, the log-odds ratio is approximately two. Hence, ignoring grouping, we estimate the log-odds ratio, , with about 100 % bias. This is confirmed by the following ML estimates from an ordinary logistic regression model and a latent class model with two classes. - TABLE 3 HERE – The likelihood value of the ordinary regression model and the latent class model seem not to yield dramatically different fit to the data. The ratio of the log-likelihoods is 1.003, even though the estimate of differ dramatically between the two models. To illustrate this consider the following figure: - FIGURE 1 HERE The figure shows observed and predicted probabilities of Y = 1. From the figure it is clear that there are only small differences between the predicted probabilities of the logistic regression model and the latent class model (which in this case yields a perfect fit to the data because it is a saturated model). It is likely that the variation in x will only yield minor discrepancies in predicted probabilities between the logistic regression model and the latent class model. And often, from these variations it will be difficult to determine whether these discrepancies are due to non-linear effects of x on the log-odds of Y or the presence of latent classes.1 3. Identification Going back to the log-likelihood function we find the following log-likelihood equations: Var ( yi ) Var ( yi ) ln L yi 1 yi Pi 1 Pi i x Var ( yi ) x Var ( yi ) ln L yi i 1 yi i Pi 1 Pi i Var ( yi | ) Var ( yi | ) ln L yi 1 yi Pi 1 Pi i P P P P ln L yi 0i i 1 yi 0i i p Pi Pi i 5 where Var ( yi ) pP0i 1 P0i 1 p P i 1 P i and Var ( yi | ) P i 1 P i . From the Log-likelihood equations we find that when 0 P0i P i . This means that whenever 0 ln L 0; 2 , i.e. there is no information on the value of p. In this case, the last p equation becomes redundant and identification of p is not possible. In practice this means that when is close to 0, the likelihood function might behave badly and identification might be problematic. In order to study how the observed information (the distribution of Y and X) may or may not lead to identification of the distribution of the latent classes, we find the posterior distribution of conditional on Y and X: P | Y y, X x P Y y | X x, p j 2 P Y 1| X x, j j 1 exp( x ) y p 1 exp( x ) j 2 exp( x ) y p 1 exp( x j j ) j 1 1 1 p 1 exp( x ) 1 p exp( y ) 1 exp( x) Now differentiate wrt. x and equate to zero to obtain: P | Y y, X x exp( x y ) 1 exp( ) p(1 p) | 0 x p 1 exp( y ) 1 exp( x) 1 exp( x ) 1 p 2 0 0 as the denominator is always defined. This means that whenever x varies so does the posterior probability of observing a latent class membership, except when the latent class membership effect is zero. 6 If we have panel data, i.e., repeated observations in both of Y and X, in general we have that P(Y1 1) 1 P(Y2 0) , where subscript t =1, 2 denotes what part of the panel the observation belongs. Hence changing values of not only x but also y along the panel might lead to information on . This can be seen by: P | Y1 1, X x P | Y2 0, X x 1 1 0 1 p 1 exp( x ) 1 p 1 exp( x ) 1 1 p exp( ) 1 exp( x) p 1 exp( x) 1 exp( x ) 1 exp( x ) exp( ) 1 exp( x) 1 exp( x) exp( ) 1 exp( x) 1 exp( x) 0 Hence whenever y varies so does also the posterior probability of observing a latent class membership, except when the latent class membership effect is zero. We also find that: P | Y1 1, X x P | Y2 0, X x ' 0 1 exp( x) x 1 exp( x ') 0 or ln As the term inside the bracket will always be negative, this is not a feasible solution. Hence, identification of improves when both Y and X vary. Finally, note that P | Y1 1, X x P | Y2 1, X x ' does not lead to any conclusion about the value of . That is, the observations that only change the values of the independent variable do not contribute to the identification of the latent classes. We may summarize these findings in the following proposition: 7 x ) If P(Y 1| x, ) 1exp( exp( x ) , , known, P | Y y, X x P | Y y ', X x ' , y y ' with P(Y y ) 1 P(Y y ') or x x' | x, x ', y, y ' 0. Proof: See the appendix. The proposition states that if two different posterior probabilities are equal for different x (the case of cross sectional data) or y and or x (the case of panel data), this must be because the distribution of the latent classes are degenerate, at least for the observed information used in the comparison. Hence, this observed information is non-informative with respect to the distribution of the latent classes. Vice versa, if the posterior probabilities differ for different observed (non-redundant) information this information is informative on the distribution of the latent classes. 4. Some Simulations In order to study the effect of cross sectional and panel data identification of the latent class model, we run a number of simulations. We run 100 simulations each on datasets with 500 observations, including repeated observations in panels. The simulations have varying degrees of identification in terms of number of panels and the variation in x. The results are shown in table 5 below. - TABLE 4 HERE From table 4 it is evident that the latent class model (LCM) with continuous x (infinite outcomes) and five panels yields estimates which are close to the true values and with small Root Mean Square Error (RMSE). However, it is also clear that in the case of only one panel (i.e. cross-sectional data) and two outcomes of x, the LCM performs poorly, although it still estimates the slope coefficient of x, , with much less bias than the logit model. As the slope of x is our parameter of interest, we may try to improve the fit of the model by reducing the number of nuisance parameters. Therefore we fix the parameter for the weight of the latent classes (the transformed probabilities of the latent classes, 2 ) to arrive at the Latent Class model with Fixed Weights (LCMFW).2 In order to asses the impact of this in real applications we have fixed 2 at a value different from the true value.3 From table 4 we find that in the weakly identified case this approach actually leads to better estimates, whereas it leads to worse results in the better identified cases. 8 But why in particular fix the weight. Why not any of the other parameters? First of all, we found from the likelihood equations, that the equation for the weight was redundant when the effect of the latent class approached zero. Hence, for some values of the other parameters, there is no information on how to choose a particular value of 2 . Further, from principal component analysis (PCA) of the estimates in the simulations in table 4 we find the following eigenvalues and eigenvectors of the estimated parameters in the simulations: - TABLE 5 HERE - By comparing the two top panels of table 5, representing PCA of the simulations on panel data, with the three lower panels, representing PCA on cross-sectional data, we find that the sum of the eigen values are much lower in the simulations pertaining to panel data simulation than those pertaining to cross-sectional data. This reflects the increased accuracy of panel data estimation compared to cross-section estimation. The first and largest eigenvalue corresponds to a eigen-vector with large loadings on the constant term, , and especially the effect of the latent class, . Hence, large parts of the RMSE on these two parameters are due to the fact that they are correlated. The second largest eigenvalue, which is still of considerable relative size in the cross-sectional simulations, pertain to an eigen-vector with a large loading in the weight of the latent class 2 . Therefore we conclude that a large part of the RMSE on this parameter is not correlated with any of the other parameters or in other words, 2 can, on cross-sectional data, take a wide range of values that leave the other parameters relatively unaffected. Hence, when identification is fragile, it seems relevant to fix 2 in estimations. In the simulations presented in table 4 we fix the number of observations to 500. In order to investigate on the impact of the sample size on the estimates we run a number of simulations for different sample sizes, keeping the number of simulations for each sample size at 100. In figure 2a and 2b we show bias and RMSE for a panel model with continuous x and five panels and an increasing sample size. - FIGURE 2a + 2b HERE – 9 From the figure it is evident that both bias and RMSE decreases substantially for both the LCM and the LCMFW, even for relatively small samples. In no cases does the bias of the LCM and the LCMFW exceed that of the logit model. Even in relatively small samples the RMSE of both the LCM and the LCMFW outperforms the logit model. Hence, estimation of latent class models seems very feasible with panel data and rich variation in the independent variables. However, from inspecting the RMSE, it seems that in very small samples (i.e. less than 400 observations) the LCM is rather unstable. In this case reasonable results can be obtained with the LCMFW. In this case, we do not get much bias but much better precision compared to the LCM where all parameters are estimated. And the LCMFW still outperforms the logit model both in terms of bias as well as RMSE. In Figure 3a and 3b we show the Bias and RMSE for a cross sectional simulation with limited variation in the independent variables (two levels). - FIGURE 3a + 3b HERE From the figures it is evident that the LCM has considerable bias and large RMSE. In fact, for small sample sizes it displays as much bias as the logit model. For small samples the LCM also has a huge RMSE compared to the logit model. And for larger sample sizes the LCM still has a much larger RMSE than the logit model. Hence, it appears that the LCM is not very feasible in this case. However, the LCMFW seems to perform much better. It has a much smaller bias than the logit model for all sample sizes and also a smaller RMSE than the logit model, at least for samples larger than 200 observations. Therefore it seems relevant to use the LCMFW when identification of the LCM fails or is very. 5. First application – trust in the parliament. In this section we use a two-class latent class logistic regression model to analyse the relationship between trust in the parliament and the position on a left/right political scale, see Arts and Gelissen (2001)., age and gender. We use data from the Danish part of the international value study, see Gundelach (2002). In the 1990 and 1999 panel 640 respondents where interviewed in both panels, 10 establishing a panel data set. After cleaning the data we are left with 484 individuals who appear in our data. In table 6 we show descriptive statistics for both waves for the variables in the analysis. - TABLE 6 HERE - From the table we se that a little more than half of the respondents where males. This fraction does not change as all individuals are in both panels. The average age in 1990 is 40 and hence 49 in 1999. More than two thirds live with a partner in both panels. Average household income is 293.000 DKK in 1990 and has risen to 386.000 in 1999. Trust is a binary variable recoded from a four ordinal variable running from very much trust to very little trust in the parliament. About 42 percent of the respondents express either very much or somewhat trust in the parliament and are hence coded as one in the data for this analysis. In 1999, average trust has increased to 50 percent. This increase in trust along the panel could indicate that trust increases with age. In table 7 we show estimation results from applying a logistic regression model, the LCM and the LCMFW on the entire panel. - TABLE 7 HERE - From the table we find two important findings. First we find that the estimate of the only significant independent variable, household income, is estimated at very different values for the logit model on the one hand and the LCM and LCMFW models on the other. Hence, taking into account unobserved heterogeneity in terms of two latent classes, seems to be important. Second, the estimate of the class effect in LCM model is very large and the standard error of this estimate is also huge, indicating weak identification. On the other hand, the class effect in the LCMFW model is much lower and with a much lower standard error as well. Hence we find that the LCMFW model is much better identified compared to the LCM model and it also yields a very similar fit of the effect of the variables of interest, namely the independent variables. The BIC of the LCMFW model is also lower than for the LCM. This is obtained by the lower number of parameters in the LCMFW model as they have almost identical fit, reflected in the values of the -2 LnL. The estimated value of the weight parameter, 2 , is very different from the value from the grid-search with the LCMFW. This also indicates why estimation of the LCM model is problematic. That two very different values 11 of the weight parameter yield almost the same fit indicates a very flat likelihood function in the dimension of the weight parameter. Hence it seems sensible to fix it to improve identification of the remaining parameters in the model. This, of course, only makes sense if we get similar fit of the parameters of interests, which is indicated by the simulations, and which also show up in this application. 6. Second application - Unemployment and the dual labour market In this application we look at the probability of being unemployed or employed on a sample of Danish males. The theoretical background is the theory of dual labour markets which argues that some individuals have a strong attachment to the labour market and a low risk of unemployment, and other individuals have much weaker ties to the labour market and face a high risk of unemployment. Individuals with low qualifications are more often employed on fixed term contracts or temporary positions. Hence, they are more often at risk of being unemployed. As a consequence, the theory hypothesizes that at least two distinct groups exist in the labour market: One group that has strong labour market attachment and no or only little unemployment and a second group that experiences most of the unemployment in the labour market. Empirical evidence in favour of the dual labour market is mixed. Sakamoto and Chen (1991) find significant support for the dual labour market whereas Launov (2004) use a latent class count model to analyze turnover and finds no evidence to support the dual labour market hypothesis. In our application we use register data from the Danish administrative registers. The data covers the years 1980 to 1995 with yearly information on employment status, socio-economic information and marital status. We confine the analysis to males who are either employed or unemployed throughout their observation period (which might not cover the entire sample period). Summary statistics for the data is shown in table 8 below. - TABLE 8 HERE - 12 From the table we find that respondents were on average unemployed 8 % of the time during the observation period, 11 % had children between zero to two years. The respondents were on average 38.14 years old, 30 % were living alone, 46 % has a vocational education and 18 % had further education, leaving 36 % unskilled. Note that some or all of these variables might change for an individual through the observation period. One of the virtues of register data is that there is no missing data due to attrition or measurement errors due to recall bias. Hence, in these respects, the data is of very high quality compared to longitudinal survey data. We study dual labour market theory by applying a latent class model and a logit model to analyse whether the respondent is employed (Y = 1) or unemployed (Y = 0) conditional on a number of independent variables. In the latent class model unobserved heterogeneity is also taken into account by the latent classes. As we have panel data with multiple panels for most of the observations, we expect that the latent class model can be reliably estimated. Table 9 below shows estimation results for the two models. - TABLE 9 HERE - We find that a latent class models with three classes yields a reasonable fit to the data. BIC did not improve by adding more latent classes to the model. From -2lnl and BIC we see that the latent class model provides a much better fit to the data than the logit model. We also find that the three latent classes represent three distinct groups on the labour market. On class with very low unemployment probability (represented with a mass-point equal to -3.236), a group with an intermediate unemployment risk (the baseline case, with a mass-point normalized to 0) and a group with a very high unemployment risk (represented with a mass-point equal to 2.045). With respect to the effects of the observed independent variables we find that middle aged individuals with vocational or further education and living with a spouse has a lower risk of unemployment than other individuals. In sum, we find that unemployment risks are very unevenly distributed in the population, both according to observed variables but also according to unobserved variables. Furthermore, we find that the latent class model yields quite different estimated effects of the observed independent 13 variables compared to the estimates from the logit model. Hence, relying on the estimates from the logit model might result in incorrect inference on the effect of the observed variables on the risk of unemployed 7. Conclusions In this paper we have demonstrated how to identify the latent class logistic regression model, and we have shown how it performs with cross sectional and panel data. Two sources underlie the identification of the model. One source is variation in the independent variables and the other source is repeated measurement in terms of panel data. The latter source of identification is much more powerful than the first. When identification is “weak” we suggest fixing the parameter for the weight of one or more of the latent classes. Based on simulation evidence it turns out that fixed weight can be a feasible strategy when the LCM proves unreliable. Although it might lead to some bias this strategy is much better than the conventional logit model, both in terms of bias and precision. Finally, we present an application in terms of analyzing the dual labour market. We show that taking into account unobserved heterogeneity is important both in terms of obtaining information of the dual labour markets and also in terms of obtaining unbiased parameters of the observed independent variables of the model. Appendix. Proof of proposition. e e Proof : let e exp( x), e ' exp( x '). Then: j 1 y ' y 1 e e e ' e e ' e 1 e e ' e e e ' e e P ( ) 1 e e j 2 e y e e P ( ) y j j j 11 e e j 2 j 1 ' e e y' P ( ) 1 e 'e y' ' j e P ( j ) j 1 e 'e (*) If y y ', ' (cross-sectional data) (*) reduces to: e e e ' e ' e e e 1 0. If y 0, y ' 1, (panel data, the case y 1, y ' 0 being similar) from (*) we get: e e 2 ' e e 1 e e e ' e 1 0 (**). This is a second order polynomial in e with roots - 1+ee ee ' and 1. As the first root is negative we can discard this solution and is only ' left with e 1 0. References Abramson, C., R. L. Andrews, I. S. Currim and M. Jones (2000), Parameter Bias from Unobserved Effects in Multinomial Logit Model of Consumer Choice. Journal of Marketing Research, 37, 410426. 14 Arts, W., & Gelissen, J. (2001). Welfare States, Solidarity and Justice Principles: Does the Type Really Matter? Acta Sociologica, 44, 284-299. Bretagnolle, J. and C. Huber-Carol (1988) Effects of Omitting Covariates in Cox’s Models for survival Data”, Scandinavian Journal of Statistics, 15 (2), pp. 125-138. Cameron, S. V. and J. J. Heckman (1998) Life cycling scholling and Dynamic Selection Bias: Evidence for Five Cohorts of American Males, Journal of political economy, 106 (2), pp. 262-333. Davies, R. B. (1993) Nonparametric control for residual heterogeneity in modelling recurrent behaviour, Computational Statistics & Data Analysis, 16 (2), pp. 143-160. Dayton, M. C. (1999), Latent Class Scaling Analysis, Sage, series in quantitative applications in the social sciences, 126. Greene, W. (2003), A latent Class model for discrete choice analysis: contrast with mixed logit, Transportation Research, Part B: Methodological, 37 (8), pp. 681-698. Gundelach, P. (Eds.) (2002) Danskernes værdier 1981-1999. Copenhagen: Hans Reitzels Forlag. Ejernaes, M. and A. Holm (2006) "Comparing fixed effect and covariance structure estimators", Sociological Methods and Research, vol 35, Heckman, J.J. and B. Singer (1984), A method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data, Econometrica, 52 (2), 271-320. Holm, A (2002) "The Effect of Training on Search Durations; A Random Effects Approach", Labour Economics, vol 9 Launov, A. (2004), An Alternative Approach to Testing Dual Labour Market Theory. IZA Discussion Paper No. 1289. Lindsay, B. G. (1983a), The geometry of mixture likelihoods: a general theory. Annals of statistics, 11, pp. 86-94. Lindsay, B. G. (1983b) The geometry of mixture likelihoods, Part II: the exponential family. Annals of statistics, 11, pp. 783-792. Murphy, S. A. and A. W. van der Vaart (2000) On Profile Likelihood, Journal of the American Statistical Association, Vol. 95, No. 450, pp. 449-465. Sakamoto, Arthur and Meichu D. Chen, (1991), Inequality and Attainment in a Dual Labour Market, American Sociological Review, Vol. 56, No. 3, pp. 295-308Acknowledgements We gratefully acknowledge comments made by participants at the seminar in applied statistics held at the University of Aarhus and from Jan Høgelund and Mads Meyer Jæger. 15 Anders Holm is professor at the school of education, University of Aarhus and has published in quantitative methods, social mobility and labor market research. Some of his previous papers have appeared in Social science research and Sociological methods and research. Morten Pedersen is a research assistant at the department of Sociology, University of Copenhagen and has extensive computer programming knowledge in R, SAS, Gauss and SPSS. 16 1 However, sometimes it may be conceivable that very erratic deviations from a smooth effect of x are due to unobserved effects. For example, if we observe clear non-smooth effects of age this is very likely not due to misspecifications of the age effect, but rather something else, e.g. omitted independent variables. But identifying such effects usually requires a lot of data. We shall return to this below. 2 One could also argue that one might pursue a profile likelihood approach, see Murphy and van der Vaart (2000), where one iterate between maximizing the a likelihood with fixed weights and fixing the remaining parameters while estimating the weights. We have tried simulations with this approach but found that the results from this approach yields very similar behaviour of the likelihood function as with the full model. Hence in terms of obtaining a better behaved likelihood function fixing the weight completely (even at a “wrong” value) yields a much more precision on the remaining parameters than both a full maximum likelihood approach as well as the profile likelihood approach. 17