1007 A Comparison of Parameter Estimates in Standard Logistic Regression using WinBugs MCMC and MLE methods in R for different sample sizes Masoud Karimlou 1,* Ph.D. , Gholamraza Jandaghi 2 Ph.D., Kazem Mohammad3 Ph.D. , Rory Wolfe 4 Ph.D. , Kmal Azam3 Ph.D. Summary. Estimation of parameters in statistical models is a serious concern among statisticians. Frequentists and Bayesians propose different approaches. In this paper we examine different approaches using MCMC and MLE methods to estimate parameters of standard logistic regression with two covariates to compare their efficiency for different sample sizes. we used WinBUGS MCMC and R MLE estimation methods. The result showed that by increasing the sample size, the MCMC estimates get closer to MLE's and the MCMC standard errors decreases as the sample size increases. Key Words: logistic regression, MCMC, MLE, WinBUGS, R, Sample size. 1.Introduction One of the obstacles in developing Bayesian inference is the complexity of posterior distributions of the parameters of interest, so that except for special cases, no closed form for the posterior distributions can be obtained. Bayesian statisticians have been challenging to this problem until the method of Markov Chain Monte Carlo (MCMC) was established by which the sampling from a posterior distribution became possible and they could infer about the posterior distributions of unknown parameters with the aid of drawn samples (Carlin & Louis 2000). MCMC is essentially Monte Carlo integration using Markov chains. Bayesians, and sometimes also frequentists, need to integrate over possibly high-dimensional probability distributions to make inference about model parameters or to make predictions. Bayesians need to integrate over the posterior distribution of model parameters given the data, and frequentists may need to integrate over the distribution of observables given parameter values. Monte Carlo integration draws samples from the required distribution, and then forms sample averages to approximate expectations. Markov Chain Monte Carlo draws these samples by running a cleverly constructed Markov chain for a long time. There are many ways of constructing these chains, but all of them, including the Gibbs sampler (Geman and Geman,1984), are special cases of the general framework of Metropolis et al. (1953) and Hastings (1970). However, there was still problems in using Bayesian statistical methods because of lack of a powerful software needed to do a large amount of computations quickly and efficiently. Berger(2000) believes that the flexibility of Bayesian inference, existence of different priors in a problem, subjectivity of Bayesian inference and the complexity of computation process in new MCMC methods are some difficulties in developing a good software for Bayesian inference. 1,* Corresponding Author Department of Biostatistics, University of Social Welfare and Rehabilitation Sciences , Tehran, Iran E-mail: mkarimlo@uswr.ac.ir 2 University of Tehran, Qom campus 3 Department of Epidemiology and Biostatistics, School of Public Health Researches, Tehran University of Medical Sciences 4 Department of Epidemiology and Preventive Medicine, Monash Medical School, Alfred Hospital, Melbourne, Australia 1008 The BUGS software (Bayesian Inference Using Gibbs Sampling) which was developed in biostatistics section of England Medical Research Council (MRC) of Cambridge university has been able to overcome many difficulties in inference of statistical models. The Windows version of BUGS called WinBUGS has been made available to researchers from 1997 and up to now, different features has been added to it (Spiegelhalter et al 2003). One of the important features in "TRICKS" section of WinBUGS is that one can introduce any unknown sampling distribution i.e. any likelihood function to the software. Therefore, WinBUGS can handle a wide variety of statistical inference in shortest time without limitation in sample size and can be done by a personal computer. In this paper, we estimate the parameters in standard logistic regression with two covariates both by direct use of Bernoulli distribution with logit link function and by using unconditional logistic likelihood with WinBUGS and finally we will compare the results with R maximum likelihood estimates. 2. Estimation of parameters in standard logistic regression using MCMC method Consider an artificial dataset in the following table in which X and Z are covariates and Y is the dependent or response variable corresponding to a binary outcome (typically disease for epidemiologic studies) so that Y=1 indicating diseased participants and Y=0 indicating non-diseased participants. X 0 0 1 1 Z 0 1 0 1 Y 280 400 220 180 n 660 660 340 340 In this table for example, the first row shows that 280 persons out of 660 in X=0 and Z=0 category have disease. This data shows some sort of study in which variable Z is distributed 50-50 percent and the variable X is distributed 34-66 percent having an odds ratio (OR) between 2 and 3, Just as like as real data. We analyzed the data by three Bayesian approaches as follows: 2.1 First approach: Let π ( x i ) be the probability of having disease of ith subject, so Y i → Bernoulli (π ( x i )) Then we have, Logit (π ( x i )) = β 0 + β1x i + β 2 z i + β12 x i z i (1) Where β 0 , β1 , β 2 and β12 are the parameters in logit model without random effects which should be estimated. In addition, the graphical model for the above example would be as the following graph in which the relation among parameters are seen and the complete conditional distribution can be obtained. 1009 beta0 beta1 beta2 beta12 x[i] p[i] z[i] n[i] d[i] for(i IN 1 : 100) Figure 1: graphical display for standard logistic regression To fit a logistic regression model to the data, because there is no informative prior available for the data, we use a non-informative normal prior with mean zero and large variance equal to 10e+4 which does not produce a closed form for the posterior and marginal distributions of the parameters. So, for doing inference about the model parameters, we need to use MCMC technique. To perform MCMC simulation, we used WinBUGS software. After considering a burn-in runs of 10000 samples, we drew 20000 samples from the posterior distribution. Figure2. shows the posterior distributions of the parameters and figure 3. which is the trace diagrams, depicts the convergence of the chain, the WinBUGS code is in appendix 1. beta1 sample: 20000 beta0 sample: 20000 0.8 0.6 0.4 0.2 0.0 1.5 1.0 0.5 0.0 -2.0 0.0 2.0 -2.0 beta12 sample: 20000 -1.0 0.0 1.0 beta2 sample: 20000 0.6 1.0 0.75 0.5 0.25 0.0 0.4 0.2 0.0 -6.0 -4.0 -2.0 0.0 2.0 -2.0 -1.0 0.0 1.0 2.0 Figure 2: posterior distributions for the standard logistic parameters using WinBUGS 1010 beta1 beta0 4.0 1.0 2.0 0.0 0.0 -1.0 -2.0 -2.0 29850 29900 29950 29850 iteration 29900 29950 iteration beta12 beta2 2.0 0.0 -2.0 -4.0 -6.0 3.0 2.0 1.0 0.0 -1.0 -2.0 29850 29900 29950 29850 iteration 29900 29950 iteration Figure 3: Trace diagram for the standard logistic parameters using WinBUGS 2.2 Second approach: In this approach, we used the logit link function, Relation (1), with the likelihood function as follows: yi n L ( β ) = ∏ ⎡⎣π ( x i ) ⎤⎦ ⎡⎣1 − π ( x i ) ⎤⎦ 1− y i i =1 We estimated the model parameters using WinBUGS "ZEROS TRICK" commands in WinBUGS (59). The results has been given in column 4 of Table 1. 2.3 Third approach: In this approach, we used the likelihood function for the standard logistic regression with two covariates Z and X as follows: n L (β ) = ∏ i =1 (e β0 + β1x i + β 2 z i + β12 x i z i ) yi 1 + e β0 + β1x i + β2 z i + β12 x i z i After using standard WinBUGS programming, the output was identical to the second method. 3. Maximum likelihood estimation of standard logistic regression model: In this stage, we used R software to estimate the model parameters of the example in section 2. The results has been given in column 3 of Table 1. 4. Comparison of MLE and MCMC estimates: Despite of identically of the theories of MLE and MCMC methods, we see some differences in estimates in Table 1. To identify the cause of this difference, we increased the sample size for sample size 100 beyond 100, so that the sample proportions in X and Z categories don’t change. Therefore, we built different datasets with different large sample sizes to analyze. The results of performing the approaches of section 2 in Table 1. This table has 5 columns as follows: Column1: parameter name Column 2: sample size, including 8 different sample sizes Column 3: MLE estimates in R Column 4: Bayesian estimates using MCMC in WinBUGS(Zeros Trick commands) 1011 Column 5: Bayesian estimates using MCMC (direct use of Bernoulli distribution in WinBUGS) As can be seen in Table 1., column 3 has the same values for each parameter in different sample sizes, because the MLE estimates does not depend on sample size. The values in columns 4 and 5 show that both Bayesian methods have almost identical results and don’t have significant difference as we increase the sample size. Moreover, this difference might be due to direct definition of logistic model based on the Bernoulli distribution while the values in column 4 is the result of considering the logistic likelihood function as the model. Because the purpose of this paper is to compare the MLE and MCMC estimates, we focus on columns 3 and 4 of the table. The values in these columns show that when sample size exceeds 2000, the estimates of Bayesian methods are almost identical to those of classical methods. We also see that standard error of MCMC estimates decreases as the sample size increases equalizes to MLE standard errors when sample size reaches 2000.Furthermore for sample size exceeds 2000, the standard error of MCMC estimates even get smaller than those of MLE's. Table 1. parameter estimates for standard logistic regression with two covariates using MLE and MCMC methods for different sample sizes(values in the parenthesis are SD). WinBUGS(MCMC) Beta0 Beta1 Beta2 Sample Size n=100 n=200 n=500 n=1000 n=2000 n=4000 n=6000 n=10000 -0.30538 (0.0788) “ “ “ “ “ “ “ S-plus (MLE) ( Zeros trick) -0.3229 (0.3538) -0.3023 (0.2447) -0.3175 (0.1523) -0.3081 (0.1053) -0.3068 (0.0759) -0.306 (0.0543) -0.3061 (0.0438) -0.3065 (0.0325) n=100 n=200 n=500 n=1000 n=2000 n=4000 n=6000 n=10000 0.91147 (0.1381) “ “ “ “ “ “ “ 0.9707 (0.6233) 0.9286 (0.4362) 0.9347 (0.2635) 0.9227 (0.1953) 0.9192 (0.1387) 0.9130 (0.0949) 0.9128 (0.0783) 0.9141 (0.0590) n=100 n=200 n=500 n=1000 n=2000 n=4000 n=6000 n=10000 0.7362 (0.1120) “ “ “ “ “ “ “ 0.7705 (0.5002) 0.7401 (0.3521) 0.7525 (0.2210) 0.7417 (0.1516) 0.7373 (0.1928) 0.7373 (0.0780) 0.7377 (0.0624) 0.7378 (0.0462) WinBUGS(MCMC) (dbern) -0.3189 (0.3569) -0.3012 (0.2498) -0.3133 (0.1583) -0.3055 (0.1101) -0.3069 (0.0774) -0.3064 (0.0562) -0.3063 (0.0446) -0.3061 (0.0351) 0.9647 (0.6354) 0.9171 (0.4491) 0.9227 (0.2823) 0.9134 (0.1935) 0.9137 (0.1345) 0.9136 (0.0990) 0.9131 (0.0792) 0.9130 (0.0625) 0.7641 (0.5076) 0.7292 (0.3610) 0.7434 (0.2269) 0.7373 (0.1589) 0.7394 (0.1130) 0.7381 (0.0810) 0.7382 (0.0636) 0.7376 (0.0504) 1012 Beta12 n=100 n=200 n=500 n=1000 n=2000 n=4000 n=6000 n=10000 -1.2245 (0.1929) “ “ “ “ “ “ “ -1.290 -1.246 -1.250 -1.241 -1.233 -1.225 -1.227 -1.228 (0.8712) (0.6142) (0.3851) (0.2708) (0.1077) (0.1344) (0.1097) (0.0817) -1.285 -1.217 -1.236 -1.226 -1.228 -1.228 -1.228 -1.227 (0.8903) (0.6284) (0.3961) (0.2740) (0.1944) (0.1386) (0.1118) (0.0878) Discussion and conclusions: The main question here is that for different sample sizes which estimate is more precise and reliable? To answer this question, we note that in MLE method we usually build a function and with the aid of derivation and iteration methods we find the maximum of that function and use that point as a parameter estimate. Therefore, we just deal with the mode of the distribution other than its shape, But in MCMC methods we deal with a posterior distribution which has come to existence by combination of a prior distribution and the likelihood function. In this structure, if we assume the prior distribution as a known distribution, accumulated on a point, the role of the prior, despite of its unity weight, compared to the likelihood weight, is more considerable than likelihood function and even has a considerable role in estimating parameters when the sample size in increased. On the other hand, the prior distribution can change the maximum point of likelihood in this case. But if the prior distribution is unknown, which is the case in many situations, we use a non-informative prior like U(-100,100) or N(0,10E6) to stabilize the effect of the prior distribution on the structure of posterior. This makes an ignorable effect of the mode of the posterior distribution, on the contrary, by increasing the sample size, the communality of the likelihood function on the posterior structure will be increased and the mode of the posterior is identical to the maximum likelihood. Therefore, for non-informative prior, the posterior when the sample size increases will be the same as the likelihood function. On the other side, although MLEs are widely used and often very useful, also have problems. Indeed, unless one adopts a very narrow view of what is required in statistical inference, it turns out that all proposed approaches and techniques have some difficulties associated with them. When the usual regularity conditions hold, the properties of MLEs are generally good, but this is not necessarily the case for non-regular situations. Cheng and Traylor(1995) give a good discussion of the problems that can arise in non-regular cases, and how such problems may be overcome. There are examples where the MLE is neither sufficient nor asymptotically efficient. We can construct examples where MLEs are inconsistent (Korn, 1990), where they do not exist or where they are not unique (Bickel and Doksum. 1977), Non uniqueness in turn opens up the possibility that an MLE need not be a function of the minimal sufficient statistic(Levy, 1985). Hengartner (1999) gives an example where the precision of the MLE decreases when additional independent data become available. At a more practical level, in real world examples the likelihood function may take a complex form, especially when several parameters are estimated. This can lead to problems in actually finding the MLE because of multiple maxima or flat areas on the likelihood surface (Gates, 1993). When a distribution or model contains several parameters, this can also lead to the problem parameter redundancy (Morgan, 2000). Another type of problem arises when the MLE occurs at the boundary of the region of allowable variables. This will often ruin the 'nice' asymptotic properties of the MLE (Catchpole and Morgan, 1994). Also, we know that MLE interval estimates are based on an assumption of asymptotic normality of the likelihood, while in MCMC simulation, the results are exact (up to MC error) and we sample from posterior or likelihood function (using uniform priors) directly and use the mean of the samples as the 1013 parameter estimate (Spiegelhalter 2005). Now, if the sample size is small, the shape of the posterior or distribution of likelihood function is clearly not quiet symmetric, especially when we deal with estimation of proportions, Therefore in such distributions, mean, median and mode are not equal and we use the mode of the distribution as the MLE estimate which is different from the mean of the distribution reported as MCMC estimate. Now, if the sample size increases, the shape of the likelihood function or in fact the shape of the posterior distribution tends to symmetricity and the MLE and MCMC gets closer to each other. In column 4 of Table 1. we clearly see the closeness of MCMC and MLE estimates as the sample size gets large and also the MCMC standard errors decreases as the sample size increases. The important result from the above discussion is that even in small sample sizes, the MCMC estimates in compare to MLE estimates are more precise because MCMC estimates are obtained from the posterior distribution while the MLE estimates are obtained based on asymptotic normal assumption of likelihood function. In summary, it can be said that in MCMC method (using uniform priors) indeed we obtain the mean of the likelihood not the maximum likelihood. References: Berger J.O. (2000), Bayesian Analysis: A look at today and thought of tomorrow. Journal of the American Statistical Association, 95, 1269-1276. Bickel P. J. and Doksum K. A. (1977), Mathematical Statistics; Basic Ideas and Selected Topics. Holden-Day, San Francisco. Carlin B.P. and Louis T.A. (2000), Bayes and Empirical Bayes Methods for Data Analysis. Second edition, Chapman & Hall. Catchpole, E. A. and Morgan, B. J. T. (1994). Boundary estimation in ring recovery models. J. Roy. Statist. Soc. Ser. B 56 385-391. Cheng R. C. H. and Traylor L. (1995), Non-regular Maximum Likelihood Problem (with discussion). J. R. Statist. Soc. B57, 30-44. Gates J. (1993), Testing for Circularity of Spatially Located Objects. J. Appl. Statistics 20, 95-103. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattn. Anal. Mach. Intel., 6, 721-741. Hastings, applications. (1970) Monte Carlo sampling methods using Markov chains And their Biometrika, 57, 97-109. Hengartner N. W. (1999), A Note on Maximum Likelihood Estimation. Amer. Statistician 53, 123-125. Korn E. L. (1990), Projecting Power From a Previous Study: Maximum Likelihood Estimation. Amer. Statistician 44, 290-292. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953) Equations of state calculations by fast computing machine. J. chem. Phys., 21, 1087-1091. Morgan B. J. T.(2000), Applied Stochastic Modelling. Arnold, London. Spiegelhalter D., Thomas A., Best N., Lunn D. (2003), WinBUGS 1.4 Manual. Spiegelhalter D., personal communication by e-mail on Tue,01 Feb 2005. 1014 Appendix 1. model { for (i in 1:100){ d[i]~dbern(pi[i]) logit(pi[i])<-beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i] } beta0~dnorm(0,1.0E-6) beta1~dnorm(0,1.0E-6) beta2~dnorm(0,1.0E-6) beta12~dnorm(0,1.0E-6) } model { c<-1000 for (i in 1:100){ zeros[i]<-0 logit(pi[i])<-beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i] L[i]<-(pow(pi[i],d[i]))*(pow((1-pi[i]),(1-d[i]))) zeros[i]~dpois(phi[i]) phi[i]<- -log(L[i]) +c } beta0~dnorm(0,1.0E-6) beta1~dnorm(0,1.0E-6) beta2~dnorm(0,1.0E-6) beta12~dnorm(0,1.0E-6) } model { c<-1000 for (i in 1:100){ zeros[i]<-0 zeros[i]~dpois(phi[i]) phi[i]<- -log((pow(exp(beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]) , d[i])) /(1+exp(beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]))) +c } beta0~dnorm(0,1.0E-6) beta1~dnorm(0,1.0E-6) beta2~dnorm(0,1.0E-6) beta12~dnorm(0,1.0E-6) }