A Comparison of Parameter Estimates in Standard Logistic Regression using

advertisement
1007
A Comparison of Parameter Estimates in Standard Logistic Regression using
WinBugs MCMC and MLE methods in R for different sample sizes
Masoud Karimlou 1,* Ph.D. , Gholamraza Jandaghi 2 Ph.D., Kazem Mohammad3 Ph.D. ,
Rory Wolfe 4 Ph.D. , Kmal Azam3 Ph.D.
Summary. Estimation of parameters in statistical models is a serious concern among statisticians.
Frequentists and Bayesians propose different approaches. In this paper we examine different
approaches using MCMC and MLE methods to estimate parameters of standard logistic regression
with two covariates to compare their efficiency for different sample sizes. we used WinBUGS MCMC
and R MLE estimation methods.
The result showed that by increasing the sample size, the MCMC estimates get closer to MLE's and
the MCMC standard errors decreases as the sample size increases.
Key Words: logistic regression, MCMC, MLE, WinBUGS, R, Sample size.
1.Introduction
One of the obstacles in developing Bayesian inference is the complexity of posterior distributions of
the parameters of interest, so that except for special cases, no closed form for the posterior
distributions can be obtained. Bayesian statisticians have been challenging to this problem until the
method of Markov Chain Monte Carlo (MCMC) was established by which the sampling from a
posterior distribution became possible and they could infer about the posterior distributions of
unknown parameters with the aid of drawn samples (Carlin & Louis 2000). MCMC is essentially
Monte Carlo integration using Markov chains. Bayesians, and sometimes also frequentists, need to
integrate over possibly high-dimensional probability distributions to make inference about model
parameters or to make predictions. Bayesians need to integrate over the posterior distribution of model
parameters given the data, and frequentists may need to integrate over the distribution of observables
given parameter values. Monte Carlo integration draws samples from the required distribution, and
then forms sample averages to approximate expectations. Markov Chain Monte Carlo draws these
samples by running a cleverly constructed Markov chain for a long time. There are many ways of
constructing these chains, but all of them, including the Gibbs sampler (Geman and Geman,1984), are
special cases of the general framework of Metropolis et al. (1953) and Hastings (1970). However,
there was still problems in using Bayesian statistical methods because of lack of a powerful software
needed to do a large amount of computations quickly and efficiently. Berger(2000) believes that the
flexibility of Bayesian inference, existence of different priors in a problem, subjectivity of Bayesian
inference and the complexity of computation process in new MCMC methods are some difficulties in
developing a good software for Bayesian inference.
1,* Corresponding Author
Department of Biostatistics, University of Social Welfare and Rehabilitation Sciences , Tehran, Iran
E-mail: mkarimlo@uswr.ac.ir
2 University of Tehran, Qom campus
3 Department of Epidemiology and Biostatistics, School of Public Health Researches, Tehran University of Medical Sciences
4 Department of Epidemiology and Preventive Medicine, Monash Medical School, Alfred Hospital, Melbourne, Australia
1008
The BUGS software (Bayesian Inference Using Gibbs Sampling) which was developed in biostatistics
section of England Medical Research Council (MRC) of Cambridge university has been able to
overcome many difficulties in inference of statistical models. The Windows version of BUGS called
WinBUGS has been made available to researchers from 1997 and up to now, different features has
been added to it (Spiegelhalter et al 2003). One of the important features in "TRICKS" section of
WinBUGS is that one can introduce any unknown sampling distribution i.e. any likelihood function to
the software. Therefore, WinBUGS can handle a wide variety of statistical inference in shortest time
without limitation in sample size and can be done by a personal computer. In this paper, we estimate
the parameters in standard logistic regression with two covariates both by direct use of Bernoulli
distribution with logit link function and by using unconditional logistic likelihood with WinBUGS and
finally we will compare the results with R maximum likelihood estimates.
2. Estimation of parameters in standard logistic regression using MCMC method
Consider an artificial dataset in the following table in which X and Z are covariates and Y is the
dependent or response variable corresponding to a binary outcome (typically disease for epidemiologic
studies) so that Y=1 indicating diseased participants and Y=0 indicating non-diseased participants.
X
0
0
1
1
Z
0
1
0
1
Y
280
400
220
180
n
660
660
340
340
In this table for example, the first row shows that 280 persons out of 660 in X=0 and Z=0 category
have disease. This data shows some sort of study in which variable Z is distributed 50-50 percent and
the variable X is distributed 34-66 percent having an odds ratio (OR) between 2 and 3, Just as like as
real data. We analyzed the data by three Bayesian approaches as follows:
2.1 First approach: Let π ( x i ) be the probability of having disease of ith subject, so
Y i → Bernoulli (π ( x i ))
Then we have,
Logit (π ( x i )) = β 0 + β1x i + β 2 z i + β12 x i z i
(1)
Where β 0 , β1 , β 2 and β12 are the parameters in logit model without random effects which should be
estimated. In addition, the graphical model for the above example would be as the following graph in
which the relation among parameters are seen and the complete conditional distribution can be
obtained.
1009
beta0
beta1
beta2
beta12
x[i]
p[i]
z[i]
n[i]
d[i]
for(i IN 1 : 100)
Figure 1: graphical display for standard logistic regression
To fit a logistic regression model to the data, because there is no informative prior available for the
data, we use a non-informative normal prior with mean zero and large variance equal to 10e+4 which
does not produce a closed form for the posterior and marginal distributions of the parameters. So, for
doing inference about the model parameters, we need to use MCMC technique. To perform MCMC
simulation, we used WinBUGS software. After considering a burn-in runs of 10000 samples, we drew
20000 samples from the posterior distribution. Figure2. shows the posterior distributions of the
parameters and figure 3. which is the trace diagrams, depicts the convergence of the chain, the
WinBUGS code is in appendix 1.
beta1 sample: 20000
beta0 sample: 20000
0.8
0.6
0.4
0.2
0.0
1.5
1.0
0.5
0.0
-2.0
0.0
2.0
-2.0
beta12 sample: 20000
-1.0
0.0
1.0
beta2 sample: 20000
0.6
1.0
0.75
0.5
0.25
0.0
0.4
0.2
0.0
-6.0
-4.0
-2.0
0.0
2.0
-2.0
-1.0
0.0
1.0
2.0
Figure 2: posterior distributions for the standard logistic parameters using WinBUGS
1010
beta1
beta0
4.0
1.0
2.0
0.0
0.0
-1.0
-2.0
-2.0
29850
29900
29950
29850
iteration
29900
29950
iteration
beta12
beta2
2.0
0.0
-2.0
-4.0
-6.0
3.0
2.0
1.0
0.0
-1.0
-2.0
29850
29900
29950
29850
iteration
29900
29950
iteration
Figure 3: Trace diagram for the standard logistic parameters using WinBUGS
2.2 Second approach: In this approach, we used the logit link function, Relation (1), with the
likelihood function as follows:
yi
n
L ( β ) = ∏ ⎡⎣π ( x i ) ⎤⎦ ⎡⎣1 − π ( x i ) ⎤⎦
1− y i
i =1
We estimated the model parameters using WinBUGS "ZEROS TRICK" commands in WinBUGS
(59). The results has been given in column 4 of Table 1.
2.3 Third approach: In this approach, we used the likelihood function for the standard logistic
regression with two covariates Z and X as follows:
n
L (β ) = ∏
i =1
(e
β0 + β1x i + β 2 z i + β12 x i z i
)
yi
1 + e β0 + β1x i + β2 z i + β12 x i z i
After using standard WinBUGS programming, the output was identical to the second method.
3. Maximum likelihood estimation of standard logistic regression model: In this stage, we used R
software to estimate the model parameters of the example in section 2. The results has been given in
column 3 of Table 1.
4. Comparison of MLE and MCMC estimates: Despite of identically of the theories of MLE and
MCMC methods, we see some differences in estimates in Table 1. To identify the cause of this
difference, we increased the sample size for sample size 100 beyond 100, so that the sample
proportions in X and Z categories don’t change. Therefore, we built different datasets with different
large sample sizes to analyze. The results of performing the approaches of section 2 in Table 1.
This table has 5 columns as follows:
Column1: parameter name
Column 2: sample size, including 8 different sample sizes
Column 3: MLE estimates in R
Column 4: Bayesian estimates using MCMC in WinBUGS(Zeros Trick commands)
1011
Column 5: Bayesian estimates using MCMC (direct use of Bernoulli distribution in WinBUGS)
As can be seen in Table 1., column 3 has the same values for each parameter in different sample sizes,
because the MLE estimates does not depend on sample size. The values in columns 4 and 5 show that
both Bayesian methods have almost identical results and don’t have significant difference as we
increase the sample size. Moreover, this difference might be due to direct definition of logistic model
based on the Bernoulli distribution while the values in column 4 is the result of considering the
logistic likelihood function as the model. Because the purpose of this paper is to compare the MLE
and MCMC estimates, we focus on columns 3 and 4 of the table. The values in these columns show
that when sample size exceeds 2000, the estimates of Bayesian methods are almost identical to those
of classical methods.
We also see that standard error of MCMC estimates decreases as the sample size increases equalizes
to MLE standard errors when sample size reaches 2000.Furthermore for sample size exceeds 2000, the
standard error of MCMC estimates even get smaller than those of MLE's.
Table 1. parameter estimates for standard logistic regression with two covariates using MLE and
MCMC methods for different sample sizes(values in the parenthesis are SD).
WinBUGS(MCMC)
Beta0
Beta1
Beta2
Sample Size
n=100
n=200
n=500
n=1000
n=2000
n=4000
n=6000
n=10000
-0.30538 (0.0788)
“
“
“
“
“
“
“
S-plus (MLE)
( Zeros trick)
-0.3229 (0.3538)
-0.3023 (0.2447)
-0.3175 (0.1523)
-0.3081 (0.1053)
-0.3068 (0.0759)
-0.306 (0.0543)
-0.3061 (0.0438)
-0.3065 (0.0325)
n=100
n=200
n=500
n=1000
n=2000
n=4000
n=6000
n=10000
0.91147 (0.1381)
“
“
“
“
“
“
“
0.9707 (0.6233)
0.9286 (0.4362)
0.9347 (0.2635)
0.9227 (0.1953)
0.9192 (0.1387)
0.9130 (0.0949)
0.9128 (0.0783)
0.9141 (0.0590)
n=100
n=200
n=500
n=1000
n=2000
n=4000
n=6000
n=10000
0.7362 (0.1120)
“
“
“
“
“
“
“
0.7705 (0.5002)
0.7401 (0.3521)
0.7525 (0.2210)
0.7417 (0.1516)
0.7373 (0.1928)
0.7373 (0.0780)
0.7377 (0.0624)
0.7378 (0.0462)
WinBUGS(MCMC)
(dbern)
-0.3189 (0.3569)
-0.3012 (0.2498)
-0.3133 (0.1583)
-0.3055 (0.1101)
-0.3069 (0.0774)
-0.3064 (0.0562)
-0.3063 (0.0446)
-0.3061 (0.0351)
0.9647 (0.6354)
0.9171 (0.4491)
0.9227 (0.2823)
0.9134 (0.1935)
0.9137 (0.1345)
0.9136 (0.0990)
0.9131 (0.0792)
0.9130 (0.0625)
0.7641 (0.5076)
0.7292 (0.3610)
0.7434 (0.2269)
0.7373 (0.1589)
0.7394 (0.1130)
0.7381 (0.0810)
0.7382 (0.0636)
0.7376 (0.0504)
1012
Beta12
n=100
n=200
n=500
n=1000
n=2000
n=4000
n=6000
n=10000
-1.2245 (0.1929)
“
“
“
“
“
“
“
-1.290
-1.246
-1.250
-1.241
-1.233
-1.225
-1.227
-1.228
(0.8712)
(0.6142)
(0.3851)
(0.2708)
(0.1077)
(0.1344)
(0.1097)
(0.0817)
-1.285
-1.217
-1.236
-1.226
-1.228
-1.228
-1.228
-1.227
(0.8903)
(0.6284)
(0.3961)
(0.2740)
(0.1944)
(0.1386)
(0.1118)
(0.0878)
Discussion and conclusions:
The main question here is that for different sample sizes which estimate is more precise and reliable?
To answer this question, we note that in MLE method we usually build a function and with the aid of
derivation and iteration methods we find the maximum of that function and use that point as a
parameter estimate. Therefore, we just deal with the mode of the distribution other than its shape, But
in MCMC methods we deal with a posterior distribution which has come to existence by combination
of a prior distribution and the likelihood function.
In this structure, if we assume the prior distribution as a known distribution, accumulated on a point,
the role of the prior, despite of its unity weight, compared to the likelihood weight, is more
considerable than likelihood function and even has a considerable role in estimating parameters when
the sample size in increased. On the other hand, the prior distribution can change the maximum point
of likelihood in this case. But if the prior distribution is unknown, which is the case in many
situations, we use a non-informative prior like U(-100,100) or N(0,10E6) to stabilize the effect of the
prior distribution on the structure of posterior. This makes an ignorable effect of the mode of the
posterior distribution, on the contrary, by increasing the sample size, the communality of the
likelihood function on the posterior structure will be increased and the mode of the posterior is
identical to the maximum likelihood. Therefore, for non-informative prior, the posterior when the
sample size increases will be the same as the likelihood function.
On the other side, although MLEs are widely used and often very useful, also have problems. Indeed,
unless one adopts a very narrow view of what is required in statistical inference, it turns out that all
proposed approaches and techniques have some difficulties associated with them. When the usual
regularity conditions hold, the properties of MLEs are generally good, but this is not necessarily the
case for non-regular situations. Cheng and Traylor(1995) give a good discussion of the problems that
can arise in non-regular cases, and how such problems may be overcome. There are examples where
the MLE is neither sufficient nor asymptotically efficient. We can construct examples where MLEs
are inconsistent (Korn, 1990), where they do not exist or where they are not unique (Bickel and
Doksum. 1977), Non uniqueness in turn opens up the possibility that an MLE need not be a function
of the minimal sufficient statistic(Levy, 1985). Hengartner (1999) gives an example where the
precision of the MLE decreases when additional independent data become available. At a more
practical level, in real world examples the likelihood function may take a complex form, especially
when several parameters are estimated. This can lead to problems in actually finding the MLE because
of multiple maxima or flat areas on the likelihood surface (Gates, 1993). When a distribution or model
contains several parameters, this can also lead to the problem parameter redundancy (Morgan, 2000).
Another type of problem arises when the MLE occurs at the boundary of the region of allowable
variables. This will often ruin the 'nice' asymptotic properties of the MLE (Catchpole and Morgan,
1994).
Also, we know that MLE interval estimates are based on an assumption of asymptotic normality of
the likelihood, while in MCMC simulation, the results are exact (up to MC error) and we sample from
posterior or likelihood function (using uniform priors) directly and use the mean of the samples as the
1013
parameter estimate (Spiegelhalter 2005). Now, if the sample size is small, the shape of the posterior or
distribution of likelihood function is clearly not quiet symmetric, especially when we deal with
estimation of proportions, Therefore in such distributions, mean, median and mode are not equal and
we use the mode of the distribution as the MLE estimate which is different from the mean of the
distribution reported as MCMC estimate. Now, if the sample size increases, the shape of the
likelihood function or in fact the shape of the posterior distribution tends to symmetricity and the
MLE and MCMC gets closer to each other. In column 4 of Table 1. we clearly see the closeness of
MCMC and MLE estimates as the sample size gets large and also the MCMC standard errors
decreases as the sample size increases. The important result from the above discussion is that even in
small sample sizes, the MCMC estimates in compare to MLE estimates are more precise because
MCMC estimates are obtained from the posterior distribution while the MLE estimates are obtained
based on asymptotic normal assumption of likelihood function.
In summary, it can be said that in MCMC method (using uniform priors) indeed we obtain the mean of
the likelihood not the maximum likelihood.
References:
Berger J.O. (2000), Bayesian Analysis: A look at today and thought of tomorrow. Journal of the
American Statistical Association, 95, 1269-1276.
Bickel P. J. and Doksum K. A. (1977), Mathematical Statistics; Basic Ideas and Selected Topics.
Holden-Day, San Francisco.
Carlin B.P. and Louis T.A. (2000), Bayes and Empirical Bayes Methods for Data Analysis. Second
edition, Chapman & Hall.
Catchpole, E. A. and Morgan, B. J. T. (1994). Boundary estimation in ring recovery models. J. Roy.
Statist. Soc. Ser. B 56 385-391.
Cheng R. C. H. and Traylor L. (1995), Non-regular Maximum Likelihood Problem (with discussion).
J. R. Statist. Soc. B57, 30-44.
Gates J. (1993), Testing for Circularity of Spatially Located Objects. J. Appl. Statistics 20, 95-103.
Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Trans. Pattn. Anal. Mach. Intel., 6, 721-741.
Hastings, applications. (1970) Monte Carlo sampling methods using Markov chains And their
Biometrika, 57, 97-109.
Hengartner N. W. (1999), A Note on Maximum Likelihood Estimation. Amer. Statistician 53, 123-125.
Korn E. L. (1990), Projecting Power From a Previous Study: Maximum Likelihood Estimation. Amer.
Statistician 44, 290-292.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953)
Equations of state calculations by fast computing machine. J. chem. Phys., 21, 1087-1091.
Morgan B. J. T.(2000), Applied Stochastic Modelling. Arnold, London.
Spiegelhalter D., Thomas A., Best N., Lunn D. (2003), WinBUGS 1.4 Manual.
Spiegelhalter D., personal communication by e-mail on Tue,01 Feb 2005.
1014
Appendix 1.
model {
for (i in 1:100){
d[i]~dbern(pi[i])
logit(pi[i])<-beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]
}
beta0~dnorm(0,1.0E-6)
beta1~dnorm(0,1.0E-6)
beta2~dnorm(0,1.0E-6)
beta12~dnorm(0,1.0E-6)
}
model {
c<-1000
for (i in 1:100){
zeros[i]<-0
logit(pi[i])<-beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]
L[i]<-(pow(pi[i],d[i]))*(pow((1-pi[i]),(1-d[i])))
zeros[i]~dpois(phi[i])
phi[i]<- -log(L[i]) +c
}
beta0~dnorm(0,1.0E-6)
beta1~dnorm(0,1.0E-6)
beta2~dnorm(0,1.0E-6)
beta12~dnorm(0,1.0E-6)
}
model {
c<-1000
for (i in 1:100){
zeros[i]<-0
zeros[i]~dpois(phi[i])
phi[i]<- -log((pow(exp(beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]) , d[i]))
/(1+exp(beta0+beta1*x[i]+beta2*z[i]+beta12*x[i]*z[i]))) +c
}
beta0~dnorm(0,1.0E-6)
beta1~dnorm(0,1.0E-6)
beta2~dnorm(0,1.0E-6)
beta12~dnorm(0,1.0E-6)
}
Download