Beta Regression: Practical Issues in Estimation

advertisement
5/02/2016
Beta Regression: Practical Issues in Estimation
1
Beta Regression: Practical Issues in Estimation (Supplementary Material)
Michael Smithson
The Australian National University
Michael.Smithson@anu.edu.au
and
Jay Verkuilen
University of Illinois at Urbana-Champaign
jvverkuilen@gmail.com
November 19, 2005
General Procedures
This document supplements the paper by Smithson and Verkuilen (2005) on beta
regression, and focuses on maximum likelihood estimation procedures in several statistical
packages. Our discussion here should not be considered a substitute for thoroughly reading
the documentation for any software you plan to use.
Maximizing the likelihood function must be done numerically. Newton-Raphson or a
quasi-Newton methods work well and generate asymptotic standard errors as a byproduct of
the estimation, unlike derivative-free methods such as the simplex method. Ferrari and
Cribari-Neto (2004) use Fisher scoring, which substitutes the expected Hessian for the
observed Hessian. Buckley (2002) used MCMC estimation in winBUGS, which provides a
Bayesian posterior density. Buckley (2002) also provided Stata code and Paolino (2001)
provided Gauss code, both of which compute maximum likelihood estimates. We have
estimated beta regressions using R/SPlus, SAS, SPSS, Mathematica, and winBUGs. Syntax
and/or script files for all of these packages are freely available on this site.
Different packages require more or less information from the user. SAS, for instance, only
requires the likelihood and analytically computes the derivatives for Newton methods,
whereas Mathematica’s Newton-Raphson routine requires the user to supply expressions for
the derivatives. The major difference between Newton-Raphson and quasi-Newton is in the
5/02/2016
Beta Regression: Practical Issues in Estimation
2
number of function evaluations per iteration (more for Newton-Raphson) and the number of
iterations necessary (more for quasi-Newton). The domain of convergence for different
algorithms, and hence importance of good starting values, will differ across algorithms as
well. The trust region Newton method implemented in SAS seems to be particularly stable on
hard problems and we recommend its use when convergence might be a problem. In general,
speed of execution for ML estimation is proportional to N + p2, where N is the sample size
and p is the number of parameters, but this will depend on the algorithm. We have never
observed a well-specified model given good starting values taking longer than a few seconds
to converge, even on a fairly modest laptop, but these models have had no more than a dozen
or so parameters. Obviously, experience will vary for larger data sets and more complicated
models.
If a Newton or quasi-Newton method is used, asymptotic standard errors usually are
estimated from the inverse of the final Hessian matrix. Bayesian estimation gives posterior
densities from which the Bayesian analogs of frequentist stability measures can be taken, e.g.,
the 2.5% and 97.5% quantiles of the posterior density as analogous quantities to a 95%
confidence interval. Though it is generally recommended in the literature that the Newton
estimate of the Hessian be used to provide asymptotic standard errors, we have tried both
methods on several data sets and it has never seemed to make an appreciable difference. It
may be desirable to use heteroscedasticity-consistent standard errors in a beta regression.
Constructing these values is discussed in detail in Hardin & Hilbe (2003, pp. 28-32). SAS
proc glimmix implements this for location-only models using the “empirical” keyword.
We intend to put this calculation in the examples but have not at this point.
Well-chosen starting values are needed to ensure convergence when more than a few
variables are included in the model, particularly when dispersion covariates are used. (For
location-only models, convexity provides a global optimum.) We have found three effective
5/02/2016
Beta Regression: Practical Issues in Estimation
3
approaches to generating starting-values. Ferrari and Cribari-Neto (2004) suggest using the
OLS estimators from the regression on the link-transformed dependent variable for the
location sub-model. For example, if the location sub-model link is the logit then startingvalues for the  would be obtained via
ln( yi/(1 – yi)) = X OLS + i.
In general the OLS estimates track the location model tolerably well, though of course the
standard errors will be inefficient and, more importantly, the variance structure will be
limited to an intercept parameter. Unfortunately, to our knowledge there seems to be no
similar proposal for starting-values of the coefficients in the dispersion sub-model.
The second approach is to begin with the null model (i.e., intercept-only sub-models with
coefficients  and ), using the method of moments to provide starting-values for  and .
Then the resultant maximum-likelihood estimates of  and  are used along with startingvalues close to 0 for of  and/or  ( =  = 0.1, say), and maximum-likelihood estimates
for , ,  and/or  are obtained. These in turn are used as starting-values for the next
more complex model, with starting-values near 0 used for each new term being introduced
into the model. Models can be built up one or more terms at a time in this way, although
adding more than one term at a time is riskier.
A third approach is to provide a grid of starting values and estimate the model over this
grid, looking for changes in the computed likelihood function and the estimates. If the
computed solution is essentially the same over the grid, the global optimum is (probably)
achieved. This procedure is very effective at avoiding local optima but it is very time
consuming for a model with many parameters because the size of the grid grows with
multiplicatively with the number of parameters. For p parameters and k values in each
parameter, the number of estimations is kp, which is large even for moderate p and k, e.g. p =
5 and k = 3 requires 243 estimations. An intermediate strategy is to use a grid over only a
5/02/2016
Beta Regression: Practical Issues in Estimation
4
subset of the parameters, e.g., the dispersion sub-model only. SAS provides a facility to
automate this procedure, but in other environments it will probably be necessary to do some
programming.
Another estimation issue is that covariates with particularly large absolute values may
result in a loss of precision in estimates or create problems for the estimation algorithms. This
is due to the evaluation of exponents, and we have found that absolute values greater than
around 30 can cause difficulties. These covariates may need to be rescaled to smaller ranges
for estimation, such as by z-scoring. Commercial packages typically do this silently with
models such as logistic regression. The loss of numerical precision due to variables with
widely different scales, e.g., one variable in millions and the other in thousandths, is common
to all numerical procedures and is not unique to beta regression. Rescaling in this case is
required.
A third practical issue is the treatment of endpoints. Although 0 and 1 may be genuine
outcome values, their logits are undefined. Two obvious remedies are proportionally
“shrinking” the range to a sub-range nearly covering the unit interval (e.g., [.01, .99]) or
simply adding a small amount to 0-valued observations and subtracting the same amount
from 1-valued observations while leaving the other observations unchanged. Both methods
bias the estimates toward no effect. A method that is frequently used in practice in areas such
as signal detection theory is to add 1/2N to a 0 observation and subtract 1/2N from a 1
observation, where N is the total number of observations (MacMillan and Creelman 2005, pp.
8-9). This has no effect on the interior points but could introduce bias if there are a non-trivial
number of boundary values. The best option may be to experiment with different endpoint
handling schemes and see whether the parameter estimates change in any appreciable way.
Our recommendation applies to a sample of scores from a continuous bounded scale that
has already been linearly transformed to the [0,1] interval. We may transform the sample
5/02/2016
Beta Regression: Practical Issues in Estimation
5
scores to a variable in the open (0,1) interval by the weighted average:
y = [y(N – 1) + s]/N,
where N is the sample size and s is a constant between 0 and 1. From a Bayesian standpoint, s
acts as if we are taking a prior into account. A reasonable choice for s would be .5.
Related to this point, it is occasionally possible to create an overflow in the objective
function. To see how this is possible, consider the log-likelihood function:
lnL = ln() – ln() – ln((1 – )) + ( – 1)lny + ((1 – ) – 1)ln(1 – y).
While the data are bounded away from 0 or 1, it is possible for the predicted values at any
given iteration to become very small or very large. This can cause the log-gamma function or
the polygamma function (derivative of the log-gamma function) to overflow. This usually
causes program termination and definitely causes meaningless results. This is unlikely to
happen for most data, but we have observed it once in a fairly complex model with random
effects. This seemed to occur when  and  were both very small, meaning their product was
even smaller. A practical solution is to alter the predicted values, , , before substituting
them into the likelihood function as follows:
adjust = 1 + (1 – 21),
adjust = 2 + (1 – 2),
where 1, 2 are small non-negative constants, e.g., .0001. The effect of these transformations
is to gently bound the predicted values away from 0 or 1 for , and 0 for . Using this
transformation biases results towards being non-significant, so the constants should be chosen
as small as possible.
Finally, researchers will need to consider whether beta regressions can be applied to a
discrete dependent variable, even an interval-level one. It is common practice to apply
normal-theory regression to variables that have only a few scale values, e.g., survey items
with a response range from 1 to 7, and researchers are likely to be motivated to do the same
5/02/2016
Beta Regression: Practical Issues in Estimation
6
with beta regression. Tamhane, Ankenman and Yang (2002) discuss the use of a beta
distributed underlying variable for ordinal data. They indicate that using the raw discrete
scores works reasonably well, though they provide a “continuizing” procedure that they show
improves the mean square error somewhat. Their procedure simply amounts to uniformly and
randomly assigning values to identical observations within the range of their “bin.”
For the present, researchers will have to rely on practical experience in the absence of
theoretical results regarding this issue. A simulation study showing how beta regression
degrades in the presence of discretization would be desirable—if in doubt compare an ordinal
regression such as ordinal logit and beta regression. It is possible to use symmetry constraints
on the thresholds to estimate a model that is in between the usual ordinal regression and the
beta model, but most packages do not support constraints on the thresholds. In the meantime,
we recommend a linear transformation of a discrete equal-interval scale that assigns values to
bin midpoints. A variable Y with n+1 scale values {a, a+b, a+2b, …, a+nb} is transformed
into
y = (2y – a)/2(nb + a).
For example, the seven scale values {1, 2, 3, 4, 5, 6, 7} are transformed into {1/14, 3/14, …,
13/14} which are the midpoints of the seven bins equally dividing the [0,1] interval.
Implementation in statistical packages
In this section we briefly describe the implementations in R/SPlus, SPSS, and SAS. All
three implementations require the user to input model terms and starting-values, but
otherwise make no special demands. Whenever possible, we recommend fitting beta
regression models in more than one package and using more than one set of starting-values
and optimization method to check convergence. We have done this for many models and beta
regression seems to converge readily to the optimum, assuming starting values are good.
5/02/2016
Beta Regression: Practical Issues in Estimation
7
Models with several terms in the dispersion sub-model, however, are more difficult to
estimate.
Our current implementation in R and SPlus provides the maximum likelihood value
attained by the model, coefficients, gradient for the coefficients, asymptotic standard-error
estimates and p-values. Predicted values and residuals also can be output or saved in the datafile. It uses two functions (“betareg” and “grad”) and a small set of output commands, all of
which can be used in the Commands window or saved as scripts. The SPlus version uses the
nlminb quasi-Newton routine for MLE and the Venables and Ripley (1999) vcov.nlminb in
the MASS library for computing the Hessian and the asymptotic estimates. The R version
uses the optim routine for MLE with the BFGS quasi-Newton method. A version of optim for
SPlus also is available in the MASS library.
In SPSS, beta regression models can be fitted under its Nonlinear Regression (NLR and
CNLR) procedure. We have written a syntax shell and documentation so that users need only
supply a model and starting-values. The CNLR procedure outputs an iteration history and a
convergence message. SPSS uses numerical approximations to the derivatives, so the analytic
score function and Hessian are not needed. There are subcommands that provide predicted
values, residuals, gradient values, and bootstrap standard-error estimates for the coefficients.
However, the Hessian and asymptotic standard error estimates are not currently available.
In SAS, proc nlmixed, proc nlin, or proc glimmix will estimate a beta
regression. Our examples all use nlmixed, but SAS syntax is fairly similar across all
procedures. The glimmix program does not allow a heteroscedasticity model but does
generate reasonable estimates for variance component models with beta responses via
penalized quasi-likelihood, whereas the random location effects models we have tried using
nlmixed computed by adaptive Gaussian quadrature typically did not converge. To
compute a heteroscedasticity model, nlmixed or nlin will be necessary. Anyone who has
5/02/2016
Beta Regression: Practical Issues in Estimation
8
a basic familiarity with SAS can code up their own model in nlmixed using our heavily
commented syntax files as an example. We recommend thoroughly reading the program
documentation. These programs have many options and the documentation provides a wealth
of advice on model convergence, fit, etc.
There are many choices for the numerical optimizer to be used. Even for Newton methods,
the analytic score function and Hessian are not needed as SAS uses automatic differentiation
to obtain the derivatives (a flag can be added to display the equation for the gradient and
Hessian). The quasi-Newton method BFGS seems to be a solid all-around choice in practice,
with the trust region Newton method being useful for more difficult problems. The output of
nlmixed is extensive. Detailed iteration information is available for each estimated
parameter, along with the likelihood, AIC, BIC, and asymptotic confidence intervals. In
addition to the basic output, it is possible to use additional command statements to generate
predicted values, residuals, and to save output in RTF and LaTex format. To obtain bootstrap
or jackknife statistics, SAS Institute has the freely downloadable jackboot macro on their
website.
5/02/2016
Beta Regression: Practical Issues in Estimation
9
References
Buckley, J. (2002). Estimation of models with beta-distributed dependent variables: A
replication and extension of Paolino (2001). Political Analysis, 11, 1-12.
Cribari-Neto, F. and Vasconcellos, K. L. P. (2002). Nearly unbiased maximum likelihood
estimation for the beta distribution. Journal of Statistical Computation and Simulation,
72, 107-118.
Ferrari, S. L. P. and Cribari-Neto, F. (2004). Beta Regression for Modeling Rates and
Proportions. Journal of Applied Statistics, 10, 1-18.
Hardin, J. W. and Hilbe, J. E. (2003). Generalized Estimating Equations, Boca Raton, FL:
CRC/Chapman & Hall.
McMillan, N. A. and Creelman, C. D. (2005). Detection Theory: A User’s Guide, Mahwah,
NJ: Lawrence Erlebaum Associates.
Paolino, P. (2001). Maximum likelihood estimation of models with beta-distributed
dependent variables. Political Analysis 9, 325-346.
SAS Institute (2005). The GLIMMIX Procedure. Cary, NC: SAS Institute.
Smithson, M. and Verkuilen, J. (2005). A Better Lemon-Squeezer? Maximum Likelihood
Regression with Beta-Distributed Dependent Variables.
Tamhane, A. C., Ankenman, B.E. and Yang, Y. (2002). The Beta Distribution as a Latent
Response Model for Ordinal Data (I): Estimation of Location and Dispersion
Parameters. Unpublished manuscript, Department of Industrial Engineering and
Management Sciences, Northwestern University, Evanston, IL.
Venables, W. N. and Ripley, B. D. (1999). Modern applied statistics with S-PLUS. New
York: Springer-Verlag.
5/02/2016
Beta Regression: Practical Issues in Estimation
10
Splus/R Code for Examples 1-3
#This code is for S-PLUS (but the functions also run in R):
#Be sure to load the MASS library before using this.
#The IVs and DVs are defined in the four statements below.
#Substitute your dependent variable name for ‘DV’ in the ydata statement #and
#your two sets of independent variables for ‘IV, …’
#in the xdata and wdata statements.
##
ydata <- cbind(DV);
const <- rep(1,length(ydata));
xdata <- cbind(const, IV, …);
wdata <- cbind(const, IV, …);
##
#The initial starting values for the null model are
#generated by using the method of moments on ydata.
start <- c(log(mean(ydata)/(1-mean(ydata))), -log((mean(ydata) - (mean(ydata))^2 var(ydata))/var(ydata)))
##
#The betareg function computes the log-likelihood. It removes missing data.
##
betareg <- function(h, y, x, z)
{
hx = x%*%h[1: length(x[1,])]
mu = exp(hx)/(1+exp(hx))
gz = z%*%h[length(x[1,])+1: length(z[1,])]
phi = exp(-gz)
loglik = lgamma(phi) - lgamma(mu*phi) – lgamma(phi – mu*phi) + mu*phi*log(y) + (phi –
mu*phi)*log(1 – y) – log(y) – log(1 – y)
-sum(loglik, na.rm = TRUE)
}
##
#The grad function computes the gradient.
##
grad <- function(h, y, x, z)
{
hx = x%*%h[1: length(x[1,])]
gz = z%*%h[length(x[1,])+1: length(z[1,])]
gd <- cbind(x*rep(exp(-gz+hx)*(log(y/(1-y)) + digamma(exp(-gz)/(1+exp(hx)))digamma(exp(-gz+hx)/(1+exp(hx))))/(1+exp(hx))^2, length(x[1,])), -z*rep(exp(-gz)*(log(1-y) +
exp(hx)*log(y) + (1+exp(hx))*digamma(exp(-gz)) – digamma(exp(-gz)/(1+exp(hx)))exp(hx)*digamma(exp(-gz+hx)/(1+exp(hx))))/(1+exp(hx)), length(z[1,])))
colSums(gd, na.rm = TRUE)
}
##
#This is the optimizer function:
##
betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata)
##
#These are the parameters, standard errors, z-stats and significance-levels.
##
outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat
<- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c(“estim”, “serr”, “zstat”,
“prob”); outfit
##
#These are the log-likelihood and convergence message, followed by the gradient at the
solution.
##
c(betafit$objective,betafit$message)
grad(betafit$parameters,ydata,xdata,zdata)
#Example 2 from Smithson and Verkuilen (2005) in an S-Plus session
> attach(Example2)
> ydata <- cbind(Anxiety);
> const <- rep(1,length(ydata));
> start <- c(log(mean(ydata)/(1-mean(ydata))), -log((mean(ydata) - (mean(ydata))^2 var(ydata))/var(ydata)))
> start
[1] -2.299012 -1.313792
> #We enter the betareg and grad functions.
[SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE]
> #We start with the null model.
> xdata <- cbind(const); wdata <- cbind(const);
> betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata)
5/02/2016
Beta Regression: Practical Issues in Estimation
11
> #This is the negative log-likelihood and convergence message.
> c(betafit$objective,betafit$message)
[1] "-239.448005712066"
"RELATIVE FUNCTION CONVERGENCE"
> #These are the estimates, asymptotic standard errors and z-statistics.
> outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat
<- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat",
"prob"); outfit
x.1
x.2
estim -2.24396066 -1.7956442
serr
0.09757744
0.1240222
zstat -22.99671641 -14.4784088
prob
0.00000000
0.0000000
> #Now we estimate a model with Stress in the location submodel.
> #We also specify new values for “start”.
> xdata <- cbind(const, Stress); start <- c(-2.244, 0.1, -1.796);
> betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata)
> c(betafit$objective,betafit$message)
[1] "-283.006657335412"
"RELATIVE FUNCTION CONVERGENCE"
> #Check that the gradient is 0.
> grad(betafit$parameters,ydata,xdata,wdata)
const
Stress
const
1.5892e-006 8.478869e-007 1.07663e-006
> outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat
<- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat",
"prob"); outfit
x.1
x.2
x.3
estim -3.4790186 3.7495930 -2.457410
serr
0.1529186 0.3379374
0.126031
zstat -22.7507931 11.0955267 -19.498451
prob
0.0000000 0.0000000
0.000000
> #Finally, we estimate a model that includes Stress in the dispersion submodel.
> wdata <- cbind(const, Stress); start <- c(-3.479, 3.75, -2.46, 0.1);
> betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata)
> c(betafit$objective,betafit$message)
[1] "-301.9600873253"
"RELATIVE FUNCTION CONVERGENCE"
> grad(betafit$parameters,ydata,xdata,wdata)
const
Stress
const
Stress
8.977042e-007 2.552195e-007 -1.098784e-006 -2.781475e-007
> outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat
<- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat",
"prob"); outfit
x.1
x.2
x.3
x.4
estim -4.0237159 4.9413672 -3.9608469 4.273312e+000
serr
0.1438478 0.4480357
0.2121417 6.360746e-001
zstat -27.9720401 11.0289595 -18.6707588 6.718256e+000
prob
0.0000000 0.0000000
0.0000000 9.195644e-012
#This code is for R (but the optimizer also runs in S-Plus via the MASS library).
#It’s a good idea to load the MASS library before using this.
#You have to run the betareg and grad functions before this next step.
#This is the optimizer function using using BFGS
#(which is what we would recommend):
##
betaopt <- optim(start, betareg, hessian = T, x = xdata, y = ydata, z = wdata, method =
“BFGS”)
##
#These are the parameters, standard errors, z-stats and significance-levels.
##
outopt <- rbind(estim <- betaopt$par, serr <- sqrt(diag(solve(betaopt$hessian))), zstat <estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outopt) <- c(“estim”, “serr”, “zstat”,
“prob”); outopt
##
#These are the log-likelihood, gradient, and convergence message.
##
betaopt$value
grad(betaopt$par,ydata,xdata,wdata)
betaopt$convergence
#Example 3 from Smithson and Verkuilen (2005) in an R session
#This code shows how to obtain the final model in Example 3.
> ex3 <- read.table("Example3.txt", header = TRUE); attach(ex3)
> library(MASS)
> ydata <- cbind(accur01);
> const <- rep(1,length(ydata));
> xdata <- cbind(const, grpctr, ziq, grpctr*ziq);
5/02/2016
Beta Regression: Practical Issues in Estimation
12
> wdata <- cbind(const, grpctr, ziq);
[SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE]
> start <- c(1.0, -1.0, 0.5, -0.5, -3.0, -1.5, -1.5);
> betaopt <- optim(start, betareg, hessian = T, x = xdata, y = ydata, z = wdata, method =
"BFGS")
> c(betaopt$value, betaopt$convergence)
[1] -65.90186
0.00000
> grad(betaopt$par,ydata,xdata,wdata)
const
grpctr
ziq
const
grpctr
ziq
0.019330881 0.017862284 0.001995802 -0.002906731 -0.005810820
-0.005330392 0.003144999
> outopt <- rbind(estim <- betaopt$par, serr <- sqrt(diag(solve(betaopt$hessian))), zstat <estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outopt) <- c("estim", "serr", "zstat",
"prob"); outopt
[SNIP—THIS PART OF THE OUTPUT DELETED]
#Example 1 from Smithson and Verkuilen (2005) in an R session
#This code shows how to obtain the final model in Example 1.
> ex1 <- read.table("Example1.txt", header = TRUE); attach(ex1)
> library(MASS)
> ydata <- cbind(crc99);
> const <- rep(1,length(ydata));
> xdata <- cbind(const, vert, confl, vert*confl);
> wdata <- cbind(const, vert, confl, vert*confl);
[SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE]
> start <- c(.88, -.01, .15, .19, -1.1, .34, -.22, .1);
[SNIP—FUNCTIONS ENTERED AS IN EXAMPLE 3]
___________________________________________________________________________
SPSS Nonlinear Regression syntax components
The beta-regression models can be run in SPSS under its Nonlinear Regression (NLR and CNLR)
procedure. This can be done either via the GUI or syntax, but we shall restrict our attention to the syntax
approach here. SPSS requires three main components for these models:
1. A formula for the location and dispersion submodels,
2. A formula for the negative log-likelihood kernel, which is the loss-function to be minimized, and
3. Names and starting-values for the model parameters.
Formula for the location and dispersion submodels
In keeping with our notation, we will denote the location submodel parameters by Bj, where j is an index
number starting with 0 for the intercept. The location submodel formula takes the form
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU))
where #MU is a linear combination of predictors,
COMPUTE
#MU = B0 + … .
For example, in a model that predicts anxiety (ANXIETY) by stress (STRESS) in the location submodel, the
location submodel formula is defined by
COMPUTE
#MU = B0 + B1*STRESS .
Likewise, we denote the dispersion submodel parameters by Dk, where k is also an index number. The
dispersion submodel formula takes the form
COMPUTE PHI = EXP(-D0 - … ) .
For example, in a model that has STRESS in the dispersion submodel, the dispersion submodel formula is
COMPUTE PHI = EXP(-D0 - D1*STRESS) .
Model parameters and starting-values
The easiest way to generate starting-values for the location and dispersion model coefficients is to begin
with a null model (i.e., a model that just has B0 and D0), and then add coefficients one variable at a time in each
model. The procedure for this is described below.
To start with, obtain the mean and variance of the dependent variable. In the ANALYZE menu under the
DESCRIPTTIVE STATISTICS item choose the DESCRIPTIVES option. The DESCRIPTIVES dialog box
5/02/2016
Beta Regression: Practical Issues in Estimation
13
appears. Click the Options… button and ensure that the Mean and Variance check-boxes have ticks in them. In
our example, the resulting syntax should look like this:
DESCRIPTIVES
VARIABLES=ANXIETY
/STATISTICS=MEAN VARIANCE .
The output should look like this:
Descriptive Statistics
N
Mean
Anxiety
166
Valid N (listwise)
166
Variance
.09120
.018
To obtain a more precise estimate of the variance, double-click on the table and then double-click inside the cell
containing “.018.” In the example here we will use the more precise value of .01756.
Now given the mean  and variance 2, we obtain our starting-value for B0 and D0 as follows:
B0 = ln[/(1 – )]
and
D0 = –ln[( – 2 – 2)/2].
In our example these formulas yield the following results:
B0 = ln[.09120/(1 – .09120)] = –2.29907
D0 = -ln[(.09120 – (.09120)2 – .01756)/.01756] = –1.31371.
Once the null model has been estimated then the coefficients for that model are used as starting-values for a
model that includes one Bj or Dk. The starting-values for the additional Bj or Dk can be set to arbitrary values
near 0 (e.g., 0.1). Likewise, when the new model has been estimated then its coefficients become the startingvalues for the next model. This process is illustrated in the example provided below.
Example 2 from Smithson & Verkuilen (2005): Anxiety predicted by Stress
Let  = Anxiety, x0 = w0 =1, and x1 = w1 = Stress. We will end up fitting the following model:
i = exp( + x1i)/[1 + exp( + x1i)], and
i = exp(–– w1i).
First, however, we fit the null model. Open the SPSS file called Example2.sav.
Null Model
We start by computing the null model for our example, using the syntax shell with the appropriate substitutions
for parameter values, submodel formulae, and dependent variable name. The starting-values for B0 and D0 are
the ones derived earlier from the formulas using the sample mean and variance.
MODEL PROGRAM B0 = -2.29907 D0 = -1.31371 .
COMPUTE
#DV = ANXIETY .
COMPUTE
#MU = B0 .
COMPUTE PHI = EXP(-D0) .
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)).
COMPUTE RESID_ = #DV - PRED_ .
COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) +
PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) .
COMPUTE LOSS_ = -LL .
CNLR ANXIETY
/OUTFILE='C:\SPSSFNLR.TMP'
/PRED PRED_
/LOSS LOSS_
/CRITERIA STEPLIMIT 2 ISTEP 1E+20 .
The relevant output shows the iteration history and the final negative log-likelihood and parameter values:
Iteration
0.1
1.1
...
7.1
8.1
Loss funct
-224.838337
-230.237861
B0
-2.2990700
-2.2407348
D0
-1.3137100
-1.3781099
-239.448006
-239.448006
-2.2439540
-2.2439602
-1.7956404
-1.7956430
Run stopped after 8 major iterations.
5/02/2016
Beta Regression: Practical Issues in Estimation
14
Optimal solution found.
Now we use the final parameter values for B0 and D0 as starting-values for the next model, which adds
ANXIETY to the location submodel. The starting-value for B1 is set at 0.1:
MODEL PROGRAM B0 = -2.2439602 B1 = 0.1 D0 = -1.7956430 .
COMPUTE
#DV = ANXIETY .
COMPUTE
#MU = B0 + B1*STRESS .
COMPUTE PHI = EXP(-D0) .
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)).
COMPUTE RESID_ = #DV - PRED_ .
COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) +
PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) .
COMPUTE LOSS_ = -LL .
CNLR ANXIETY
/OUTFILE='C:\SPSSFNLR.TMP'
/PRED PRED_
/LOSS LOSS_
/CRITERIA STEPLIMIT 2 ISTEP 1E+20 .
The negative log-likelihood is –283.007, very different from –239.448 in the null model. The chi-square
change statistic is 2*(283.007 – 239.448) = 87.118 which is large and of course significant (p < .0001). The B1
coefficient is large and positive, so higher STRESS predicts higher ANXIETY.
Now we compute the final model in our example with STRESS in the dispersion submodel, again using the
syntax shell with the starting values from the preceding model and D0 = 0.1.
MODEL PROGRAM B0 = -3.4790199 B1 = 3.74959458 D0 = -2.4574117 D1 = 0.1 .
COMPUTE
#DV = ANXIETY .
COMPUTE
#MU = B0 + B1*STRESS .
COMPUTE PHI = EXP(-D0 - D1*STRESS) .
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)).
COMPUTE RESID_ = #DV - PRED_ .
COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) +
PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) .
COMPUTE LOSS_ = -LL .
CNLR ANXIETY
/OUTFILE='C:\SPSSFNLR.TMP'
/PRED PRED_
/LOSS LOSS_
/CRITERIA STEPLIMIT 2 ISTEP 1E+20 .
An abbreviation of the relevant output is shown below. Note that the negative log-likelihood (-301.960) is
lower than for the preceding model (–283.007), indicating improved fit. For a loss of 1 degree of freedom, the
chi-square gain is 2*(301.960 – 283.007) = 37.906 (p < .0001), once again a significant improvement.
Iteration
0.1
1.1
...
8.1
9.1
10.1
Loss funct
-283.913317
-284.607693
B0
-3.4790199
-3.3541934
B1
3.74959458
3.77969286
D0
-2.4574117
-2.6867715
D1
.100000000
.617696966
-301.960080
-301.960087
-301.960087
-4.0239012
-4.0237352
-4.0237166
4.94243205
4.94148628
4.94137080
-3.9614916
-3.9608697
-3.9608459
4.27552137
4.27341043
4.27330871
This output does not tell us anything about the standard errors for the individual coefficients, but we can
obtain that information and more by using subcommands under the NLR procedure. These matters are taken up
in the next section.
Extracting More from CNLR
Fortunately, the CNLR procedure can yield more than just the coefficients and log-likelihood, but I
recommend ensuring that a solution is found by the procedure before including the extra commands. In this
section I will cover two additional features: Saving predicted values, residuals, and gradient values; and
obtaining bootstrap standard-error estimates for the coefficients.
Saving computed variables
Predicted values, residuals, gradient values, and loss-function values for all cases can be obtained by
inserting the /SAVE subcommand before the /CRITERIA subcommand. For instance,
/SAVE PRED RESID DERIVATIVES
5/02/2016
Beta Regression: Practical Issues in Estimation
15
will save the predicted values, residuals, and gradient values (derivatives) to the working data-file. The saved
variables may then be used in the usual diagnostic fashion. For instance, summing the derivatives for the final
model in our example shows the gradient is near 0 at the solution for each parameter, thereby supporting the
claim that the solution is the true optimum. This is a good diagnostic for checking whether a stable solution has
been found.
Obtaining bootstrap standard-error estimates
SPSS does not compute the Hessian at the solution, so we cannot obtain asymptotic standard-error estimates
in the usual way that we can from SAS or R. However, SPSS does provide bootstrap estimates (the default is
100 samples). To obtain bootstrap estimates of the standard errors (and the corresponding confidence intervals
and correlation matrix of the estimates), insert the following subcommand before the / CRITERIA subcommand:
/BOOTSTRAP = N
where N is the number of samples desired. Usually 1000-2000 samples suffice for accurate estimates. This may
take some time for your computer to complete. Two different kinds of “95% confidence intervals” are
displayed: Bootstrap intervals using the standard error estimates and intervals based on excluding the bottom
and top 2.5% of the bootstrap distribution.
Example 3: Reading accuracy, dyslexia, and IQ
Here is the syntax for obtaining the final model presented in Example 3.
MODEL PROGRAM B0 = 1.0 B1 = -1.0 B2 = 0.5 B3 = -0.5 D0 = -3.0 D1 = -1.5 D2 = -1.5 .
COMPUTE
#DV = accur01 .
COMPUTE
#MU = B0 + B1*grpctr + B2*ziq + B3*ziq*grpctr .
COMPUTE PHI = EXP(-D0 - D1*grpctr - D2*ziq) .
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)).
COMPUTE RESID_ = #DV - PRED_ .
COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) +
PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) .
COMPUTE LOSS_ = -LL .
CNLR accur01
/OUTFILE='C:\SPSSFNLR.TMP'
/PRED PRED_
/LOSS LOSS_
/CRITERIA STEPLIMIT 2 ISTEP 1E+20 .
Example 1: Reading verdict, conflict, and confidence
Here is the syntax for obtaining the final model presented in Example 1. In this example, the BOOTSTRTAP
subcommand has been included in order to get standard errors and confidence intervals.
MODEL PROGRAM B0 = 0.88 B1 = -0.01 B2 = 0.15 B3 = 0.19 D0 = -1.1 D1 0.34 D2 = -0.22 D3 = 0.1 .
COMPUTE
#DV = crc99 .
COMPUTE
#MU = B0 + B1*vert + B2*confl + B3*vert*confl .
COMPUTE PHI = EXP(-D0 - D1*vert - D2*confl - D3*vert*confl) .
COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)).
COMPUTE RESID_ = #DV - PRED_ .
COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) +
PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) .
COMPUTE LOSS_ = -LL .
CNLR crc99
/OUTFILE='C:\SPSSFNLR.TMP'
/PRED PRED_
/LOSS LOSS_
/BOOTSTRAP = 2000
/CRITERIA STEPLIMIT 2 ISTEP 1E+20 .
SAS Syntax Files
*This file has the code for examples 1-3 in Smithson & Verkuilen (2006).
We assume you have imported the data into SAS.
Code to fit models in the paper using NLMIXED is shown below. Comments
in the syntax of each model should help you write your own models.
SAS has two other programs that can be used to fit beta regressions:
GLIMMIX and NONLIN. NONLIN is fairly similar to NLMIXED. GLIMMIX has syntax
5/02/2016
Beta Regression: Practical Issues in Estimation
16
very similar to MIXED (for linear mixed models) and the beta distribution
is a built-in. It can fit random effects models but cannot fit
heteroscedasticity-covariate models.
We HIGHLY recommend that you familiarize yourself with whatever program you
use to fit a beta regression. Nonlinear optimization is an inherently
difficult problem that does not have a guaranteed global solution. Just
because the optimizer converges does not mean the global optimum was
attained. In our experience, if many iterations are required, something is
usually wrong: the model is misspecified, not identified, or has bad
starting values. Finally, the most common cause of an error message is an
out-of-bounds observation.;
;
****Example 1****;
*Note: Two different, but statistically equivalent, ways to lay out the
model.
*Simple transformations turn one into the other.
*;
proc nlmixed data=jury tech = trureg hess cov itdetails;;
title 'One-Way Layout';
*linear predictors;
Xb = b1*group1 + b2*group2 + b3*group3 + b4*group4;
Wd = d1*group1 + d2*group2 + d3*group3 + d4*group4;
*link functions transform linear predictors;
mu = exp(Xb)/(1 + exp(Xb));
phi = exp(-Wd);
*transform to standard parameterization for easy entry to log-likelihood;
w = mu*phi;
t = (1-mu)*phi;
*log-likelihood;
ll = lgamma(w+t) - lgamma(w) - lgamma(t) + (w-1)*log(crc99) + (t-1)*log(1crc99);
model crc99 ~ general(ll);
run;
proc nlmixed data=jury tech = trureg hess cov itdetails;;
title 'Two-Way Layout';
*linear predictors;
Xb = b0 + b1*vert + b2*confl + b3*(vert*confl);
Wd = d0 + d1*vert + d2*confl + d3*(vert*confl);
*link functions transform linear predictors;
mu = exp(Xb)/(1 + exp(Xb));
phi = exp(-Wd);
*transform to standard parameterization for easy entry to log-likelihood;
w = mu*phi;
t = (1-mu)*phi;
*log-likelihood;
ll = lgamma(w+t) - lgamma(w) - lgamma(t) + (w-1)*log(crc99) + (t-1)*log(1crc99);
5/02/2016
Beta Regression: Practical Issues in Estimation
17
model crc99 ~ general(ll);
run;
****Example 2****;
proc nlmixed data = anxiety tech = trureg hess cov itdetails;
*The options here tell SAS to use the data set just defined and then
request trust-region Newton, detailed iteration history, the final Hessian
and asymptotic covariance matrix. There are many other options available-see NLMIXED documentation;
title 'null model anxiety--fits a two-parameter beta distribution as a
reference';
*These statements set up the model equations. Any variables that are not in
the data set declared above are quantities SAS will calculate.;
xb = b0;
mu = exp(xb)/(1 + exp(xb));
wd = d0;
phi = exp(-1*wd);
*These numbers are used to simplify the log-likelihood by converting back
to the standard parameterization for the beta distribution. There is no
really good reason to do it this way but it's easier to read the loglikelihood function and you can monitor these values using "predict" for
plotting if you want.;
p = mu*phi;
q = phi - mu*phi;
*This statement sets up the log-likelihood. An alternate to forcing the DV
to be in the open unit interval would be to trap any values of the DV that
are equal to 0 or 1 exactly--see the NLMIXED examples in SAS doc for how to
do this. Otherwise, you can do the transformation using standard SAS data
step commands;
ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1
- anxiety);
*The model statement tells SAS to use anxiety as the DV and then use the
log-likelihood just defined.;
model anxiety ~ general(ll);
run;
proc nlmixed data = anxiety tech = trureg hess cov itdetails;
title 'anxiety on stress, heteroscedastic model';
xb = b0 + b1*stress;
mu = exp(xb)/(1 + exp(xb));
wd = d0 + d1*stress;
phi = exp(-1*wd);
p = mu*phi;
q = phi - mu*phi;
5/02/2016
Beta Regression: Practical Issues in Estimation
18
ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1
- anxiety);
model anxiety ~ general(ll);
*The next statement generates predicted values as another SAS data set.
This file contains many useful quantities and you can look at residuals,
etc. It is possible to use "predict" to generate quantities that are
functions of estimates, which is really useful for plotting. You can use
GPLOT or open the data set in Excel and export it to another program. The
estimate command can also be used to generate estimates that are functions
of parameters. This is particularly convenient because confidence intervals
are also computed for you by the delta method.;
predict mu out = anxiety_pred_het;
run;
proc nlmixed data = anxiety tech = trureg hess cov itdetails;
title 'anxiety on stress location only';
xb = b0 + b1*stress;
mu = exp(xb)/(1 + exp(xb));
wd = d0;
phi = exp(-1*wd);
p = mu*phi;
q = phi - mu*phi;
ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1
- anxiety);
model anxiety ~ general(ll);
predict mu out = anxiety_pred_loc;
run;
****Example 3****;
proc nlmixed data = dyslexia tech = trureg hess cov itdetails;
*This statement tells SAS to use trust region Newton optimization and to
output the final Hessian (second derivative matrix), parameter covariance
matrix,
and give a detailed iteration history.;
title 'accur01 on ziq and grpctr (dyslexia effect code), heteroscedastic
model';
*The model equations are constructed here and substituted into the
likelihood below;
xb = bconst + b_ziq*ziq + b_dys*grpctr;
mu = exp(xb)/(1 + exp(xb));
wd = d0 + d_ziq*ziq + d_dys*grpctr;
phi = exp(-1*wd);
p = mu*phi;
q = phi - mu*phi;
*This next command is the beta likelihood;
ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(accur01) + (q-1)*log(1
- accur01);
model accur01 ~ general(ll);
*Request predicted values for the mean from this model, to be saved in the
specified data set.;
predict mu out = dyslexia_pred_het;
run;
Download