5/02/2016 Beta Regression: Practical Issues in Estimation 1 Beta Regression: Practical Issues in Estimation (Supplementary Material) Michael Smithson The Australian National University Michael.Smithson@anu.edu.au and Jay Verkuilen University of Illinois at Urbana-Champaign jvverkuilen@gmail.com November 19, 2005 General Procedures This document supplements the paper by Smithson and Verkuilen (2005) on beta regression, and focuses on maximum likelihood estimation procedures in several statistical packages. Our discussion here should not be considered a substitute for thoroughly reading the documentation for any software you plan to use. Maximizing the likelihood function must be done numerically. Newton-Raphson or a quasi-Newton methods work well and generate asymptotic standard errors as a byproduct of the estimation, unlike derivative-free methods such as the simplex method. Ferrari and Cribari-Neto (2004) use Fisher scoring, which substitutes the expected Hessian for the observed Hessian. Buckley (2002) used MCMC estimation in winBUGS, which provides a Bayesian posterior density. Buckley (2002) also provided Stata code and Paolino (2001) provided Gauss code, both of which compute maximum likelihood estimates. We have estimated beta regressions using R/SPlus, SAS, SPSS, Mathematica, and winBUGs. Syntax and/or script files for all of these packages are freely available on this site. Different packages require more or less information from the user. SAS, for instance, only requires the likelihood and analytically computes the derivatives for Newton methods, whereas Mathematica’s Newton-Raphson routine requires the user to supply expressions for the derivatives. The major difference between Newton-Raphson and quasi-Newton is in the 5/02/2016 Beta Regression: Practical Issues in Estimation 2 number of function evaluations per iteration (more for Newton-Raphson) and the number of iterations necessary (more for quasi-Newton). The domain of convergence for different algorithms, and hence importance of good starting values, will differ across algorithms as well. The trust region Newton method implemented in SAS seems to be particularly stable on hard problems and we recommend its use when convergence might be a problem. In general, speed of execution for ML estimation is proportional to N + p2, where N is the sample size and p is the number of parameters, but this will depend on the algorithm. We have never observed a well-specified model given good starting values taking longer than a few seconds to converge, even on a fairly modest laptop, but these models have had no more than a dozen or so parameters. Obviously, experience will vary for larger data sets and more complicated models. If a Newton or quasi-Newton method is used, asymptotic standard errors usually are estimated from the inverse of the final Hessian matrix. Bayesian estimation gives posterior densities from which the Bayesian analogs of frequentist stability measures can be taken, e.g., the 2.5% and 97.5% quantiles of the posterior density as analogous quantities to a 95% confidence interval. Though it is generally recommended in the literature that the Newton estimate of the Hessian be used to provide asymptotic standard errors, we have tried both methods on several data sets and it has never seemed to make an appreciable difference. It may be desirable to use heteroscedasticity-consistent standard errors in a beta regression. Constructing these values is discussed in detail in Hardin & Hilbe (2003, pp. 28-32). SAS proc glimmix implements this for location-only models using the “empirical” keyword. We intend to put this calculation in the examples but have not at this point. Well-chosen starting values are needed to ensure convergence when more than a few variables are included in the model, particularly when dispersion covariates are used. (For location-only models, convexity provides a global optimum.) We have found three effective 5/02/2016 Beta Regression: Practical Issues in Estimation 3 approaches to generating starting-values. Ferrari and Cribari-Neto (2004) suggest using the OLS estimators from the regression on the link-transformed dependent variable for the location sub-model. For example, if the location sub-model link is the logit then startingvalues for the would be obtained via ln( yi/(1 – yi)) = X OLS + i. In general the OLS estimates track the location model tolerably well, though of course the standard errors will be inefficient and, more importantly, the variance structure will be limited to an intercept parameter. Unfortunately, to our knowledge there seems to be no similar proposal for starting-values of the coefficients in the dispersion sub-model. The second approach is to begin with the null model (i.e., intercept-only sub-models with coefficients and ), using the method of moments to provide starting-values for and . Then the resultant maximum-likelihood estimates of and are used along with startingvalues close to 0 for of and/or ( = = 0.1, say), and maximum-likelihood estimates for , , and/or are obtained. These in turn are used as starting-values for the next more complex model, with starting-values near 0 used for each new term being introduced into the model. Models can be built up one or more terms at a time in this way, although adding more than one term at a time is riskier. A third approach is to provide a grid of starting values and estimate the model over this grid, looking for changes in the computed likelihood function and the estimates. If the computed solution is essentially the same over the grid, the global optimum is (probably) achieved. This procedure is very effective at avoiding local optima but it is very time consuming for a model with many parameters because the size of the grid grows with multiplicatively with the number of parameters. For p parameters and k values in each parameter, the number of estimations is kp, which is large even for moderate p and k, e.g. p = 5 and k = 3 requires 243 estimations. An intermediate strategy is to use a grid over only a 5/02/2016 Beta Regression: Practical Issues in Estimation 4 subset of the parameters, e.g., the dispersion sub-model only. SAS provides a facility to automate this procedure, but in other environments it will probably be necessary to do some programming. Another estimation issue is that covariates with particularly large absolute values may result in a loss of precision in estimates or create problems for the estimation algorithms. This is due to the evaluation of exponents, and we have found that absolute values greater than around 30 can cause difficulties. These covariates may need to be rescaled to smaller ranges for estimation, such as by z-scoring. Commercial packages typically do this silently with models such as logistic regression. The loss of numerical precision due to variables with widely different scales, e.g., one variable in millions and the other in thousandths, is common to all numerical procedures and is not unique to beta regression. Rescaling in this case is required. A third practical issue is the treatment of endpoints. Although 0 and 1 may be genuine outcome values, their logits are undefined. Two obvious remedies are proportionally “shrinking” the range to a sub-range nearly covering the unit interval (e.g., [.01, .99]) or simply adding a small amount to 0-valued observations and subtracting the same amount from 1-valued observations while leaving the other observations unchanged. Both methods bias the estimates toward no effect. A method that is frequently used in practice in areas such as signal detection theory is to add 1/2N to a 0 observation and subtract 1/2N from a 1 observation, where N is the total number of observations (MacMillan and Creelman 2005, pp. 8-9). This has no effect on the interior points but could introduce bias if there are a non-trivial number of boundary values. The best option may be to experiment with different endpoint handling schemes and see whether the parameter estimates change in any appreciable way. Our recommendation applies to a sample of scores from a continuous bounded scale that has already been linearly transformed to the [0,1] interval. We may transform the sample 5/02/2016 Beta Regression: Practical Issues in Estimation 5 scores to a variable in the open (0,1) interval by the weighted average: y = [y(N – 1) + s]/N, where N is the sample size and s is a constant between 0 and 1. From a Bayesian standpoint, s acts as if we are taking a prior into account. A reasonable choice for s would be .5. Related to this point, it is occasionally possible to create an overflow in the objective function. To see how this is possible, consider the log-likelihood function: lnL = ln() – ln() – ln((1 – )) + ( – 1)lny + ((1 – ) – 1)ln(1 – y). While the data are bounded away from 0 or 1, it is possible for the predicted values at any given iteration to become very small or very large. This can cause the log-gamma function or the polygamma function (derivative of the log-gamma function) to overflow. This usually causes program termination and definitely causes meaningless results. This is unlikely to happen for most data, but we have observed it once in a fairly complex model with random effects. This seemed to occur when and were both very small, meaning their product was even smaller. A practical solution is to alter the predicted values, , , before substituting them into the likelihood function as follows: adjust = 1 + (1 – 21), adjust = 2 + (1 – 2), where 1, 2 are small non-negative constants, e.g., .0001. The effect of these transformations is to gently bound the predicted values away from 0 or 1 for , and 0 for . Using this transformation biases results towards being non-significant, so the constants should be chosen as small as possible. Finally, researchers will need to consider whether beta regressions can be applied to a discrete dependent variable, even an interval-level one. It is common practice to apply normal-theory regression to variables that have only a few scale values, e.g., survey items with a response range from 1 to 7, and researchers are likely to be motivated to do the same 5/02/2016 Beta Regression: Practical Issues in Estimation 6 with beta regression. Tamhane, Ankenman and Yang (2002) discuss the use of a beta distributed underlying variable for ordinal data. They indicate that using the raw discrete scores works reasonably well, though they provide a “continuizing” procedure that they show improves the mean square error somewhat. Their procedure simply amounts to uniformly and randomly assigning values to identical observations within the range of their “bin.” For the present, researchers will have to rely on practical experience in the absence of theoretical results regarding this issue. A simulation study showing how beta regression degrades in the presence of discretization would be desirable—if in doubt compare an ordinal regression such as ordinal logit and beta regression. It is possible to use symmetry constraints on the thresholds to estimate a model that is in between the usual ordinal regression and the beta model, but most packages do not support constraints on the thresholds. In the meantime, we recommend a linear transformation of a discrete equal-interval scale that assigns values to bin midpoints. A variable Y with n+1 scale values {a, a+b, a+2b, …, a+nb} is transformed into y = (2y – a)/2(nb + a). For example, the seven scale values {1, 2, 3, 4, 5, 6, 7} are transformed into {1/14, 3/14, …, 13/14} which are the midpoints of the seven bins equally dividing the [0,1] interval. Implementation in statistical packages In this section we briefly describe the implementations in R/SPlus, SPSS, and SAS. All three implementations require the user to input model terms and starting-values, but otherwise make no special demands. Whenever possible, we recommend fitting beta regression models in more than one package and using more than one set of starting-values and optimization method to check convergence. We have done this for many models and beta regression seems to converge readily to the optimum, assuming starting values are good. 5/02/2016 Beta Regression: Practical Issues in Estimation 7 Models with several terms in the dispersion sub-model, however, are more difficult to estimate. Our current implementation in R and SPlus provides the maximum likelihood value attained by the model, coefficients, gradient for the coefficients, asymptotic standard-error estimates and p-values. Predicted values and residuals also can be output or saved in the datafile. It uses two functions (“betareg” and “grad”) and a small set of output commands, all of which can be used in the Commands window or saved as scripts. The SPlus version uses the nlminb quasi-Newton routine for MLE and the Venables and Ripley (1999) vcov.nlminb in the MASS library for computing the Hessian and the asymptotic estimates. The R version uses the optim routine for MLE with the BFGS quasi-Newton method. A version of optim for SPlus also is available in the MASS library. In SPSS, beta regression models can be fitted under its Nonlinear Regression (NLR and CNLR) procedure. We have written a syntax shell and documentation so that users need only supply a model and starting-values. The CNLR procedure outputs an iteration history and a convergence message. SPSS uses numerical approximations to the derivatives, so the analytic score function and Hessian are not needed. There are subcommands that provide predicted values, residuals, gradient values, and bootstrap standard-error estimates for the coefficients. However, the Hessian and asymptotic standard error estimates are not currently available. In SAS, proc nlmixed, proc nlin, or proc glimmix will estimate a beta regression. Our examples all use nlmixed, but SAS syntax is fairly similar across all procedures. The glimmix program does not allow a heteroscedasticity model but does generate reasonable estimates for variance component models with beta responses via penalized quasi-likelihood, whereas the random location effects models we have tried using nlmixed computed by adaptive Gaussian quadrature typically did not converge. To compute a heteroscedasticity model, nlmixed or nlin will be necessary. Anyone who has 5/02/2016 Beta Regression: Practical Issues in Estimation 8 a basic familiarity with SAS can code up their own model in nlmixed using our heavily commented syntax files as an example. We recommend thoroughly reading the program documentation. These programs have many options and the documentation provides a wealth of advice on model convergence, fit, etc. There are many choices for the numerical optimizer to be used. Even for Newton methods, the analytic score function and Hessian are not needed as SAS uses automatic differentiation to obtain the derivatives (a flag can be added to display the equation for the gradient and Hessian). The quasi-Newton method BFGS seems to be a solid all-around choice in practice, with the trust region Newton method being useful for more difficult problems. The output of nlmixed is extensive. Detailed iteration information is available for each estimated parameter, along with the likelihood, AIC, BIC, and asymptotic confidence intervals. In addition to the basic output, it is possible to use additional command statements to generate predicted values, residuals, and to save output in RTF and LaTex format. To obtain bootstrap or jackknife statistics, SAS Institute has the freely downloadable jackboot macro on their website. 5/02/2016 Beta Regression: Practical Issues in Estimation 9 References Buckley, J. (2002). Estimation of models with beta-distributed dependent variables: A replication and extension of Paolino (2001). Political Analysis, 11, 1-12. Cribari-Neto, F. and Vasconcellos, K. L. P. (2002). Nearly unbiased maximum likelihood estimation for the beta distribution. Journal of Statistical Computation and Simulation, 72, 107-118. Ferrari, S. L. P. and Cribari-Neto, F. (2004). Beta Regression for Modeling Rates and Proportions. Journal of Applied Statistics, 10, 1-18. Hardin, J. W. and Hilbe, J. E. (2003). Generalized Estimating Equations, Boca Raton, FL: CRC/Chapman & Hall. McMillan, N. A. and Creelman, C. D. (2005). Detection Theory: A User’s Guide, Mahwah, NJ: Lawrence Erlebaum Associates. Paolino, P. (2001). Maximum likelihood estimation of models with beta-distributed dependent variables. Political Analysis 9, 325-346. SAS Institute (2005). The GLIMMIX Procedure. Cary, NC: SAS Institute. Smithson, M. and Verkuilen, J. (2005). A Better Lemon-Squeezer? Maximum Likelihood Regression with Beta-Distributed Dependent Variables. Tamhane, A. C., Ankenman, B.E. and Yang, Y. (2002). The Beta Distribution as a Latent Response Model for Ordinal Data (I): Estimation of Location and Dispersion Parameters. Unpublished manuscript, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL. Venables, W. N. and Ripley, B. D. (1999). Modern applied statistics with S-PLUS. New York: Springer-Verlag. 5/02/2016 Beta Regression: Practical Issues in Estimation 10 Splus/R Code for Examples 1-3 #This code is for S-PLUS (but the functions also run in R): #Be sure to load the MASS library before using this. #The IVs and DVs are defined in the four statements below. #Substitute your dependent variable name for ‘DV’ in the ydata statement #and #your two sets of independent variables for ‘IV, …’ #in the xdata and wdata statements. ## ydata <- cbind(DV); const <- rep(1,length(ydata)); xdata <- cbind(const, IV, …); wdata <- cbind(const, IV, …); ## #The initial starting values for the null model are #generated by using the method of moments on ydata. start <- c(log(mean(ydata)/(1-mean(ydata))), -log((mean(ydata) - (mean(ydata))^2 var(ydata))/var(ydata))) ## #The betareg function computes the log-likelihood. It removes missing data. ## betareg <- function(h, y, x, z) { hx = x%*%h[1: length(x[1,])] mu = exp(hx)/(1+exp(hx)) gz = z%*%h[length(x[1,])+1: length(z[1,])] phi = exp(-gz) loglik = lgamma(phi) - lgamma(mu*phi) – lgamma(phi – mu*phi) + mu*phi*log(y) + (phi – mu*phi)*log(1 – y) – log(y) – log(1 – y) -sum(loglik, na.rm = TRUE) } ## #The grad function computes the gradient. ## grad <- function(h, y, x, z) { hx = x%*%h[1: length(x[1,])] gz = z%*%h[length(x[1,])+1: length(z[1,])] gd <- cbind(x*rep(exp(-gz+hx)*(log(y/(1-y)) + digamma(exp(-gz)/(1+exp(hx)))digamma(exp(-gz+hx)/(1+exp(hx))))/(1+exp(hx))^2, length(x[1,])), -z*rep(exp(-gz)*(log(1-y) + exp(hx)*log(y) + (1+exp(hx))*digamma(exp(-gz)) – digamma(exp(-gz)/(1+exp(hx)))exp(hx)*digamma(exp(-gz+hx)/(1+exp(hx))))/(1+exp(hx)), length(z[1,]))) colSums(gd, na.rm = TRUE) } ## #This is the optimizer function: ## betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata) ## #These are the parameters, standard errors, z-stats and significance-levels. ## outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat <- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c(“estim”, “serr”, “zstat”, “prob”); outfit ## #These are the log-likelihood and convergence message, followed by the gradient at the solution. ## c(betafit$objective,betafit$message) grad(betafit$parameters,ydata,xdata,zdata) #Example 2 from Smithson and Verkuilen (2005) in an S-Plus session > attach(Example2) > ydata <- cbind(Anxiety); > const <- rep(1,length(ydata)); > start <- c(log(mean(ydata)/(1-mean(ydata))), -log((mean(ydata) - (mean(ydata))^2 var(ydata))/var(ydata))) > start [1] -2.299012 -1.313792 > #We enter the betareg and grad functions. [SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE] > #We start with the null model. > xdata <- cbind(const); wdata <- cbind(const); > betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata) 5/02/2016 Beta Regression: Practical Issues in Estimation 11 > #This is the negative log-likelihood and convergence message. > c(betafit$objective,betafit$message) [1] "-239.448005712066" "RELATIVE FUNCTION CONVERGENCE" > #These are the estimates, asymptotic standard errors and z-statistics. > outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat <- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat", "prob"); outfit x.1 x.2 estim -2.24396066 -1.7956442 serr 0.09757744 0.1240222 zstat -22.99671641 -14.4784088 prob 0.00000000 0.0000000 > #Now we estimate a model with Stress in the location submodel. > #We also specify new values for “start”. > xdata <- cbind(const, Stress); start <- c(-2.244, 0.1, -1.796); > betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata) > c(betafit$objective,betafit$message) [1] "-283.006657335412" "RELATIVE FUNCTION CONVERGENCE" > #Check that the gradient is 0. > grad(betafit$parameters,ydata,xdata,wdata) const Stress const 1.5892e-006 8.478869e-007 1.07663e-006 > outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat <- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat", "prob"); outfit x.1 x.2 x.3 estim -3.4790186 3.7495930 -2.457410 serr 0.1529186 0.3379374 0.126031 zstat -22.7507931 11.0955267 -19.498451 prob 0.0000000 0.0000000 0.000000 > #Finally, we estimate a model that includes Stress in the dispersion submodel. > wdata <- cbind(const, Stress); start <- c(-3.479, 3.75, -2.46, 0.1); > betafit <- nlminb(start, betareg, x = xdata, y = ydata, z = wdata) > c(betafit$objective,betafit$message) [1] "-301.9600873253" "RELATIVE FUNCTION CONVERGENCE" > grad(betafit$parameters,ydata,xdata,wdata) const Stress const Stress 8.977042e-007 2.552195e-007 -1.098784e-006 -2.781475e-007 > outfit <- rbind(estim <- betafit$parameters, serr <- sqrt(diag(vcov.nlminb(betafit))), zstat <- estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outfit) <- c("estim", "serr", "zstat", "prob"); outfit x.1 x.2 x.3 x.4 estim -4.0237159 4.9413672 -3.9608469 4.273312e+000 serr 0.1438478 0.4480357 0.2121417 6.360746e-001 zstat -27.9720401 11.0289595 -18.6707588 6.718256e+000 prob 0.0000000 0.0000000 0.0000000 9.195644e-012 #This code is for R (but the optimizer also runs in S-Plus via the MASS library). #It’s a good idea to load the MASS library before using this. #You have to run the betareg and grad functions before this next step. #This is the optimizer function using using BFGS #(which is what we would recommend): ## betaopt <- optim(start, betareg, hessian = T, x = xdata, y = ydata, z = wdata, method = “BFGS”) ## #These are the parameters, standard errors, z-stats and significance-levels. ## outopt <- rbind(estim <- betaopt$par, serr <- sqrt(diag(solve(betaopt$hessian))), zstat <estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outopt) <- c(“estim”, “serr”, “zstat”, “prob”); outopt ## #These are the log-likelihood, gradient, and convergence message. ## betaopt$value grad(betaopt$par,ydata,xdata,wdata) betaopt$convergence #Example 3 from Smithson and Verkuilen (2005) in an R session #This code shows how to obtain the final model in Example 3. > ex3 <- read.table("Example3.txt", header = TRUE); attach(ex3) > library(MASS) > ydata <- cbind(accur01); > const <- rep(1,length(ydata)); > xdata <- cbind(const, grpctr, ziq, grpctr*ziq); 5/02/2016 Beta Regression: Practical Issues in Estimation 12 > wdata <- cbind(const, grpctr, ziq); [SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE] > start <- c(1.0, -1.0, 0.5, -0.5, -3.0, -1.5, -1.5); > betaopt <- optim(start, betareg, hessian = T, x = xdata, y = ydata, z = wdata, method = "BFGS") > c(betaopt$value, betaopt$convergence) [1] -65.90186 0.00000 > grad(betaopt$par,ydata,xdata,wdata) const grpctr ziq const grpctr ziq 0.019330881 0.017862284 0.001995802 -0.002906731 -0.005810820 -0.005330392 0.003144999 > outopt <- rbind(estim <- betaopt$par, serr <- sqrt(diag(solve(betaopt$hessian))), zstat <estim/serr, prob <- 1-pnorm(abs(zstat))); row.names(outopt) <- c("estim", "serr", "zstat", "prob"); outopt [SNIP—THIS PART OF THE OUTPUT DELETED] #Example 1 from Smithson and Verkuilen (2005) in an R session #This code shows how to obtain the final model in Example 1. > ex1 <- read.table("Example1.txt", header = TRUE); attach(ex1) > library(MASS) > ydata <- cbind(crc99); > const <- rep(1,length(ydata)); > xdata <- cbind(const, vert, confl, vert*confl); > wdata <- cbind(const, vert, confl, vert*confl); [SNIP—FUNCTIONS ENTERED AS IN S-PLUS CODE] > start <- c(.88, -.01, .15, .19, -1.1, .34, -.22, .1); [SNIP—FUNCTIONS ENTERED AS IN EXAMPLE 3] ___________________________________________________________________________ SPSS Nonlinear Regression syntax components The beta-regression models can be run in SPSS under its Nonlinear Regression (NLR and CNLR) procedure. This can be done either via the GUI or syntax, but we shall restrict our attention to the syntax approach here. SPSS requires three main components for these models: 1. A formula for the location and dispersion submodels, 2. A formula for the negative log-likelihood kernel, which is the loss-function to be minimized, and 3. Names and starting-values for the model parameters. Formula for the location and dispersion submodels In keeping with our notation, we will denote the location submodel parameters by Bj, where j is an index number starting with 0 for the intercept. The location submodel formula takes the form COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)) where #MU is a linear combination of predictors, COMPUTE #MU = B0 + … . For example, in a model that predicts anxiety (ANXIETY) by stress (STRESS) in the location submodel, the location submodel formula is defined by COMPUTE #MU = B0 + B1*STRESS . Likewise, we denote the dispersion submodel parameters by Dk, where k is also an index number. The dispersion submodel formula takes the form COMPUTE PHI = EXP(-D0 - … ) . For example, in a model that has STRESS in the dispersion submodel, the dispersion submodel formula is COMPUTE PHI = EXP(-D0 - D1*STRESS) . Model parameters and starting-values The easiest way to generate starting-values for the location and dispersion model coefficients is to begin with a null model (i.e., a model that just has B0 and D0), and then add coefficients one variable at a time in each model. The procedure for this is described below. To start with, obtain the mean and variance of the dependent variable. In the ANALYZE menu under the DESCRIPTTIVE STATISTICS item choose the DESCRIPTIVES option. The DESCRIPTIVES dialog box 5/02/2016 Beta Regression: Practical Issues in Estimation 13 appears. Click the Options… button and ensure that the Mean and Variance check-boxes have ticks in them. In our example, the resulting syntax should look like this: DESCRIPTIVES VARIABLES=ANXIETY /STATISTICS=MEAN VARIANCE . The output should look like this: Descriptive Statistics N Mean Anxiety 166 Valid N (listwise) 166 Variance .09120 .018 To obtain a more precise estimate of the variance, double-click on the table and then double-click inside the cell containing “.018.” In the example here we will use the more precise value of .01756. Now given the mean and variance 2, we obtain our starting-value for B0 and D0 as follows: B0 = ln[/(1 – )] and D0 = –ln[( – 2 – 2)/2]. In our example these formulas yield the following results: B0 = ln[.09120/(1 – .09120)] = –2.29907 D0 = -ln[(.09120 – (.09120)2 – .01756)/.01756] = –1.31371. Once the null model has been estimated then the coefficients for that model are used as starting-values for a model that includes one Bj or Dk. The starting-values for the additional Bj or Dk can be set to arbitrary values near 0 (e.g., 0.1). Likewise, when the new model has been estimated then its coefficients become the startingvalues for the next model. This process is illustrated in the example provided below. Example 2 from Smithson & Verkuilen (2005): Anxiety predicted by Stress Let = Anxiety, x0 = w0 =1, and x1 = w1 = Stress. We will end up fitting the following model: i = exp( + x1i)/[1 + exp( + x1i)], and i = exp(–– w1i). First, however, we fit the null model. Open the SPSS file called Example2.sav. Null Model We start by computing the null model for our example, using the syntax shell with the appropriate substitutions for parameter values, submodel formulae, and dependent variable name. The starting-values for B0 and D0 are the ones derived earlier from the formulas using the sample mean and variance. MODEL PROGRAM B0 = -2.29907 D0 = -1.31371 . COMPUTE #DV = ANXIETY . COMPUTE #MU = B0 . COMPUTE PHI = EXP(-D0) . COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)). COMPUTE RESID_ = #DV - PRED_ . COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) + PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) . COMPUTE LOSS_ = -LL . CNLR ANXIETY /OUTFILE='C:\SPSSFNLR.TMP' /PRED PRED_ /LOSS LOSS_ /CRITERIA STEPLIMIT 2 ISTEP 1E+20 . The relevant output shows the iteration history and the final negative log-likelihood and parameter values: Iteration 0.1 1.1 ... 7.1 8.1 Loss funct -224.838337 -230.237861 B0 -2.2990700 -2.2407348 D0 -1.3137100 -1.3781099 -239.448006 -239.448006 -2.2439540 -2.2439602 -1.7956404 -1.7956430 Run stopped after 8 major iterations. 5/02/2016 Beta Regression: Practical Issues in Estimation 14 Optimal solution found. Now we use the final parameter values for B0 and D0 as starting-values for the next model, which adds ANXIETY to the location submodel. The starting-value for B1 is set at 0.1: MODEL PROGRAM B0 = -2.2439602 B1 = 0.1 D0 = -1.7956430 . COMPUTE #DV = ANXIETY . COMPUTE #MU = B0 + B1*STRESS . COMPUTE PHI = EXP(-D0) . COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)). COMPUTE RESID_ = #DV - PRED_ . COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) + PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) . COMPUTE LOSS_ = -LL . CNLR ANXIETY /OUTFILE='C:\SPSSFNLR.TMP' /PRED PRED_ /LOSS LOSS_ /CRITERIA STEPLIMIT 2 ISTEP 1E+20 . The negative log-likelihood is –283.007, very different from –239.448 in the null model. The chi-square change statistic is 2*(283.007 – 239.448) = 87.118 which is large and of course significant (p < .0001). The B1 coefficient is large and positive, so higher STRESS predicts higher ANXIETY. Now we compute the final model in our example with STRESS in the dispersion submodel, again using the syntax shell with the starting values from the preceding model and D0 = 0.1. MODEL PROGRAM B0 = -3.4790199 B1 = 3.74959458 D0 = -2.4574117 D1 = 0.1 . COMPUTE #DV = ANXIETY . COMPUTE #MU = B0 + B1*STRESS . COMPUTE PHI = EXP(-D0 - D1*STRESS) . COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)). COMPUTE RESID_ = #DV - PRED_ . COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) + PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) . COMPUTE LOSS_ = -LL . CNLR ANXIETY /OUTFILE='C:\SPSSFNLR.TMP' /PRED PRED_ /LOSS LOSS_ /CRITERIA STEPLIMIT 2 ISTEP 1E+20 . An abbreviation of the relevant output is shown below. Note that the negative log-likelihood (-301.960) is lower than for the preceding model (–283.007), indicating improved fit. For a loss of 1 degree of freedom, the chi-square gain is 2*(301.960 – 283.007) = 37.906 (p < .0001), once again a significant improvement. Iteration 0.1 1.1 ... 8.1 9.1 10.1 Loss funct -283.913317 -284.607693 B0 -3.4790199 -3.3541934 B1 3.74959458 3.77969286 D0 -2.4574117 -2.6867715 D1 .100000000 .617696966 -301.960080 -301.960087 -301.960087 -4.0239012 -4.0237352 -4.0237166 4.94243205 4.94148628 4.94137080 -3.9614916 -3.9608697 -3.9608459 4.27552137 4.27341043 4.27330871 This output does not tell us anything about the standard errors for the individual coefficients, but we can obtain that information and more by using subcommands under the NLR procedure. These matters are taken up in the next section. Extracting More from CNLR Fortunately, the CNLR procedure can yield more than just the coefficients and log-likelihood, but I recommend ensuring that a solution is found by the procedure before including the extra commands. In this section I will cover two additional features: Saving predicted values, residuals, and gradient values; and obtaining bootstrap standard-error estimates for the coefficients. Saving computed variables Predicted values, residuals, gradient values, and loss-function values for all cases can be obtained by inserting the /SAVE subcommand before the /CRITERIA subcommand. For instance, /SAVE PRED RESID DERIVATIVES 5/02/2016 Beta Regression: Practical Issues in Estimation 15 will save the predicted values, residuals, and gradient values (derivatives) to the working data-file. The saved variables may then be used in the usual diagnostic fashion. For instance, summing the derivatives for the final model in our example shows the gradient is near 0 at the solution for each parameter, thereby supporting the claim that the solution is the true optimum. This is a good diagnostic for checking whether a stable solution has been found. Obtaining bootstrap standard-error estimates SPSS does not compute the Hessian at the solution, so we cannot obtain asymptotic standard-error estimates in the usual way that we can from SAS or R. However, SPSS does provide bootstrap estimates (the default is 100 samples). To obtain bootstrap estimates of the standard errors (and the corresponding confidence intervals and correlation matrix of the estimates), insert the following subcommand before the / CRITERIA subcommand: /BOOTSTRAP = N where N is the number of samples desired. Usually 1000-2000 samples suffice for accurate estimates. This may take some time for your computer to complete. Two different kinds of “95% confidence intervals” are displayed: Bootstrap intervals using the standard error estimates and intervals based on excluding the bottom and top 2.5% of the bootstrap distribution. Example 3: Reading accuracy, dyslexia, and IQ Here is the syntax for obtaining the final model presented in Example 3. MODEL PROGRAM B0 = 1.0 B1 = -1.0 B2 = 0.5 B3 = -0.5 D0 = -3.0 D1 = -1.5 D2 = -1.5 . COMPUTE #DV = accur01 . COMPUTE #MU = B0 + B1*grpctr + B2*ziq + B3*ziq*grpctr . COMPUTE PHI = EXP(-D0 - D1*grpctr - D2*ziq) . COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)). COMPUTE RESID_ = #DV - PRED_ . COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) + PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) . COMPUTE LOSS_ = -LL . CNLR accur01 /OUTFILE='C:\SPSSFNLR.TMP' /PRED PRED_ /LOSS LOSS_ /CRITERIA STEPLIMIT 2 ISTEP 1E+20 . Example 1: Reading verdict, conflict, and confidence Here is the syntax for obtaining the final model presented in Example 1. In this example, the BOOTSTRTAP subcommand has been included in order to get standard errors and confidence intervals. MODEL PROGRAM B0 = 0.88 B1 = -0.01 B2 = 0.15 B3 = 0.19 D0 = -1.1 D1 0.34 D2 = -0.22 D3 = 0.1 . COMPUTE #DV = crc99 . COMPUTE #MU = B0 + B1*vert + B2*confl + B3*vert*confl . COMPUTE PHI = EXP(-D0 - D1*vert - D2*confl - D3*vert*confl) . COMPUTE PRED_ = EXP(#MU)/(1+EXP(#MU)). COMPUTE RESID_ = #DV - PRED_ . COMPUTE LL = LNGAMMA(PHI) - LNGAMMA(PRED_*PHI) - LNGAMMA(MAX(0.001,PHI - PRED_*PHI)) + PRED_*PHI*LN(#DV) + (PHI - PRED_*PHI)*LN(1 - #DV) - LN(#DV) - LN(1 - #DV) . COMPUTE LOSS_ = -LL . CNLR crc99 /OUTFILE='C:\SPSSFNLR.TMP' /PRED PRED_ /LOSS LOSS_ /BOOTSTRAP = 2000 /CRITERIA STEPLIMIT 2 ISTEP 1E+20 . SAS Syntax Files *This file has the code for examples 1-3 in Smithson & Verkuilen (2006). We assume you have imported the data into SAS. Code to fit models in the paper using NLMIXED is shown below. Comments in the syntax of each model should help you write your own models. SAS has two other programs that can be used to fit beta regressions: GLIMMIX and NONLIN. NONLIN is fairly similar to NLMIXED. GLIMMIX has syntax 5/02/2016 Beta Regression: Practical Issues in Estimation 16 very similar to MIXED (for linear mixed models) and the beta distribution is a built-in. It can fit random effects models but cannot fit heteroscedasticity-covariate models. We HIGHLY recommend that you familiarize yourself with whatever program you use to fit a beta regression. Nonlinear optimization is an inherently difficult problem that does not have a guaranteed global solution. Just because the optimizer converges does not mean the global optimum was attained. In our experience, if many iterations are required, something is usually wrong: the model is misspecified, not identified, or has bad starting values. Finally, the most common cause of an error message is an out-of-bounds observation.; ; ****Example 1****; *Note: Two different, but statistically equivalent, ways to lay out the model. *Simple transformations turn one into the other. *; proc nlmixed data=jury tech = trureg hess cov itdetails;; title 'One-Way Layout'; *linear predictors; Xb = b1*group1 + b2*group2 + b3*group3 + b4*group4; Wd = d1*group1 + d2*group2 + d3*group3 + d4*group4; *link functions transform linear predictors; mu = exp(Xb)/(1 + exp(Xb)); phi = exp(-Wd); *transform to standard parameterization for easy entry to log-likelihood; w = mu*phi; t = (1-mu)*phi; *log-likelihood; ll = lgamma(w+t) - lgamma(w) - lgamma(t) + (w-1)*log(crc99) + (t-1)*log(1crc99); model crc99 ~ general(ll); run; proc nlmixed data=jury tech = trureg hess cov itdetails;; title 'Two-Way Layout'; *linear predictors; Xb = b0 + b1*vert + b2*confl + b3*(vert*confl); Wd = d0 + d1*vert + d2*confl + d3*(vert*confl); *link functions transform linear predictors; mu = exp(Xb)/(1 + exp(Xb)); phi = exp(-Wd); *transform to standard parameterization for easy entry to log-likelihood; w = mu*phi; t = (1-mu)*phi; *log-likelihood; ll = lgamma(w+t) - lgamma(w) - lgamma(t) + (w-1)*log(crc99) + (t-1)*log(1crc99); 5/02/2016 Beta Regression: Practical Issues in Estimation 17 model crc99 ~ general(ll); run; ****Example 2****; proc nlmixed data = anxiety tech = trureg hess cov itdetails; *The options here tell SAS to use the data set just defined and then request trust-region Newton, detailed iteration history, the final Hessian and asymptotic covariance matrix. There are many other options available-see NLMIXED documentation; title 'null model anxiety--fits a two-parameter beta distribution as a reference'; *These statements set up the model equations. Any variables that are not in the data set declared above are quantities SAS will calculate.; xb = b0; mu = exp(xb)/(1 + exp(xb)); wd = d0; phi = exp(-1*wd); *These numbers are used to simplify the log-likelihood by converting back to the standard parameterization for the beta distribution. There is no really good reason to do it this way but it's easier to read the loglikelihood function and you can monitor these values using "predict" for plotting if you want.; p = mu*phi; q = phi - mu*phi; *This statement sets up the log-likelihood. An alternate to forcing the DV to be in the open unit interval would be to trap any values of the DV that are equal to 0 or 1 exactly--see the NLMIXED examples in SAS doc for how to do this. Otherwise, you can do the transformation using standard SAS data step commands; ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1 - anxiety); *The model statement tells SAS to use anxiety as the DV and then use the log-likelihood just defined.; model anxiety ~ general(ll); run; proc nlmixed data = anxiety tech = trureg hess cov itdetails; title 'anxiety on stress, heteroscedastic model'; xb = b0 + b1*stress; mu = exp(xb)/(1 + exp(xb)); wd = d0 + d1*stress; phi = exp(-1*wd); p = mu*phi; q = phi - mu*phi; 5/02/2016 Beta Regression: Practical Issues in Estimation 18 ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1 - anxiety); model anxiety ~ general(ll); *The next statement generates predicted values as another SAS data set. This file contains many useful quantities and you can look at residuals, etc. It is possible to use "predict" to generate quantities that are functions of estimates, which is really useful for plotting. You can use GPLOT or open the data set in Excel and export it to another program. The estimate command can also be used to generate estimates that are functions of parameters. This is particularly convenient because confidence intervals are also computed for you by the delta method.; predict mu out = anxiety_pred_het; run; proc nlmixed data = anxiety tech = trureg hess cov itdetails; title 'anxiety on stress location only'; xb = b0 + b1*stress; mu = exp(xb)/(1 + exp(xb)); wd = d0; phi = exp(-1*wd); p = mu*phi; q = phi - mu*phi; ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(anxiety) + (q-1)*log(1 - anxiety); model anxiety ~ general(ll); predict mu out = anxiety_pred_loc; run; ****Example 3****; proc nlmixed data = dyslexia tech = trureg hess cov itdetails; *This statement tells SAS to use trust region Newton optimization and to output the final Hessian (second derivative matrix), parameter covariance matrix, and give a detailed iteration history.; title 'accur01 on ziq and grpctr (dyslexia effect code), heteroscedastic model'; *The model equations are constructed here and substituted into the likelihood below; xb = bconst + b_ziq*ziq + b_dys*grpctr; mu = exp(xb)/(1 + exp(xb)); wd = d0 + d_ziq*ziq + d_dys*grpctr; phi = exp(-1*wd); p = mu*phi; q = phi - mu*phi; *This next command is the beta likelihood; ll = lgamma(p+q) - lgamma(p) - lgamma(q) + (p-1)*log(accur01) + (q-1)*log(1 - accur01); model accur01 ~ general(ll); *Request predicted values for the mean from this model, to be saved in the specified data set.; predict mu out = dyslexia_pred_het; run;