On iterative adjustment of responses for the reduction of bias in binary regression models Ioannis Kosmidis Research Fellow Department of Statistics IWSM 2009 Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Outline 1 Introduction 2 Parameter-dependent adjustments to the data 3 Illustration 4 Discussion and further work Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Constant adjustment schemes Why adjust the binomial data? To improve the frequentist properties of the estimators (especially bias). To avoid sparseness issues which may result to infinite maximum likelihood estimates. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Constant adjustment schemes Constant adjustment schemes Landmark studies: → Haldane (1955), Anscombe (1956): add 1/2 to the binomial response y and 1 to the binomial totals m and then replace the actual data with the adjusted in the usual log-odds estimator. → Simple logistic regressions: Hitchcock (1962), Gart et al. (1985). Estimation can be conveniently performed by the following procedure: 1 2 → Calculate the value of the adjusted responses and totals. Proceed with usual estimation methods, treating the adjusted data as actual. No constant adjustment scheme can be optimal for every possible binary regression model, in terms of improving the frequentist properties of the estimators. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Constant adjustment schemes Constant adjustment schemes Estimation of the parameters β1 , . . . , βp of the general logistic regression model p log X πr = ηr = βt xrt 1 − πr t=1 (r = 1, . . . , n) , where xrt is the (r, t)th component of an n × p design matrix X. Clog et al. (1991) developed an adjustment scheme where Pn p r=1 yr p ∗ yr = yr + Pn and m∗r = mr + (r = 1, . . . , n) . n n r=1 mr → The resultant estimators are not invariant to different representations of the data (for example, aggregated and disaggregated view). Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Bias-reducing adjusted score functions Binomial observations y1 , . . . , yn with totals m1 , . . . , mn , respectively. Binary regression: g(πr ) = ηr = p X βt xrt (r = 1, . . . , n) , t=1 where g : [0, 1] → <. Score functions Ut = n X wr r=1 dr (yr − mr πr ) xrt (t = 1, . . . , p) , with dr = mr dπr /dηr and wr = d2r /{mr πr (1 − πr )}. Kosmidis, I. On iterative adjustment of the responses Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Bias-reducing adjusted score functions Estimators with second order bias by solving the adjusted score equations Ut∗ = 0 (t = 1, . . . , p). Bias-reducing adjusted score functions (Kosmidis & Firth, 2009) n X wr d0r ∗ Ut = yr + hr − mr πr xrt (t = 1, . . . , p) , d 2wr r=1 r with d0r = mr d2 πr /dηr2 and hr the rth diagonal element of H = X(X T W X)−1 X T W . → Adjusted responses: yr∗ = yr + hr d0r 2wr (r = 1, . . . , n) . pseudo-data representation Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Parameter-dependent adjustments to the data Use the adjusted responses yr + hr d0r /(2wr ) in existing maximum likelihood implementations, iteratively. Practical issues can arise relating to the sign of d0r : possibly yr∗ > mr of yr∗ < 0 violating the range of the actual data (0 ≤ yr ≤ mr ). Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Pseudo-data representations Pseudo-data representation: the pair {y ∗ , m∗ }, where y ∗ is the adjusted binomial response and m∗ the adjusted totals. By the form of Ut∗ , Bias-reducing pseudo-data representation d0 y+h , m . 2w adjusted score function Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data A set of bias-reducing pseudo-data representations A countable set of equivalent bias-reducing pseudo-data representations, where any pseudo-data representation in the set can be obtained from any other by the following operations: → add and subtract a quantity to either the adjusted responses or totals → move summands from the adjusted responses to the adjusted totals after division by −π. Special case (Firth, 1992): For logistic regressions Ut∗ = n X 1 yr + hr (1 − 2πr ) − mr πr xrt 2 r=1 (t = 1, . . . , p) . → bias-reducing pseudo-data representations: {y + h(1 − 2π)/2, m}, {y + h/2, m + h}, etc. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data An appropriate pseudo-data representation Within this set of bias-reducing pseudo-data representations: Bias-reducing pseudo-data representation with 0 ≤ y ∗ ≤ m∗ md0 1 y + hπ 1 + 2 Id0 >0 , 2 d 1 md0 m + h 1 + 2 (π − Id0 ≤0 ) , 2 d where IE = 1 if E holds and IE = 0 else. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Local maximum likelihood fits on pseudo-data representations Local maximum likelihood fits on pseudo-data representations 1 2 ∗ Update to {yr,(j+1) , m∗r,(j+1) } (r = 1, . . . , n) evaluating all the quantities involved at β(j) . ∗ Use maximum likelihood to fit the model with responses yr,(j+1) and ∗ totals mr,(j+1) (r = 1, . . . , n) with β(j) as starting value. β(0) : the maximum likelihood estimates, possibly after adding c > 0 to the responses and 2c to the binomial totals. Pp ∗ Convergence criterion: Ut (β(j+1) ) ≤ , > 0 t=1 Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Demonstration: The beetle mortality data The beetle mortality data (Agresti, 2002, Table 6.14) Killed Total Log-dose 6 13 18 28 52 53 61 60 59 60 62 56 63 59 62 60 1.691 1.724 1.755 1.784 1.811 1.837 1.861 1.884 Kosmidis, I. log[− log(1 − πr )] = β1 + β2 xr On iterative adjustment of the responses Demonstration Current estimates ML estimates β1 β2 -39.5222 22.0147 Adjusted responses Adjusted totals Demonstration Current estimates β1 β2 ML estimates Iteration 1 -39.5222 22.0147 -39.0581 21.7545 Adjusted responses Adjusted totals ∗ y(1) m∗(1) 6.1316 13.1500 18.1414 28.0900 52.1025 53.1620 61.1461 60.0375 59.2457 60.2640 62.2293 56.1372 63.1751 59.2824 62.2616 60.0697 Demonstration Current estimates β1 β2 ML estimates Iteration 1 Iteration 2 -39.5222 22.0147 -39.0581 21.7545 -39.0469 21.7482 Adjusted responses Adjusted totals ∗ y(1) ∗ y(2) m∗(1) m∗(2) 6.1316 13.1500 18.1414 28.0900 52.1025 53.1620 61.1461 60.0375 6.1321 13.1497 18.1406 28.0893 52.1005 53.1590 61.1484 60.0414 59.2457 60.2640 62.2293 56.1372 63.1751 59.2824 62.2616 60.0697 59.2464 60.2631 62.2277 56.1361 63.1715 59.2770 62.2653 60.0769 Demonstration Current estimates β1 β2 ML estimates Iteration 1 Iteration 2 Iteration 3 ... -39.5222 22.0147 -39.0581 21.7545 -39.0469 21.7482 -39.0466 21.7481 ... ... Adjusted responses ∗ y(1) ∗ y(2) ∗ y(3) 6.1316 13.1500 18.1414 28.0900 52.1025 53.1620 61.1461 60.0375 6.1321 13.1497 18.1406 28.0893 52.1005 53.1590 61.1484 60.0414 6.1322 13.1497 18.1405 28.0893 52.1004 53.1589 61.1484 60.0415 Adjusted totals ... m∗(1) m∗(2) m∗(3) ... ... ... ... ... ... ... ... ... 59.2457 60.2640 62.2293 56.1372 63.1751 59.2824 62.2616 60.0697 59.2464 60.2631 62.2277 56.1361 63.1715 59.2770 62.2653 60.0769 59.2464 60.2631 62.2277 56.1361 63.1715 59.2769 62.2654 60.0771 ... ... ... ... ... ... ... ... Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Demonstration Estimates of α and β for the Beetle mortality data. Estimates Parameters α β ML BR -39.5222 (3.2356) 22.0147 (1.7970) -39.0466 (3.1919) 21.7481 (1.7723) Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Parameter-dependent adjustments to the data A set of bias-reducing pseudo-data representations Local maximum likelihood fits on pseudo-data representations Demonstration Invariance of the estimates to the structure of the data Invariance of the estimates to the structure of the data Response y1 Total m1 .. . mn yn yr = kr X zrs Covariates x1 Response z11 xn z1k1 zn1 (r = 1, . . . , n) s=1 mr = kr X znk1 lrs Total l11 .. . l1k1 .. . ln1 .. . lnk1 Covariates x1 x1 xn xn (r = 1, . . . , n) s=1 0 ≤ zrs ≤ lrs yr∗ = kr X ∗ zrs and m∗r = s=1 Kosmidis, I. On iterative adjustment of the responses kr X s=1 ∗ lrs Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work A complete enumeration study Results A complete enumeration study Binomial observations y1 , . . . , y5 , each with totals m, made independently at the five design points x = (−2, −1, 0, 1, 2). Binary regression model: g(πr ) = β1 + β2 xr (r = 1, . . . , 5) . g(.): logit, probit and complementary log-log links. The “true” parameter values are set to β1 = −1 and β2 = 1.5: logit probit cloglog π1 0.01799 0.00003 0.01815 π2 0.07586 0.00621 0.07881 π3 0.26894 0.15866 0.30780 π4 0.62246 0.69146 0.80770 π5 0.88080 0.97725 0.99938 → increased probability of infinite maximum likelihood estimates. Through complete enumeration for m = 4, 8, 16 calculate: bias and mean squared error of the bias-reduced estimator, and coverage of the nominally 95% Wald-type confidence interval. Kosmidis, I. On iterative adjustment of the responses Results The parenthesized probabilities refer to the event of encountering at least one infinite ML estimate for the corresponding m and link function. All the quantities for the ML estimator are calculated conditionally on the finiteness of its components. Link logit probit cloglog Bias (×102 ) ML BR 4 β1 -8.79 0.52 (0.1621) β2 14.44 -0.13 8 β1 -12.94 -0.68 (0.0171) β2 20.17 1.11 16 β1 -7.00 -0.19 (0.0002) β2 10.55 0.29 4 β1 17.89 13.54 (0.5475) β2 -18.84 -16.93 8 β1 0.80 3.24 (0.2296) β2 6.08 -3.81 16 β1 -7.06 0.24 (0.0411) β2 12.54 -0.17 4 β1 2.97 3.18 (0.3732) β2 -2.93 -12.97 8 β1 -8.42 0.84 (0.1000) β2 15.63 -5.40 16 β1 -6.45 0.17 (0.0071) β2 13.13 -1.74 ML: maximum likelihood; BR: bias m Parameter MSE (×10) ML BR 5.84 6.07 3.62 4.73 3.93 3.11 3.70 2.68 1.75 1.42 1.59 1.16 1.44 2.61 0.98 3.07 1.07 1.82 1.26 2.13 1.03 1.08 1.39 1.22 2.97 3.07 1.35 3.51 2.49 1.89 2.33 2.36 1.32 0.98 1.60 1.23 reduction. Coverage ML BR 0.971 0.972 0.960 0.939 0.972 0.964 0.972 0.942 0.961 0.957 0.960 0.948 0.968 0.911 0.960 0.897 0.964 0.938 0.972 0.908 0.974 0.949 0.973 0.933 0.959 0.962 0.955 0.880 0.962 0.953 0.972 0.906 0.964 0.957 0.965 0.921 Results The parenthesized probabilities refer to the event of encountering at least one infinite ML estimate for the corresponding m and link function. All the quantities for the ML estimator are calculated conditionally on the finiteness of its components. Link logit probit cloglog Bias (×102 ) ML BR 4 β1 -8.79 0.52 (0.1621) β2 14.44 -0.13 8 β1 -12.94 -0.68 (0.0171) β2 20.17 1.11 16 β1 -7.00 -0.19 (0.0002) β2 10.55 0.29 4 β1 17.89 13.54 (0.5475) β2 -18.84 -16.93 8 β1 0.80 3.24 (0.2296) β2 6.08 -3.81 16 β1 -7.06 0.24 (0.0411) β2 12.54 -0.17 4 β1 2.97 3.18 (0.3732) β2 -2.93 -12.97 8 β1 -8.42 0.84 (0.1000) β2 15.63 -5.40 16 β1 -6.45 0.17 (0.0071) β2 13.13 -1.74 ML: maximum likelihood; BR: bias m Parameter MSE (×10) ML BR 5.84 6.07 3.62 4.73 3.93 3.11 3.70 2.68 1.75 1.42 1.59 1.16 1.44 2.61 0.98 3.07 1.07 1.82 1.26 2.13 1.03 1.08 1.39 1.22 2.97 3.07 1.35 3.51 2.49 1.89 2.33 2.36 1.32 0.98 1.60 1.23 reduction. Coverage ML BR 0.971 0.972 0.960 0.939 0.972 0.964 0.972 0.942 0.961 0.957 0.960 0.948 0.968 0.911 0.960 0.897 0.964 0.938 0.972 0.908 0.974 0.949 0.973 0.933 0.959 0.962 0.955 0.880 0.962 0.953 0.972 0.906 0.964 0.957 0.965 0.921 Results The parenthesized probabilities refer to the event of encountering at least one infinite ML estimate for the corresponding m and link function. All the quantities for the ML estimator are calculated conditionally on the finiteness of its components. Link logit probit cloglog Bias (×102 ) ML BR 4 β1 -8.79 0.52 (0.1621) β2 14.44 -0.13 8 β1 -12.94 -0.68 (0.0171) β2 20.17 1.11 16 β1 -7.00 -0.19 (0.0002) β2 10.55 0.29 4 β1 17.89 13.54 (0.5475) β2 -18.84 -16.93 8 β1 0.80 3.24 (0.2296) β2 6.08 -3.81 16 β1 -7.06 0.24 (0.0411) β2 12.54 -0.17 4 β1 2.97 3.18 (0.3732) β2 -2.93 -12.97 8 β1 -8.42 0.84 (0.1000) β2 15.63 -5.40 16 β1 -6.45 0.17 (0.0071) β2 13.13 -1.74 ML: maximum likelihood; BR: bias m Parameter MSE (×10) ML BR 5.84 6.07 3.62 4.73 3.93 3.11 3.70 2.68 1.75 1.42 1.59 1.16 1.44 2.61 0.98 3.07 1.07 1.82 1.26 2.13 1.03 1.08 1.39 1.22 2.97 3.07 1.35 3.51 2.49 1.89 2.33 2.36 1.32 0.98 1.60 1.23 reduction. Coverage ML BR 0.971 0.972 0.960 0.939 0.972 0.964 0.972 0.942 0.961 0.957 0.960 0.948 0.968 0.911 0.960 0.897 0.964 0.938 0.972 0.908 0.974 0.949 0.973 0.933 0.959 0.962 0.955 0.880 0.962 0.953 0.972 0.906 0.964 0.957 0.965 0.921 Shrinkage Fitted probabilities using bias-reduced estimates versus fitted probabilities using maximum likelihood estimates for the complementary log-log link and m = 4. The marked point is (c, c) where c = 1 − exp(−1). Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Discussion and further work Bias-reduction can be easily applied using pseudo-data representations, along with existing maximum likelihood fitting procedures. The brglm R package implements this. The bias-reduced estimator is invariant to the representation of the binomial data. The bias-reduced estimates are always finite and because of their improved statistical properties their routine use in applications is appealing. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Discussion and further work Wald-type approximate confidence intervals for the bias-reduced estimator perform poorly (mainly because of finiteness and shrinkage). Heinze & Schemper (2002), in the case of logistic regressions used the penalized likelihood interpretation of the adjusted score functions to construct profile penalized likelihood confidence intervals. Such a penalized likelihood interpretation is not always possible from links other than logit (Kosmidis & Firth 2009). The adjusted-score statistic Ut∗ (β̂1 , . . . , β̂t−1 , βt , β̂t+1 , . . . , β̂p )2 F tt (β̂1 , . . . , β̂t−1 , βt , β̂t+1 , . . . , β̂p ) , could be used for the construction of confidence intervals for the parameter βt (t = 1, . . . , p), where β̂u are the bias-reduced estimates when the tth component of the parameter vector is fixed at βt and F tt is the (t, t)th component of the inverse Fisher information. Kosmidis, I. On iterative adjustment of the responses Introduction Parameter-dependent adjustments to the data Illustration Discussion and further work Anscombe (1956). On estimating binomial response relations. Biometrika, 43, 461–464. Clogg, C. C., D. B. Rubin, N. Schenker, B. Schultz, and L. Weidman (1991). Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68–78. Firth, D. (1992). Bias reduction, the Jeffreys prior and GLIM. In L. Fahrmeir, B. Francis, R. Gilchrist, and G. Tutz (Eds.), Advances in GLIM and Statistical Modelling: Proceedings of the GLIM 92 Conference, Munich, New York, pp. 91–100. Springer. Gart, J. J., H. M. Pettigrew, and D. G. Thomas (1985). The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika, 72, 179–190. Haldane, J. (1955). The estimation of the logarithm of a ratio of frequencies. Annals of Human Genetics, 20, 309–311. Heinze, G. and M. Schemper (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, 2409–2419. Hitchcock, S. E. (1962). A note on the estimation of parameters of the logistic function using the minimum logit χ2 method. Biometrika, 49, 250–252. Kosmidis, I. and D. Firth (2009). Bias reduction in exponential family nonlinear models. Technical Report 8-5, CRiSM working paper series, University of Warwick. Accepted for publication in Biometrika. Kosmidis, I. On iterative adjustment of the responses