PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH Clumping at 0 Some subjects show no response, others have a continuous, or at least ordered response Examples: Hospitalization expense in an HMO Cell growth on plates Urinary output in shock patients Usual normal theory doesn’t apply 2 Urinary Output (Afifi & Azen) surv==1 surv==2 Fraction .883721 0 0 510 0 510 uo Histograms by surv 3 UO Analysis Survival: 27/70 had UO=0; mean=127.9, s=148.13, skewness=1.13 Deaths: 22/43 had UO=0; mean=31.0, s=71.76, skewness=3.37 For these data: – t=3.01 (p=0.0032) – Wilcoxon z=2.794 (p=0.0052) – Kolmogorov-Smirnov p=0.001 – 2 part X2=15.86 (p0.00036) 4 Statistical Model fi(x,d)=pi1-d{(1-pi)hi(x)}d H0: p1=p2 h1=h2 Tests: – t-test on full data set – Wilcoxon rank sum test – Kolmogorov-Smirnov – Two part Models: Bin+Z; Bin+W; Bin+KS 5 What are the relative properties? Right size? Is=0.05 when it’s supposed to be? Are the null distributions correct? What is the power of these procedures under various alternatives? (Use log-normal model) – Difference only in proportions – Difference only in means – Difference in both 6 Tests z W y1 y 2 2 sp n G1 n ( n m 1) 2 n m( n m 1) 12 Ri Dmn sup(| Fm ( y) Gn ( y) |) 7 Two-part Tests Define B p1 p2 p (1 p ) 2 n Then the two-part tests are: B2+Z2 (denoted as BZ), B2+W2 (denoted as BW) and B2+K2 (denoted as BK), where K2 is the chi-squared value corresponding to the p-value of the KS statistic. Since these are independent, we have the sum of two 1 d.f. (central) chi-squared statistics (under the null) 8 Size of Tests n1=n2=50, Equal means P1= P2 W K Z BZ BW BK 0.1 0.0432 0.0624 0.0440 0.0424 0.0471 0.0541 0.2 0.0462 0.0658 0.0466 0.0468 0.0475 0.0549 9 Probability Plots for null case n1=n2=100, p1=p2=0.2, Means=0 Chi-plot for Wilcoxon Kolmogorov-Smirnov plot vs. uniform Chi-plot of z-test 1 15 15 Z 10 W 10 .5 5 5 0 0 0 5 10 15 Expected Chi-Squared d.f. = 1 Chi-plot of BZ-test 0 0 .5 Uniform (0,1) Chi-plot of BW-test 1 0 5 10 15 Expected Chi-Squared d.f. = 1 Chi-plot of BK-test 20 20 15 15 10 BK BW BZ 20 10 5 0 5 0 0 5 10 15 20 Expected Chi-Squared d.f. = 2 10 0 0 5 10 15 20 Expected Chi-Squared d.f. = 2 0 5 10 15 20 Expected Chi-Squared d.f. = 2 10 Power: n= 50,100 P1=0.1, P2=0.2; MEAN DIFFERENCE=0 N W K Z BZ BW BK 50 0.142 0.156 0.065 0.198 0.197 0.206 100 0.222 0.222 0.092 0.415 .413 0.424 11 Power: n=50, 100 Differ only in means P=0.1,0.2, mean=0.5 p,n 0.1,50 W Z BZ BW BK 0.467 0.504 0.464 0.374 0.506 0.432 0.1,100 0.774 0.2,50 K 0.769 0.703 0.626 0.821 .733 0.309 0.400 0.370 0.310 0.450 0.377 0.2,100 0.592 0.646 0.652 0.583 0.771 0.686 12 Power:n=100,p1=0.1,p2=0.2 mean=0.3, 0.5 Proportion and mean are consonant W K Z BZ BW BK 0.3 0.784 0.700 0.579 0.627 0.696 0.670 0.5 0.964 0.945 0.886 0.848 0.936 0.906 13 Power:n=100,p1=0.2,p2=0.1 mean=0.3, 0.5 Proportion and mean are dissonant W K Z BZ BW BK 0.3 0.055 0.162 0.132 0.616 0.706 0.672 0.5 0.214 0.459 0.452 0.852 0.933 0.892 14 Conclusions These results are similar to those for other sample sizes and parameter combinations Size is appropriate Distributions match expectations, except for largest values For differences only in proportions (low proportions), the BZ, BW and BK methods did well, Z did poorly 15 Conclusions (2) For differences only in means, the W, K, Z, BW and BK did well For consonant differences (mean and proportion in same direction), W, K, BW and BK did well, Z and BZ poorly For dissonant differences, BW, BK and BZ were far superior to the others 16 Conclusions (3) Theoretical results indicate that computing sample size or power with the non-central 2 distribution gives an excellent agreement with the simulated powers Papers: Comparisons - Statistics in Medicine 2001, p. 1215 Non-central - Statistics in Medicine 2001, p. 1235 17 Peter A. Lachenbruch and John Molitor Oregon State University The Two-part Model Some data have an excess of zero values. These aren’t be easily modeled because of the spike at 0. Can use a mixture model if one cannot distinguish a sampling zero from a structural zero. Example: telephone calls in a short period of time. If phone is turned on, some time periods may have no calls. If phone is turned off, there are no calls registered. Can use two-part model if all zeros are structural. Example: hospitalization cost when an insured was not hospitalized. Size of growth on an agar plate if all activity is inhibited. 19 An equation or two Let y be the response. It is zero if no response, and non-zero otherwise. Let h(y) be the conditional distribution of y given y>0 Let d be an indicator of non-zero response and p=probability that z=1 For a two part model, we have f ( y, d ) p1d *[(1 p) * h( y)]d The log-likelihood is easy to compute and the solution is simply the likelihood estimate for p and for the mean (regression) of y. 20 Inference One estimates parameters using the individual components of the likelihood. These are standard estimates. For the zero-nonzero part we use a logistic regression, and for the nonzero values we use a multiple regression. An issue is how to select variables for inclusion in a model. Select variables separately for each part of the model? Select variables for the model as a whole using the 0 as if it were a regular observation. 21 Variable selection criteria What criterion: R2 =1-RSS/SST R2adj =1-(n-1)/(n-k-1)*RSS/SST AIC=n*ln(RSS/n)+2k+n+n*ln(2) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2) (these are for normal distribution models) Use forward or backward stepping P to enter 0.15, 0.05 P to remove 0.15, 0.05 Best subsets models? For generalized linear models, the deviance is proposed. 22 Variable Selection For the multivariate regression, we can use stepwise regression. There are the usual concerns about stepwise. We can use AIC, BIC, R2 to select the best model. AIC and BIC penalize the selection based on the number of variables in the model. For normal distributions we have AIC=n*ln(RSS/n)+2k+n+n*ln(2) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2) Bias adjusted versions of R2 and AIC are also available 23 More on selection For the logistic part of the model, we use stepwise logistic regression and specify a p(enter) or p(remove) – this is based on the test of the odds ratio for each candidate variable. For variable selection, most programs use a stepwise routine that selects on the basis of the test on the odds ratio (basically a normal theory test). 24 Single model methods There are two single model methods we consider: Include the 0 values in a multiple regression This is obviously inappropriate, but users often have done this In practice, it selects more variables and includes the ones that have been selected by the logistic and multiple regression models. Conduct a Bayesian analysis of the variable selection problem. This is work in progress. 25 Computing - Stata We use Stata for computing because it has some convenient selection commands. The recently developed command, vselect, due to Lindsay and Sheather, allows one to do variable selection using AIC, BIC, R2 and forward or backward stepping, as well as finding the best set of variables for each number of variables. The Best subsets option uses the “leaps and bounds” algorithm that vastly reduces the amount of computations. This was due to Furnival and Wilson. 26 More on selection Unfortunately, at present, vselect works only for multiple regression and not for logistic regression. Thus, we considered two strategies: Use stepwise logistic regression directly Regress the 0-1 variable using regression and perform the variable selection operation on the results. The vselect command first computes a multiple regression on all variables, then it computes the stepwise variable selection from the X’X matrix It allows the use of R2 , AIC, BIC, Mallows’ C, and Best subsets regression. In the example, we use the Best option that gives all of the above The Bayesian methods will be presented separately. 27 Example data We use a data set courtesy of Lisa Rider. lald=ln(aldosterone) (response) aldind – indicator for 0 -1 Dx2 – Polymyositis (1) or Dermatomyositis (2) Agedx – age at diagnosis Yeardx – year of diagnosis Ild – interstitial lung disease Y/N Fever >100 – Y/N Mechhand – mechanics hands Y/N Dysphagia Y/N Race – W/NW gender – male (0) female (1) Arthritis – Y/N Raynaud’s sign Y/N palpitations Y/N Proximal weakness Y/N Realonspeed – onset speed 1 28 The prediction problem We wish to predict laldo. However, 72 out of 420 are 0. This leads to a clump of zero values. We may wish to have a single set of predictors for lald, or we may wish to have a set of predictors for the non-zero values and a (possibly distinct) set of predictors for the 0 values. A related question is how can we evaluate the prediction ability of the resulting equations? 29 Example of vselect . regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita Source | SS df MS -------------+-----------------------------Model | 44.1754461 14 3.15538901 Residual | 235.26075 332 .708616718 -------------+-----------------------------Total | 279.436196 346 .807619065 Number of obs F( 14, 332) Prob > F R-squared Adj R-squared Root MSE = = = = = = dysphag proxweak racewnw realonspeed 347 4.45 0.0000 0.1581 0.1226 .84179 -----------------------------------------------------------------------------laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agedx | 0.0061 0.0120 0.51 6.1e-01 -0.0176 0.0298 yeardx | -0.0015 0.0086 -0.18 8.6e-01 -0.0185 0.0154 dx2 | -0.7198 0.1617 -4.45 1.2e-05 -1.0379 -0.4016 gender | -0.1017 0.1016 -1.00 3.2e-01 -0.3015 0.0982 ild | -0.0200 0.1802 -0.11 9.1e-01 -0.3744 0.3345 arthritis | 0.0548 0.0957 0.57 5.7e-01 -0.1334 0.2430 fever | -0.0830 0.1000 -0.83 4.1e-01 -0.2798 0.1138 raynaud | 0.3457 0.1490 2.32 2.1e-02 0.0526 0.6389 mechhand | -0.0275 0.1822 -0.15 8.8e-01 -0.3859 0.3310 palpita | -0.2085 0.1973 -1.06 2.9e-01 -0.5966 0.1797 dysphag | 0.2590 0.0983 2.63 8.8e-03 0.0656 0.4525 proxweak | 0.4575 0.8487 0.54 5.9e-01 -1.2119 2.1270 racewnw | -0.0937 0.0991 -0.95 3.4e-01 -0.2887 0.1012 realonspeed | -0.1849 0.0445 -4.16 4.1e-05 -0.2723 -0.0974 _cons | 6.6862 17.2356 0.39 7.0e-01 -27.2186 40.5910 ------------------------------------------------------------------------------ The next slide gives the vselect command and output. Note the restriction that lald>0 and u80 (an indicator variable that the patient was first diagnosted after 1980. 30 Vselect output This is the vselect output on the non-zero values. all 14 variables We truncated at 5 variables selected – the actual output includes . vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita ,best 1 Observations Containing Missing Predictor Values dysphag proxweak racewnw realonspeed Response : laldo Fixed Predictors : Selected Predictors: dx2 realonspeed dysphag raynaud palpita gender racewnw fever a > rthritis proxweak agedx yeardx mechhand ild Actual Regressions 37 Possible Regressions 16384 Optimal Models Highlighted: # Preds 1 2 3 4 5 6 R2ADJ C .0663986 24.09272 .1044985 10.09118 .1207073 4.734216 .1356839 -.1055272 .1361631 .7231399 .1365321 1.595634 AIC 888.755 875.2897 869.9412 864.9669 865.7583 866.591 AICC 1873.568 1860.15 1854.861 1849.957 1850.832 1851.76 BIC 896.4537 886.8377 885.3385 884.2135 888.8543 893.5363 Selected Predictors 1 2 3 4 5 6 : : : : : : dx2 dx2 dx2 dx2 dx2 dx2 realonspeed realonspeed realonspeed realonspeed realonspeed raynaud dysphag raynaud dysphag raynaud racewnw dysphag raynaud palpita racewnw In this case, the program computed 27 regressions out of 16384 (=214 possible regressions) 31 Selecting predictors for 0 indicator For the logistic regressions we use stepwise logistic regression that selects variables based on odds ratios. We use forward stepping with a p-to-enter of 0.15 stepwise, pe(.15): logistic aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 note: proxweak dropped because of estimability note: 1 obs. dropped because of estimability begin with empty model p = 0.0036 < 0.1500 adding palpita p = 0.0322 < 0.1500 adding arthritis p = 0.0340 < 0.1500 adding gender Logistic regression Log likelihood = -183.34326 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 418 17.40 0.0006 0.0453 -----------------------------------------------------------------------------aldind | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------palpita | 0.3060 0.1217 -2.98 2.9e-03 0.1403 0.6674 arthritis | 1.8598 0.5150 2.24 2.5e-02 1.0809 3.2000 gender | 0.4839 0.1657 -2.12 3.4e-02 0.2474 0.9466 -----------------------------------------------------------------------------estat ic ----------------------------------------------------------------------------Model | Obs ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------. | 418 -192.0435 -183.3433 4 374.6865 390.8284 ----------------------------------------------------------------------------Note: N=Obs used in calculating BIC; see [R] BIC note We see that the dx2 and onset speed variables did not enter, so somewhat different variables predict 0-ness than the magnitude of response 32 Selecting predictors for 0 with regression, ignoring binomial form We display only results for first five selected variables. regress aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80 Source | SS df MS -------------+-----------------------------Model | 3.56544676 14 .254674768 Residual | 56.0622382 404 .138767916 -------------+-----------------------------Total | 59.627685 418 .142649964 Number of obs F( 14, 404) Prob > F R-squared Adj R-squared Root MSE = = = = = = 419 1.84 0.0319 0.0598 0.0272 .37252 -----------------------------------------------------------------------------aldind | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agedx | -0.0053 0.0047 -1.14 2.5e-01 -0.0145 0.0038 yeardx | 0.0017 0.0035 0.50 6.2e-01 -0.0051 0.0085 dx2 | -0.0281 0.0646 -0.43 6.6e-01 -0.1550 0.0988 gender | -0.0857 0.0416 -2.06 4.0e-02 -0.1675 -0.0039 ild | -0.0459 0.0714 -0.64 5.2e-01 -0.1862 0.0944 arthritis | 0.0789 0.0380 2.08 3.8e-02 0.0043 0.1535 fever | 0.0636 0.0396 1.61 1.1e-01 -0.0143 0.1414 raynaud | 0.0049 0.0599 0.08 9.4e-01 -0.1129 0.1226 mechhand | 0.0803 0.0765 1.05 2.9e-01 -0.0701 0.2306 palpita | -0.2003 0.0701 -2.86 4.5e-03 -0.3382 -0.0624 dysphag | -0.0360 0.0390 -0.92 3.6e-01 -0.1127 0.0407 proxweak | -0.2055 0.3751 -0.55 5.8e-01 -0.9429 0.5319 racewnw | 0.0280 0.0395 0.71 4.8e-01 -0.0496 0.1057 realonspeed | -0.0053 0.0178 -0.30 7.6e-01 -0.0404 0.0297 _cons | -2.2499 6.9270 -0.32 7.5e-01 -15.8673 11.3676 ------------------------------------------------------------------------------ 33 Selecting predictors for 0 with regression, ignoring binomial form, 2 . . vselect aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : aldind Fixed Predictors : Selected Predictors: palpita arthritis gender fever agedx mechhand dysphag racewnw > ild proxweak yeardx dx2 realonspeed raynaud Actual Regressions 62 Possible Regressions 16384 Optimal Models Highlighted: # Preds 1 2 3 4 5 R2ADJ .0197545 .028156 .0365444 .0389249 .0403595 C 5.197552 2.597088 .0194683 .0159628 .4189426 AIC 366.7613 364.1486 361.5079 361.4605 361.8213 AICC 1555.89 1553.316 1550.724 1550.735 1551.164 BIC 374.837 376.2622 377.6594 381.6499 386.0485 Selected Predictors 1 2 3 4 5 : : : : : palpita palpita palpita palpita palpita arthritis arthritis gender arthritis gender fever arthritis gender fever agedx Note that the selected variables are identical to the stepwise logistic regression. 34 Multiple regression with 0 in the data set We now consider the model including 0 as part of the data. This may be made a bit easier having taken logs of the non-zero values, so the 0s aren’t quite so obviously different. . regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 Source | SS df MS -------------+-----------------------------Model | 62.68539 14 4.47752786 Residual | 638.017201 404 1.5792505 -------------+-----------------------------Total | 700.702591 418 1.67632199 Number of obs F( 14, 404) Prob > F R-squared Adj R-squared Root MSE = = = = = = 419 2.84 0.0004 0.0895 0.0579 1.2567 -----------------------------------------------------------------------------laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agedx | -0.0075 0.0157 -0.48 6.4e-01 -0.0383 0.0234 yeardx | 0.0024 0.0117 0.21 8.4e-01 -0.0206 0.0254 dx2 | -0.6763 0.2178 -3.11 2.0e-03 -1.1044 -0.2482 gender | -0.3182 0.1404 -2.27 2.4e-02 -0.5941 -0.0423 ild | -0.1800 0.2408 -0.75 4.6e-01 -0.6533 0.2933 arthritis | 0.2548 0.1280 1.99 4.7e-02 0.0031 0.5065 fever | 0.1069 0.1336 0.80 4.2e-01 -0.1557 0.3695 raynaud | 0.3104 0.2021 1.54 1.3e-01 -0.0868 0.7076 mechhand | 0.2043 0.2580 0.79 4.3e-01 -0.3029 0.7115 palpita | -0.7101 0.2366 -3.00 2.9e-03 -1.1753 -0.2449 dysphag | 0.1165 0.1315 0.89 3.8e-01 -0.1422 0.3751 proxweak | -0.0250 1.2653 -0.02 9.8e-01 -2.5124 2.4625 racewnw | -0.0079 0.1332 -0.06 9.5e-01 -0.2698 0.2541 realonspeed | -0.1742 0.0601 -2.90 4.0e-03 -0.2924 -0.0560 _cons | -0.8421 23.3682 -0.04 9.7e-01 -46.7806 45.0964 ------------------------------------------------------------------------------ 35 Using vselect on the full data set Displaying best five . vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : laldo Fixed Predictors : Selected Predictors: dx2 palpita realonspeed gender arthritis raynaud dysphag fever > mechhand ild agedx yeardx racewnw proxweak Actual Regressions 47 Possible Regressions 16384 Optimal Models Highlighted: # Preds 1 2 3 4 5 6 7 R2ADJ .0154376 .0322276 .048014 .0580737 .0673386 .0695667 .0699354 C 20.79848 14.33945 8.358132 4.926931 1.865516 1.901132 2.752656 AIC 1401.003 1394.79 1388.891 1385.429 1382.274 1382.256 1383.071 AICC 2590.131 2583.957 2578.106 2574.703 2571.617 2571.677 2572.582 BIC 1409.079 1406.904 1405.042 1405.618 1406.501 1410.521 1415.374 Selected Predictors 1 2 3 4 5 6 7 : : : : : : : dx2 dx2 dx2 dx2 dx2 dx2 dx2 palpita palpita palpita palpita palpita palpita realonspeed realonspeed realonspeed realonspeed realonspeed arthritis gender arthritis gender arthritis raynaud gender arthritis raynaud dysphag There are some differences in the variables selected by logistic regression and multiple regression. Raynaud’s and dysphagia were selected in the multiple regression 36 Future Steps Develop a full Bayesian analysis/model May include a model that involves selection of variables with 0 values in the variable selection set or may involve a Bayesian model on the non-zero values and a model for the variable of zero and non-zero values Develop a model using a bootstrap and select based on Wald statistics Stay tuned… 37