comparisons of two-part models with competitors peter a

advertisement
PETER A. LACHENBRUCH
OREGON STATE UNIVERSITY
DEPARTMENT OF PUBLIC HEALTH
Clumping at 0
 Some subjects show no response, others have a
continuous, or at least ordered response
 Examples:
 Hospitalization expense in an HMO
 Cell growth on plates
 Urinary output in shock patients
 Usual normal theory doesn’t apply
2
Urinary Output
(Afifi & Azen)
surv==1
surv==2
Fraction
.883721
0
0
510
0
510
uo
Histograms by surv
3
UO Analysis
 Survival: 27/70 had UO=0; mean=127.9, s=148.13,
skewness=1.13
 Deaths:
22/43 had UO=0; mean=31.0, s=71.76,
skewness=3.37
 For these data:
– t=3.01 (p=0.0032)
– Wilcoxon z=2.794 (p=0.0052)
– Kolmogorov-Smirnov p=0.001
– 2 part X2=15.86 (p0.00036)
4
Statistical Model
 fi(x,d)=pi1-d{(1-pi)hi(x)}d
 H0: p1=p2 h1=h2
 Tests:
– t-test on full data set
– Wilcoxon rank sum test
– Kolmogorov-Smirnov
– Two part Models: Bin+Z; Bin+W; Bin+KS
5
What are the relative properties?
 Right size? Is=0.05 when it’s supposed to be?
 Are the null distributions correct?
 What is the power of these procedures under various
alternatives? (Use log-normal model)
– Difference only in proportions
– Difference only in means
– Difference in both
6
Tests
z 
W 
y1  y 2
2
sp
n

G1
n ( n  m  1)
2
n m( n  m  1)
12
Ri 
Dmn  sup(| Fm ( y)  Gn ( y) |)
7
Two-part Tests
 Define
B
p1  p2
p (1  p )
2
n
 Then the two-part tests are: B2+Z2 (denoted as BZ),
B2+W2 (denoted as BW) and B2+K2 (denoted as BK),
where K2 is the chi-squared value corresponding to
the p-value of the KS statistic.
 Since these are independent, we have the sum of
two 1 d.f. (central) chi-squared statistics (under the
null)
8
Size of Tests

n1=n2=50, Equal means
P1=
P2
W
K
Z
BZ
BW
BK
0.1
0.0432 0.0624 0.0440 0.0424 0.0471 0.0541
0.2
0.0462 0.0658 0.0466 0.0468 0.0475 0.0549
9
Probability Plots for null case
n1=n2=100, p1=p2=0.2, Means=0
Chi-plot for Wilcoxon
Kolmogorov-Smirnov plot vs. uniform
Chi-plot of z-test
1
15
15
Z
10
W
10
.5
5
5
0
0
0
5
10
15
Expected Chi-Squared d.f. = 1
Chi-plot of BZ-test
0
0
.5
Uniform (0,1)
Chi-plot of BW-test
1
0
5
10
15
Expected Chi-Squared d.f. = 1
Chi-plot of BK-test
20
20
15
15
10
BK
BW
BZ
20
10
5
0
5
0
0
5
10
15
20
Expected Chi-Squared d.f. = 2
10
0
0
5
10
15
20
Expected Chi-Squared d.f. = 2
0
5
10
15
20
Expected Chi-Squared d.f. = 2
10
Power: n= 50,100
 P1=0.1, P2=0.2; MEAN DIFFERENCE=0
N
W
K
Z
BZ
BW
BK
50
0.142 0.156 0.065 0.198 0.197 0.206
100
0.222 0.222 0.092 0.415
.413
0.424
11
Power: n=50, 100
Differ only in means
 P=0.1,0.2, mean=0.5
p,n
0.1,50
W
Z
BZ
BW
BK
0.467 0.504 0.464 0.374 0.506 0.432
0.1,100 0.774
0.2,50
K
0.769 0.703 0.626 0.821
.733
0.309 0.400 0.370 0.310 0.450 0.377
0.2,100 0.592
0.646 0.652 0.583 0.771 0.686
12
Power:n=100,p1=0.1,p2=0.2
mean=0.3, 0.5
 Proportion and mean are consonant

W
K
Z
BZ
BW
BK
0.3
0.784 0.700 0.579 0.627 0.696 0.670
0.5
0.964 0.945 0.886 0.848 0.936 0.906
13
Power:n=100,p1=0.2,p2=0.1
mean=0.3, 0.5
 Proportion and mean are dissonant

W
K
Z
BZ
BW
BK
0.3
0.055 0.162 0.132 0.616 0.706 0.672
0.5
0.214 0.459 0.452 0.852 0.933 0.892
14
Conclusions
 These results are similar to those for other sample
sizes and parameter combinations
 Size is appropriate
 Distributions match expectations, except for largest
values
 For differences only in proportions (low proportions),
the BZ, BW and BK methods did well, Z did poorly
15
Conclusions (2)
 For differences only in means, the W, K, Z, BW and
BK did well
 For consonant differences (mean and proportion in
same direction), W, K, BW and BK did well, Z and
BZ poorly
 For dissonant differences, BW, BK and BZ were far
superior to the others
16
Conclusions (3)
 Theoretical results indicate that computing sample
size or power with the non-central 2 distribution
gives an excellent agreement with the simulated
powers
 Papers:
 Comparisons - Statistics in Medicine 2001, p. 1215
 Non-central - Statistics in Medicine 2001, p. 1235
17
Peter A. Lachenbruch and John Molitor
Oregon State University
The Two-part Model
 Some data have an excess of zero values. These aren’t
be easily modeled because of the spike at 0.
 Can use a mixture model if one cannot distinguish a
sampling zero from a structural zero. Example:
telephone calls in a short period of time. If phone is
turned on, some time periods may have no calls. If
phone is turned off, there are no calls registered.
 Can use two-part model if all zeros are structural.
Example: hospitalization cost when an insured was not
hospitalized. Size of growth on an agar plate if all
activity is inhibited.
19
An equation or two
 Let y be the response. It is zero if no response, and
non-zero otherwise. Let h(y) be the conditional
distribution of y given y>0
 Let d be an indicator of non-zero response and
p=probability that z=1
 For a two part model, we have
f ( y, d )  p1d *[(1  p) * h( y)]d
 The log-likelihood is easy to compute and the
solution is simply the likelihood estimate for p and for
the mean (regression) of y.
20
Inference
 One estimates parameters using the individual
components of the likelihood. These are standard
estimates. For the zero-nonzero part we use a logistic
regression, and for the nonzero values we use a multiple
regression.
 An issue is how to select variables for inclusion in a
model.
 Select variables separately for each part of the model?
 Select variables for the model as a whole using the 0 as if it were
a regular observation.
21
Variable selection criteria
 What criterion:
 R2 =1-RSS/SST
 R2adj =1-(n-1)/(n-k-1)*RSS/SST
 AIC=n*ln(RSS/n)+2k+n+n*ln(2)
 BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2) (these are for normal distribution
models)
 Use forward or backward stepping
 P to enter 0.15, 0.05
 P to remove 0.15, 0.05
 Best subsets models?
 For generalized linear models, the deviance is proposed.
22
Variable Selection
 For the multivariate regression, we can use stepwise
regression. There are the usual concerns about
stepwise.
 We can use AIC, BIC, R2 to select the best model.
AIC and BIC penalize the selection based on the
number of variables in the model. For normal
distributions we have



AIC=n*ln(RSS/n)+2k+n+n*ln(2)
BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2)
Bias adjusted versions of R2 and AIC are also available
23
More on selection
 For the logistic part of the model, we use stepwise
logistic regression and specify a p(enter) or
p(remove) – this is based on the test of the odds
ratio for each candidate variable.
 For variable selection, most programs use a
stepwise routine that selects on the basis of the test
on the odds ratio (basically a normal theory test).
24
Single model methods
 There are two single model methods we consider:
 Include the 0 values in a multiple regression


This is obviously inappropriate, but users often have done
this
In practice, it selects more variables and includes the ones
that have been selected by the logistic and multiple
regression models.
 Conduct a Bayesian analysis of the variable selection
problem. This is work in progress.
25
Computing - Stata
 We use Stata for computing because it has some
convenient selection commands.
 The recently developed command, vselect, due to
Lindsay and Sheather, allows one to do variable
selection using AIC, BIC, R2 and forward or backward
stepping, as well as finding the best set of variables for
each number of variables.
 The Best subsets option uses the “leaps and bounds”
algorithm that vastly reduces the amount of
computations. This was due to Furnival and Wilson.
26
More on selection
 Unfortunately, at present, vselect works only for multiple
regression and not for logistic regression. Thus, we
considered two strategies:
 Use stepwise logistic regression directly
 Regress the 0-1 variable using regression and perform the variable
selection operation on the results.
 The vselect command first computes a multiple regression on
all variables, then it computes the stepwise variable selection
from the X’X matrix
 It allows the use of R2 , AIC, BIC, Mallows’ C, and Best subsets
regression. In the example, we use the Best option that gives all of
the above
 The Bayesian methods will be presented separately.
27
Example data
 We use a data set courtesy of Lisa Rider.
 lald=ln(aldosterone) (response)
aldind – indicator for 0 -1
 Dx2 – Polymyositis (1) or Dermatomyositis (2) Agedx – age at diagnosis
 Yeardx – year of diagnosis
 Ild – interstitial lung disease Y/N
 Fever >100 – Y/N
 Mechhand – mechanics hands Y/N
 Dysphagia Y/N
 Race – W/NW
gender – male (0) female (1)
Arthritis – Y/N
Raynaud’s sign Y/N
palpitations Y/N
Proximal weakness Y/N
Realonspeed – onset speed 1
28
The prediction problem
 We wish to predict laldo. However, 72 out of 420 are
0. This leads to a clump of zero values.
 We may wish to have a single set of predictors for
lald, or we may wish to have a set of predictors for
the non-zero values and a (possibly distinct) set of
predictors for the 0 values.
 A related question is how can we evaluate the
prediction ability of the resulting equations?
29
Example of vselect




























. regress laldo
agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita
Source |
SS
df
MS
-------------+-----------------------------Model | 44.1754461
14 3.15538901
Residual |
235.26075
332 .708616718
-------------+-----------------------------Total | 279.436196
346 .807619065
Number of obs
F( 14,
332)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
dysphag proxweak racewnw realonspeed
347
4.45
0.0000
0.1581
0.1226
.84179
-----------------------------------------------------------------------------laldo |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------agedx |
0.0061
0.0120
0.51
6.1e-01
-0.0176
0.0298
yeardx |
-0.0015
0.0086
-0.18
8.6e-01
-0.0185
0.0154
dx2 |
-0.7198
0.1617
-4.45
1.2e-05
-1.0379
-0.4016
gender |
-0.1017
0.1016
-1.00
3.2e-01
-0.3015
0.0982
ild |
-0.0200
0.1802
-0.11
9.1e-01
-0.3744
0.3345
arthritis |
0.0548
0.0957
0.57
5.7e-01
-0.1334
0.2430
fever |
-0.0830
0.1000
-0.83
4.1e-01
-0.2798
0.1138
raynaud |
0.3457
0.1490
2.32
2.1e-02
0.0526
0.6389
mechhand |
-0.0275
0.1822
-0.15
8.8e-01
-0.3859
0.3310
palpita |
-0.2085
0.1973
-1.06
2.9e-01
-0.5966
0.1797
dysphag |
0.2590
0.0983
2.63
8.8e-03
0.0656
0.4525
proxweak |
0.4575
0.8487
0.54
5.9e-01
-1.2119
2.1270
racewnw |
-0.0937
0.0991
-0.95
3.4e-01
-0.2887
0.1012
realonspeed |
-0.1849
0.0445
-4.16
4.1e-05
-0.2723
-0.0974
_cons |
6.6862
17.2356
0.39
7.0e-01
-27.2186
40.5910
------------------------------------------------------------------------------
The next slide gives the vselect command and output. Note the restriction that lald>0 and u80 (an
indicator variable that the patient was first diagnosted after 1980.
30
Vselect output
This is the vselect output on the non-zero values.
all 14 variables





























We truncated at 5 variables selected – the actual output includes
. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita
,best
1 Observations Containing Missing Predictor Values
dysphag proxweak racewnw realonspeed
Response :
laldo
Fixed Predictors :
Selected Predictors:
dx2 realonspeed dysphag raynaud palpita gender racewnw fever a
> rthritis proxweak agedx yeardx mechhand ild
Actual Regressions
37
Possible Regressions 16384
Optimal Models Highlighted:
# Preds
1
2
3
4
5
6
R2ADJ
C
.0663986 24.09272
.1044985 10.09118
.1207073 4.734216
.1356839 -.1055272
.1361631 .7231399
.1365321 1.595634
AIC
888.755
875.2897
869.9412
864.9669
865.7583
866.591
AICC
1873.568
1860.15
1854.861
1849.957
1850.832
1851.76
BIC
896.4537
886.8377
885.3385
884.2135
888.8543
893.5363
Selected Predictors
1
2
3
4
5
6
:
:
:
:
:
:
dx2
dx2
dx2
dx2
dx2
dx2
realonspeed
realonspeed
realonspeed
realonspeed
realonspeed
raynaud
dysphag raynaud
dysphag raynaud racewnw
dysphag raynaud palpita racewnw
In this case, the program computed 27 regressions out of 16384 (=214 possible regressions)
31
Selecting predictors for 0 indicator
For the logistic regressions we use stepwise logistic regression that selects variables based on odds ratios. We use forward stepping with a p-to-enter of 0.15




















stepwise, pe(.15): logistic aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak
racewnw realonspeed if u80
note: proxweak dropped because of estimability
note: 1 obs. dropped because of estimability
begin with empty model
p = 0.0036 < 0.1500 adding palpita
p = 0.0322 < 0.1500 adding arthritis
p = 0.0340 < 0.1500 adding gender
Logistic regression
Log likelihood = -183.34326
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
418
17.40
0.0006
0.0453
-----------------------------------------------------------------------------aldind | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------palpita |
0.3060
0.1217
-2.98
2.9e-03
0.1403
0.6674
arthritis |
1.8598
0.5150
2.24
2.5e-02
1.0809
3.2000
gender |
0.4839
0.1657
-2.12
3.4e-02
0.2474
0.9466
-----------------------------------------------------------------------------estat ic
----------------------------------------------------------------------------Model |
Obs
ll(null)
ll(model)
df
AIC
BIC
-------------+--------------------------------------------------------------. |
418
-192.0435
-183.3433
4
374.6865
390.8284
----------------------------------------------------------------------------Note: N=Obs used in calculating BIC; see [R] BIC note
We see that the dx2 and onset speed variables did not enter, so somewhat different variables predict 0-ness than the magnitude of response
32
Selecting predictors for 0 with
regression, ignoring binomial form
We display only results for first five selected
variables.





























regress aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi
> ta dysphag proxweak racewnw realonspeed if u80
Source |
SS
df
MS
-------------+-----------------------------Model | 3.56544676
14 .254674768
Residual | 56.0622382
404 .138767916
-------------+-----------------------------Total |
59.627685
418 .142649964
Number of obs
F( 14,
404)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
419
1.84
0.0319
0.0598
0.0272
.37252
-----------------------------------------------------------------------------aldind |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------agedx |
-0.0053
0.0047
-1.14
2.5e-01
-0.0145
0.0038
yeardx |
0.0017
0.0035
0.50
6.2e-01
-0.0051
0.0085
dx2 |
-0.0281
0.0646
-0.43
6.6e-01
-0.1550
0.0988
gender |
-0.0857
0.0416
-2.06
4.0e-02
-0.1675
-0.0039
ild |
-0.0459
0.0714
-0.64
5.2e-01
-0.1862
0.0944
arthritis |
0.0789
0.0380
2.08
3.8e-02
0.0043
0.1535
fever |
0.0636
0.0396
1.61
1.1e-01
-0.0143
0.1414
raynaud |
0.0049
0.0599
0.08
9.4e-01
-0.1129
0.1226
mechhand |
0.0803
0.0765
1.05
2.9e-01
-0.0701
0.2306
palpita |
-0.2003
0.0701
-2.86
4.5e-03
-0.3382
-0.0624
dysphag |
-0.0360
0.0390
-0.92
3.6e-01
-0.1127
0.0407
proxweak |
-0.2055
0.3751
-0.55
5.8e-01
-0.9429
0.5319
racewnw |
0.0280
0.0395
0.71
4.8e-01
-0.0496
0.1057
realonspeed |
-0.0053
0.0178
-0.30
7.6e-01
-0.0404
0.0297
_cons |
-2.2499
6.9270
-0.32
7.5e-01
-15.8673
11.3676
------------------------------------------------------------------------------
33
Selecting predictors for 0 with
regression, ignoring binomial form, 2




























. . vselect aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi
> ta dysphag proxweak racewnw realonspeed if u80,best
2 Observations Containing Missing Predictor Values
Response :
aldind
Fixed Predictors :
Selected Predictors:
palpita arthritis gender fever agedx mechhand dysphag racewnw
> ild proxweak yeardx dx2 realonspeed raynaud
Actual Regressions
62
Possible Regressions 16384
Optimal Models Highlighted:
# Preds
1
2
3
4
5
R2ADJ
.0197545
.028156
.0365444
.0389249
.0403595
C
5.197552
2.597088
.0194683
.0159628
.4189426
AIC
366.7613
364.1486
361.5079
361.4605
361.8213
AICC
1555.89
1553.316
1550.724
1550.735
1551.164
BIC
374.837
376.2622
377.6594
381.6499
386.0485
Selected Predictors
1
2
3
4
5
:
:
:
:
:
palpita
palpita
palpita
palpita
palpita
arthritis
arthritis gender
arthritis gender fever
arthritis gender fever agedx
 Note that the selected variables are identical to the
stepwise logistic regression.
34
Multiple regression with 0 in the
data set
We now consider the model including 0 as part of the data. This may be made a bit easier having taken logs of the
non-zero values, so the 0s aren’t quite so obviously different.






























. regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita
dysphag proxweak racewnw realonspeed if u80
Source |
SS
df
MS
-------------+-----------------------------Model |
62.68539
14 4.47752786
Residual | 638.017201
404
1.5792505
-------------+-----------------------------Total | 700.702591
418 1.67632199
Number of obs
F( 14,
404)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
419
2.84
0.0004
0.0895
0.0579
1.2567
-----------------------------------------------------------------------------laldo |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------agedx |
-0.0075
0.0157
-0.48
6.4e-01
-0.0383
0.0234
yeardx |
0.0024
0.0117
0.21
8.4e-01
-0.0206
0.0254
dx2 |
-0.6763
0.2178
-3.11
2.0e-03
-1.1044
-0.2482
gender |
-0.3182
0.1404
-2.27
2.4e-02
-0.5941
-0.0423
ild |
-0.1800
0.2408
-0.75
4.6e-01
-0.6533
0.2933
arthritis |
0.2548
0.1280
1.99
4.7e-02
0.0031
0.5065
fever |
0.1069
0.1336
0.80
4.2e-01
-0.1557
0.3695
raynaud |
0.3104
0.2021
1.54
1.3e-01
-0.0868
0.7076
mechhand |
0.2043
0.2580
0.79
4.3e-01
-0.3029
0.7115
palpita |
-0.7101
0.2366
-3.00
2.9e-03
-1.1753
-0.2449
dysphag |
0.1165
0.1315
0.89
3.8e-01
-0.1422
0.3751
proxweak |
-0.0250
1.2653
-0.02
9.8e-01
-2.5124
2.4625
racewnw |
-0.0079
0.1332
-0.06
9.5e-01
-0.2698
0.2541
realonspeed |
-0.1742
0.0601
-2.90
4.0e-03
-0.2924
-0.0560
_cons |
-0.8421
23.3682
-0.04
9.7e-01
-46.7806
45.0964
------------------------------------------------------------------------------
35
Using vselect on the full data set
Displaying best five































. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80,best
2 Observations Containing Missing Predictor Values
Response :
laldo
Fixed Predictors :
Selected Predictors:
dx2 palpita realonspeed gender arthritis raynaud dysphag fever
> mechhand ild agedx yeardx racewnw proxweak
Actual Regressions
47
Possible Regressions 16384
Optimal Models Highlighted:
# Preds
1
2
3
4
5
6
7
R2ADJ
.0154376
.0322276
.048014
.0580737
.0673386
.0695667
.0699354
C
20.79848
14.33945
8.358132
4.926931
1.865516
1.901132
2.752656
AIC
1401.003
1394.79
1388.891
1385.429
1382.274
1382.256
1383.071
AICC
2590.131
2583.957
2578.106
2574.703
2571.617
2571.677
2572.582
BIC
1409.079
1406.904
1405.042
1405.618
1406.501
1410.521
1415.374
Selected Predictors
1
2
3
4
5
6
7
:
:
:
:
:
:
:
dx2
dx2
dx2
dx2
dx2
dx2
dx2
palpita
palpita
palpita
palpita
palpita
palpita
realonspeed
realonspeed
realonspeed
realonspeed
realonspeed
arthritis
gender arthritis
gender arthritis raynaud
gender arthritis raynaud dysphag
There are some differences in the variables selected by logistic regression and
multiple regression. Raynaud’s and dysphagia were selected in the multiple
regression
36
Future Steps
 Develop a full Bayesian analysis/model
 May include a model that involves selection of
variables with 0 values in the variable selection set or
may involve a Bayesian model on the non-zero values
and a model for the variable of zero and non-zero
values
 Develop a model using a bootstrap and select based
on Wald statistics
 Stay tuned…
37
Download