Binomial prediction talk

advertisement
Exploring Statistical Tools for
Predicting Binary Outcome
Mark A. Rizzardi
Joseph E. Carroll
The Problem
Hepatitis C is a problem worth addressing in
Humboldt County with an incidence rate about
0.2%, about 1½ to 2 times that in California as
a whole during the last six years, and a prevalence rate of 2.3%.
70% - 85% of those acutely infected go on to
chronic infection.
20% of those with chronic infection go on to
cirrhosis, i.e. liver failure.
The Current Treatment
Right now, treatment consists of interferon a
and ribavirin for 48 weeks for genotype 1 and
24 weeks for other genotypes.
Side effects of treatment include continuous
flu like symptoms among other problems.
Although successful treatment in the general
population of other genotypes is 75% - 80%,
treatment of genotype 1 patients is effective
only about 40% of the time.
What Might Be Useful
Since the treatment costs at least $2000 per
month and has unpleasant side effects, it would
be useful to have a tool based on some easily
obtained patient parameters to predict when a
given individual might or might not respond to
treatment for hepatitis C.
That is the subject of this talk. We shall discuss some generalized linear approaches to
this problem.
Solving the Problem
Geometrically, we shall think of the patient as
described by a vector of parameters (x1,…,xn)
 Rn. These might include, e.g., age, amount
of alcohol consumed, certain lab values, etc.
The treated patients then comprise two point
sets in Rn, corresponding to responders and
nonresponders. Usually, the sets intermix in
the sense that their convex hulls have nonvoid
intersection.
10
8
variable 2
12
14
Cured
Not cured
18
20
22
variable 1
24
26
Generalized Linear Models
We need a rule to separate these two sets, hoping the rule will apply to future patients.
The simplest solution would be a hyperplane
in Rn which separates the two groups with a
minimal number of errors.
If its equation were a1x1 +…+ anxn = -a0, the
rule would be the sign of a0 + a1x1 +…+ anxn.
Generalized Linear Models
A generalized linear model is one for which
the parameters x1, …, xn act through a function
of the form A(x1,…,xn) = a0 + a1x1 +…+ anxn,
where the ai’s are constants.
The coefficients a0, a1, …, an are fit by some
method to optimize a value.
In evaluating each such model, it is important
to consider both the model assumptions and
what exactly is being optimized by it.
What Has Already Been Done?
Graham Foster, a British hepatologist published a paper which looked at data from two
multinational studies, and collected the explanatory variables age, race, weight, BMI, viral
genotype and load, ALT, and histology.
The retained variables were x = viral load, y =
age, z = ALT, u = BMI (all continuous), and v
= histology (categorical: 0 = cirrhosis; 1 = no
cirrhosis).
What Has Already Been Done?
From logistic regression analysis on genotype 1
patients only, Foster concluded:
• Viral load - lower is better
• Age - younger is better
• ALT - higher is better
• BMI - lower is better
• Cirrhosis is worse
More about logistic regression later…
What Has Already Been Done?
Only the genotype 1 patients were analyzed,
using logistic regression, so the paper posits
P(x,y,z,u,v) = eA(x,y,z,u,v)/(1 + eA(x,y,z,u,v)) as the
probability of response, where
A(x,y,z,u,v) = a0 - 1.446x - 1.236y +
1.376z – 1.134u + 2.322v
for some constant a0 and appropriate units.
Predicting the Response to
Treatment of Hepatitis C in
Humboldt County, California
Joseph E. Carroll, ODCHC and HSU
Mark A. Rizzardi, Statistics HSU
Donald J. Iverson, Humboldt Neurology
Adil Wakil, Hepatology CPMC
Jennifer Hampton
Mia R. Kumar
The Data
We collected information from charts of
patients treated for hepatitis C in Humboldt
and Del Norte counties from about 2001 by the
Eureka Liver Clinic (California Pacific Medical Center) and the Open Door Clinics in
Eureka and Crescent City.
Other patients have been treated by the San
Francisco VA and Stanford’s local clinic.
The Data
The information retrieved included outcome,
demographic parameters (e.g., age, gender,
ethnic background, substance use), findings on
physical exam (e.g., weight, BMI), numerous
laboratory results dated before, but as close to
the onset of treatment as possible, reports of
pathology on liver biopsy and of liver ultrasound, and interferon/ribavirin combination
used. The parameters totaled 56.
The Data
We started with about 170 patients but, on
account of missing data, especially missing
outcomes (responder or not), and because we
eliminated genotype other than 1 patients, the
analysis you’ll see today is based on only
about 60 patients.
We are working, so far unsuccessfully, to
obtain the data from Stanford’s local liver
clinic.
0.2
-6
-4
-2
0
not cured
not cured
not cured
not cured
not cured
0.0
0.4
P(cure) = e
0.6
1 e
0.8
0
1x1
2x2
2
cured
cured
cured
cured
cured
cured
cured
1.0
Logistic Regression
4
6
Logistic Regression (continued)
• Why Logistic instead of linear regression?
– Binary data
– Predicting probability: 0  P  1
– Nonnormal error terms
– Variance is not constant
• Commonly used in medical field: odds ratios
• Solved via maximum likelihood estimation
n
l ( y1 , y2 ,, yn | ˆ )    yi ln ˆ i  1 - yi  ln 1 - ˆ i 
i 1
exp( Xˆ )
ˆ i 
1  exp( Xˆ )
1.0
Logistic Regression for dataset
0.8
0.6
Range of Bilirubin=[ 0.2, 1.8 ]
0.2
0.4
Range of AST/ALT=[ 0.5, 2.7 ]
0.0
Predicted probability of cure
Cured
Not cured
-20
-15
-10
11.14 11.79 1viral
-5
500
0
5
3.97 AST ALT
10
15
11.74 Bilirubin
Linear Discriminant Analysis
10
8
variable 2
12
14
Cured
Not cured
18
20
22
variable 1
24
26
LDA
(continued)
10
8
variable 2
12
14
Cured
Not cured
Misclassified
18
20
22
variable 1
24
26
LDA (continued)
Maximize variance between relative to variance within.
LDA
(continued)
10
8
variable 2
12
14
Cured
Not cured
Misclassified
18
20
22
variable 1
24
26
Classification trees
Classifying Egyptian Skull time periods using
BL < 97.5
|
MB < 135.5
BL < 98.5
MB < 134.5
-3300
MB < 138.5
-200
NH < 51.5
BL < 103.5
NH < 53.5
-200
-1850
-200
-4000
-3300
-1850
MB=maximal breadth
BH=Basibregmatic height
BL=Basialveolar length
NH=Nasal height
Time periods: 4000BC, 3300BC, 1850BC, 200AD, 150AD
-3300
Artificial Neural Networks
We can also employ a three-layer feed forward
neural network, composed of input nodes for
each patient variable, a second layer of hidden
nodes, and an output node.
ANNs – Structure
If the hidden layer has m nodes, the network
operates on an input x  Rn as a composition
of two functions from Rn Rm and then from
Rm R, each itself being a composition of a
linear function followed by a threshold function.
Specifically, there are weights, wij and vi, and
thresholds, qi and q, i = 1,…,m, and j = 1,…,n
such that if we let
ANNs – Structure
yi = Sj wij xj (i.e., y = Wx) and zi = 0 or 1
according as to whether yi < qi or yi > qi,
Then the network output is success or failure
depending on whether Si vizi = vz exceeds or
is exceeded by q.
ANNs – Geometry
Given an input x, its image in the ith hidden
node depends only on whether or not yi > qi.
This is equivalent to locating x as being on one
side or the other of a hyperplane in Rn.
Therefore, x goes to a point in the discrete
hypercube {0,1}m  Rm which codes for the
position of x with respect to m hyperplanes.
The output depends on the side of an (m-1)hyperplane on which the hypercube point lies.
ANNs – Training
The network starts with arbitrary weights and
is trained by successively processing each input, retaining the weights if the computed output is correct, and adjusting them according to
the back propagation method, a steepest descent technique, if it is not.
Unlike classical binary methods, this training
tends to directly decrease miscategorizations,
which may explain why neural networks often
outperform other binary methods.
Our ANNs – The Sordid Truth
Training by backpropagation requires differentiability, which is not possessed by threshold
functions. Therefore, these step functions are
replaced by “activation” functions, e.g., a logistic or hyperbolic tangent, while the linear
functions yield to affine linear functions.
Neural Network
1
1
7.6
29
8
0.8
70
43
-9.
97
33
5
8
--71.84.958 83
14 90
.6 6 5
50
7
3
30
5
.0
-4
8
520
-0.0
26 2
-17.33
64.
79 0
66
5.
07
04
2
5.84897
782
42.23
4
53
.32
14
vG500
37 6
.86
-13
59
98
4.1
astalt
7
95
17
.
10
-6.75204
-1
6.
62
98
9
4 26
3.61
9
05 2
54886.4780 26
2-.4.3
-1 4
-3.7
886
3
23
11
-7.
bili
6
159
-1.7
response
Support Vector Machine
Let y1, y2, …, yk be the patient vectors for responders, z1, z2, …, zl for nonresponders, and
suppose that there is a hyperplane in Rn separating the y’s and z’s with equation wx = - b
for some wRn. and bR. Then, wyi + b > 0
and wzj + b < 0.
If we alter w and b by multiplying both by the
same adequately large positive number, we can
ensure that wyi + b  1 and wzi + b  -1.
0
2
4
variable 2
6
8
10
Support Vector Machine
0
2
4
6
variable 1
8
10
Support Vector Machine
An SVM works by finding a widest rectangular prism separator. This is equivalent to a
convex programming problem.
That problem is to find w0 and b0 which minimize ||w||2 subject to wyi + b  1 for i = 1,…,k
and wzi + b  -1 for j = 1,…,l. This optimization problem can be solved by a generalization of the Lagrange multiplier method and,
because it is a convex programming problem,
it has a unique solution.
Support Vector Machine
We call vectors yi or zj satisfying w0yi + b = 1
or w0zj + b = -1 support vectors.
It can be shown that the error rate for future
patients should be no worse than the ratio of
the lesser of the number of support vectors and
n+1 to the number of training vectors.
However, more commonly, the two sets will
not be linearly separable. There are two ways
to solve this problem.
Support Vector Machine
One can embed the patient vectors in a higher
dimensional space nonlinearly and try to find a
best separating hyperplane there. This corresponds to finding a more general separating hypersurface in Rn. For example, if we are unable to separate by a line in R2, we can map
(x, y)  (x, y, x2, xy, y2, x3, x2y, xy2, y3)
and try to separate by a line in R9, which corresponds to separating by a cubic back in R2.
Support Vector Machine
Dimension increases rapidly but support vectors may not, keeping the error rate low.
Incidentally, there is always a polynomial separator. Suppose y1, y2, …, yk and z1, z2, …, zl
are as above and let P(x) = i || x – yi ||2. Then
P(x)  0 for all x and its zeros are exactly the
yi’s. Therefore, P has a positive minimum on
the zj’s, say c, so if Q(x) = P(x) – ½c, then
Q(zj) > 0 and Q(yi) < 0.
Support Vector Machine
Alternatively, there is a method to find a
hyperplane, again using generalized Lagrange
multipliers, which minimizes the sum of the
distances of the misclassified vectors to it.
Then, as previously, this hyperplane can be
expanded to a rectangular prism with bordering support vectors.
How well do the models predict
future patients’ responses to
treatment?
Training set vs. Test set
• Objective:
– Avoid over fitting model to particular dataset.
– Simulates fitting future data.
• General approach:
– Fit model to a large randomly selected subset of
the data (training set).
– Use model to predict outcomes of remaining data
(test set).
– Select model/method which “best” predicts test
set.
Data: n=59 patients
Variables: AST/ALT, Bilirubin, Viral load
Viral load < 500
1.5
1.0
0.5
1.0
Bilirubin
1.5
Cured
Not cured
0.5
Bilirubin
Viral load > 500
-0.5
0.0
0.5
1.0
1.5
AST/ALT
2.0
2.5
-0.5
0.0
0.5
1.0
1.5
AST/ALT
2.0
2.5
Leave-one-out Cross-validation
• k-fold cross-validation:
– Divide data set into k random equal size groups.
– Rotate each group as being test set
– Fit model k different times.
– Note model/method with “best” prediction.
• Leave-one-out cross-validation:
– k=n , n=sample size
– Requires fitting model n times.
Neural network : 80% correct
Actuality
Cured
Not cured
Predicted cured
18
4
Predicted not cured
8
29
Total
26
33
2
1
0
Neural Net output
3
Boxplots of Neural Network output
Cured
Not cured
Logistic Regression: 75% correct
Actuality
Not cured
Predicted cured
18
7
Predicted not cured
8
26
Total
26
33
1.0
Cured
Cured
Not cured
0.6
0.4
0.2
0.0
Predicted probability of cure
0.8
Something to think about:
•Which type of error is
worse?
•Should something other
than 0.5 be used as dividing
line?
-20
-15
-10
0
1
-5
1viral
500
0
2
AST ALT
5
10
3
Bilirubin
15
Linear Discriminant Analysis: 80% correct
Actuality
Cured
Not cured
Predicted cured
16
2
Predicted not cured
10
31
Total
26
33
Quadratic Discriminant Analysis: 78% correct
Actuality
Cured
Not cured
Predicted cured
16
3
Predicted not cured
10
30
Total
26
33
Support Vector Machine: 78% correct
Actuality
Cured
Not cured
Predicted cured
16
3
Predicted not cured
10
30
Total
26
33
Classification tree: 76% correct
Actuality
Cured
Not cured
Predicted cured
19
7
Predicted not cured
7
26
Total
26
33
astalt < 1.58787
|
vG500 < 0.5
bili < 0.55
cure
cure
no cure
no cure
Random Forest:
•Use random subsets of variables when
building each branch of a tree.
•Grow a forest of many trees
•Forest of trees votes on classification for each
observation
•Classification with greatest number of votes
wins.
•Hepatitis data results: from 78% to 83%
correct
Bibliography
Foster, Graham R., et al (2007). Prediction of
sustained virological response in chronic
hepatitis C patients treated with peginterferon
a-2a (40KD) and ribavirin. Scandinavian
Journal of Gastroenterology 42, 247-55.
Breiman, Leo (2001). Statistical modeling: the
two cultures. Statistical Science 16:3, 199-231.
Bibliography
Cortes, Corinna and Vladimir Vapnik (1995).
Support-vector networks. Machine Learning,
20, 273-97.
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179-188.
Bishop, C.M. (2006). Pattern Recognition and
Machine Learning, Springer.
Download