Exploring Statistical Tools for Predicting Binary Outcome Mark A. Rizzardi Joseph E. Carroll The Problem Hepatitis C is a problem worth addressing in Humboldt County with an incidence rate about 0.2%, about 1½ to 2 times that in California as a whole during the last six years, and a prevalence rate of 2.3%. 70% - 85% of those acutely infected go on to chronic infection. 20% of those with chronic infection go on to cirrhosis, i.e. liver failure. The Current Treatment Right now, treatment consists of interferon a and ribavirin for 48 weeks for genotype 1 and 24 weeks for other genotypes. Side effects of treatment include continuous flu like symptoms among other problems. Although successful treatment in the general population of other genotypes is 75% - 80%, treatment of genotype 1 patients is effective only about 40% of the time. What Might Be Useful Since the treatment costs at least $2000 per month and has unpleasant side effects, it would be useful to have a tool based on some easily obtained patient parameters to predict when a given individual might or might not respond to treatment for hepatitis C. That is the subject of this talk. We shall discuss some generalized linear approaches to this problem. Solving the Problem Geometrically, we shall think of the patient as described by a vector of parameters (x1,…,xn) Rn. These might include, e.g., age, amount of alcohol consumed, certain lab values, etc. The treated patients then comprise two point sets in Rn, corresponding to responders and nonresponders. Usually, the sets intermix in the sense that their convex hulls have nonvoid intersection. 10 8 variable 2 12 14 Cured Not cured 18 20 22 variable 1 24 26 Generalized Linear Models We need a rule to separate these two sets, hoping the rule will apply to future patients. The simplest solution would be a hyperplane in Rn which separates the two groups with a minimal number of errors. If its equation were a1x1 +…+ anxn = -a0, the rule would be the sign of a0 + a1x1 +…+ anxn. Generalized Linear Models A generalized linear model is one for which the parameters x1, …, xn act through a function of the form A(x1,…,xn) = a0 + a1x1 +…+ anxn, where the ai’s are constants. The coefficients a0, a1, …, an are fit by some method to optimize a value. In evaluating each such model, it is important to consider both the model assumptions and what exactly is being optimized by it. What Has Already Been Done? Graham Foster, a British hepatologist published a paper which looked at data from two multinational studies, and collected the explanatory variables age, race, weight, BMI, viral genotype and load, ALT, and histology. The retained variables were x = viral load, y = age, z = ALT, u = BMI (all continuous), and v = histology (categorical: 0 = cirrhosis; 1 = no cirrhosis). What Has Already Been Done? From logistic regression analysis on genotype 1 patients only, Foster concluded: • Viral load - lower is better • Age - younger is better • ALT - higher is better • BMI - lower is better • Cirrhosis is worse More about logistic regression later… What Has Already Been Done? Only the genotype 1 patients were analyzed, using logistic regression, so the paper posits P(x,y,z,u,v) = eA(x,y,z,u,v)/(1 + eA(x,y,z,u,v)) as the probability of response, where A(x,y,z,u,v) = a0 - 1.446x - 1.236y + 1.376z – 1.134u + 2.322v for some constant a0 and appropriate units. Predicting the Response to Treatment of Hepatitis C in Humboldt County, California Joseph E. Carroll, ODCHC and HSU Mark A. Rizzardi, Statistics HSU Donald J. Iverson, Humboldt Neurology Adil Wakil, Hepatology CPMC Jennifer Hampton Mia R. Kumar The Data We collected information from charts of patients treated for hepatitis C in Humboldt and Del Norte counties from about 2001 by the Eureka Liver Clinic (California Pacific Medical Center) and the Open Door Clinics in Eureka and Crescent City. Other patients have been treated by the San Francisco VA and Stanford’s local clinic. The Data The information retrieved included outcome, demographic parameters (e.g., age, gender, ethnic background, substance use), findings on physical exam (e.g., weight, BMI), numerous laboratory results dated before, but as close to the onset of treatment as possible, reports of pathology on liver biopsy and of liver ultrasound, and interferon/ribavirin combination used. The parameters totaled 56. The Data We started with about 170 patients but, on account of missing data, especially missing outcomes (responder or not), and because we eliminated genotype other than 1 patients, the analysis you’ll see today is based on only about 60 patients. We are working, so far unsuccessfully, to obtain the data from Stanford’s local liver clinic. 0.2 -6 -4 -2 0 not cured not cured not cured not cured not cured 0.0 0.4 P(cure) = e 0.6 1 e 0.8 0 1x1 2x2 2 cured cured cured cured cured cured cured 1.0 Logistic Regression 4 6 Logistic Regression (continued) • Why Logistic instead of linear regression? – Binary data – Predicting probability: 0 P 1 – Nonnormal error terms – Variance is not constant • Commonly used in medical field: odds ratios • Solved via maximum likelihood estimation n l ( y1 , y2 ,, yn | ˆ ) yi ln ˆ i 1 - yi ln 1 - ˆ i i 1 exp( Xˆ ) ˆ i 1 exp( Xˆ ) 1.0 Logistic Regression for dataset 0.8 0.6 Range of Bilirubin=[ 0.2, 1.8 ] 0.2 0.4 Range of AST/ALT=[ 0.5, 2.7 ] 0.0 Predicted probability of cure Cured Not cured -20 -15 -10 11.14 11.79 1viral -5 500 0 5 3.97 AST ALT 10 15 11.74 Bilirubin Linear Discriminant Analysis 10 8 variable 2 12 14 Cured Not cured 18 20 22 variable 1 24 26 LDA (continued) 10 8 variable 2 12 14 Cured Not cured Misclassified 18 20 22 variable 1 24 26 LDA (continued) Maximize variance between relative to variance within. LDA (continued) 10 8 variable 2 12 14 Cured Not cured Misclassified 18 20 22 variable 1 24 26 Classification trees Classifying Egyptian Skull time periods using BL < 97.5 | MB < 135.5 BL < 98.5 MB < 134.5 -3300 MB < 138.5 -200 NH < 51.5 BL < 103.5 NH < 53.5 -200 -1850 -200 -4000 -3300 -1850 MB=maximal breadth BH=Basibregmatic height BL=Basialveolar length NH=Nasal height Time periods: 4000BC, 3300BC, 1850BC, 200AD, 150AD -3300 Artificial Neural Networks We can also employ a three-layer feed forward neural network, composed of input nodes for each patient variable, a second layer of hidden nodes, and an output node. ANNs – Structure If the hidden layer has m nodes, the network operates on an input x Rn as a composition of two functions from Rn Rm and then from Rm R, each itself being a composition of a linear function followed by a threshold function. Specifically, there are weights, wij and vi, and thresholds, qi and q, i = 1,…,m, and j = 1,…,n such that if we let ANNs – Structure yi = Sj wij xj (i.e., y = Wx) and zi = 0 or 1 according as to whether yi < qi or yi > qi, Then the network output is success or failure depending on whether Si vizi = vz exceeds or is exceeded by q. ANNs – Geometry Given an input x, its image in the ith hidden node depends only on whether or not yi > qi. This is equivalent to locating x as being on one side or the other of a hyperplane in Rn. Therefore, x goes to a point in the discrete hypercube {0,1}m Rm which codes for the position of x with respect to m hyperplanes. The output depends on the side of an (m-1)hyperplane on which the hypercube point lies. ANNs – Training The network starts with arbitrary weights and is trained by successively processing each input, retaining the weights if the computed output is correct, and adjusting them according to the back propagation method, a steepest descent technique, if it is not. Unlike classical binary methods, this training tends to directly decrease miscategorizations, which may explain why neural networks often outperform other binary methods. Our ANNs – The Sordid Truth Training by backpropagation requires differentiability, which is not possessed by threshold functions. Therefore, these step functions are replaced by “activation” functions, e.g., a logistic or hyperbolic tangent, while the linear functions yield to affine linear functions. Neural Network 1 1 7.6 29 8 0.8 70 43 -9. 97 33 5 8 --71.84.958 83 14 90 .6 6 5 50 7 3 30 5 .0 -4 8 520 -0.0 26 2 -17.33 64. 79 0 66 5. 07 04 2 5.84897 782 42.23 4 53 .32 14 vG500 37 6 .86 -13 59 98 4.1 astalt 7 95 17 . 10 -6.75204 -1 6. 62 98 9 4 26 3.61 9 05 2 54886.4780 26 2-.4.3 -1 4 -3.7 886 3 23 11 -7. bili 6 159 -1.7 response Support Vector Machine Let y1, y2, …, yk be the patient vectors for responders, z1, z2, …, zl for nonresponders, and suppose that there is a hyperplane in Rn separating the y’s and z’s with equation wx = - b for some wRn. and bR. Then, wyi + b > 0 and wzj + b < 0. If we alter w and b by multiplying both by the same adequately large positive number, we can ensure that wyi + b 1 and wzi + b -1. 0 2 4 variable 2 6 8 10 Support Vector Machine 0 2 4 6 variable 1 8 10 Support Vector Machine An SVM works by finding a widest rectangular prism separator. This is equivalent to a convex programming problem. That problem is to find w0 and b0 which minimize ||w||2 subject to wyi + b 1 for i = 1,…,k and wzi + b -1 for j = 1,…,l. This optimization problem can be solved by a generalization of the Lagrange multiplier method and, because it is a convex programming problem, it has a unique solution. Support Vector Machine We call vectors yi or zj satisfying w0yi + b = 1 or w0zj + b = -1 support vectors. It can be shown that the error rate for future patients should be no worse than the ratio of the lesser of the number of support vectors and n+1 to the number of training vectors. However, more commonly, the two sets will not be linearly separable. There are two ways to solve this problem. Support Vector Machine One can embed the patient vectors in a higher dimensional space nonlinearly and try to find a best separating hyperplane there. This corresponds to finding a more general separating hypersurface in Rn. For example, if we are unable to separate by a line in R2, we can map (x, y) (x, y, x2, xy, y2, x3, x2y, xy2, y3) and try to separate by a line in R9, which corresponds to separating by a cubic back in R2. Support Vector Machine Dimension increases rapidly but support vectors may not, keeping the error rate low. Incidentally, there is always a polynomial separator. Suppose y1, y2, …, yk and z1, z2, …, zl are as above and let P(x) = i || x – yi ||2. Then P(x) 0 for all x and its zeros are exactly the yi’s. Therefore, P has a positive minimum on the zj’s, say c, so if Q(x) = P(x) – ½c, then Q(zj) > 0 and Q(yi) < 0. Support Vector Machine Alternatively, there is a method to find a hyperplane, again using generalized Lagrange multipliers, which minimizes the sum of the distances of the misclassified vectors to it. Then, as previously, this hyperplane can be expanded to a rectangular prism with bordering support vectors. How well do the models predict future patients’ responses to treatment? Training set vs. Test set • Objective: – Avoid over fitting model to particular dataset. – Simulates fitting future data. • General approach: – Fit model to a large randomly selected subset of the data (training set). – Use model to predict outcomes of remaining data (test set). – Select model/method which “best” predicts test set. Data: n=59 patients Variables: AST/ALT, Bilirubin, Viral load Viral load < 500 1.5 1.0 0.5 1.0 Bilirubin 1.5 Cured Not cured 0.5 Bilirubin Viral load > 500 -0.5 0.0 0.5 1.0 1.5 AST/ALT 2.0 2.5 -0.5 0.0 0.5 1.0 1.5 AST/ALT 2.0 2.5 Leave-one-out Cross-validation • k-fold cross-validation: – Divide data set into k random equal size groups. – Rotate each group as being test set – Fit model k different times. – Note model/method with “best” prediction. • Leave-one-out cross-validation: – k=n , n=sample size – Requires fitting model n times. Neural network : 80% correct Actuality Cured Not cured Predicted cured 18 4 Predicted not cured 8 29 Total 26 33 2 1 0 Neural Net output 3 Boxplots of Neural Network output Cured Not cured Logistic Regression: 75% correct Actuality Not cured Predicted cured 18 7 Predicted not cured 8 26 Total 26 33 1.0 Cured Cured Not cured 0.6 0.4 0.2 0.0 Predicted probability of cure 0.8 Something to think about: •Which type of error is worse? •Should something other than 0.5 be used as dividing line? -20 -15 -10 0 1 -5 1viral 500 0 2 AST ALT 5 10 3 Bilirubin 15 Linear Discriminant Analysis: 80% correct Actuality Cured Not cured Predicted cured 16 2 Predicted not cured 10 31 Total 26 33 Quadratic Discriminant Analysis: 78% correct Actuality Cured Not cured Predicted cured 16 3 Predicted not cured 10 30 Total 26 33 Support Vector Machine: 78% correct Actuality Cured Not cured Predicted cured 16 3 Predicted not cured 10 30 Total 26 33 Classification tree: 76% correct Actuality Cured Not cured Predicted cured 19 7 Predicted not cured 7 26 Total 26 33 astalt < 1.58787 | vG500 < 0.5 bili < 0.55 cure cure no cure no cure Random Forest: •Use random subsets of variables when building each branch of a tree. •Grow a forest of many trees •Forest of trees votes on classification for each observation •Classification with greatest number of votes wins. •Hepatitis data results: from 78% to 83% correct Bibliography Foster, Graham R., et al (2007). Prediction of sustained virological response in chronic hepatitis C patients treated with peginterferon a-2a (40KD) and ribavirin. Scandinavian Journal of Gastroenterology 42, 247-55. Breiman, Leo (2001). Statistical modeling: the two cultures. Statistical Science 16:3, 199-231. Bibliography Cortes, Corinna and Vladimir Vapnik (1995). Support-vector networks. Machine Learning, 20, 273-97. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188. Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Springer.