Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology, George Mason University Manassas, Virginia BIBM 2012, Philadelphia, Pennsylvania Graphical Outline of Presentation HIV-1 Protein Sequence Datasets • Data available from Stanford HIV Drug Resistance Database • 548 protease (PR) and 331 reverse transcriptase (RT) sequences with distinct mutational patterns defined by residue substitutions • For each of 8 PR and 11 RT inhibitors, PhenoSense assay used to measure degree to which mutant target proteins are susceptible • PR/RT genotyping much faster and cheaper than phenotyping • Hence accurate predictive models of drug susceptibility only from target sequence are in high demand • Here we develop 19 inhibitor-specific predictive classification and regression models trained on the available phenotype data HIV-1 Protein Sequence Datasets Isolate Phenotypes (%) a Drug S Amprenavir (APV) Atazanavir (ATV) Indinavir (IDV) Lopinavir (LPV) Nelfinavir (NFV) Ritonavir (RTV) Saquinavir (SQV) Tipranavir (TPV) 63 49 53 46 39 50 61 78 I Protease Inhibitors 26 29 26 22 28 20 18 11 R Total 11 22 21 32 33 30 21 11 495 200 502 320 526 473 509 47 Nucleoside / Nucleotide RT Inhibitors Lamivudine (3TC) Abacavir (ABC) Zidovudine (AZT) Stavudine (d4T) Zalcitabine (ddC) Didanosine (ddI) Emtricitabine (FTC) Tenofovir (TDF) 29 28 50 53 39 51 31 65 18 45 23 36 52 43 13 25 53 27 27 11 9 6 56 10 244 237 240 242 161 243 52 167 Nonnucleoside RT Inhibitors Delavirdine (DLV) Efavirenz (EFV) Nevirapine (NVP) 53 53 43 20 22 11 27 25 46 304 296 307 a. S, sensitive; I, intermediate; R, resistant Sequence Feature Vectors Using n-Grams • Used successfully by other groups for sequence representation to study proteins; first application in this context (HIV-1 PR/RT) • Each of the 19 inhibitor sequence datasets encoded separately • Relative frequency method: sliding window of size n = 2 captures all ordered 2-grams of the seqs; calc. rel. freq. for all 400 types of 2-grams; represent each seq. as ordered vector of rel. freqs. • Counts method: each seq. represented as a 400-dim. vector, each component represents a specific 2-gram type whose value is the absolute freq. of its occurrence in that seq. • Dataset sequences have inhibitor susceptibility (phenotype) values (regression models), which can be be placed into 3 (S/I/R) groups (classification models) Classification and Regression Models • Algorithms: random forest (RF) for classification, reduced-error pruned tree (REPTree) for regression, implemented in Weka • Testing: stratified tenfold cross-validation applied to each dataset • Reported results on each dataset: • RF classification: accuracy (% correct), out-of-bag (OOB) error, balanced error rate (BER), area under ROC curve (AUC) • REPTree regression: corr coeff (r2), mean-squared error (mse), accuracy (% correct) based on where predicted numerical susceptibility values fall relative to S/I/R category thresholds Accuracy Results Relative Frequency Drug REPTree RF Counts REPTree RF Drug Mean 0.80 0.76 0.80 0.81 0.82 0.84 0.80 0.81 0.81 0.80 0.75 0.78 0.81 0.80 0.86 0.80 0.78 0.80 Protease Inhibitors APV ATV IDV LPV NFV RTV SQV TPV AVG 0.81 0.74 0.78 0.80 0.80 0.87 0.80 0.75 0.79 3TC ABC AZT d4T ddC ddI FTC TDF AVG 0.89 0.68 0.75 0.74 0.80 0.69 0.96 0.75 0.78 DLV EFV NVP AVG 0.76 0.78 0.84 0.79 0.80 0.75 0.80 0.82 0.80 0.86 0.79 0.79 0.80 0.80 0.76 0.75 0.80 0.79 0.87 0.80 0.75 0.79 Rhee, et al. (Stanford) 0.78 Nucleoside / Nucleotide RT Inhibitors 0.87 0.68 0.75 0.79 0.75 0.73 0.83 0.75 0.77 0.87 0.66 0.73 0.76 0.80 0.69 0.94 0.68 0.77 0.90 0.67 0.70 0.78 0.76 0.71 0.89 0.74 0.77 0.88 0.67 0.73 0.77 0.78 0.71 0.91 0.73 0.77 0.76 0.71 0.73 0.77 0.74 0.73 0.75 0.81 0.76 0.83 Nonnucleoside RT Inhibitors 0.70 0.74 0.79 0.74 0.76 0.76 0.82 0.78 Information-Rich REPTree Attributes Drugs Root Node a Level 1 Nodes a Level 2 Nodes a PIs (Protease Inhibitors) APV 10 84, 87 32, 34, 53 ATV 54 73 32, 50 IDV 54 45, 53 72, 83, 90 LPV 54 45 77, 84 NFV 10 54, 87 29, 75, 83, 90 RTV 54 9, 84 19, 82, 84 SQV 70 10, 83 47, 54, 90 TPV 90 52, 56 40, 73 NRTIs (Nucleoside / Nucleotide RT Inhibitors) 3TC 183 64 66 ABC 183 115, 214 64, 101, 114, 118 AZT 67 166, 210 76, 214 d4T 209 76, 177 66, 67 ddC 115 134, 183 65, 117 ddI 150 43, 61 39, 183 FTC 183 123, 214 40 TDF 214 34, 65 68, 227, 285 NNRTIs (Non-nucleoside RT Inhibitors) DLV 102 165, 180 69, 100, 190, 209 EFV 102 189 99, 188 NVP 189 103, 172 173, 180 • Based on relative frequency method for generating sequence feature vectors • Node attribute i is a vector component number, whose value is the rel. freq. for the (i, i + 1) sequence 2-gram • Ex.: root node 10 for APV corresponds to PR sequence positions (10, 11), and at least one of these is known to be an important drug resistance position (10 is in both IAS and TSM subsets) a. Regular font, both IAS and TSM sets of positions; bold, TSM only; underlined, neither. Application: Drug Cocktail Effectiveness • Used relative frequency method and REPTree regression • Train with one inhibitor dataset, test with another • High corr coeff (r) between actual and predicted susceptibility values on test set both inhibitors (train and test sets) have similar resistance patterns and/or likely not good taken together • Low or slightly negative r potentially good in combination Train / Test NRTIs NNRTIs ------------------------------------------------------------------------------- --------------------------- 3TC ABC AZT d4T ddC ddI FTC TDF DLV EFV NVP 0.98 0.85 0.11 0.18 0.57 0.45 0.94 -0.42 0.69 0.91 0.44 0.51 0.63 0.68 0.69 -0.06 -0.08 0.29 0.91 0.79 0.16 0.21 0.03 0.68 0.01 0.42 0.78 0.91 0.47 0.56 0.05 0.48 0.45 0.62 0.27 0.57 0.90 0.86 0.41 -0.19 0.38 0.63 0.35 0.58 0.79 0.91 0.36 -0.05 0.99 0.93 0.32 0.29 0.08 0.84 1.00 -0.34 -0.31 0.05 0.60 0.53 -0.07 0.03 -0.27 0.82 -0.13 -0.10 -0.07 -0.07 -0.13 -0.10 -0.13 0.04 -0.17 -0.05 -0.01 -0.02 -0.17 -0.07 -0.17 0.11 -0.25 -0.14 -0.05 -0.06 -0.22 -0.13 -0.24 0.10 Known bad pairing NRTIs 3TC/ABC or FTC/ABC pairs are effective, but high risk of severe adverse events that require stoppage 3TC ABC AZT d4T ddC ddI FTC TDF NNRTIs DLV EFV NVP -0.14 -0.13 -0.10 -0.15 -0.02 -0.01 0.02 0.14 0.15 -0.03 0.05 0.09 -0.10 -0.10 -0.13 -0.10 -0.06 -0.06 -0.20 -0.13 -0.11 -0.07 0.06 0.02 0.87 0.55 0.60 0.51 0.91 0.73 0.60 0.72 0.92 Known good pairing Shaded areas: NRTI/NNRTI pairs (known good together) Two NNRTIs should NOT be taken together (based on clinical trials) Acknowledgements and References • Thanks to the Stanford HIV Drug Resistance Database (http://hivdb.stanford.edu/) for the genotype-phenotype correlation data characterizing HIV-1 PR and RT sequences • This study was inspired by Rhee, et al., PNAS (2006) • Effective cocktails, and drugs not to co-administer, based on Antiretroviral Guidelines for Adults and Adolescents from the U.S. Department of Health and Human Services: http://www.aidsinfo.nih.gov/ContentFiles/AdultandAdolescentGL.pdf