Prediction of HIV-1 Drug Resistance

advertisement
Prediction of HIV-1 Drug Resistance:
Representation of Target Sequence Mutational
Patterns via an n-Grams Approach
Majid Masso
School of Systems Biology, George Mason University
Manassas, Virginia
BIBM 2012, Philadelphia, Pennsylvania
Graphical Outline of Presentation
HIV-1 Protein Sequence Datasets
• Data available from Stanford HIV Drug Resistance Database
• 548 protease (PR) and 331 reverse transcriptase (RT) sequences
with distinct mutational patterns defined by residue substitutions
• For each of 8 PR and 11 RT inhibitors, PhenoSense assay used to
measure degree to which mutant target proteins are susceptible
• PR/RT genotyping much faster and cheaper than phenotyping
• Hence accurate predictive models of drug susceptibility only
from target sequence are in high demand
• Here we develop 19 inhibitor-specific predictive classification
and regression models trained on the available phenotype data
HIV-1 Protein Sequence Datasets
Isolate Phenotypes (%) a
Drug
S
Amprenavir (APV)
Atazanavir (ATV)
Indinavir (IDV)
Lopinavir (LPV)
Nelfinavir (NFV)
Ritonavir (RTV)
Saquinavir (SQV)
Tipranavir (TPV)
63
49
53
46
39
50
61
78
I
Protease Inhibitors
26
29
26
22
28
20
18
11
R
Total
11
22
21
32
33
30
21
11
495
200
502
320
526
473
509
47
Nucleoside / Nucleotide RT Inhibitors
Lamivudine (3TC)
Abacavir (ABC)
Zidovudine (AZT)
Stavudine (d4T)
Zalcitabine (ddC)
Didanosine (ddI)
Emtricitabine (FTC)
Tenofovir (TDF)
29
28
50
53
39
51
31
65
18
45
23
36
52
43
13
25
53
27
27
11
9
6
56
10
244
237
240
242
161
243
52
167
Nonnucleoside RT Inhibitors
Delavirdine (DLV)
Efavirenz (EFV)
Nevirapine (NVP)
53
53
43
20
22
11
27
25
46
304
296
307
a. S, sensitive; I, intermediate; R, resistant
Sequence Feature Vectors Using n-Grams
• Used successfully by other groups for sequence representation to
study proteins; first application in this context (HIV-1 PR/RT)
• Each of the 19 inhibitor sequence datasets encoded separately
• Relative frequency method: sliding window of size n = 2 captures
all ordered 2-grams of the seqs; calc. rel. freq. for all 400 types of
2-grams; represent each seq. as ordered vector of rel. freqs.
• Counts method: each seq. represented as a 400-dim. vector, each
component represents a specific 2-gram type whose value is the
absolute freq. of its occurrence in that seq.
• Dataset sequences have inhibitor susceptibility (phenotype)
values (regression models), which can be be placed into 3 (S/I/R)
groups (classification models)
Classification and Regression Models
• Algorithms: random forest (RF) for classification, reduced-error
pruned tree (REPTree) for regression, implemented in Weka
• Testing: stratified tenfold cross-validation applied to each dataset
• Reported results on each dataset:
• RF classification: accuracy (% correct), out-of-bag (OOB) error,
balanced error rate (BER), area under ROC curve (AUC)
• REPTree regression: corr coeff (r2), mean-squared error (mse),
accuracy (% correct) based on where predicted numerical
susceptibility values fall relative to S/I/R category thresholds
Accuracy Results
Relative Frequency
Drug
REPTree
RF
Counts
REPTree
RF
Drug
Mean
0.80
0.76
0.80
0.81
0.82
0.84
0.80
0.81
0.81
0.80
0.75
0.78
0.81
0.80
0.86
0.80
0.78
0.80
Protease Inhibitors
APV
ATV
IDV
LPV
NFV
RTV
SQV
TPV
AVG
0.81
0.74
0.78
0.80
0.80
0.87
0.80
0.75
0.79
3TC
ABC
AZT
d4T
ddC
ddI
FTC
TDF
AVG
0.89
0.68
0.75
0.74
0.80
0.69
0.96
0.75
0.78
DLV
EFV
NVP
AVG
0.76
0.78
0.84
0.79
0.80
0.75
0.80
0.82
0.80
0.86
0.79
0.79
0.80
0.80
0.76
0.75
0.80
0.79
0.87
0.80
0.75
0.79
Rhee, et al.
(Stanford)
0.78
Nucleoside / Nucleotide RT Inhibitors
0.87
0.68
0.75
0.79
0.75
0.73
0.83
0.75
0.77
0.87
0.66
0.73
0.76
0.80
0.69
0.94
0.68
0.77
0.90
0.67
0.70
0.78
0.76
0.71
0.89
0.74
0.77
0.88
0.67
0.73
0.77
0.78
0.71
0.91
0.73
0.77
0.76
0.71
0.73
0.77
0.74
0.73
0.75
0.81
0.76
0.83
Nonnucleoside RT Inhibitors
0.70
0.74
0.79
0.74
0.76
0.76
0.82
0.78
Information-Rich REPTree Attributes
Drugs
Root Node a
Level 1 Nodes a
Level 2 Nodes a
PIs (Protease Inhibitors)
APV
10
84, 87
32, 34, 53
ATV
54
73
32, 50
IDV
54
45, 53
72, 83, 90
LPV
54
45
77, 84
NFV
10
54, 87
29, 75, 83, 90
RTV
54
9, 84
19, 82, 84
SQV
70
10, 83
47, 54, 90
TPV
90
52, 56
40, 73
NRTIs (Nucleoside / Nucleotide RT Inhibitors)
3TC
183
64
66
ABC
183
115, 214
64, 101, 114, 118
AZT
67
166, 210
76, 214
d4T
209
76, 177
66, 67
ddC
115
134, 183
65, 117
ddI
150
43, 61
39, 183
FTC
183
123, 214
40
TDF
214
34, 65
68, 227, 285
NNRTIs (Non-nucleoside RT Inhibitors)
DLV
102
165, 180
69, 100, 190, 209
EFV
102
189
99, 188
NVP
189
103, 172
173, 180
• Based on relative frequency
method for generating
sequence feature vectors
• Node attribute i is a vector
component number, whose
value is the rel. freq. for the
(i, i + 1) sequence 2-gram
• Ex.: root node 10 for APV
corresponds to PR sequence
positions (10, 11), and at
least one of these is known
to be an important drug
resistance position (10 is in
both IAS and TSM subsets)
a. Regular font, both IAS and TSM sets of positions; bold, TSM only; underlined, neither.
Application: Drug Cocktail Effectiveness
• Used relative frequency method and REPTree regression
• Train with one inhibitor dataset, test with another
• High corr coeff (r) between actual and predicted susceptibility
values on test set both inhibitors (train and test sets) have
similar resistance patterns and/or likely not good taken together
• Low or slightly negative r potentially good in combination
Train / Test
NRTIs
NNRTIs
-------------------------------------------------------------------------------
---------------------------
3TC
ABC
AZT
d4T
ddC
ddI
FTC
TDF
DLV
EFV
NVP
0.98
0.85
0.11
0.18
0.57
0.45
0.94
-0.42
0.69
0.91
0.44
0.51
0.63
0.68
0.69
-0.06
-0.08
0.29
0.91
0.79
0.16
0.21
0.03
0.68
0.01
0.42
0.78
0.91
0.47
0.56
0.05
0.48
0.45
0.62
0.27
0.57
0.90
0.86
0.41
-0.19
0.38
0.63
0.35
0.58
0.79
0.91
0.36
-0.05
0.99
0.93
0.32
0.29
0.08
0.84
1.00
-0.34
-0.31
0.05
0.60
0.53
-0.07
0.03
-0.27
0.82
-0.13
-0.10
-0.07
-0.07
-0.13
-0.10
-0.13
0.04
-0.17
-0.05
-0.01
-0.02
-0.17
-0.07
-0.17
0.11
-0.25
-0.14
-0.05
-0.06
-0.22
-0.13
-0.24
0.10
Known bad
pairing
NRTIs
3TC/ABC or
FTC/ABC pairs
are effective, but
high risk of severe
adverse events that
require stoppage
3TC
ABC
AZT
d4T
ddC
ddI
FTC
TDF
NNRTIs
DLV
EFV
NVP
-0.14
-0.13
-0.10
-0.15
-0.02
-0.01
0.02
0.14
0.15
-0.03
0.05
0.09
-0.10
-0.10
-0.13
-0.10
-0.06
-0.06
-0.20
-0.13
-0.11
-0.07
0.06
0.02
0.87
0.55
0.60
0.51
0.91
0.73
0.60
0.72
0.92
Known good
pairing
Shaded areas:
NRTI/NNRTI pairs
(known good together)
Two NNRTIs should
NOT be taken together
(based on clinical trials)
Acknowledgements and References
• Thanks to the Stanford HIV Drug Resistance Database
(http://hivdb.stanford.edu/) for the genotype-phenotype
correlation data characterizing HIV-1 PR and RT sequences
• This study was inspired by Rhee, et al., PNAS (2006)
• Effective cocktails, and drugs not to co-administer, based on
Antiretroviral Guidelines for Adults and Adolescents from
the U.S. Department of Health and Human Services:
http://www.aidsinfo.nih.gov/ContentFiles/AdultandAdolescentGL.pdf
Download