ntu-bioengineering-m..

advertisement
Exciting
Bioinformatics
Adventures
Limsoon Wong
Institute for Infocomm Research
Plan
•
•
•
•
•
Treatment optimization of childhood ALL
Treatment prognosis of DLBC lymphoma
Prediction of translation initiation site
Prediction of vaccine target
Reliability Assessment of Y2H expts
Treatment
Optimization of
Childhood Leukemia
Image credit: FEER
Childhood ALL
• Major subtypes are: TALL, E2A-PBX, TEL-AML,
MLL genome
rearrangements,
Hyperdiploid>50, BCR-ABL
• Diff subtypes respond
differently to same Tx
• Over-intensive Tx
– Development of
secondary cancers
– Reduction of IQ
• Under-intensiveTx
– Relapse
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
• The subtypes look
similar
• Conventional diagnosis
– Immunophenotyping
– Cytogenetics
– Molecular diagnostics
• Unavailable in most
ASEAN countries
Single-Test Platform of
Microarray & Machine Learning
Image credit: Affymetrix
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Multidimensional Scaling Plot
Subtype Diagnosis
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Is there a new subtype?
• Hierarchical
clustering of
gene
expression
profiles reveals
a novel subtype
of childhood
ALL
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Conclusions
Conventional Tx:
• intermediate intensity to
everyone
 10% suffers relapse
 50% suffers side effects
 costs US$150m/yr
Our optimized Tx:
• high intensity to 10%
• intermediate intensity to 40%
• low intensity to 50%
• costs US$100m/yr
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
•High cure rate of 80%
• Less relapse
• Less side effects
• Save US$51.6m/yr
References
• E.-J. Yeoh et al., “Classification, subtype discovery, and
prediction of outcome in pediatric acute lymphoblastic
leukemia by gene expression profiling”, Cancer Cell,
1:133--143, 2002
Treatment Prognosis
for DLBC Lymphoma
Image credit: Rosenwald et al, 2002
Diffuse Large B-Cell Lymphoma
• DLBC lymphoma is the
most common type of
lymphoma in adults
• Can be cured by
anthracycline-based
chemotherapy in 35 to
40 percent of patients
 DLBC lymphoma
comprises several
diseases that differ in
responsiveness to
chemotherapy
Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu
• Intl Prognostic Index
(IPI)
– age, “Eastern Cooperative
Oncology Group” Performance
status, tumor stage, lactate
dehydrogenase level, sites of
extranodal disease, ...
• Not very good for
stratifying DLBC
lymphoma patients for
therapeutic trials
 Use gene-expression
profiles to predict
outcome of
chemotherapy?
Knowledge Discovery from Gene
Expression of “Extreme” Samples
240
samples
“extreme”
sample
selection:
< 1 yr vs > 8 yrs
knowledge
discovery
from gene
expression
47 shortterm survivors
26 longterm survivors
84
genes
T is long-term if S(T) < 0.3
T is short-term if S(T) > 0.7
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
7399
genes
80
samples
Kaplan-Meier Plot for 80 Test Cases
p-value of log-rank test: < 0.0001
Risk score thresholds: 0.7, 0.3
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
Improvement Over IPI
(A) IPI low,
p-value = 0.0063
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
(B) IPI intermediate,
p-value = 0.0003
Merit of “Extreme” Samples
(A) W/o sample selection (p =0.38)
(B) With sample selection (p=0.009)
No clear difference on the overall survival of the 80 samples in the validation
group of DLBCL study, if no training sample selection conducted
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
References
• H. Liu et al, “Selection of patient samples and genes
for outcome prediction”, Proc. CSB2004, pages 382-392
Protein Translation
Initiation Site
Recognition
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
• What makes the second ATG the TIS?
Copyright © 2005 by Limsoon Wong
80
160
240
80
160
240
Approach
• Training data gathering
• Signal generation
– k-grams, distance, domain know-how, ...
• Signal selection
– Entropy, 2, CFS, t-test, domain know-how...
• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright © 2005 by Limsoon Wong
Amino-Acid Features
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
Amino-Acid Features
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
Amino Acid K-grams Discovered (by entropy)
Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong
Validation Results (on Hatzigeorgiou’s)
• Using top 100 features selected by entropy
and trained on Pedersen & Nielsen’s dataset
Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu
Validation Results (on Chr X and Chr 21)
Our
method
ATGpr
• Using top 100 features selected by entropy and
trained on Pedersen & Nielsen’s
Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu
References
• L. Wong et al., “Using feature generation and
feature selection for accurate prediction of
translation initiation sites”, GIW 13:192--200,
2002
Vaccine Target
Prediction
Image credit: Asif Khan
T-Cell Epitope Prediction
• Why?
• Challenges:
– Only 1%-5% of peptides
from a protein bind to any
one HLA molecule
– Traditional approaches
are slow, & inapplicable to
large-scale screening
– There are ~2000 variants
of HLA classified in ~20
supertypes
– Relatively small number of
expt data on peptides that
bind HLA molecules
– for majority of HLA
molecules expt data do
not exist
 Computer Modeling
– Enable systematic
screening for HLA binders
– Minimize number of expts
– Reduce cost 10x
P1
P2
P3
P4
Promiscuous peptides
One supertype
H1
Copyright © 2005 by Limsoon Wong. Adapted from Asif Khan.
H2
H3
H4
Multipred Approach
Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic
Expt Validation
FP
FN
Cut-off
Threshold
HCV IB
protein sequence
DR supertype
Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic
Accuracy of Multipred
1.00
0.90
0.80
0.70
0.60
ANN
HMM
0.50
SVM
0.40
0.30
0.20
0.10
0.00
A-0201
A-0202
A-0204
A-0205
A-0206
avearage
ANN
0.87
0.76
0.88
0.93
0.91
0.87
HMM
0.93
0.73
0.92
0.94
0.88
0.88
SVM
0.90
0.81
0.93
0.97
0.85
0.89
Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic
Conclusions
• Computer models are necessary to aid in
identification of vaccine targets
• Prediction models built are both sensitive and
specific
• MULTIPRED can identify promiscuous
peptides and immunological hot-spots which
are useful for vaccine design
• Hot-spots are ideal for development of epitopebased vaccines
References
• K.N. Srinivasan, et al. “Predictions of Class I Tcell epitopes: Evidence of presence of
immunological hot spots inside antigens”,
Bioinformatics, 20:i297-i302, 2004.
Assessing Reliability
of Protein-Protein
Interaction Expts
% of TP based on shared cellular role (I = .95)
% of TP based on shared cellular role (I = 1)
% of TP based on co-localization
TP = ~50%
Image credit: Sprinzak et al, 2003
Some Protein Interaction Data Sets
Large disagreement betw methods
• Can we find a way to rank candidate interacting pairs
according to their reliability?
Copyright © 2005 by Limsoon Wong. Adapted from Sprinzak et al, 2003
Some “Reasonable” Speculations
• A true interacting pair is often connected by at
least one alternative path (reason: a biological
function is performed by a highly
interconnected network of interactions)
• The shorter the alternative path, the more likely
the interaction (reason: evolution of life is
through “add-on” interactions of other or newer
folds onto existing ones)
 Existence of a strong short alternative path
connecting an interacting pair indicates that the
interaction is “reliable”
Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004
Interaction
Pathway
Reliability
Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004
Evaluation wrt
Reproducible Interactions
The number of pairs not in the
intersection of Ito & Uetz is not
changed much wrt the ipr value
of the pairs
The number of pairs in the
intersection of Ito & Uetz
increases wrt the ipr value
of the pairs
• “ipr” correlates
well to
“reproducible”
interactions
• “ipr” seems to
work
Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004
Evaluation wrt
Common Cellular Role, etc
• “ipr” correlates well
At the ipr threshold
to common cellular
that eliminated 80%
of pairs, ~85% of the
roles, localization, &
of the remaining pairs
have common cellular
expression
roles
Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004
Evaluation wrt
“Many-few” Interactions
Part of the network of
physical interactions
reported by
Ito et al., PNAS, 2001
• Number of “Many-few” interactions increases when
more “reliable” IPR threshold is used to filter interactions
• Consistent with the Maslov-Sneppen prediction
Copyright © 2005 by Limsoon Wong. Adapted from Chen et al., 2004
Evaluation wrt “Cross-Talkers”
• A MIPS functional cat:
–
–
–
–
| 02
| ENERGY
| 02.01
| glycolysis and gluconeogenesis
| 02.01.01 | glycolysis methylglyoxal bypass
| 02.01.03 | regulation of glycolysis &
gluconeogenesis
• First 2 digits is top cat
• Other digits add more
granularity to the cat
 Compare non-colocalized high- & low- IPR
pairs to find number that
fall into same cat. More
high-IPR pairs in same
cat, then IPR works
Copyright © 2005 by Limsoon Wong.
• For top cat
– 148/257 high-IPR pairs
are in same cat
– 65/260 low-IPR pairs are
in same cat
• For fine-granularity cat
– 135/257 high-IPR pairs
are in same cat.
37/260 low-IPR pairs are
in same cat
 IPR works
 IPR pairs that are not
co-localized are real
cross-talkers!
Conclusions
• There are latent local & global “motifs” that
indicate the likelihood of protein interactions
• These motifs can be exploited in computational
elimination of false positives from highthroughput Y2H expts
Copyright © 2005 by Limsoon Wong.
References
• J. Chen et al, “Mining high-throughput
experimental data for reliable protein
interaction data using using network”, 16th
IEEE International Conference on Tools with
Artificial Intelligence (ICTAI 2004), Florida,
November 15-17, 2004
Acknowledgements
• Childhood ALL:
– Jinyan Li, Huiqing Liu
– Allen Yeoh
• DLBC Lymphoma:
– Jinyan Li, Huiqing Liu
• Translation Initiation:
– Fanfan Zeng, Roland Yap
– Huiqing Liu
• T-Cell Epitopes:
– Vladimir Brusic, Asif Khan,
Guanglan Zhang
– Tom August, KN Srinivasan
• Protein Interaction
Reliability:
– Jin Chen, Mong Li Lee,
Wynne Hsu
– See-Kiong Ng
– Prasanna Kolatkar, JerMing Chia
Download