Limsoon Wong
Institute for Infocomm Research
Singapore
What is Bioinformatics?
Themes of Bioinformatics
Bioinformatics =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Benefits of Bioinformatics
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
From Informatics to Bioinformatics
8 years of bioinformatics
R&D in
Singapore
MHC-Peptide
Binding
(PREDICT)
Protein Interactions
Extraction (PIES)
Cleansing &
Warehousing
(FIMM)
Gene Expression
& Medical Record
Datamining (PCL)
Gene Feature
Recognition (Dragon) Integration
Technology
(Kleisli) Venom
Informatics
1994
ISS
1996 1998
KRDL
2000 2002
LIT/I 2 R
Data Integration
A DOE “impossible query”:
For each gene on a given cytogenetic band, find its non-human homologs.
source type location remarks
GDB Sybase Baltimore Flat tables
SQL joins
Location info
Entrez ASN.1
Bethesda Nested tables
Keywords
Homolog info
Data Integration Results
• Using Kleisli
• Clear
:
• Succinct
• Efficient
• Handles
•heterogeneity
•complexity sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select
#accn: g.#genbank_ref, #nonhuman-homologs: H from
L as c, E as g,
{ select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")} as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { });
Data Warehousing
Motivation efficiency availabilty
“denial of service” data cleansing
Requirements efficient to query easy to update. model data naturally
{(#uid: 6138971,
#title: "Homo sapiens adrenergic ...",
#accession: "NM_001619",
#organism: "Homo sapiens",
#taxon: 9606,
#lineage: ["Eukaryota", "Metazoa", …],
#seq: "CTCGGCCTCGGGCGCGGC...",
#feature: {
(#name: "source",
#continuous: true,
#position: [
(#accn: "NM_001619",
#start: 0, #end: 3602,
#negative: false)],
#anno: [
(#anno_name: "organism",
#descr: "Homo sapiens"), …] ), …)}
Data Warehousing Results
Relational DBMS is insufficient because it forces us to fragment data into 3NF.
Kleisli turns flat relational
DBMS into nested relational
DBMS.
It can use flat relational
DBMS such as Sybase, Oracle,
MySQL, etc. to be its update-able complex object store.
! Log in oracle-cplobj-add (#name: "db", ...);
! Define table create table GP
(#uid: "NUMBER", #detail: "LONG") using db;
! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db;
! Map GP to that table create view GP from GP using db;
! Run a queryto get title of 131470 select x.#detail.#title from GP as x where x.#uid = 131470;
Epitope Prediction
TRAP-559AA
MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN
LNDNAIHLY VNVFSNNAK EIIRLHSDASKNKEKALIIIRS
LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL
TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR
FLVGCHPSDGKCNLYADSAWENV KNVIGPFMKAVCVEVEK
TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ
CEEERCPPKWEPLDVPDEPEDDQPRP RGDNSSVQK PEENI
IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ
KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN
QNNLPNDKSDRN IPYSPLPPK VLDNERKQSDPQSQDNNGN
RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE
KPDNNKKKGESDNKYKIAGGIAGGLAL LACAGLAYK FVVP
GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results
Prediction by our ANN model for HLA-A11
29 predictions
22 epitopes
76% specificity
Prediction by BIMAS matrix for HLA-A*1101
Number of experimental binders
19 (52.8%) 5 (13.9%) 12 (33.3%)
1 66 100
Rank by BIMAS
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis age sex chol ecg heart sick
49 M 266 Hyp 171 N
64 M 211 Norm 144 N
58 F 283 Hyp 162 N
58 M 284 Hyp 160 Y
58 M 224 Abn 173 Y
Looking for patterns that are
valid
novel
useful understandable
Gene Expression Analysis
Classifying gene expression profiles
find stable differentially expressed genes find significant gene groups
derive coordinated gene expression
Medical Record & Gene
Expression Analysis Results
PCL, a novel “emerging pattern’’ method
Beats C4.5, CBA, LB,
NB, TAN in 21 out of 32
UCI benchmarks
Works well for gene expressions
Cancer Cell, March 2002, 1(2)
Protein Interaction Extraction
“What are the protein-protein interaction pathways from the latest reported discoveries?”
Protein Interaction Extraction Results
Rule-based system for processing free texts in scientific abstracts
Specialized in
extracting protein names
extracting proteinprotein interactions
Behind the Scene
Vladimir Bajic
Vladimir Brusic
Jinyan Li
See-Kiong Ng
Limsoon Wong
Louxin Zhang
Allen Chong
Judice Koh
SPT Krishnan
Huiqing Liu
Seng Hong Seah
Soon Heng Tan
Guanglan Zhang
Zhuo Zhang and many more: students, folks from geneticXchange,
MolecularConnections, and other collaborators….
A more detailed example of post-genome knowledge discovery
Translation Initiation Recognition
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCC ATG GCTGAACACTGACTCCCAGCTGTG 80
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC ATG GCTTTTGGCTGTCAGGGCAGCTGTA 160
GGAGGCAG ATG AGAAGAGGGAG ATG GCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................ 80
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
What makes the second ATG the translation initiation site?
Approach
Training data gathering
Signal generation
k-grams, distance, domain know-how, ...
Signal selection
Entropy,
2, CFS, t-test, domain knowhow...
Signal integration
SVM, ANN, PCL, CART, C4.5, kNN, ...
Training & Testing Data
Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]
3312 sequences
13503 ATG sites
3312 (24.5%) are TIS
10191 (75.5%) are non-TIS
Use for 3-fold x-validation expts
Signal Generation
K-grams (ie., k consecutive letters)
K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
1.5
1
0.5
0 seq1 seq2 seq3
A C G T
Too Many Signals
For each value of k, there are
4 k * 3 * 2 k-grams
If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!
This is too many for most machine learning algorithms
Signal Selection (Basic Idea)
Choose a signal w/ low intra-class distance
Choose a signal w/ high inter-class distance
Which of the following 3 signals is good?
Signal Selection
(eg., t-statistics)
Signal Selection
(eg., CFS)
Instead of scoring individual signals, how about scoring a group of signals as a whole?
CFS
A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other
Sample k-grams Selected by CFS
Leaky scanning
Kozak consensus
Position – 3
in-frame upstream ATG
in-frame downstream
TAA, TAG, TGA ,
CTG, GAC, GAG, and GCC
Stop codon
Codon bias?
Signal Integration
kNN
Given a test sample, find the k training samples that are most similar to it. Let the majority class win.
SVM
Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.
Naïve Bayes, ANN, C4.5, ...
Results
(3-fold x-validation)
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
Naïve Bayes
SVM
84.3%
73.9%
Neural Network 77.6%
Decision Tree 74.0%
86.1%
93.2%
93.2%
94.4%
66.3%
77.9%
78.8%
81.1%
85.7%
88.5%
89.4%
89.4%
Improvement by Voting
Apply any 3 of Naïve Bayes, SVM, Neural
Network, & Decision Tree. Decide by majority.
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
NB+SVM+NN 79.2%
NB+SVM+Tree 78.8%
NB+NN+Tree 77.6%
SVM+NN+Tree 75.9%
Best of 4 84.3%
Worst of 4 73.9%
92.1%
92.0%
94.5%
94.3%
94.4%
86.1%
76.5%
76.2%
82.1%
81.2%
81.1%
66.3%
88.9%
88.8%
90.4%
89.8%
89.4%
85.7%
Improvement by Scanning
Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS.
Naïve Bayes & SVM models were trained using
TIS vs. Up-stream ATG
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
NB
SVM
84.3%
73.9%
NB+Scanning 87.3%
SVM+Scanning 88.5%
86.1%
93.2%
96.1%
96.3%
66.3%
77.9%
87.9%
88.6%
85.7%
88.5%
93.9%
94.4%
Performance Comparisons
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
NB
Decision Tree
NB+NN+Tree
SVM+Scanning
84.3%
74.0%
77.6%
88.5%
Pedersen&Nielsen 78%
Zien
Hatzigeorgiou -
69.9%
-
86.1%
94.4%
94.5%
96.3%
87%
94.1%
-
-
-
66.3%
81.1%
82.1%
88.6%
85.7%
89.4%
90.4%
94.4%*
85%
88.1%
94%*
* result not directly comparable
Technique Comparisons
Pedersen&Nielsen
[ISMB’97]
Our approach
Neural network
Explicit feature
No explicit features
Zien
[Bioinformatics’00] generation
Explicit feature selection
SVM+kernel engineering
Use any machine learning method w/o any
No explicit features
Hatzigeorgiou
[Bioinformatics’02] form of complicated tuning
Multiple neural networks
Scanning rule
No explicit features
Scanning rule is optional
Acknowledgements
A.G. Pedersen
H. Nielsen
Roland Yap
Fanfan Zeng