sma-talk02.ppt

advertisement
From Informatics
to Bioinformatics
Limsoon Wong
Laboratories for Information Technology
Singapore
What is Bioinformatics?
Themes of Bioinformatics
Bioinformatics =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Benefits of Bioinformatics
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
From Informatics to Bioinformatics
8 years of
bioinformatics
R&D in
Singapore
Integration
Technology
(Kleisli)
1994
ISS
MHC-Peptide Protein Interactions
Binding
Extraction (PIES)
(PREDICT)
Gene Expression
Cleansing &
& Medical Record
Warehousing
Datamining (PCL)
(FIMM)
Gene Feature
Recognition (Dragon)
1996
Venom
Informatics
1998
KRDL
2000
2002
LIT
Quick Samplings
Data Integration
A DOE “impossible query”:
For each gene on a given cytogenetic band, find its
non-human homologs.
source
type
location
remarks
GDB
Sybase
Baltimore Flat tables
SQL joins
Location info
Entrez
ASN.1
Bethesda
Nested tables
Keywords
Homolog info
Data Integration Results
• Using Kleisli:
• Clear
• Succinct
• Efficient
sybase-add (#name:”GDB", ...);
create view L from locus_cyto_location using GDB;
create view E from object_genbank_eref using GDB;
select
#accn: g.#genbank_ref, #nonhuman-homologs: H
from
L as c, E as g,
• Handles
•heterogeneity
•complexity
{select u
from g.#genbank_ref.na-get-homolog-summary as u
where not(u.#title string-islike "%Human%") andalso
not(u.#title string-islike "%H.sapien%")} as H
where
c.#chrom_num = "22” andalso
g.#object_id = c.#locus_id andalso
not (H = { });
Data Warehousing
Motivation
efficiency
availabilty
“denial of service”
data cleansing
Requirements
efficient to query
easy to update.
model data naturally
{(#uid: 6138971,
#title: "Homo sapiens adrenergic ...",
#accession: "NM_001619",
#organism: "Homo sapiens",
#taxon: 9606,
#lineage: ["Eukaryota", "Metazoa", …],
#seq: "CTCGGCCTCGGGCGCGGC...",
#feature: {
(#name: "source",
#continuous: true,
#position: [
(#accn: "NM_001619",
#start: 0, #end: 3602,
#negative: false)],
#anno: [
(#anno_name: "organism",
#descr: "Homo sapiens"), …] ), …)}
Data Warehousing Results
Relational DBMS is
insufficient because it forces us to
fragment data into 3NF.
Kleisli turns flat relational
DBMS into nested relational
DBMS. It can use flat relational
DBMS such as Sybase, Oracle,
MySQL, etc. to be its update-able
complex object store.
! Log in
oracle-cplobj-add (#name: "db", ...);
! Define table
create table GP
(#uid: "NUMBER", #detail: "LONG")
using db;
! Populate table with GenPept reports
select #uid: x.#uid, #detail: x into GP
from aa-get-seqfeat-general "PTP” as x
using db;
! Map GP to that table
create view GP from GP using db;
! Run a queryto get title of 131470
select x.#detail.#title
from GP as x
where x.#uid = 131470;
Epitope Prediction
TRAP-559AA
MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN
LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS
LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL
TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR
FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK
TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ
CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI
IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ
KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN
QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN
RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE
KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP
GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results
 Prediction by our ANN model for HLA-A11



29 predictions
22 epitopes
76% specificity
 Prediction by BIMAS matrix for HLA-A*1101
Number of experimental binders
19 (52.8%)
5 (13.9%)
12 (33.3%)
1
66
100
Rank by BIMAS
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis
age
sex
chol
ecg
heart
sick
49
64
58
58
58
M
M
F
M
M
266
211
283
284
224
Hyp
Norm
Hyp
Hyp
Abn
171
144
162
160
173
N
N
N
Y
Y
 Looking for patterns that are




valid
novel
useful
understandable
Gene Expression Analysis
 Classifying gene expression profiles



find stable differentially expressed genes
find significant gene groups
derive coordinated gene expression
Medical Record & Gene
Expression Analysis Results
 PCL, a novel “emerging
pattern’’ method
 Beats C4.5, CBA, LB, NB,
TAN in 21 out of 32 UCI
benchmarks
 Works well for gene
expressions
Cancer Cell, March 2002, 1(2)
Protein Interaction Extraction
“What are the protein-protein interaction pathways
from the latest reported discoveries?”
Protein Interaction Extraction
Results
 Rule-based system for
processing free texts in
scientific abstracts
 Specialized in
 extracting protein
names
 extracting proteinprotein interactions
Behind the Scene
 Vladimir Bajic
 Vladimir Brusic
 Jinyan Li
 See-Kiong Ng
 Limsoon Wong
 Louxin Zhang
 Allen Chong
 Judice Koh
 SPT Krishnan
 Huiqing Liu
 Seng Hong Seah
 Soon Heng Tan
 Guanglan Zhang
 Zhuo Zhang
and many more:
students, folks from geneticXchange,
MolecularConnections, and other collaborators….
A More Detailed Account
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules : Blue or Circle
Jessica’s rules
: All the rest
What is Datamining?
Question: Can you explain how?
The Steps of Data Mining
 Training data gathering
 Signal generation

k-grams, colour, texture, domain know-how, ...
 Signal selection

Entropy, 2, CFS, t-test, domain know-how...
 Signal integration

SVM, ANN, PCL, CART, C4.5, kNN, ...
Translation Initiation
Recognition
A Sample mRNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
80
160
240
80
160
240
Signal Generation
 K-grams (ie., k consecutive letters)




K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
C
G
T
Too Many Signals
 For each value of k, there are
4k * 3 * 2 k-grams
 If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188
features!
 This is too many for most machine learning
algorithms
Signal Selection (eg., 2)
Sample k-grams Selected
Kozak consensus
Leaky scanning
 Position –3
 in-frame upstream ATG
 in-frame downstream


Stop codon
TAA, TAG, TGA,
CTG, GAC, GAG, and GCC
Codon bias
Signal Integration
 kNN
Given a test sample, find the k training samples
that are most similar to it. Let the majority class
win.
 SVM
Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.
 Naïve Bayes, ANN, C4.5, ...
Results (on Pedersen & Nielsen’s mRNA)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP +{ FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Acknowledgements




Roland Yap
Zeng Fanfan
A.G. Pedersen
H. Nielsen
Questions?
Download