From Informatics to Bioinformatics

advertisement
From Informatics
to Bioinformatics
Limsoon Wong
Kent Ridge Digital Labs
Singapore
Show & Tell
What is Bioinformatics?
Show & Tell
What are the Themes of
Bioinformatics?
Bioinformatics =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Show & Tell
What are the Benefits of
Bioinformatics?

To the patient:


To the pharma:


Save time, save cost, make more $
To the scientist:

Show & Tell
Better drug, better treatment
Better science
Data Integration

A DOE “impossible query”:
For each gene on a given cytogenetic band,
find its non-human homologs.
Show & Tell
source
type
location
remarks
GDB
Sybase
Baltimore Flat tables
SQL joins
Location info
Entrez
ASN.1
Bethesda
Nested tables
Keywords
Homolog info
Data Integration Results
• Using Kleisli:
• Clear
• Succint
• Efficient
• Handles
•heterogeneity
•complexity
sybase-add (#name:”GDB", ...);
create view L from locus_cyto_location using GDB;
create view E from object_genbank_eref using GDB;
select
#accn: g.#genbank_ref, #nonhuman-homologs: H
from
L as c, E as g,
(select u
from g.#genbank_ref.na-get-homolog-summary as u
where not(u.#title string-islike "%Human%") andalso
not(u.#title string-islike "%H.sapien%")) as H
where
c.#chrom_num = "22” andalso
g.#object_id = c.#locus_id andalso
not (H = { });
Show & Tell
Data Warehousing


Show & Tell
Motivation

efficiency

availabilty

“denial of service”

data cleansing
Requirements

efficient to query

easy to update.

model data naturally
{(#uid: 6138971,
#title: "Homo sapiens adrenergic ...",
#accession: "NM_001619",
#organism: "Homo sapiens",
#taxon: 9606,
#lineage: ["Eukaryota", "Metazoa", …],
#seq: "CTCGGCCTCGGGCGCGGC...",
#feature: {
(#name: "source",
#continuous: true,
#position: [
(#accn: "NM_001619",
#start: 0, #end: 3602,
#negative: false)],
#anno: [
(#anno_name: "organism",
#descr: "Homo sapiens"), …] ), …)}
Data Warehousing Results

Relational DBMS is
insufficient because it
forces us to fragment
data into 3NF.

Show & Tell
Kleisli turns flat
relational DBMS into
nested relational DBMS.
It can use flat relational
DBMS such as Sybase,
Oracle, MySQL, etc. to
be its updatable complex
object store. It can even
use all of these systems
simultaneously!
! Log in
oracle-cplobj-add (#name: "db", ...);
! Define table
create table GP
(#uid: "NUMBER", #detail: "LONG")
using db;
! Populate table with GenPept reports
select #uid: x.#uid, #detail: x into GP
from aa-get-seqfeat-general "PTP” as x
using db;
! Map GP to that table
create view GP from GP using db;
! Run a queryto get title of 131470
select x.#detail.#title
from GP as x
where x.#uid = 131470;
Epitope Prediction
TRAP-559AA
MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN
LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS
LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL
TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR
FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK
TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ
CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI
IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ
KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN
QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN
RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE
KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP
GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Show & Tell
Epitope Prediction Results

Prediction by our ANN model for HLA-A11



29 predictions
22 epitopes
76% specificity
 Prediction by BIMAS matrix for HLA-A*1101
Number of experimental binders
19 (52.8%)
5 (13.9%)
12 (33.3%)
1
Show & Tell
66
100
Rank by BIMAS
Gene Expression Analysis

Clustering gene expression profiles

Classifying gene expression profiles

Show & Tell
find stable differentially expressed genes
Gene Expression Analysis
Results
The Discovery System
• Correlation test
• Voter selection
• Class prediction
Show & Tell
Protein Interaction
Extraction
“What are the protein-protein interaction pathways
from the latest reported discoveries?”
Show & Tell
Protein Interaction
Extraction Results

Rule-based system
for processing free
texts in scientific
abstracts

Specialized in


Show & Tell
extracting
protein names
extracting
protein-protein
interactions
Transcription Start
Prediction
Show & Tell
Transcription Start
Prediction Results
Show & Tell
Medical Record Analysis
age
49
64
58
58
58

chol
266
211
283
284
224
ecg
Hyp
Norm
Hyp
Hyp
Abn
heart
171
144
162
160
173
Looking for patterns that are




Show & Tell
sex
M
M
F
M
M
valid
novel
useful
understandable
sick
N
N
N
Y
Y
Medical Record Analysis
Results
 DeEPs, a novel “emerging
pattern’’ method
 Beats C4.5, CBA, LB, NB, TAN in
21 out of 32 UCI benchmarks
 Works for gene expressions
Show & Tell
Behind the Scene

Research








Peter Saunders
Industry Assignees


Show & Tell
Vladimir Bajic
Vladimir Brusic
Jinyan Li
See-Kiong Ng
Limsoon Wong
Louxin Zhang
Business


Hao Han (gX)
Rahul Despande (MC)
Engineering







Allen Chong
Judice Koh
SPT Krishnan
Seng Hong Seah
Guanglan Zhang
Zhuo Zhang
Students



Huiqing Liu
Song Zhu
Kun Yu
Download