From Informatics to Bioinformatics Limsoon Wong Institute for Infocomm Research

advertisement

From Informatics to Bioinformatics

Limsoon Wong

Institute for Infocomm Research

Singapore

What is Bioinformatics?

Themes of Bioinformatics

Bioinformatics =

Data Mgmt + Knowledge Discovery

Data Mgmt =

Integration + Transformation + Cleansing

Knowledge Discovery =

Statistics + Algorithms + Databases

Benefits of Bioinformatics

To the patient:

Better drug, better treatment

To the pharma:

Save time, save cost, make more $

To the scientist:

Better science

From Informatics to Bioinformatics

8 years of bioinformatics

R&D in

Singapore

MHC-Peptide

Binding

(PREDICT)

Protein Interactions

Extraction (PIES)

Cleansing &

Warehousing

(FIMM)

Gene Expression

& Medical Record

Datamining (PCL)

Gene Feature

Recognition (Dragon) Integration

Technology

(Kleisli) Venom

Informatics

1994

ISS

1996 1998

KRDL

2000 2002

LIT/I 2 R

Data Integration

A DOE “impossible query”:

For each gene on a given cytogenetic band, find its non-human homologs.

source type location remarks

GDB Sybase Baltimore Flat tables

SQL joins

Location info

Entrez ASN.1

Bethesda Nested tables

Keywords

Homolog info

Data Integration Results

• Using Kleisli

• Clear

:

• Succinct

• Efficient

• Handles

•heterogeneity

•complexity sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select

#accn: g.#genbank_ref, #nonhuman-homologs: H from

L as c, E as g,

{ select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")} as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { });

Data Warehousing

Motivation efficiency availabilty

“denial of service” data cleansing

Requirements efficient to query easy to update. model data naturally

{(#uid: 6138971,

#title: "Homo sapiens adrenergic ...",

#accession: "NM_001619",

#organism: "Homo sapiens",

#taxon: 9606,

#lineage: ["Eukaryota", "Metazoa", …],

#seq: "CTCGGCCTCGGGCGCGGC...",

#feature: {

(#name: "source",

#continuous: true,

#position: [

(#accn: "NM_001619",

#start: 0, #end: 3602,

#negative: false)],

#anno: [

(#anno_name: "organism",

#descr: "Homo sapiens"), …] ), …)}

Data Warehousing Results

Relational DBMS is insufficient because it forces us to fragment data into 3NF.

Kleisli turns flat relational

DBMS into nested relational

DBMS.

It can use flat relational

DBMS such as Sybase, Oracle,

MySQL, etc. to be its update-able complex object store.

! Log in oracle-cplobj-add (#name: "db", ...);

! Define table create table GP

(#uid: "NUMBER", #detail: "LONG") using db;

! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db;

! Map GP to that table create view GP from GP using db;

! Run a queryto get title of 131470 select x.#detail.#title from GP as x where x.#uid = 131470;

Epitope Prediction

TRAP-559AA

MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE

EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN

LNDNAIHLY VNVFSNNAK EIIRLHSDASKNKEKALIIIRS

LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL

TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR

FLVGCHPSDGKCNLYADSAWENV KNVIGPFMKAVCVEVEK

TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ

CEEERCPPKWEPLDVPDEPEDDQPRP RGDNSSVQK PEENI

IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ

KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN

QNNLPNDKSDRN IPYSPLPPK VLDNERKQSDPQSQDNNGN

RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE

KPDNNKKKGESDNKYKIAGGIAGGLAL LACAGLAYK FVVP

GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results

 Prediction by our ANN model for HLA-A11

29 predictions

22 epitopes

76% specificity

 Prediction by BIMAS matrix for HLA-A*1101

Number of experimental binders

19 (52.8%) 5 (13.9%) 12 (33.3%)

1 66 100

Rank by BIMAS

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis age sex chol ecg heart sick

49 M 266 Hyp 171 N

64 M 211 Norm 144 N

58 F 283 Hyp 162 N

58 M 284 Hyp 160 Y

58 M 224 Abn 173 Y

 Looking for patterns that are

 valid

 novel

 useful understandable

Gene Expression Analysis

 Classifying gene expression profiles

 find stable differentially expressed genes find significant gene groups

 derive coordinated gene expression

Medical Record & Gene

Expression Analysis Results

 PCL, a novel “emerging pattern’’ method

 Beats C4.5, CBA, LB,

NB, TAN in 21 out of 32

UCI benchmarks

 Works well for gene expressions

Cancer Cell, March 2002, 1(2)

Protein Interaction Extraction

“What are the protein-protein interaction pathways from the latest reported discoveries?”

Protein Interaction Extraction Results

 Rule-based system for processing free texts in scientific abstracts

 Specialized in

 extracting protein names

 extracting proteinprotein interactions

Behind the Scene

Vladimir Bajic

Vladimir Brusic

Jinyan Li

See-Kiong Ng

Limsoon Wong

Louxin Zhang

 Allen Chong

 Judice Koh

 SPT Krishnan

 Huiqing Liu

 Seng Hong Seah

 Soon Heng Tan

 Guanglan Zhang

 Zhuo Zhang and many more: students, folks from geneticXchange,

MolecularConnections, and other collaborators….

Using Feature Generation &

Feature Selection for

Accurate Prediction of

Translation Initiation Sites

A more detailed example of post-genome knowledge discovery

Translation Initiation Recognition

A Sample cDNA

299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCC ATG GCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC ATG GCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAG ATG AGAAGAGGGAG ATG GCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

............................................................ 80

................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

What makes the second ATG the translation initiation site?

Approach

 Training data gathering

 Signal generation

 k-grams, distance, domain know-how, ...

 Signal selection

Entropy,

2, CFS, t-test, domain knowhow...

 Signal integration

SVM, ANN, PCL, CART, C4.5, kNN, ...

Training & Testing Data

 Vertebrate dataset of Pedersen & Nielsen

[ISMB’97]

 3312 sequences

 13503 ATG sites

 3312 (24.5%) are TIS

 10191 (75.5%) are non-TIS

 Use for 3-fold x-validation expts

Signal Generation

 K-grams (ie., k consecutive letters)

K = 1, 2, 3, 4, 5, …

Window size vs. fixed position

Up-stream, downstream vs. any where in window

In-frame vs. any frame

3

2.5

2

1.5

1

0.5

0 seq1 seq2 seq3

A C G T

Too Many Signals

 For each value of k, there are

4 k * 3 * 2 k-grams

 If we use k = 1, 2, 3, 4, 5, we have

4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!

 This is too many for most machine learning algorithms

Signal Selection (Basic Idea)

 Choose a signal w/ low intra-class distance

 Choose a signal w/ high inter-class distance

 Which of the following 3 signals is good?

Signal Selection

(eg., t-statistics)

Signal Selection

(eg., CFS)

 Instead of scoring individual signals, how about scoring a group of signals as a whole?

 CFS

A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Sample k-grams Selected by CFS

Leaky scanning

Kozak consensus

 Position – 3

 in-frame upstream ATG

 in-frame downstream

TAA, TAG, TGA ,

CTG, GAC, GAG, and GCC

Stop codon

Codon bias?

Signal Integration

 kNN

Given a test sample, find the k training samples that are most similar to it. Let the majority class win.

 SVM

Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.

 Naïve Bayes, ANN, C4.5, ...

Results

(3-fold x-validation)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes

SVM

84.3%

73.9%

Neural Network 77.6%

Decision Tree 74.0%

86.1%

93.2%

93.2%

94.4%

66.3%

77.9%

78.8%

81.1%

85.7%

88.5%

89.4%

89.4%

Improvement by Voting

 Apply any 3 of Naïve Bayes, SVM, Neural

Network, & Decision Tree. Decide by majority.

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB+SVM+NN 79.2%

NB+SVM+Tree 78.8%

NB+NN+Tree 77.6%

SVM+NN+Tree 75.9%

Best of 4 84.3%

Worst of 4 73.9%

92.1%

92.0%

94.5%

94.3%

94.4%

86.1%

76.5%

76.2%

82.1%

81.2%

81.1%

66.3%

88.9%

88.8%

90.4%

89.8%

89.4%

85.7%

Improvement by Scanning

 Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS.

 Naïve Bayes & SVM models were trained using

TIS vs. Up-stream ATG

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB

SVM

84.3%

73.9%

NB+Scanning 87.3%

SVM+Scanning 88.5%

86.1%

93.2%

96.1%

96.3%

66.3%

77.9%

87.9%

88.6%

85.7%

88.5%

93.9%

94.4%

Performance Comparisons

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB

Decision Tree

NB+NN+Tree

SVM+Scanning

84.3%

74.0%

77.6%

88.5%

Pedersen&Nielsen 78%

Zien

Hatzigeorgiou -

69.9%

-

86.1%

94.4%

94.5%

96.3%

87%

94.1%

-

-

-

66.3%

81.1%

82.1%

88.6%

85.7%

89.4%

90.4%

94.4%*

85%

88.1%

94%*

* result not directly comparable

Technique Comparisons

 Pedersen&Nielsen

[ISMB’97]

 Our approach

Neural network

Explicit feature

No explicit features

 Zien

[Bioinformatics’00] generation

Explicit feature selection

SVM+kernel engineering

Use any machine learning method w/o any

No explicit features

 Hatzigeorgiou

[Bioinformatics’02] form of complicated tuning

Multiple neural networks

Scanning rule

No explicit features

Scanning rule is optional

Acknowledgements

 A.G. Pedersen

 H. Nielsen

 Roland Yap

 Fanfan Zeng

Download