bi6103-20feb04.ppt

For written notes, please read chapters 3, 4, and 7 of The Practical Bioinformatician , http://sdmc.i2r.a-star.edu.sg/~limsoon/psZ/practical-bioinformatician

Recognition of Gene Features

Limsoon Wong

Institute for Infocomm Research

BI6103 guest lecture on ?? February 2004

Copyright



2003 limsoon wong

Lecture Plan

• experiment design, result interpretation

• central dogma

• recognition of translation initiation sites

• recognition of transcription start sites

• survey of some ANN-based systems for recognizing gene features

Copyright



2003 limsoon wong

What is Accuracy?

Copyright



2003 limsoon wong

What is Accuracy?

Accuracy =

No. of correct predictions

No. of predictions

=

TP + TN

TP + TN + FP + FN

Copyright



2003 limsoon wong

Examples (Balanced Population) classifier TP TN FP FN Accuracy

A

B

25 25 25 25

50 25 25 0

50%

75%

C

D

25 50 0 25

37 37 13 13

75%

74%

• Clearly, B, C, D are all better than A

• Is B better than C, D?

• Is C better than B, D?

• Is D better than B, C?

Accuracy may not tell the whole story

Copyright



2003 limsoon wong

Examples (Unbalanced Population) classifier TP TN FP FN Accuracy

A

B

25 75 75 25

0 150 0 50

50%

75%

C

D

50 0 150 0

30 100 50 20

25%

65%

• Clearly, D is better than A

• Is B better than A, C, D?

high accuracy is meaningless if population is unbalanced

Copyright



2003 limsoon wong

What is Sensitivity (aka Recall)?

Sensitivity = wrt positives

No. of correct positive predictions

No. of positives

TP

=

TP + FN

Sometimes sensitivity wrt negatives is termed specificity

Copyright



2003 limsoon wong

What is Precision?

Precision = wrt positives

No. of correct positive predictions

=

No. of positives predictions

TP

TP + FP

Copyright



2003 limsoon wong

Precision-Recall Trade-off

• A predicts better than

B if A has better recall and precision than B

• There is a trade-off between recall and precision

• In some applications, once you reach a satisfactory precision, you optimize for recall

• In some applications, once you reach a satisfactory recall, you optimize for precision precision

Copyright



2003 limsoon wong

Comparing Prediction Performance

• Accuracy is the obvious measure

– But it conveys the right intuition only when the positive and negative populations are roughly equal in size

• Recall and precision together form a better measure

– But what do you do when A has better recall than B and B has better precision than A?

Copyright



2003 limsoon wong

Adjusted Accuracy

• Weigh by the importance of the classes

Adjusted accuracy =



* Sensitivity +



* Specificity where



+



= 1 typically,



=



= 0.5

classifier TP TN FP FN Accuracy Adj Accuracy

A 25 75 75 25 50% 50%

B

C

D

0 150

50

0 50

0 150

30 100

0

50 20

75%

25%

65%

50%

50%

63%

But people can’t always agree on values for 

,



Copyright



2003 limsoon wong

ROC Curves

• By changing thresholds, get a range of sensitivities and specificities of a classifier

• A predicts better than B if

A has better sensitivities than B at most specificities

• Leads to ROC curve that plots sensitivity vs. (1 – specificity)

• Then the larger the area under the ROC curve, the better

1 – specificity

Copyright



2003 limsoon wong

What is Cross Validation?

Copyright



2003 limsoon wong

Construction of a Classifier

Training samples

Test instance

Build Classifier

Apply Classifier

Classifier

Prediction

Copyright



2003 limsoon wong

Estimate Accuracy: Wrong Way

Training samples

Build

Classifier

Apply

Classifier

Estimate

Accuracy

Classifier

Predictions

Accuracy

Why is this way of estimating accuracy wrong?

Copyright



2003 limsoon wong

Recall ...

…the abstract model of a classifier

• Given a test sample S

• Compute scores p(S), n(S)

• Predict S as negative if p(S) < t * n(s)

• Predict S as positive if p(S)

 t * n(s) t is the decision threshold of the classifier

Copyright



2003 limsoon wong

K-Nearest Neighbour Classifier (k-NN)

• Given a sample S , find the k observations

S i in the known data that are “closest” to it, and average their responses.

• Assume S is well approximated by its neighbours p(S) =

S i



1



N k

(S)



D P n(S) =

S i



1



N k

(S)



D N where N k

(S) is the neighbourhood of S defined by the k nearest samples to it.

Assume distance between samples is Euclidean distance for now

Copyright



2003 limsoon wong

Estimate Accuracy: Wrong Way

Training samples

Build

1-NN

1-NN

Apply

1-NN

Predictions

Estimate

Accuracy

100%

Accuracy

For sure k-NN (k = 1) has 100% accuracy in the

“accuracy estimation” procedure above. But does this accuracy generalize to new test instances?

Copyright



2003 limsoon wong

Estimate Accuracy: Right Way

Training samples

Testing samples

Build

Classifier

Apply

Classifier

Estimate

Accuracy

Classifier

Predictions

Accuracy

Testing samples are NOT to be used during “Build Classifier”

Copyright



2003 limsoon wong

How Many Training and Testing

Samples?

• No fixed ratio between training and testing samples; but typically 2:1 ratio

• Proportion of instances of different classes in testing samples should be similar to proportion in training samples

• What if there are insufficient samples to reserve 1/3 for testing?

• Ans: Cross validation

Copyright



2003 limsoon wong

Cross Validation

1.Test 2.Train 3.Train 4.Train 5.Train

1.Train 2.Test 3.Train 4.Train 5.Train

1.Train 2.Train 3.Test 4.Train 5.Train

1.Train 2.Train 3.Train 4.Test 5.Train

1.Train 2.Train 3.Train 4.Train 5.Test

• Divide samples into k roughly equal parts

• Each part has similar proportion of samples from different classes

• Use each part to testing other parts

• Total up accuracy

Copyright



2003 limsoon wong

How Many Fold?

• If samples are divided into k parts, we call this k-fold cross validation

• Choose k so that

– the k-fold cross validation accuracy does not change much from k-1 fold

– each part within the kfold cross validation has similar accuracy

• k = 5 or 10 are popular choices for k.

Size of training set

Copyright



2003 limsoon wong

Bias-Variance Decomposition

• Suppose classifiers C j and C k were trained on different sets S j and S of 1000 samples each k

• Then C j and C have different k accuracy might

• What is the expected accuracy of a classifier

C trained this way?

• Let Y = f(X) be what C is trying to predict

• The expected squared error at a test instance x, averaging over all such training samples, is variance

E[f(x) – C(x)] 2

= E[C(x) – E[C(x)]] 2

+ [E[C(x)] - f(x)] 2 bias

Copyright



2003 limsoon wong

Bias-Variance Trade-Off

• In k-fold cross validation,

– small k tends to under estimate accuracy

(i.e., large bias downwards)

– large k has smaller bias, but can have high variance

Size of training set

Copyright



2003 limsoon wong

Curse of Dimensionality

Copyright



2003 limsoon wong

Curse of Dimensionality

• How much of each dimension is needed to cover a proportion r of total sample space?

0.6

0.5

0.4

0.3

0.2

0.1

0

1

0.9

0.8

0.7

• Calculate by e p

(r) = r 1/p

• So, to cover 1% of a

15-D space, need 85% of each dimension! r=0.01

r=0.1

p=3 p=6 p=9 p=12 p=15

Copyright



2003 limsoon wong

Consequence of the Curse

• Suppose the number of samples given to us in the total sample space is fixed

• Let the dimension increase

• Then the distance of the k nearest neighbours of any point increases

• Then the k nearest neighbours are less and less useful for prediction, and can confuse the k-NN classifier

Copyright



2003 limsoon wong

What is Feature Selection?

Copyright



2003 limsoon wong

Tackling the Curse

• Given a sample space of p dimensions

• It is possible that some dimensions are irrelevant

• Need to find ways to separate those dimensions (aka features) that are relevant (aka signals) from those that are irrelevant (aka noise)

Copyright



2003 limsoon wong

Signal Selection (Basic Idea)

• Choose a feature w/ low intra-class distance

• Choose a feature w/ high inter-class distance

Copyright



2003 limsoon wong

Signal Selection

(eg., t-statistics)

Copyright



2003 limsoon wong


(eg., MIT-correlation)

Copyright



2003 limsoon wong


(eg., entropy)

Copyright



2003 limsoon wong


(eg.,



2)

Copyright



2003 limsoon wong


(eg., CFS)

• Instead of scoring individual signals, how about scoring a group of signals as a whole?

• CFS

– Correlation-based Feature Selection

– A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Copyright



2003 limsoon wong

Self-fulfilling Oracle

• Construct artificial dataset with 100 samples, each with

100,000 randomly generated features and randomly assigned class labels

• select 20 features with the best t-statistics (or other methods)

• Evaluate accuracy by cross validation using only the 20 selected features

• The resultant estimated accuracy can be ~90%

• But the true accuracy should be 50%, as the data were derived randomly

Copyright



2003 limsoon wong

What Went Wrong?

• The 20 features were selected from the whole dataset

• Information in the held-out testing samples has thus been “leaked” to the training process

• The correct way is to re-select the 20 features at each fold; better still, use a totally new set of samples for testing

Copyright



2003 limsoon wong

Short Break

Copyright



2003 limsoon wong

Central Dogma of Molecular Biology

Copyright



2003 limsoon wong

What is a gene?

Copyright



2003 limsoon wong

Central Dogma

Copyright



2003 limsoon wong

Transcription: DNA

 nRNA

Copyright



2003 limsoon wong

Splicing: nRNA

 mRNA

Copyright



2003 limsoon wong

Translation: mRNA

 protein

L

R

S stop

A

T

E

F

L

S

P

I T

M

V A

Y C

D

E

N

K

H

Q

W

R

G

Copyright



2003 limsoon wong

What does DNA data look like?

• A sample GenBank record from NCBI

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?c

md=Retrieve&db=nucleotide&list_uids=197439

34&dopt=GenBank

Copyright



2003 limsoon wong

What does protein data look like?

• A sample GenPept record from NCBI

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?c

md=Retrieve&db=protein&list_uids=19743935

&dopt=GenPept

Copyright



2003 limsoon wong

Recognition of

Translation Initiation Sites

An introduction to the World’s simplest TIS recognition system

A simple approach to accuracy and understandability

Copyright



2003 limsoon wong

Translation Initiation Site

Copyright



2003 limsoon wong

A Sample cDNA

299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCC ATG GCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC ATG GCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAG ATG AGAAGAGGGAG ATG GCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

............................................................ 80

................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

• What makes the second ATG the TIS?

Copyright



2003 limsoon wong

Approach

• Training data gathering

• Signal generation

 k-grams, distance, domain know-how, ...

• Signal selection

 Entropy,



2, CFS, t-test, domain know-how...

• Signal integration

 SVM, ANN, PCL, CART, C4.5, kNN, ...

Copyright



2003 limsoon wong

Training & Testing Data

• Vertebrate dataset of Pedersen & Nielsen

[ISMB’97]

• 3312 sequences

• 13503 ATG sites

• 3312 (24.5%) are TIS

• 10191 (75.5%) are non-TIS

• Use for 3-fold x-validation expts

Copyright



2003 limsoon wong

Signal Generation

• K-grams (ie., k consecutive letters)

– K = 1, 2, 3, 4, 5, …

– Window size vs. fixed position

– Up-stream, downstream vs. any where in window

– In-frame vs. any frame

3

2.5

2

1.5

1

0.5

0 seq1 seq2 seq3

A C G T

Copyright



2003 limsoon wong

Signal Generation: An Example

299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGC AGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC ATG GCTTTTGGCTGTCAGGGCAGCTGTA

80

160

GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCC GAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

• Window = 

100 bases

• In-frame, downstream

– GCT = 1, TTT = 1, ATG = 1…

• Any-frame, downstream

– GCT = 3, TTT = 2, ATG = 2…

• In-frame, upstream

– GCT = 2, TTT = 0, ATG = 0, ...

Copyright



2003 limsoon wong

Too Many Signals

• For each value of k, there are

4 k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have

4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!

• This is too many for most machine learning algorithms

Copyright



2003 limsoon wong

Sample k-grams Selected by CFS

Leaky scanning

Kozak consensus

• Position – 3

• in-frame upstream ATG

• in-frame downstream

– TAA, TAG, TGA ,

– CTG, GAC, GAG, and GCC

Stop codon

Codon bias?

Copyright



2003 limsoon wong

Signal Integration

• kNN

Given a test sample, find the k training samples that are most similar to it. Let the majority class win.

• SVM

Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.

• Naïve Bayes, ANN, C4.5, ...

Copyright



2003 limsoon wong

Results

(3-fold x-validation)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes

SVM

84.3%

73.9%

Neural Network 77.6%

Decision Tree 74.0%

86.1%

93.2%

93.2%

94.4%

66.3%

77.9%

78.8%

81.1%

85.7%

88.5%

89.4%

89.4%

Copyright



2003 limsoon wong

Performance Comparisons

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB

Decision Tree

84.3%

74.0%

NN 77.6%

SVM 73.9%

Pedersen&Nielsen 78%

Zien

Hatzigeorgiou -

69.9%

-

86.1%

94.4%

93.2%

93.2%

87%

94.1% -

-

-

66.3%

81.1%

78.8%

77.9%

85.7%

89.4%

89.4%

88.5%

85%

88.1%

94%*

* result not directly comparable due to different dataset and ribosome-scanning model

Copyright



2003 limsoon wong

Improvement by Scanning

• Apply Naïve Bayes or SVM left-to-right until first

ATG predicted as positive. That’s the TIS.

• Naïve Bayes & SVM models were trained using

TIS vs. Up-stream ATG

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB

SVM

84.3%

73.9%

NB+Scanning 87.3%

SVM+Scanning 88.5%

86.1%

93.2%

96.1%

96.3%

66.3%

77.9%

87.9%

88.6%

85.7%

88.5%

93.9%

94.4%

Copyright



2003 limsoon wong

Technique Comparisons

• Pedersen&Nielsen

[ISMB’97]

– Neural network

– No explicit features

• Zien

[Bioinformatics’00]

– SVM+kernel engineering


• Hatzigeorgiou

[Bioinformatics’02]

– Multiple neural networks

– Scanning rule


• Our approach

– Explicit feature generation

– Explicit feature selection

– Use any machine learning method w/o any form of complicated tuning

– Scanning rule is optional

Copyright



2003 limsoon wong

Can we do even better?

L

R

S stop

A

T

E

How about using k-grams from the translation?

F

L

S

P

I

M

V

T

A

Y

N

K

D

E

H

Q

C

W

R

G

Copyright



2003 limsoon wong

Amino-Acid Features

Copyright



2003 limsoon wong

Amino-Acid Features

Copyright



2003 limsoon wong

Amino Acid K-grams

Discovered by Entropy

Copyright



2003 limsoon wong

Results (on Pedersen & Nielsen’s mRNA)

Performance based on top 100 amino-acid features: is better than performance based on DNA seq. features:

Copyright



2003 limsoon wong

Independent Validation Sets

• A. Hatzigeorgiou:

– 480 fully sequenced human cDNAs

– 188 left after eliminating sequences similar to training set (Pedersen & Nielsen’s)

– 3.42% of ATGs are TIS

• Our own:

– well characterized human gene sequences from chromosome X (565 TIS) and chromosome 21 (180 TIS)

Copyright



2003 limsoon wong

Validation Results (on Hatzigeorgiou’s)

– Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset

Copyright



2003 limsoon wong

Validation Results

(on Chr X and Chr 21)

Our method

ATGpr

– Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Copyright



2003 limsoon wong

Recognition of

Transcription Start Sites

An introduction to the World’s best TSS recognition system

A heavy tuning approach

Copyright



2003 limsoon wong

Transcription Start Site

Copyright



2003 limsoon wong

Approach taken in Dragon

• Multi-sensor integration via ANNs

• Multi-model system structure

– for different sensitivity levels

– for GC-rich and GC-poor promoter regions

Copyright



2003 limsoon wong

Structure of Dragon Promoter Finder

-200 to +50 window size

Model selected based on desired sensitivity

Copyright



2003 limsoon wong

Each model has two submodels based on GC content

GC-rich submodel

GC-poor submodel

(C+G) =

#C + #G

Window Size

Copyright



2003 limsoon wong

Data Analysis Within Submodel s p s e s i

K-gram (k = 5) positional weight matrix

Copyright



2003 limsoon wong

Promoter, Exon, Intron Sensors

• These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers)

• They are calculated as s below using promoter, exon, intron data respectively

Pentamer at i th position in input

Window size s

Frequency of jth pentamer at ith position in training window j th pentamer at i th position in training window

Copyright



2003 limsoon wong

Data Preprocessing & ANN

Tuning parameters

Simple feedforward ANN trained by the Bayesian regularisation method s

E w i tanh(net)

Tuned threshold s

I s

IE tanh(x) = e x

e -x e x

+ e -x net =

 s i

* w i

Copyright



2003 limsoon wong

Accuracy Comparisons with C+G submodels without C+G submodels

Copyright



2003 limsoon wong

Training Data Criteria & Preparation

• Contain both positive and negative sequences

• Sufficient diversity, resembling different transcription start mechanisms

• Sufficient diversity, resembling different non-promoters

• Sanitized as much as possible

• TSS taken from

– 793 vertebrate promoters from EPD

– -200 to +50 bp of TSS

• non-TSS taken from

– GenBank,

– 800 exons

– 4000 introns,

– 250 bp,

– non-overlapping,

– <50% identities

Copyright



2003 limsoon wong

Tuning Data Preparation

• To tune adjustable system parameters in Dragon, we need a separate tuning data set

• TSS taken from

– 20 full-length gene seqs with known TSS

– -200 to +50 bp of TSS

– no overlap with EPD

• Non-TSS taken from

– 1600 human 3’UTR seqs

– 500 human exons

– 500 human introns

– 250 bp

– no overlap

Copyright



2003 limsoon wong

Testing Data Criteria & Preparation

• Seqs should be from the training or evaluation of other systems (no bias!)

• Seqs should be disjoint from training and tuning data sets

• Seqs should have TSS

• Seqs should be cleaned to remove redundancy,

<50% identities

• 159 TSS from 147 human and human virus seqs

• cummulative length of more than 1.15Mbp

• Taken from

GENESCAN, GeneId,

Genie, etc.

Copyright



2003 limsoon wong

Survey of Neural Network Based

Systems for Recognizing Gene

Features

Copyright



2003 limsoon wong

NNPP (TSS Recognition)

• NNPP2.1

– use 3 time-delayed

ANNs

– recognize TATA-box,

Initiator, and their mutual distance

– Dragon is 8.82 times more accurate

• Makes about 1 prediction per 550 nt at 0.75 sensitivity

Copyright



2003 limsoon wong

Promoter 2.0 (TSS Recognition)

• Promoter 2.0

– use ANN

– recognize 4 signals commonly present in eukaryotic promoters:

TATA-box, Initiator,

GC-box, CCAAT-box, and their mutual distances

– Dragon is 56.9 times more accurate

Copyright



2003 limsoon wong

Promoter Inspector (TSS Recognition)

• statistics-based

• the most accurate reported system for finding promoter region

• uses sensors for promoters, exons, introns, 3’UTRs

• Strong bias for CpG-related promoters

• Dragon is 6.88 times better

– to compare with Dragon, we consider Promoter Inspector to have made correct prediction if TSS falls within a predicted promoter region by Promoter Inspector

Copyright



2003 limsoon wong

Grail’s Promoter Prediction Module

Makes about 1 prediction per 230000 nt at 0.66 sensitivity

Copyright



2003 limsoon wong

LVQ Networks for TATA Recognition

• Achieves 0.33 sensitivity at 47 FP on Fickett &

Hatzigeorgiou 1997

Copyright



2003 limsoon wong

Hatzigeorgiou’s DIANA-TIS

• Get local TIS score of ATG and -7 to +5 bases flanking

• Get coding potential of 60 in-frame bases up-stream and down-stream

• Get coding score by subtracting down-stream from up-stream

• ATG may be TIS if product of two scores is > 0.2

• Choose the 1st one

Copyright



2003 limsoon wong

Pedersen & Nielson’s NetStart

• Predict TIS by ANN

• -100 to +100 bases as input window

• feedforward 3 layer ANN

• 30 hidden neurons

• sensitivity = 78%

• specificity = 87%

Copyright



2003 limsoon wong

Notes

Copyright



2003 limsoon wong

References

(expt design, result interpretation)

• John A. Swets, “Measuring the accuracy of diagnostic systems”, Science 240:1285--1293,

June 1988

• Trevor Hastie, Robert Tibshirani, Jerome

Friedman, The Elements of Statistical

Learning: Data Mining, Inference, and

Prediction , Springer, 2001. Chapters 1, 7

• Lance D. Miller et al., “Optimal gene expression analysis by microarrays”, Cancer

Cell 2:353--361, November 2002

Copyright



2003 limsoon wong

References

(TIS recognition)

• A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997

• L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002

• A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”,

Bioinformatics 16:799--807, 2000

• A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”,

Bioinformatics 18:343--350, 2002

Copyright



2003 limsoon wong

References

(TSS Recognition)

• V.B.Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol.

Graph. & Mod. 21:323--332, 2003

• V.B.Bajic et al., “Dragon Promoter Finder:

Recognition of vertebrate RNA polymerase II promoters”, Bioinformatics 18:198--199, 2002.

• V.B.Bajic et al., “Intelligent system for vertebrate promoter recognition”, IEEE Intelligent Systems

17:64--70, 2002.

Copyright



2003 limsoon wong

References


• J.W.Fickett, A.G.Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997

• Y.Xu, et al., “GRAIL: A multi-agent neural network system for gene identification”, Proc. IEEE

84:1544--1552, 1996

• M.G.Reese, “Application of a time-delay neural network to promoter annotation in the

D. melanogaster genome”, Comp. & Chem.

26:51--56, 2001

• A.G.Pedersen et al., “The biology of eukaryotic promoter prediction--a review”, Comp. & Chem.

23:191--207, 1999

Copyright



2003 limsoon wong

References


• S.Knudsen, “Promoter 2.0 for the recognition of

Pol II promoter sequences”, Bioinformatics

15:356--361, 1999.

• H.Wang, “Statistical pattern recognition based on

LVQ ANN: Application to TATAbox motif”, M.Tech

Thesis , Technikon Natal, South Africa

• M.Scherf, et al., “Highly specific localisation of promoter region in large genome sequences by

Promoter Inspector: A novel context analysis approach”, JMB 297:599--606, 2000

Copyright



2003 limsoon wong

References

(feature selection)

• M. A. Hall, “Correlation-based feature selection machine learning”, PhD thesis , Dept of Comp.

Sci., Univ. of Waikato, New Zealand, 1998

• U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuousvalued attributes”,

IJCAI 13:1022-1027, 1993

• H. Liu, R. Sentiono, “Chi2: Feature selection and discretization of numeric attributes”, IEEE Intl.

Conf. Tools with Artificial Intelligence 7:338--391,

1995

Copyright



2003 limsoon wong

Acknowledgements (for TIS)

• A.G. Pedersen

• H. Nielsen

• Roland Yap

• Fanfan Zeng

Jinyan Li

Huiqing Liu

Copyright



2003 limsoon wong

Acknowledgements (for TSS)

Copyright



2003 limsoon wong

The lecture will be on Thursday, 20th, from 6.30--9.30pm

at LT07 (which is very close to the entrance of the school of computer engineering, 2nd floor). If you come to my office

Blk N4 - 2a05, which is close to the General Office of SCE,

I am happy to usher you to the lecture theater.

Copyright



2003 limsoon wong

bi6103-20feb04.ppt

Related documents

Products

Support

bi6103-20feb04.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib