Gene Finding and Gene Feature Recognition by Computational Analysis.

advertisement
Copyright © 2004 by Limsoon Wong
Gene Finding & Gene
Feature Recognition
by Computational
Analysis
Limsoon Wong
Institute for Infocomm Research
November 2004
Lecture Plan
•
•
•
•
•
•
•
•
•
Gene structure basics
Gene finding overview
GRAIL
Indel & frame-shift in coding regions
Histone promoters: A cautionary case study
Knowledge discovery basics
TIS recognition
Poly-A signal recognition
Basic materials
TSS recognition
Advanced materials
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Gene Structure Basics
A brief refresher
Some slides here are “borrowed” from Ken Sung
Body
• Our body consists of a number of organs
• Each organ composes of a number of tissues
• Each tissue composes of cells of the same
type
Copyright © 2004 by Limsoon Wong
Cell
• Performs two types of function
– Chemical reactions necessary to maintain our life
– Pass info for maintaining life to next generation
• In particular
– Protein performs chemical reactions
– DNA stores & passes info
– RNA is intermediate between DNA & proteins
Copyright © 2004 by Limsoon Wong
Protein
• A protein sequence
composed from an
alphabet of 20 amino
acids
– Length is usually 20 to
5000 amino acids
– Average around 350
amino acids
• Folds into 3D shape,
forming the building
blocks & performing
most of the chemical
reactions within a cell
Copyright © 2004 by Limsoon Wong
Amino Acid
• Each amino acid consist of
– Amino group
– Carboxyl group
– R group
Amino
group
NH2
Carboxyl group
H
O
C
C
R
C
(the central carbon)
Copyright © 2004 by Limsoon Wong
OH
R group
Classification of Amino Acids
• Amino acids can be
classified into 4 types.
• Positively charged (basic)
– Arginine (Arg, R)
– Histidine (His, H)
– Lysine (Lys, K)
• Negatively charged
(acidic)
– Aspartic acid (Asp, D)
– Glutamic acid (Glu, E)
Copyright © 2004 by Limsoon Wong
Classification of Amino Acids
• Polar (overall uncharged, • Nonpolar (overall
but uneven charge
uncharged and uniform
distribution. can form
charge distribution. cant
hydrogen bonds with
form hydrogen bonds
water. they are called
with water. they are
hydrophilic)
called hydrophobic)
–
–
–
–
–
–
–
Asparagine (Asn, N)
Cysteine (Cys, C)
Glutamine (Gln, Q)
Glycine (Gly, G)
Serine (Ser, S)
Threonine (Thr, T)
Tyrosine (Tyr, Y)
Copyright © 2004 by Limsoon Wong
–
–
–
–
–
–
–
–
Alanine (Ala, A)
Isoleucine (Ile, I)
Leucine (Leu, L)
Methionine (Met, M)
Phenylalanine (Phe, F)
Proline (Pro, P)
Tryptophan (Trp, W)
Valine (Val, V)
Protein & Polypeptide Chain
• Formed by joining amino acids via peptide bond
• One end the amino group, called N-terminus
• The other end is the carboxyl group, called C-terminus
H
O
H
O
NH2 C
C
OH + NH2 C
C
R
R’
NH2
Peptide bond
OH
H
O
C
C
R
Copyright © 2004 by Limsoon Wong
H
O
N
C
C
H
R’
OH
DNA
• DNA stores instruction
needed by the cell to
perform daily life function
• Consists of two strands
interwoven together and
form a double helix
• Each strand is a chain of
some small molecules
called nucleotides
Francis Crick shows James Watson the model of DNA
in their room number 103 of the Austin Wing at the
Cavendish Laboratories, Cambridge
Copyright © 2004 by Limsoon Wong
Nucleotide
• Consists of three parts:
– Deoxyribose
– Phosphate (bound to the 5’ carbon)
– Base (bound to the 1’ carbon)
Base
(Adenine)
5`
4`
Phosphate
Copyright © 2004 by Limsoon Wong
1`
3`
2`
Deoxyribose
Classification of Nucleotides
• 5 diff nucleotides: adenine(A), cytosine(C), guanine(G),
thymine(T), & uracil(U)
• A, G are purines. They have a 2-ring structure
• C, T, U are pyrimidines. They have a 1-ring structure
• DNA only uses A, C, G, & T
A
C
Copyright © 2004 by Limsoon Wong
G
T
U
Watson-Crick rules
• Complementary bases:
– A with T (two hydrogen-bonds)
– C with G (three hydrogen-bonds)
C
A
T
10Å
Copyright © 2004 by Limsoon Wong
G
10Å
Orientation of a DNA
• One strand of DNA is generated by chaining together
nucleotides, forming a phosphate-sugar backbone
• It has direction: from 5’ to 3’, because DNA always
extends from 3’ end:
– Upstream, from 5’ to 3’
– Downstream, from 3’ to 5’
P
P
P
P
3’
5’
A
Copyright © 2004 by Limsoon Wong
C
G
T
A
Double Stranded DNA
• DNA is double stranded in a cell. The two
strands are anti-parallel. One strand is
reverse complement of the other
• The double strands are interwoven to
form a double helix
Copyright © 2004 by Limsoon Wong
Locations of DNAs in a Cell?
• Two types of organisms
– Prokaryotes (single-celled organisms with no nuclei. e.g., bacteria)
– Eukaryotes (organisms with single or multiple cells. their cells have
nuclei. e.g., plant & animal)
• In Prokaryotes, DNA swims within the cell
• In Eukaryotes, DNA locates within the nucleus
Copyright © 2004 by Limsoon Wong
Chromosome
• DNA is usually tightly wound around histone
proteins and forms a chromosome
• The total info stored in all chromosomes
constitutes a genome
• In most multi-cell organisms, every cell
contains the same complete set of
chromosomes
– May have some small different due to mutation
• Human genome has 3G base pairs, organized
in 23 pairs of chromosomes
Copyright © 2004 by Limsoon Wong
Gene
• A gene is a sequence of DNA that encodes a
protein or an RNA molecule
• About 30,000 – 35,000 (protein-coding) genes
in human genome
• For gene that encodes protein
– In Prokaryotic genome, one gene corresponds to
one protein
– In Eukaryotic genome, one gene can corresponds to
more than one protein because of the process
“alternative splicing”
Copyright © 2004 by Limsoon Wong
Complexity of Organism
vs. Genome Size
• Human Genome: 3G
base pairs
• Amoeba dubia (a single
cell organism): 600G
base pairs
Copyright © 2004 by Limsoon Wong
 Genome size has no
relationship with the
complexity of the
organism
Number of Genes vs. Genome Size
• Prokaryotic genome
(e.g., E. coli)
– Number of base pairs: 5M
– Number of genes: 4k
– Average length of a gene:
1000 bp
• Eukaryotic genome (e.g.,
human)
– Number of base pairs: 3G
– Estimated number of
genes: 30k – 35k
– Estimated average length
of a gene: 1000-2000 bp
Copyright © 2004 by Limsoon Wong
• ~ 90% of E. coli genome
are of coding regions.
• < 3% of human genome
is believed to be coding
regions
 Genome size has no
relationship with the
number of genes!
RNA
• RNA has both the
properties of DNA &
protein
• Nucleotide for RNA has
of three parts:
– Ribose Sugar (has an
extra OH group at 2’)
– Phosphate (bound to 5’
carbon)
– Base (bound to 1’ carbon)
– Similar to DNA, it can
store & transfer info
– Similar to protein, it can
form complex 3D structure
& perform some functions
Base
(Adenine)
5`
4`
Phosphate
Copyright © 2004 by Limsoon Wong
1`
3`
2`
Ribose Sugar
RNA vs DNA
• RNA is single stranded
• Nucleotides of RNA are similar to that of DNA,
except that have an extra OH at position 2’
– Due to this extra OH, it can form more hydrogen
bonds than DNA
– So RNA can form complex 3D structure
• RNA use the base U instead of T
– U is chemically similar to T
– In particular, U is also complementary to A
Copyright © 2004 by Limsoon Wong
Mutation
• Mutation is a sudden
change of genome
• Basis of evolution
• Cause of cancer
• Can occur in DNA, RNA,
& Protein
Copyright © 2004 by Limsoon Wong
Central Dogma
• Gene expression
consists of two steps
– Transcription
DNA  mRNA
– Translation
mRNA  Protein
Copyright © 2004 by Limsoon Wong
Transcription
• Synthesize mRNA from
one strand of DNA
– An enzyme RNA
polymerase temporarily
separates doublestranded DNA
– It begins transcription at
transcription start site
– A  A, CC, GG, &
TU
– Once RNA polymerase
reaches transcription stop
site, transcription stops
Copyright © 2004 by Limsoon Wong
• Additional “steps” for
Eukaryotes
– Transcription produces
pre-mRNA that contains
both introns & exons
– 5’ cap & poly-A tail are
added to pre-mRNA
– RNA splicing removes
introns & mRNA is made
– mRNA are transported out
of nucleus
Translation
• Synthesize protein from
mRNA
• Each amino acid is
encoded by consecutive
seq of 3 nucleotides,
called a codon
• The decoding table from
codon to amino acid is
called genetic code
Copyright © 2004 by Limsoon Wong
• 43=64 diff codons
 Codons are not 1-to-1
corr to 20 amino acids
• All organisms use the
same decoding table
• Recall that amino acids
can be classified into 4
groups. A single-base
change in a codon is
usually not sufficient to
cause a codon to code
for an amino acid in
different group
Genetic Code
• Start codon: ATG (code for M)
• Stop codon: TAA, TAG, TGA
Copyright © 2004 by Limsoon Wong
Ribosome
• Translation is handled by a molecular complex,
ribosome, which consists of both proteins &
ribosomal RNA (rRNA)
• Ribosome reads mRNA & the translation starts
at a start codon (the translation start site)
• With help of tRNA, each codon is translated to
an amino acid
• Translation stops once ribosome reads a stop
codon (the translation stop site)
Copyright © 2004 by Limsoon Wong
Introns and exons
• Eukaryotic genes
contain introns & exons
– Introns are seq that are
ultimately spliced out of
mRNA
– Introns normally satisfy
GT-AG rule, viz. begin w/
GT & end w/ AG
– Each gene can have
many introns & each
intron can have thousands
bases
Copyright © 2004 by Limsoon Wong
• Introns can be very long
• An extreme example is a
gene associated with
cystic fibrosis in human:
– Length of 24 introns ~1Mb
– Length of exons ~1kb
Typical Eukaryotic Gene Structure
Image credit: Xu
• Unlike eukaryotic genes, a prokaryotic gene typically consists of
only one contiguous coding region
Copyright © 2004 by Limsoon Wong
Reading Frame
• Each DNA segment has six possible reading
frames
Forward strand:
ATGGCTTACGCTTGA
Reading frame #1
Reading frame #2
Reading frame #3
ATG
GCT
TAC
GCT
TGC
TGG
CTT
ACG
CTT
GA.
GGC
TTA
CGC
TTG
A..
Reverse strand:
TCAAGCGTAAGCCAT
Reading frame #4
Reading frame #5
Reading frame #6
TCA
AGC
GTA
AGC
CAT
CAA
GCG
TAA
GCC
AT.
AAG
CGT
AAG
CCA
T..
Copyright © 2004 by Limsoon Wong
Open Reading Frame (ORF)
• ORF is a segment of DNA with two in-frame
stop codons at the two ends and no in-frame
stop codon in the middle
stop
stop
ORF
• Each ORF has a fixed reading frame
Copyright © 2004 by Limsoon Wong
Coding Region
• Each coding region (exon or whole gene) has a
fixed translation frame
• A coding region always sits inside an ORF of
same reading frame
• All exons of a gene are on the same strand
• Neighboring exons of a gene could have
different reading frames
Copyright © 2004 by Limsoon Wong
Frame Consistency
• Neighbouring exons of a gene should be
frame-consistent
ATG GCT TGG GCT TTA A -------------- GT TTC CCG GAG AT ------ T GGG
exon 1
exon 2
Exercise:
Define frame consistency mathematically
Copyright © 2004 by Limsoon Wong
exon 3
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Overview of
Gene Finding
Some slides here are “borrowed” from Mark Craven
What is Gene Finding?
• Find all coding regions
from a stretch of DNA
sequence, and construct
gene structures from the
identified exons
• Can be decomposed into
– Find coding potential of a
region in a frame
– Find boundaries betw
coding & non-coding
regions
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Approaches
• Search-by-signal: find genes by identifying the
sequence signals involved in gene expression
• Search-by-content: find genes by statistical
properties that distinguish protein coding DNA
from non-coding DNA
• Search-by-homology: find genes by homology
(after translation) to proteins
• State-of-the-art systems for gene finding
usually combine these strategies
Copyright © 2004 by Limsoon Wong
Relevant Signals
for Search-by-Signals
• Transcription initiation
– Promoter
• Transcription termination
– Terminators
• Translation initiation
– Ribosome binding sites
– Initiation codons
• Translation termination
– Stop codons
• RNA processing
– Splice junction
Copyright © 2004 by Limsoon Wong
Image credit: Xu
How Search-by-Signal Works
• There are 2 impt regions in a promoter seq
–10 region, ~10bp before TSS
–35 region, ~35bp before TSS
• Consensus for–10 region in E. coli is TATAAT,
but few promoters actually have this seq
 Recognize promoters by
– weight matrices
– probabilistic models
– neural networks, …
Copyright © 2004 by Limsoon Wong
How Search-by-Content Works
• Encoding a protein
affects stats properties
of a DNA seq
– some amino acids used
more frequently
– diff number of codons for
diff amino acids
– for given protein, usually
one codon is used more
frequently than others
 Estimate prob that a
given region of seq was
“caused by” its being a
coding seq
Copyright © 2004 by Limsoon Wong
Image credit: Craven
How Search-by-Homology Works
• Translate DNA seq in all reading frames
• Search against protein db
• High-scoring matches suggest presence of
homologous genes in DNA
 You can use BLASTX for this
Copyright © 2004 by Limsoon Wong
Search-by-Content Example:
Codon Usage Method
• Staden & McLachlan, 1982
• Process a seq w/ “window” of length L
• Assume seq falls into one of 7 categories, viz.
– Coding in frame 0, frame 1, …, frame 5
– Non-coding
• Use Bayes’ rule to determine prob of each
category
• Assign seq to category w/ max prob
Copyright © 2004 by Limsoon Wong
Image credit: Craven
Image credit: Craven
Image credit: Craven
• Pr(codingi) is the same
for each frame if window
size fits same number of
codons in each frame
• otherwise, consider
relative number of
codons in window in
each frame
Image credit: Craven
Search-by-Homology Example:
Gene Finding Using BLAST
• High seq similarity typically implies
homologous genes
 Search for genes in yeast seq using BLAST
 Extract Feature for gene identification
candidate gene
Image credit: Xu
BLAST
search
Genbank
or nr
Copyright © 2004 by Limsoon Wong
sequence alignments
with known genes,
alignment p-values
• Searching all ORFs
against known genes in
nr db helps identify an
initial set of (possibly
incomplete) genes
sequence
BLAST hits
Image credit: Xu
known
nongenes
%
0
known
genes
coding potential
gene length distribution
• A (yeast) gene starts w/
ATG and ends w/ a stop
codon, in same reading
frame of ORF
• Have “strong” coding
potentials, measured by,
preference models, Markov
chain model, ...
• Have “strong” translation
start signal, measured by
weight matrix model, ...
• Have distributions wrt
length, G+C composition, ...
• Have special seq signals in
flanking regions, ...
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
GRAIL,
An Important Gene
Finding Program
• Signals assoc w/ coding regions
• Models for coding regions
• Signals assoc w/ boundaries
• Models for boundaries
• Other factors & information fusion
Some slides here are “borrowed” from Ying Xu
Coding Signal
• Freq distribution of
dimers in protein
sequence
• E.g., Shewanella
– Ave freq is 5%
– Some amino acids
prefer to be next to
each other
– Some amino acids
prefer to be not next
to each other
Image credit: Xu
Exercise: What is shewanella?
Copyright © 2004 by Limsoon Wong
Coding Signal
• Dimer preference implies dicodon (6-mers like AAA
TTT) bias in coding vs non-coding regions
• Relative freq of a di-codon in coding vs non-coding
– Freq of dicodon X (e.g, AAA AAA) in coding region, total number
of occurrences of X divided by total number of dicocon occurrences
– Freq of dicodon X (e.g, AAA AAA) in noncoding region, total
number of occurrences of X divided by total number of dicodon occurrences
• Exercise: In human genome, freq of dicodon “AAA
AAA” is ~1% in coding region vs ~5% in non-coding
region. If you see a region with many “AAA AAA”,
would you guess it is a coding or non-coding region?
Copyright © 2004 by Limsoon Wong
Why Dicodon (6-mer)?
• Codon (3-mer)-based
models are not as info
rich as dicodon-based
models
• Tricodon (9-mer)-based
models need too many
data points
There are
43 = 64 codons
46 = 4096 dicodons
49 = 262144 tricodons
Copyright © 2004 by Limsoon Wong
• To make stats reliable,
need ~15 occurrences of
each X-mer
 For tricodon-based
models, need at least
15*262144 = 3932160
coding bases in our
training data, which is
probably not going to be
available for most
genomes
Coding Signal
• Most dicodons show bias towards either coding
or non-coding regions
 Foundation for coding region identification
Regions consisting of dicodons that
mostly tend to be in coding regions are
probably coding regions; otherwise
non-coding regions
 Dicodon freq are key signal used for coding
region detection; all gene finding programs use
this info
Copyright © 2004 by Limsoon Wong
Coding Signal
• Dicodon freq in coding vs non-coding are
genome-dependent
Image credit: Xu
Shewanella
Copyright © 2004 by Limsoon Wong
Bovine
Coding Signal
• In-frame vs any-frame
dicodons
• In-frame dicodon freq
provide a more sensitive
measure than any-frame
dicodon freq
In-frame:
not in-frame dicodons
ATG TTG GAT GCC CAG AAG.....
in-frame dicodons
Copyright © 2004 by Limsoon Wong
Not in-frame:
ATG TTG TGTTGG, ATGCCC
GAT GCC AGAAG ., GTTGGA
CAG AAG AGCCCA, AGAAG ..
any-frame
Dicodon Preference Model
• The preference value P(X) of a dicodon X is
defined as
P(X) = log FC(X)/FN(X)
where
FC(X) is freq of X in coding regions
FN(X) is freq of X in non-coding regions
Copyright © 2004 by Limsoon Wong
Dicodon Preference Model’s
Properties
• P(X) = 0 if X has same freq in coding and noncoding regions
• P(X) > 0 if X has higher freq in coding than in
non-coding region; the larger the diff, the more
positive the score is
• P(X) < 0 if X has higher freq in non-coding than
in coding region; the larger the diff, the more
negative the score is
Copyright © 2004 by Limsoon Wong
Dicodon Preference Model Example
• Suppose AAA ATT, AAA
GAC, AAA TAG have the
following freq:
• Then
P(AAA ATT) = –0.57
P(AAA GAC) = –0.40
P(AAA TAG) = –,
FC(AAA ATT) = 1.4%
FN(AAA ATT) = 5.2%
FC(AAA GAC) = 1.9%
FN(AAA GAC) = 4.8%
FC(AAA TAG) = 0.0%
FN(AAA TAG) = 6.3%
Copyright © 2004 by Limsoon Wong
treating STOP codons differently
 A region consisting of
only these dicodons is
probably a non-coding
region
Frame-Insensitive
Coding Region Preference Model
• A frame-insensitive coding preference Sis(R) of
a region R can be defined as
Sis(R) = X is a dicodon in R P(X)
• R is predicted as coding region if Sis(R) > 0
• NB. This model is not commonly used
Copyright © 2004 by Limsoon Wong
In-Frame
Dicodon Preference Model
• The in-frame + i preference value Pi(X) of a
dicodon X is defined as
Pi(X) = log FCi(X)/FN(X)
• where
FCi(X) is freq of X in coding regions
at in-frame + i positions
FN(X) is freq of X in non-coding regions
ATG TGC CGC GCT P0
P1
P2
Copyright © 2004 by Limsoon Wong
In-Frame
Coding Region Preference Model
• The in-frame + i preference Si(R) of a region R
can be defined as
Si(R) = X is a dicodon at in-frame + i position in R Pi(X)
• R is predicted as coding if i=0,1,2 Si(R)/|R| > 0
• NB. This coding preference model is commonly used
Copyright © 2004 by Limsoon Wong
Coding Region Prediction: An
Example Procedure
• Calculate all ORFs of a DNA segment
• For each ORF
– Slide thru ORF w/ increment of 10bp
– Calculate in-frame coding region preference score,
in same frame as ORF, within window of 60bp
– Assign score to center of window
• E.g., forward strand in a particular frame...
+5
0
-5
preference scores
Copyright © 2004 by Limsoon Wong
Image credit: Xu
Problem with Coding Region
Boundaries
• Making the call: coding or non-coding and
where the boundaries are
where to draw
the line?
coding
region?
Image credit: Xu
where to draw the
boundaries?
 Need training set with known coding and noncoding regions to select threshold that includes
as many known coding regions as possible,
and at the same time excludes as many known
non-coding regions as possible
Copyright © 2004 by Limsoon Wong
Types of Coding Region
Boundaries
• Knowing boundaries of coding regions helps
identify them more accurately
• Possible boundaries of an exon
{ translation start,
acceptor site }
{ translation stop,
donor site }
• Splice junctions:
– Donor site: coding region | GT
– Acceptor site: CAG | TAG | coding region
• Translation start
– in-frame ATG
Copyright © 2004 by Limsoon Wong
Image credit: Xu
Signals for Coding Region
Boundaries
• Splice junction sites and translation starts have
certain distribution profiles
• For example, ...
Copyright © 2004 by Limsoon Wong
Acceptor Site (Human Genome)
• If we align all known acceptor sites (with their
splice junction site aligned), we have the
following nucleotide distribution
Image credit: Xu
• Acceptor site: CAG | TAG | coding region
Copyright © 2004 by Limsoon Wong
Donor Site (Human Genome)
• If we align all known donor sites (with their
splice junction site aligned), we have the
following nucleotide distribution
• Donor site: coding region | GT
Copyright © 2004 by Limsoon Wong
Image credit: Xu
What Positions Have “High”
Information Content?
• For a weight matrix, information content of
each column is calculated as
– X{A,C,G,T} F(X)*log (F(X)/0.25)
 When a column has evenly distributed
nucleotides, its information content is lowest
 Only need to look at positions having high
information content
Copyright © 2004 by Limsoon Wong
Information Content Around Donor
Sites in Human Genome
Image credit: Xu
• Information content
column –3 = – .34*log (.34/.25) – .363*log (.363/.25) –
.183* log (.183/.25) – .114* log (.114/.25) = 0.04
column –1 = – .092*log (.92/.25) – .03*log (.033/.25) –
.803* log (.803/.25) – .073* log (.73/.25) = 0.30
Copyright © 2004 by Limsoon Wong
Weight Matrix Model for Splice Sites
• Weight matrix model
– build a weight matrix for donor, acceptor, translation
start site, respectively
– use positions of high information content
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Splice Site Prediction: A Procedure
Image credit: Xu
• Add up freq of corr letter in corr positions:
AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0
+ .52 + .71 + .81 + .46 = 6.24
TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0
+ .02 + .07 + .05 + .16 = 2.56
• Make prediction on splice site based on some
threshold
Copyright © 2004 by Limsoon Wong
Other Factors Considered by GRAIL
• G+C composition affects dicodon distributions
• Length of exons follows certain distribution
• Other signals associated with coding regions
– periodicity
– structure information
– .....
• Pseudo genes
• ........
Copyright © 2004 by Limsoon Wong
Info Fusion by ANN in GRAIL
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Remaining Challenges in GRAIL
• Initial exon
• Final exon
• Indels & frame shifts
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Indel & Frame-Shift in
Coding Regions
• Problem definition
• Indel & frameshift identification
• Indel correction
• An iterative strategy
Some slides here are “borrowed” from Ying Xu
Indels in Coding Regions
• Indel = insertion or deletion in coding region
• Indels are usually caused by seq errors
ATG GAT CCA CAT …..
ATG GAT CA CAT …..
ATG GAT CTCA CAT …..
Copyright © 2004 by Limsoon Wong
Effects of Indels on Exon
Prediction
• Indels may cause shifts in reading frames &
affect prediction algos for coding regions
exon
pref
scores
indel
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Key Idea for Detecting Frame-Shift
• Preferred reading frame is reading frame w/
highest coding score
• Diff DNA segments may have diff preferred
reading frames
Image credit: Xu
 Segment a coding sequence into regions w/ consistent
preferred reading frames corr well w/ indel positions
 Indel identification problem can be solved as a
sequence segmentation problem!
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection by Sequence
Segmentation
• Partition seq into segs so that
– Chosen frames of adjacent segs are diff
– Each segment has >30 bps to avoid small
fluctuations
– Sum of coding scores in the chosen frames over all
segments is maximized
• This combinatorial optimization problem can be
solved in 6 steps...
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 1
• Given DNA sequence a1 … an
• Define key quantities
C(i, r, 1) = max score on a1 … ai,
w/ the last segment in frame r
C(i, r, 0) = C(i, r, 1) except that
the last seg may have <30 bps
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 2
• Determine relationships among the quantities
and the optimization problem, viz.
maxr{0, 1, 2}C(i, r, 1) is optimal solution
• Can calculate C(i, r, 0) & C(i, r, 1) from C(i–k, r,
0) & C(i – k, r, 1) for some k > 0
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 2, C(i,r,0)
• To calculate C(i,r,0), there are 3 possible cases
for each position i:
– Case 1: no indel occurred at position i
– Case 2: ai is an inserted base
– Case 3: a base has been deleted in front of ai
 C(i, r, 0) = max { Case 1, Case 2, Case 3 }
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 2, Case 1
• No indel occurs at position i. Then
C(i,r,0) = C(i–1,(2+r) mod 3,0) +
P(1+r) mod 3 (ai–5…ai)
a1 a2 …… ai-5 ai-4 ai-3 ai-2 ai-1 ai
Copyright © 2004 by Limsoon Wong
di-codon
preference
Frame-Shift Detection: Step 2, Case 2
• ai-1 is an inserted base. Then
C(i,r,0) = C(i–2, (r+2) mod 3, 1) +
P(1+r) mod 3 (ai–6...a i–2ai)
a1 a2 …… ai-6 ai-5 ai-4 ai-3 ai-2 ai-1 ai
Copyright © 2004 by Limsoon Wong
di-codon
preference
Frame-Shift Detection: Step 2, Case 3
• A base has been deleted in front of ai. Then
C(i, r, 0) = C(i–1, (r+1) mod 3, 1) +
Pr (ai–5… ai–1C) +
P(1+r) mod 3 (ai–4… ai–1Cai)
a1 a2 …… ai-5 ai-4 ai-3 ai-2 ai-1 ai
Copyright © 2004 by Limsoon Wong
add a neutral
base “C”
Frame-Shift Detection: Step 2, C(i,r,1)
• To calculate C(i,r,1),
C(i, r, 1) = C(i–30, r, 0) +
i–30 < j  i–5 P(j+r) mod 3 (aj…aj+5)
coding score in frame r
a1 a2 …… ai-30 ai-30+1 …… ai
summed di-codon
preference
Exercise:
This formula is not quite right. Fix it.
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 2,
Initiation
• Initial conditions,
C (k, r, 0) = –, k < 6
C (6, r, 0) = P(1+r) mod 3 (a1 … a6)
C(i, r, 1) = –, i < 30
• This is a dynamic programming (DP) algorithm; the
equations are DP recurrences
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 3
• Calculation of maxr{0, 1, 2}C(i, r, 1) gives an
optimal segmentation of a DNA sequence
• Tracing back the transition points---viz. case 2
& case 3---gives the segmentation results
frame 0
frame 1
frame 2
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 4
• Determine of coding regions
– For given H1 and H2 (e.g., = 0.25 and 0.75), partition
a DNA seq into segs so that each seg has >30
bases & coding values of each seg are consistently
closer to one of H1 or H2 than the other
H1
H2
segmentation
result
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 5
• Overlay “preferred reading-frame segs” &
“coding segs” gives coding region predictions
regions w/ indels
Image credit: Xu
Copyright © 2004 by Limsoon Wong
Frame-Shift Detection: Step 6
• We still need to correct the identified indels...
• If
an “insertion” is detected,
delete the base at the
transition point
• If a “deletion” is detected,
add a neutral base “C” at
transition point
Copyright © 2004 by Limsoon Wong
What Happens When Indels Are
Close Together?
• Our procedure works well when indels are not
too close together (i.e., >30 bases apart)
• When indels are too close together, they will be
missed...
actual
indels
predicted
indels
Copyright © 2004 by Limsoon Wong
Handling Indels That Are Close
Together
• Employ an iterative process, viz
Find one set of indels and correct
them & then iterate until no more
indels can be found
actual
indels
predicted
indels
predicted indels
in iteration 2
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Modeling &
Recognition of Histone
Promoters
Some slides here are “borrowed” from Rajesh Chowdhary
Histone
• Basic proteins of
eukaryotic cell nucleus
• Form a major part of
chromosomal proteins
• Help in packaging DNA
in the chromatin complex
• Five types, namely H1,
H2A, H2B, H3 and H4
• Highly conserved across
species
– H1 least conserved, H3 &
H4 most conserved
Copyright © 2004 by Limsoon Wong
• Play essential role in
chromosomal processes
– gene transcription,
regulation,
– chromosome
condensation,
recombination &
replication
Histone Transcription
• TFs bound in core,
proximal, distal promoter &
enhancer regions
• TFIID binds to TATA box &
identifies TSS with help of
TAFs & TBP
• RNA Pol-II supplemented by
GTFs (A,B,D,E,F,H)
recruited to core promoter to
form Pre-initiation complex
• Transcription initiated
– Basal/Activated, depending on
space & time
Copyright © 2004 by Limsoon Wong
Histone Promoter Modeling
Werner 1999
• Three promoter types: Core, proximal and distal
• Characterised by the presence of specific TFBSs
– CAAT box, TATA Box, Inr, & DPE
– Order and mutual distance of TFBS modules is specific &
determine function
Copyright © 2004 by Limsoon Wong
Histone H1t Gene Regulation
Grimes et al. 2003
• One gene can express in
diff ways in diff cells
• Same binding site can
have diff functions in diff
cells
Copyright © 2004 by Limsoon Wong
Why Model Histone promoters
• To understand histone’s regulatory mechanism
– To characterise regulatory features from known
promoters
– To identify promoter from uncharacterised genomic
sequence (promoter recognition)
– To find other genes with similar regulatory behaviour
and gene-products
– To define potential gene regulatory networks
Copyright © 2004 by Limsoon Wong
Difficulties of Histone Promoter
Modeling
• Not a plain sequence alignment problem
• Not all features are common among different
groups
• Not only TFBSs’ presence, but their location,
order, mutual distance and orientation are
critical to promoter function
• Not all TFs & TFBSs have been characterized
yet
Copyright © 2004 by Limsoon Wong
Tools for Promoter Modeling
• Genomic signals in
promoter v/s nonpromoter
– Core promoter (TATA Box,
Inr, DPE) and/or few
TFBS outside core
promoter
– Entire promoter (core,
proximal & distal) with
whole ensemble of TFBS
• Genomic content in
promoter v/s nonpromoter
– CpG islands, GC content
Copyright © 2004 by Limsoon Wong
• 2D-3D DNA structural
features
• Model with a scoring
system based on training
data (good data not
always available)
– Input seq scanned for
desired patterns & those
whose scores above
certain threshold are
reported
Promoter Recognition Programs
• Programs have different
objectives
• Use various
combinations of genomic
signals and content
• Typically analyse 5’
region [-1000,+500]
• Due to low accuracy,
programs developed for
sub-classes of
promoters
Image credit: Rajesh
Copyright © 2004 by Limsoon Wong
Steps for Building Histone
Promoter Recognizer
• Exercise: What do you think these steps are?
Copyright © 2004 by Limsoon Wong
MEME
• MEME is a powerful and good method for
finding motifs from biological sequences
• T. L. Bailey & C. Elkan, "Fitting a mixture model by expectation
maximization to discover motifs in biopolymers", ISMB, 2:28--36,
1994
Copyright © 2004 by Limsoon Wong
Motifs Discovered by MEME in
Histone Gene 5’ Region [-1000,+500]
H2A
Copyright © 2004 by Limsoon Wong
Image credit: Rajesh
Motifs Discovered by MEME in
Histone Gene 5’ Region [-1000,+500]
Image credit: Rajesh
H2B
Copyright © 2004 by Limsoon Wong
Are These Really Motifs of H2A and
H2B Promoters?
• One could use the motifs
discovered by MEME to
detect H2A & H2B
promoters
• But….it is strange that
the motifs for H2A and
H2B are generally the
same, but in opposite
orientation
• Exercise: Suggest a
possible explanation
H2A
H2B
Image credit: Rajesh
Copyright © 2004 by Limsoon Wong
The Real Common Promoter
Region of H2A & H2B is at [-250,-1]!
H2A
• MEME was
overwhelmed by coding
region & did not find the
right motifs!
Copyright © 2004 by Limsoon Wong
H2B
Image credit: Rajesh
Motifs Discovered by MEME in
Histone Promoter 5’ Region [-250,-1]
• Discovered 9 motifs
among all 127 histone
promoters
• All 9 motifs are
experimentally proven
TFBSs (TRANSFAC)
Image credit: Rajesh
Copyright © 2004 by Limsoon Wong
Deriving Histone Promoter Models
• Divide H1 seqs into
5 subgroups
• Aligned seqs within
each subgroup
• Consensus
alignment matches
biologically known
H1 subgroup models
 Can apply same
approach to find
promoter models for
H2A, H2B, H3, H4...
Image credit: Rajesh
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Knowledge Discovery
Basics
• Knowledge discovery in brief
• K-Nearest Neighbour
• Support Vector Machines
• Bayesian Approach
• Hidden Markov Models
• Artificial Neural Networks
Some slides are from a tutorial jointly taught with Jinyan Li
What is Knowledge Discovery?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules : Blue or Circle
Jessica’s rules : All the rest
Image credit: Tan
Copyright © 2004 by Limsoon Wong
What is Knowledge Discovery?
Question: Can you explain how?
Image credit: Tan
Copyright © 2004 by Limsoon Wong
Steps of Knowledge Discovery
• Training data gathering
• Feature generation
– k-grams, colour, texture, domain know-how, ...
• Feature selection
– Entropy, 2, CFS, t-test, domain know-how...
• Feature integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Some classifiers/learning methods
Copyright © 2004 by Limsoon Wong
Some Knowledge Discovery Methods
•
•
•
•
•
K-Nearest Neighbour
Support Vector Machines
Bayesian Approach
Hidden Markov Models
Artificial Neural Networks
Copyright © 2004 by Limsoon Wong
How kNN Works
• Given a new case
• Find k “nearest”
neighbours, i.e., k most
similar points in the
training data set
• Assign new case to the
same class to which
most of these
neighbours belong
Copyright © 2004 by Limsoon Wong
• A common “distance”
measure betw samples x
and y is
where f ranges over
features of the samples
Illustration of kNN (k=8)
Neighborhood
5 of class
3 of class
=
Image credit: Zaki
Copyright © 2004 by Limsoon Wong
Prediction of Compound Signature
Based on Gene Expression Profiles
• Hamadeh et al, Toxicological
Sciences 67:232-240, 2002
• Store gene expression
profiles corr to biological
responses to exposures
to known compounds
whose toxicological and
pathological endpoints
are well characterized
• use kNN to infer effects
of unknown compound
based on gene expr
profiles induced by it
Copyright © 2004 by Limsoon Wong
Peroxisome proliferators
Enzyme inducers
Basic Idea of SVM
Image credit: Zien
(a) Linear separation not possible w/o errors
(b) Better separation by nonlinear surfaces in input space
(c ) Nonlinear surface corr to linear surface in feature
space. Map from input to feature space by “kernel”
function 
 “Linear learning machine” + kernel function as classifier
Copyright © 2004 by Limsoon Wong
Linear Learning Machines
• Hyperplane separating the x’s and o’s points is
given by (W•X) + b = 0, with (W•X) = jW[j]*X[j]
 Decision function is llm(X) = sign((W•X) + b))
Copyright © 2004 by Limsoon Wong
Linear Learning Machines
• Solution is a linear combination of training
points Xk with labels Yk
W[j] = kk*Yk*Xk[j],
with k > 0, and Yk = ±1
 llm(X) = sign(kk*Yk* (Xk•X) + b)
“data” appears only in dot product!
Copyright © 2004 by Limsoon Wong
Kernel Function
• llm(X) = sign(kk*Yk* (Xk•X) + b)
• svm(X) = sign(kk*Yk* (Xk• X) + b)
 svm(X) = sign(kk*Yk* K(Xk,X) + b)
where K(Xk,X) = (Xk• X)
Copyright © 2004 by Limsoon Wong
Kernel Function
• svm(X) = sign(kk*Yk* K(Xk,X) + b)
 K(A,B) can be computed w/o computing 
• In fact replace it w/ lots of more “powerful”
kernels besides (A • B). E.g.,
– K(A,B) = (A • B)d
– K(A,B) = exp(– || A B||2 / (2*)), ...
Copyright © 2004 by Limsoon Wong
How SVM Works
• svm(X) = sign(kk*Yk* K(Xk,X) + b)
• To find k is a quadratic programming problem
max: kk – 0.5 * k h k*h Yk*Yh*K(Xk,Xh)
subject to: kk*Yk=0
and for all k , C  k 0
• To find b, estimate by averaging
Yh – kk*Yk* K(Xh,Xk)
for all h 0
Copyright © 2004 by Limsoon Wong
Prediction of Gene Function From
Gene Expression Data Using SVM
• Brown et al., PNAS 91:262267, 2000
• Use SVM to identify sets
of genes w/ a c’mon
function based on their
expression profiles
• Use SVM to predict
functional roles of
uncharacterized yeast
ORFs based on their
expression profiles
Copyright © 2004 by Limsoon Wong
Bayes Theorem
• P(h) = prior prob that hypothesis h holds
• P(d|h) = prob of observing data d given h holds
• P(h|d) = posterior prob that h holds given observed
data d
Copyright © 2004 by Limsoon Wong
Bayesian Approach
• Let H be all possible classes. Given a test
instance w/ feature vector {f1 = v1, …, fn = vn},
the most probable classification is given by
• Using Bayes Theorem, rewrites to
• Since denominator is indep of hj, simplifies to
Copyright © 2004 by Limsoon Wong
Naïve Bayes
• But estimating P(f1=v1, …, fn=vn|hj) accurately
may not be feasible unless training data set is
sufficiently large
• “Solved” by assuming f1, …, fn are indep
• Then
• where P(hj) and P(fi=vi|hj) can often be
estimated reliably from typical training data set
Copyright © 2004 by Limsoon Wong
Bayesian Design of Screens for
Macromolecular Crystallization
• Hennessy et al., Acta Cryst
D56:817-827, 2000
• Xtallization of proteins
requires search of expt
settings to find right
conditions for diffractionquality xtals
• BMCD is a db of known
xtallization conditions
• Use Bayes to determine
prob of success of a set
of expt conditions based
on BMCD
Copyright © 2004 by Limsoon Wong
How HMM Works
• HMM is a stochastic
generative model for
sequences
• Defined by
–
–
–
–
a1
a2
s1
s2
sk
…
finite set of states S
finite alphabet A
transition prob matrix T
emission prob matrix E
• Move from state to state
according to T while
emitting symbols
according to E
Copyright © 2004 by Limsoon Wong
How HMM Works
• In nth order HMM, T & E depend on all n
previous states
• E.g., for 1st order HMM, given emissions X =
x1, x2, …, & states S = s1, s2, …, the prob of
this seq is
• If seq of emissions X is given, use Viterbi algo
to get seq of states S such that
S = argmaxS Prob(X, S)
• If emissions unknown, use Baum-Welch algo
Copyright © 2004 by Limsoon Wong
Example: Dishonest Casino
• Casino has two dices:
– Fair dice
• P(i) = 1/6, i = 1..6
– Loaded dice
• P(i) = 1/10, i = 1..5
• P(i) = 1/2, i = 6
• Casino switches betw
fair & loaded die with
prob 1/2. Initially, dice is
always fair
Copyright © 2004 by Limsoon Wong
• Game:
–
–
–
–
You bet $1
You roll
Casino rolls
Highest number wins $2
• Question: Suppose we
played 2 games, and the
sequence of rolls was 1,
6, 2, 6. Were we likely to
be cheated?
“Visualization” of Dishonest Casino
Copyright © 2004 by Limsoon Wong
1, 6, 2, 6?
We were probably cheated...
Copyright © 2004 by Limsoon Wong
Protein Families Modelling By HMM
• Baldi et al., PNAS 91:10591063, 1994
• HMM is used to model
families of biological
sequences, such as
kinases, globins, &
immunoglobulins
• Bateman et al., NAR
32:D138-D141, 2004
• HMM is used to model
6190 families of protein
domains in Pfam
Copyright © 2004 by Limsoon Wong
What are ANNs?
• ANNs are highly connected networks of “neural
computing elements” that have ability to respond to
input stimuli and learn to adapt to the environment...
Copyright © 2004 by Limsoon Wong
Computing Element
• Behaves as a monotone
function y = f(net), where
net is cummulative input
stimuli to the neuron
• net is usually defined as
weighted sum of inputs
• f is usually a sigmoid
Copyright © 2004 by Limsoon Wong
How ANN Works
• Computing elements are
connected into layers in
a network
• The network is used for
classification as follows:
– Inputs xi are fed into input
layer
– each computing element
produces its corr output
– which are fed as inputs to
next layer, and so on
– until outputs are produced
at output layer
Copyright © 2004 by Limsoon Wong
• What makes ANN works
is how the weights on
the links are learned
• Usually achieved using
“back propagation”
Back Propagation
• vji = weight on link betw
xi and jth computing
element in 1st layer
• wj be weight of link betw
jth computing element in
1st layer and computing
element in last layer
• zj = output of jth
computing element in 1st
layer
• Then
Copyright © 2004 by Limsoon Wong
vij
wj
zj
Back Propagation
• For given sample, y may
differ from target output t
by amt 
• Need to propagate this
error backwards by
adjusting weights in
proportion to the error
gradient
• For math convenience,
define the squared error
as
Copyright © 2004 by Limsoon Wong
 To find an expression
for weight adjustment,
we differentiate E wrt vij
and wj to obtain error
gradients for these
weights
vij
wj
zj
Applying chain rule a few times and recalling
definitions of y, zj, E, and f, we derive...
Copyright © 2004 by Limsoon Wong
Back Propagation
vij
wj
zj
Copyright © 2004 by Limsoon Wong
T-Cell Epitopes Prediction By ANN
• Honeyman et al., Nature
Biotechnology 16:966-969,
1998
• Use ANN to predict
candidate T-cell epitopes
Image credit: Brusic
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Translation Initiation
Site Recognition
An introduction to the World’s
simplest TIS recognition system
Translation Initiation Site
Copyright © 2004 by Limsoon Wong
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
• What makes the second ATG the TIS?
Copyright © 2004 by Limsoon Wong
80
160
240
80
160
240
Approach
• Training data gathering
• Signal generation
– k-grams, distance, domain know-how, ...
• Signal selection
– Entropy, 2, CFS, t-test, domain know-how...
• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright © 2004 by Limsoon Wong
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]
• 3312 sequences
• 13503 ATG sites
• 3312 (24.5%) are TIS
• 10191 (75.5%) are non-TIS
• Use for 3-fold x-validation expts
Copyright © 2004 by Limsoon Wong
Signal Generation
• K-grams (ie., k consecutive letters)
–
–
–
–
K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
Copyright © 2004 by Limsoon Wong
C
G
T
Signal Generation: An Example
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
• Window = 100 bases
• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…
• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…
• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
Copyright © 2004 by Limsoon Wong
80
160
240
Too Many Signals
• For each value of k, there are
4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188
features!
• This is too many for most machine learning
algorithms
Copyright © 2004 by Limsoon Wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class dist
• Choose a signal w/ high inter-class dist
Image credit: Slonim
Copyright © 2004 by Limsoon Wong
Signal Selection (eg., t-statistics)
Copyright © 2004 by Limsoon Wong
Signal Selection (eg., 2)
Copyright © 2004 by Limsoon Wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how about
scoring a group of signals as a whole?
• CFS
– Correlation-based Feature Selection
– A good group contains signals that are highly
correlated with the class, and yet uncorrelated with
each other
Copyright © 2004 by Limsoon Wong
Sample k-grams Selected by CFS
Kozak consensus
Leaky scanning
• Position –3
• in-frame upstream ATG
• in-frame downstream
Stop codon
– TAA, TAG, TGA,
– CTG, GAC, GAG, and GCC
Codon bias?
Copyright © 2004 by Limsoon Wong
Signal Integration
• kNN
– Given a test sample, find the k training samples that
are most similar to it. Let the majority class win.
• SVM
– Given a group of training samples from two classes,
determine a separating plane that maximises the
margin of error.
• Naïve Bayes, ANN, C4.5, ...
Copyright © 2004 by Limsoon Wong
Results (3-fold x-validation)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Copyright © 2004 by Limsoon Wong
Improvement by Scanning
• Apply Naïve Bayes or SVM left-to-right until first ATG
predicted as positive. That’s the TIS.
• Naïve Bayes & SVM models were trained using TIS vs.
Up-stream ATG
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
NB
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
NB+Scanning
87.3%
96.1%
87.9%
93.9%
SVM+Scanning
88.5%
96.3%
88.6%
94.4%
Copyright © 2004 by Limsoon Wong
Performance Comparisons
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
NB
84.3%
86.1%
66.3%
85.7%
Decision Tree
74.0%
94.4%
81.1%
89.4%
NB+NN+Tree
77.6%
94.5%
82.1%
90.4%
SVM+Scanning
88.5%
96.3%
88.6%
94.4%*
Pedersen&Nielsen
78%
87%
-
85%
Zien
69.9%
94.1%
-
88.1%
Hatzigeorgiou
-
-
-
94%*
* result not directly comparable
Copyright © 2004 by Limsoon Wong
mRNAprotein
A
T
E
L
R
S
stop
Copyright © 2004 by Limsoon Wong
How about using k-grams
from the translation?
F S
L
P
Y
I T
N
K
D
E
M
V
A
H
Q
C
W
R
G
Amino-Acid Features
Image credit: Liu
Copyright © 2004 by Limsoon Wong
Amino-Acid
Features
Copyright © 2004 by Limsoon Wong
Image credit: Liu
Amino Acid K-grams Discovered
(by entropy)
Copyright © 2004 by Limsoon Wong
Independent Validation Sets
• A. Hatzigeorgiou:
– 480 fully sequenced human cDNAs
– 188 left after eliminating sequences similar to
training set (Pedersen & Nielsen’s)
– 3.42% of ATGs are TIS
• Our own:
– well characterized human gene sequences from
chromosome X (565 TIS) and chromosome 21 (180
TIS)
Copyright © 2004 by Limsoon Wong
Validation Results (on
Hatzigeorgiou’s)
• Using top 100 features selected by entropy
and trained on Pedersen & Nielsen’s dataset
Copyright © 2004 by Limsoon Wong
Validation Results
(on Chr X & Chr 21)
• Using top 100 features selected by entropy and
trained on Pedersen & Nielsen’s
Our
method
ATGpr
Image credit: Liu
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Human
Polyadenylation Signal
Prediction
Some slides are “borrowed” from Huiqing Liu
Cleavage & Polyadenylation of PremRNAs in Mammalian Cells
Copyright © 2004 by Limsoon Wong
PAS in Human Pre-mRNA 3’end
Processing Site
• Selection of poly-A site is primarily determined
by a hexameric poly-A signal (PAS) of
sequence AAUAAA (or a one-base variant) &
downstream U-rich or GU-rich elements
Copyright © 2004 by Limsoon Wong
PAS Prediction: Step 1
BEGIN
Incoming
sequences
Feature generation
Feature selection
Feature integration
END
Copyright © 2004 by Limsoon Wong
PAS Prediction: Step 2
BEGIN
Incoming
sequences
Entropy Measure for
Feature Selection
Fayyad & Irani, IJCAI, 1993
Feature generation
Feature selection
Feature integration
END
discard all the features
without cut points.
Copyright © 2004 by Limsoon Wong
PAS Prediction: Step 3
BEGIN
Incoming
sequences
Feature generation
Feature selection
Feature integration
END
Copyright © 2004 by Limsoon Wong
SVM in Weka
Data Set I (From Erpin)
• Training set: 2327 seqs
– 1632 “unique” & 695
“strong” poly(A) sites
• All seq trimmed to
contain 206 bases,
having a false or true
PAS in the center
Copyright © 2004 by Limsoon Wong
• Positive testing set:
– 982 seq w/ annotated
PASes from EMBL
• Negative testing set:
– 982 CDS seqs
– 982 seqs of 1st intron
– 982 randomized UTR
seqs using same 1st order
Markov model as human
3’ UTRs
– 982 randomized UTR
seqs of same mono
nucleotide composition as
human 3’ UTRs
Data Set II (mRNA data)
• Positive set:
– 312 human mRNA seqs
from RefSeq release 1
– Each contains a “poly(A)signal” feature tag
carrying an
“evidence=experimental”
label
– 767 human mRNA sequences
from RefSeq containing a
“poly(A)-site” feature tag carrying
an “evidence=experimental”
label. Similar sequences have
been removed
Copyright © 2004 by Limsoon Wong
• Negative set:
– Generated by scanning
“AATAAA” at coding
region (exclude those
near the end of seq)
Experimental Results
• Preliminary test: In order to compare with the
performance of Erpin and Polyadq, we also
adjust prediction accuracy on 982 true PASes
at around 56%.
Copyright © 2004 by Limsoon Wong
Experimental Results
• Testing results on Erpin using validation sets
All the numbers regarding to the performance of Erpin and Polyadq are copied or derived from Legendre & Gautheret 2003
Copyright © 2004 by Limsoon Wong
Experimental Results
• Top ranked features
It is clear that both upstream and dowstream are wellcharacterized by G/U rich segments (consistent w/
reported motifs)
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Copyright © 2004 by Limsoon Wong
Recognition of
Transcription Start
Sites
An introduction to the World’s
best TSS recognition system:
A heavy tuning approach
Transcription Start Site
Copyright © 2004 by Limsoon Wong
Structure of Dragon Promoter Finder
-200 to +50
window size
Model selected based
on desired sensitivity
Image credit: Bajic
Copyright © 2004 by Limsoon Wong
Each model has two submodels
based on GC content
GC-rich submodel
(C+G) =
GC-poor submodel
Copyright © 2004 by Limsoon Wong
Image credit: Bajic
#C + #G
Window Size
Data Analysis Within Submodel
p
e
i
K-gram (k = 5) positional weight matrix
Copyright © 2004 by Limsoon Wong
Image credit: Bajic
Promoter, Exon, Intron Sensors
• These sensors are positional weight matrices of kgrams, k = 5 (aka pentamers)
• They are calculated as s below using promoter, exon,
intron data respectively
Pentamer at ith
position in input
Window size

Frequency of jth
pentamer at ith position
in training window
Copyright © 2004 by Limsoon Wong
jth pentamer at
ith position in
training window
Data Preprocessing & ANN
Simple feedforward ANN
trained by the Bayesian
regularisation method
Tuning parameters
sE
wi
tanh(net)
Tuned
threshold
sI
sIE
ex - e-x
tanh(x) = ex + e-x
net =  si * wi
Copyright © 2004 by Limsoon Wong
Accuracy Comparisons
with C+G submodels
without C+G submodels
Image credit: Bajic
Copyright © 2004 by Limsoon Wong
Copyright © 2004 by Limsoon Wong
Any Question?
Acknowledgements
I “borrowed” a lot of materials in this lecture from
• Xu Ying, Univ of Georgia
• Mark Craven, Univ of Wisconsin
• Ken Sung, NUS
• Rajesh Chowdhary, I2R
• Jinyan Li, I2R
• Huiqing Liu, I2R
Copyright © 2004 by Limsoon Wong
Primary References
• Y. Xu et al. “GRAIL: A Multi-agent neural network system for gene
identification”, Proc. IEEE, 84:1544--1552, 1996
• R. Staden & A. McLachlan, “Codon preference and its use in
identifying protein coding regions in long DNA sequences”, NAR,
10:141--156, 1982
• Y. Xu, et al., "Correcting Sequencing Errors in DNA Coding
Regions Using Dynamic Programming", Bioinformatics, 11:117-124, 1995
• Y. Xu, et al., "An Iterative Algorithm for Correcting DNA
Sequencing Errors in Coding Regions", JCB, 3:333--344, 1996
• R. Chowdhary et al., “Modeling 5' regions of histone genes using
Bayesian Networks”, APBC 2005, accepted
Copyright © 2004 by Limsoon Wong
Primary References
• H. Liu et al., "Data Mining Tools for Biological Sequences", JBCB,
1:139--168, 2003
• H. Liu et al., "An in-silico method for prediction of polyadenylation
signals in human sequences", GIW, 14:84--93, 2003
• V. B. Bajic et al., "Dragon Gene Start Finder: An advanced system
for finding approximate locations of the start of gene
transcriptional units", Genome Research, 13:1923--1929, 2003
Copyright © 2004 by Limsoon Wong
Other Useful Readings
• L. Wong. The Practical Bioinformatician. World Scientific, 2004
• T. Jiang et al. Current Topics in Computational Molecular Biology.
MIT Press, 2002
• R. V. Davuluri et al., "Computational identification of promoters
and first exons in the human genome", Nat. Genet., 29:412--417,
2001
• J. E. Tabaska et al., "Identifying the 3'-terminal exon in human
DNA", Bioinformatics, 17:602--607, 2001
• J. E. Tabaska et al., "Detection of polyadenylation signals in
human DNA sequences", Gene, 23:77--86, 1999
• A. G. Pedersen & H. Nielsen, “Neural network prediction of
translation initiation sites in eukaryotes”, ISMB, 5:226--233, 1997
Copyright © 2004 by Limsoon Wong
Other Useful Readings
• C. Burge & S. Karlin. “Prediction of Complete Gene Structures in
Human Genomic DNA”, JMB, 268:78--94, 1997
• V. Solovyev et al. "Predicting internal exons by oligonucleotide
composition and discriminant analysis of spliceable open reading
frames", NAR, 22:5156--5163, 1994
• V. Solovyev & A. Salamov. “The Gene-Finder computer tools for
analysis of human and model organisms genome sequences",
ISMB, 5:294--302, 1997
• T. A. Down & T. J. P. Hubbard. “Computational Detection and
Location of Transcription Start Sites in Mammalian Genomic
DNA”, Genome Research, 12:458--461, 2002
• T. L. Bailey & C. Elkan, "Fitting a mixture model by expectation
maximization to discover motifs in biopolymers", ISMB, 2:28--36,
1994
Copyright © 2004 by Limsoon Wong
Other Useful Readings
• A. Zien et al., “Engineering support vector machine kernels that
recognize translation initiation sites”, Bioinformatics, 16:799--807,
2000
• A. G. Hatzigeorgiou, “Translation initiation start prediction in human
cDNAs with high accuracy”, Bioinformatics, 18:343--350, 2002
• V.B.Bajic et al., “Computer model for recognition of functional
transcription start sites in RNA polymerase II promoters of
vertebrates”, J. Mol. Graph. & Mod., 21:323--332, 2003
• J.W.Fickett & A.G.Hatzigeorgiou, “Eukaryotic promoter recognition”,
Genome Research, 7:861--878, 1997
• A.G.Pedersen et al., “The biology of eukaryotic promoter prediction--a review”, Computer & Chemistry, 23:191--207, 1999
• M.Scherf et al., “Highly specific localisation of promoter regions in
large genome sequences by PromoterInspector”, JMB, 297:599-606, 2000
Copyright © 2004 by Limsoon Wong
Other Useful Readings
• M. A. Hall, “Correlation-based feature selection machine learning”,
PhD thesis, Univ. of Waikato, New Zealand, 1998
• U. M. Fayyad, K. B. Irani, “Multi-interval discretization of
continuous-valued attributes”, IJCAI, 13:1022-1027, 1993
• H. Liu, R. Sentiono, “Chi2: Feature selection and discretization of
numeric attributes”, IEEE Intl. Conf. Tools with Artificial
Intelligence, 7:338--391, 1995
• C. P. Joshi et al., “Context sequences of translation initiation
codon in plants”, PMB, 35:993--1001, 1997
• D. J. States, W. Gish, “Combined use of sequence similarity and
codon bias for coding region identification”, JCB, 1:39--50, 1994
• G. D. Stormo et al., “Use of Perceptron algorithm to distinguish
translational initiation sites in E. coli”, NAR, 10:2997--3011, 1982
• Legendre & Gautheret, “Sequence determinants in human
polyadenylation site selection”, BMC Genomics, 4(1):7, 2003
Copyright © 2004 by Limsoon Wong
Download