Gene Prediction: Statistical Approaches

advertisement
Genome Annotation
Haixu Tang
School of Informatics
Genome and genes
• Genome: an organism’s genetic material (Car encyclopedia)
• Gene: a discrete units of hereditary information located on the
chromosomes and consisting of DNA. (Chapters to make
components of a car, or to use and drive a car).
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene!
Gene Prediction: Computational Challenge
• Gene: A sequence of nucleotides coding
for protein
• Gene Prediction Problem: Determine the
beginning and end positions of genes in a
genome
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Translating Nucleotides into Amino Acids
• Codon: 3 consecutive nucleotides
• 4 3 = 64 possible codons
• Genetic code is degenerative and redundant
– Includes start and stop codons
– An amino acid may be coded by more than
one codon (codon degeneracy)
Codons
• In 1961 Sydney Brenner and Francis Crick
discovered frameshift mutations
• Systematically deleted nucleotides from DNA
– Single and double deletions dramatically
altered protein product
– Effects of triple deletions were minor
– Conclusion: every triplet of nucleotides, each
codon, codes for exactly one amino acid in a
protein
Genetic Code and Stop Codons
UAA, UAG and
UGA correspond to
3 Stop codons that
(together with Start
codon ATG)
delineate Open
Reading Frames
Six Frames in a DNA Sequence
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
• stop codons – TAA, TAG, TGA
• start codons - ATG
n
3
Open Reading Frames (ORFs)
• Detect potential coding regions by looking at ORFs
– A genome of length n is comprised of (n/3) codons
– Stop codons break genome into segments
between consecutive Stop codons
– The subsegments of these that start from the Start
codon (ATG) are ORFs
• ORFs in different frames may overlap
ATG
TGA
Genomic Sequence
Open reading frame
Long vs.Short ORFs
• Long open reading frames may be a gene
– At random, we should expect one stop codon
every (64/3) ~= 21 codons
– However, genes are usually much longer
than this
• A basic approach is to scan for ORFs whose
length exceeds certain threshold
– This is naïve because some genes (e.g. some
neural and immune system genes) are
relatively short
Testing ORFs: Codon Usage
• Create a 64-element hash table and count
the frequencies of codons in an ORF
• Amino acids typically have more than one
codon, but in nature certain codons are
more in use
• Uneven use of the codons may
characterize a real gene
• This compensate for pitfalls of the ORF
length test
Codon Usage in Human Genome
Codon Usage in Mouse Genome
AA codon
Ser TCG
Ser TCA
Ser TCT
Ser TCC
Ser AGT
Ser AGC
/1000
4.31
11.44
15.70
17.92
12.25
19.54
frac
0.05
0.14
0.19
0.22
0.15
0.24
Pro
Pro
Pro
Pro
6.33
17.10
18.31
18.42
0.11
0.28
0.30
0.31
CCG
CCA
CCT
CCC
AA codon
Leu CTG
Leu CTA
Leu CTT
Leu CTC
/1000
39.95
7.89
12.97
20.04
frac
0.40
0.08
0.13
0.20
Ala
Ala
Ala
Ala
GCG
GCA
GCT
GCC
6.72
15.80
20.12
26.51
0.10
0.23
0.29
0.38
Gln
Gln
CAG
CAA
34.18
11.51
0.75
0.25
Transcription in prokaryotes
Transcribed region
start codon
5’
stop codon
Coding region
3’
Untranslated regions
Promoter
Transcription start side
upstream
downstream
Transcription stop side
Microbial gene finding
• Microbial genome tends to be gene rich
(80%-90% of the sequence is coding
sequence)
• Major problem – finding genes without
known homologue.
Open Reading Frame
Open Reading Frame (ORF) is a sequence of codons
which starts with start codon, ends with a stop codon and
has no stop codons in-between.
Searching for ORFs – consider all 6 possible
reading frames: 3 forward and 3 reverse
Is the ORF a coding sequence?
1. Must be long enough (roughly 300 bp or more)
2. Should have average amino-acid composition specific for a
given organism.
3. Should have codon usage specific for the given organism.
Gene finding using codon
frequency
Input sequence
frequency in coding region
frequency in non-coding region
Compare
Coding region or non-coding region
Example
Codon
position
1
A
C
T
G
28% 33% 18% 21%
2
32% 16% 21% 32%
3
33% 15% 14% 38%
frequency 31% 18% 19% 31%
in
genome
Assume: bases making
codon are independent
P(x|in coding)
P(x|random)
=
P(Ai at ith position)
P i P(Ai in the sequence)
Score of AAAGAT:
.28*.32*.33*.21*.26*.14
.31*.31*.31*.31*.31*.19
Using codon frequency to find
correct reading frame
Consider sequence x1 x2 x3 x4 x5 x6 x7 x8 x9….
where xi is a nucleotide
let p1 = p x1 x2 x3 p x3 x4 x5….
p2 = p x2 x3 x4 p x5 x6 x7….
p3 = p x3 x4 x5 p x6 x7 x8….
then probability that ith reading frame is the coding frame is:
Algorithm:
pi
• slide a window along the sequence and
Pi = p + p + p
1
2
3
compute Pi
•Plot the results
Eukaryotic gene finding
• On average, vertebrate gene is about 30KB
long
• Coding region takes about 1KB
• Exon sizes vary from double digit numbers to
kilobases
• An average 5’ UTR is about 750 bp
• An average 3’UTR is about 450 bp but both
can be much longer.
Exons and Introns
• In eukaryotes, the gene is a combination
of coding segments (exons) that are
interrupted by non-coding segments
(introns)
• This makes computational gene prediction
in eukaryotes even more difficult
• Prokaryotes don’t have introns - Genes in
prokaryotes are continuous
Gene Structure
Gene structure in eukaryotes
exons
Final exon
Initial exon
Transcribed region
start codon
stop codon
3’
5’
GT
AG
Untranslated regions
Promoter
Transcription stop side
Transcription start side
donor and acceptor sides
Central Dogma and Splicing
exon1
intron1
exon2
intron2
exon3
transcription
splicing
exon = coding
intron = non-coding
translation
Splicing Signals
Exons are interspersed with introns and
typically flanked by GT and AG
Splice site detection
Donor site
5’
3’
Position
%
A
C
G
T
-8 … -2 -1
26
26
25
23
…
…
…
…
0
1
2
… 17
60 9 0 1 54 … 21
15 5 0 1 2 … 27
12 78 99 0 41 … 27
13 8 1 98 3 … 25
Consensus splice sites
Donor: 7.9 bits
Acceptor: 9.4 bits
Promoters
• Promoters are DNA segments upstream
of transcripts that initiate transcription
Promoter
5’
3’
• Promoter attracts RNA Polymerase to the
transcription start site
Two Approaches to Eukaryotic Gene Prediction
• Statistical: coding segments (exons) have typical
sequences on either end and use different
subwords than non-coding segments (introns).
• Similarity-based: many human genes are similar
to genes in mice, chicken, or even bacteria.
Therefore, already known mouse, chicken, and
bacterial genes may help to find human genes.
Ribosomal Binding Site
Donor and Acceptor Sites: Motif Logos
Donor: 7.9 bits
Acceptor: 9.4 bits
(Stephens & Schneider, 1996)
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Similarity-based gene finding
• Alignment of
– Genomic sequence and (assembled) EST
sequences
– Genomic sequence and known (similar)
protein sequences
– Two or more similar genomic sequences
Expressed Sequence Tags
Cell or tissue
Isolate mRNA and
Reverse transcribe into
cDNA
dbEST
Clone cDNA into a vector to
Make a cDNA library
Vectors
Submit
To dbEST
5’ EST
3’
Pick a clone
And sequence the 5’ and 3’
Ends of cDNA insert
Central Dogma and Splicing
exon1
intron1
exon2
intron2
exon3
transcription
splicing
exon = coding
intron = non-coding
translation
Splicing Sequence Alignment
Potential splicing sites
Using Similarities to Find the Exon Structure
• Human EST (mRNA) sequence is aligned to
different locations in the human genome
• Find the “best” path to reveal the exon structure
of human gene
EST sequence
Human Genome
An annotated gene in human
genome
Download