Genome sequencing

advertisement
Computational Genomics
Izabela Makalowska
July 15, 2006
The main task in modern biology is to
find out how this…
TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG
TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG
CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG
CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA
GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC
TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT
AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT
GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA
AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT
AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC
CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT
AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC
…becomes this
DNA sequence contains all information but
we need to decipher it.
TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG
TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG
CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG
CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA
GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC
TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT
AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT
GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA
AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT
AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC
CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT
AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC
Program for life



DNA in our cells store information in a way that is very similar to the way computers
do.
Instead of being a binary memory, where everything is either 0 or 1, DNA is a 4 letter
alphabet: A, C, G, T
Using computer metaphor we can say that:
 Plant cell do not look like a mouse cell because their “programs” are different
 Liver cells work differently than lung cells because of different input to the


program
Children look like parents because their program is a “revision” of parents
program
Many diseases are caused by “bugs” in program:
 Tay-Sach’s disease: A simple mistake in one line of code
 Huntington’s disease: A “line” of code gets repeated a bunch of times

by accident
Different ways to solve the same problem:
 Plants: photosynthesis = turn light into sugar
 Animals: eat plant or other animals
What exactly are we looking for in the
DNA sequence?
 Genes
Protein coding
 RNA genes
 Retrogenes
 Regulatory elements
 Promotors
 Enhancers
 siRNA
 Repetitive elements
 LINES
 SINES
 Simple repeats

Are genes just protein or RNA coding
elements?
Makorin1-p1 is a non-coding pseudogene of Makorin1.
Makorin1-p1 regulates the expression of its related coding
gene. It acts by stabilizing the Makorin1 gene by blocking of
a cis-acting RNA decay element within the 5’ region of
Makorin1.
Are repeats just a junk DNA?
Translation of mRNA containing
Alu-cassette results in soluble
form of the protein
Caras, I.W., Davitz, M.A., Rhee, L., Weddell, G.,
Martin Jr., D.W., Namba, T., Sugimoto, Y., Negishi,
M., Irie, A., Ushikubi, F., Kaki-Nussenzweig,V.,
Cloning of decay-accelerating factor suggests
novel use of splicing to generate two proteins.
Nature. 1987 Feb 5-11;325(6104):545-9.
Getting all genes
 The most direct way to identify a gene is to document the
transcription of a fragment of the genome - EST sequencing
 Requires less sequencing since it is focused on coding
sequence only
 Small rate of false positives, although even 10% of EST
sequences could be artifacts
 Genes with very restricted expression may newer be
discovered
 In most cases gives only partial sequences
 Genome sequencing
 Access to entire genome, allows to learn more about
genome organization
 Regulatory elements
 Only small percentage of the genome codes for genes
 Hard to identify less typical genes
 High rate of false positives
Constructing EST
Cell or tissue
Isolate mRNA and
reverse transcribe into cDNA
Analyze
Clone cDNA into a vector
to make a cDNA library
5' EST
3' EST
cDNA
vector
Pick individual
clones
Sequence the
5' and 3' ends
of cDNA inserts
Problems with EST data
 Contamination
 Low quality – the error rates are high in individual
ESTs
 Highly redundant, for highly expressed genes we
can have hundreds of ESTs representing a single
gene
 The databases are skewed for sequences near 3’
end of mRNA
 For most ESTs there is no indication as to the
gene from which it was derived
 Overlapping genes
 Splice variants
Chromatogram
An example of a good chromatogram showing
well-resolved peaks and no ambiguities
This is a region of a chromatogram fairly far
along the sequence where some bases in runs
of 2 or more are no longer visible as single
peaks. Many peaks are beginning to broaden
and smear into one another, interpretation of
the peaks has become more difficult, and the
basecalling software has begun to use 'N's.
This is a region of a chromatogram where the
traces have become too ambiguous for
accurate basecalling. While some parts of this
region of the chromatogram can be useful for
linking to existing sequences following manual
editing, it should not be considered accurate.
Sequence quality screening
 ABI sequencing software contains a program
for quality screening
 PHRED - reads DNA sequence data, calls
bases, and writes the base calls and quality
values to output files
Quality
file
Phred quality scores
Phred quality score
Probability that the
base is called wrong
Accuracy of the base
call
10
1 in 10
90%
20
1 in 100
99%
30
1 in 1,000
99.9%
40
1 in 10,000
99.99%
50
1 in 100,000
99.999%
Contamination
 Vectors - DNA/cDNAs from the biological source organism/organelle are
usually inserted into a cloning vector so that they can be cloned, propagated,
and manipulated. Sequencing of such constructs frequently produces raw
sequences that include segments derived from vector.
 Adapters, linkers, and PCR primers - Various oligonucleotides can
be attached to the DNA/RNA under investigation as part of the cloning or
amplification process.
 Impurities in the DNA/RNA - Nucleic acid preparations may contain
DNA/RNA from sources other than the intended one.
 nucleic acids from an organelle
 mRNA/DNA present in a reagent used in the isolation, purification, or
cloning procedures
 nucleic acids from other organisms present in the material from which the
DNA/RNA was isolated
 other DNAs/RNAs used in the laboratory (e.g., from accidental mixing of
samples or cross contamination from dirty pipettes, tips, tubes, or
equipment)
Consequences of contamination
 Time and effort wasted on meaningless
analyses
 Erroneous conclusions drawn about the
biological significance of the sequence
 Misassembly of sequence contigs and
false clustering of Expressed Sequence
Tags (ESTs)
Why contamination is causing problems
Gene A
Gene B
Assembly
Contigs
VecScreen
http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen_docs.html
Cross_match
 Cross_match is a general purpose application for
comparing any two DNA sequence sets. For example,
it can be used to compare a set of reads to a set of
vector sequences and produce vector-masked
versions of the reads. It is slower but more sensitive
than BLAST.
GCACGCACAACCAGACCATGCTCGGACGACCCGCTGTACATCGGCCTGCGGCAGAGGCGCGTGCGCGGCGCCGCG
TACGACGAGTTCGTCGACGAGTTCATGCAGGCGGTCGTCAAGCGCTTCGGGCAGAACTGCCTCATACAGTTCGAG
GACTTCGCCAACGCGAACGCGTTCCGCCTGCTCGAGAAGTACCGCGGCAGGTACTGCACGTTCAACGACGACCTC
CAGGGCACGGCGGCGGTGGCGGTGGCCGGGCTGCTCGCGTCGCTGCGCATCACCGGCAAGCGGCTCTCCGACAAC
GTGTTCGTGTTCCAGGGAGCCGGCGAGGCATCTCTGGGTATCGCCGAGCTGTGCGTGATGGCGATGAAGAACGAG
GGTACATCGGACGCCGATGCCCGCTGCAAGATTTGGATGGTGGACTCCAAGGGTCTCATCGTGAAGAACCGTCCT
GAAGGTGGACTGAACGAACACAAGGAGAAGTTTGCCCAGAACTGCTCCCCCATTCGGACACTTGCCGAAGTTATA
AATGTTGCTAAGCCTTCTGTACTGATTGGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXGCTCGACGGCGGAAGGAAAA
Qscreen
 Sequencing data management and quality
checking system







Quality screening
Relational database for sequence data
management and archive
Easy data access via web interface
Easy data sharing
Sequence and trace view
Project statistics
Password protected
Qscreen: project and user management
Gene discovery strategy
 Cluster and assemble EST sequences to
lower redundancy and to increase the
length of transcripts
 Find coding regions and reading frame
 Use deducted protein to search databases
and assign function to the gene
Cluster and assemble EST resources
 Assembly tools:



TGICL http://www.tigr.org/tdb/tgi/software/
Cap 3 http://genome.cs.mtu.edu/sas.html
Phrap http://www.phrap.org/
 EST clusters databases


UniGene http://www.ncbi.nlm.nih.gov/
TGI http://www.tigr.org/
 EST analysis pipeline

SMAPi http://smapi.cbio.psu.edu
Assembling ESTs
 Two stage process:


Clustering ESTs based on the similarity and clone ID
Assembling inside each cluster
Important parameters
Length and similarity level
of overlapping fragments
Length of overhanging fragments
 Criteria too stringent = many ESTs
will not be assembled and genes will
stay fragmented
 Criteria too loose = ESTs from
genes from the same family will be
assembled into one gene
Contig quality
ATGTCTCTNTCACTGA
TCTGTCCC-CAGTCACGATCGAN
ATGTCTCTGTCNCTNAGTCACGATCGAN
ATGTCTCTNTCACTGA
TCTGTCCC-CAGTCACGATCGAN
ATGTCTCGGTCAC-CAGTCACGATCGAT
ATGTCTCGGTCAC-CAGTCACGATCGAT
TTGTCTGGGTCAC-CTCC
GGTGGC-CAGTCACGATNGAN
ATGTCTCGGTCAC-CAGTCACGATCGAT
ATGTCTCTGTCNCTNAGTCACGATCGAN
ATGTCTCGGTCAC-CAGTCACGATCGAT
One gene one cluster?
Joined based on clone ID
Joined based on similarity
5’ESTs
Joined based on similarity
3’ESTs
One cluster one gene?
Cap3
 Use of forward-reverse constraints to correct
assembly errors and link contigs.
 Use of base quality values in alignment of
sequence reads.
 Automatic clipping of 5' and 3' poor regions of
reads.
 Generation of assembly results in ‘ace’ file
format for Consed.
Input files
 CAP3 takes as input a file of sequence reads
in FASTA format. CAP3 takes two optional
files: a file of quality values in FASTA format
and a file of forward-reverse constraints.
The file of quality values must be named
"xyz.qual", and the file of forward-reverse
constraints must be named "xyz.con", where
"xyz" is the name of the sequence file. CAP3
uses the same format of a quality file as
Phrap.
http://bio.ifom-firc.it/ASSEMBLY/assemble.html
Web interface
Output files
TGICL – TIGR Gene Indices
clustering tool
 Clustering – uses modified megablast
program to cluster sequences together
 Assembly – CAP3 is used to assemble
sequences inside each cluster
TGICL – TIGR Gene Indices
clustering tool
 Sequences need to be cleaned before using
TGICL (Lucy, UniVEc, SeqClean)
 mRNA sequences may be used as ‘seeds’ for
clustering. Caution: partial mRNAs
mislabeled as complete can prevent cluster
extension beyond the seed.
 Difficulty with highly expressed genes that
have several thousand ESTs in a single
cluster (assembly program may run out of
memory)
Phrap
 part of the Phred/Phrap/Consed
 program for assembling shotgun DNA sequence data.
 allows use of the entire read and not just the trimmed high





quality part
uses a combination of user-supplied and internally computed
data quality information to improve assembly accuracy
constructs the contig sequence as a mosaic of the highest
quality read segments rather than a consensus
provides extensive assembly information to assist in troubleshooting assembly problems
handles large datasets
It is strongly recommended that phrap be used in conjunction
with the base calls and base quality values produced by the
basecaller, phred; and with the sequence editor/assembly
viewer, consed.
Phrap output in Consed
Apple EST assembly
183, 732 ESTs
 Phrap:
 24,199
contigs; 9,765 singletons
 CAP3
 19,791
contigs; 18,927 singletons
 TGICL

22,481 contigs; 28,279 singletons
NCBI UniGene
Clustering at NCBI
 EST must have at least 100bp after removing contaminants
 The overlap between similar ESTs must be at least 70 bp
 Similarity between overlapping area must be at least 96%
over the 70% of overlapping region
>100bp
>70% of overlap with at
least 96% of similarity
>100bp
>70bp
TIGR Gene
Indices
www.tigr.org
What are TIGR gene indices?
 Clustered and assembled ESTs and mRNA
sequences
 Each gene and each splice variant is
represented by a single consensus sequence
 Provides ORF annotation, genome mapping,
expression profiles, domain annotation,
unique oligomers
<30bp
>40bp of overlap with at
least 95% of similarity
<30bp
TGI annotations
TGI sequence
annotations
BLAST search
SMAPi – Sequence Management and
Analysis Pipeline
 The Parasitic Plant
Genome Project
Comparative
genomics of related
 Online access to data
parasitic genera
 Easy sequence management
 Host-parasite lateral
 Custom as well as automated
transfer

annotation
 Sharing among researchers
 Integrated usage of various
analysis tools (BLAST,
Phred/Phrap.)
Supported by grant from
Pennsylvania Greenhouse for
Life Sciences
NCBI ORF finder
Mapping contigs to the genome
 Splign http://www.ncbi.nlm.nih.gov/sutils/splign/
Genome sequencing
Access to entire genome
 Regulatory elements
 Only small percentage of the genome codes for
genes
 Hard to identify less typical genes
 High rate of false positives

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG
TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG
CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG
CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA
GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC
TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT
AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT
GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA
AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT
AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC
CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT
AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC
Genome sequencing strategies
Gene identification methods
 Molecular techniques
 Very laborious
 Time consuming
 Expensive
 Low rate of false positives
 Computational methods
 Fast
 Relatively low cost
 High rate of false positives
 Poor performance on less typical genes
Before we start analysis…
 We have to:
 Check
sequences quality
 Remove contamination
 Assembly sequence reads into longer
contigs
 Close gaps (in perfect situation)
Gene finding strategies
Genomic Sequence
Model based
Content-Based
Bulk properties of
sequence:
Open reading frame
Codon usage
Repeat periodicity
Compositional complexity
Site-Based
Absolute properties of
sequence:
Consensus sequences
Donor and acceptor splice
sites
Transcription factor binding
sites
Polyadenylation signals
Start codon
Stop codon
Comparative
Inferences based on
sequence homology:
Protein sequence with
similarity to translated
product of query
Similarity to EST sequences
Conserved regions between
species (comparative
genomics)
Gene finding methods classification
Similarity based predictors:
make use of similarity to already
known genes and proteins coded
by these genes as well as
expression data including
sequences from cDNAs and data
from hybridization experiments
(tiling arrays for example)
Dual- and multi-genome predictors:
rely on the fact that functional regions
of a genome sequence are more
conserved during evolution
Model based predictors: use a
single genome sequence and
exon/intron structure is predicted
based on absolute and bulk
properties of the sequence
Comparative genomics - MultiPipmaker
http://pipmaker.bx.psu.edu/pipmaker/
Model based methods
 We take advantage of what we already
learned about gene structures and features of
coding sequences. Based on this knowledge
we can build theoretical model, develop an
algorithm to search for important features,
train it on known data and use to search for
coding sequences in anonymous genomic
fragments
Pattern recognition and matching
 The ability of a program to compare novel and known
patterns and determine the degree of similarity forms the
basis of sequence analysis including gene identification. In
similarity based methods we search the genome directly
for nucleotide or amino acid pattern observed in one or
more already known genes; in comparative genomics we
look for similar sequence pattern in two or more genomes,
and in method based prediction we look for patterns in
sequence composition and signals.
 One of the major challenges associated with using pattern
matching is in that, in most cases, we need to identify
patterns that are ‘similar’ to a target pattern, but the
concept of similarity isn’t well defined from programmatic
and biological sense. Also, only already known pattern
may be used for searches, therefore genes with unusual
patterns may not be discovered using these methods.
Coding regions in Prokaryotes
INTERGENIC
REGION
START
CODING
SEQUENCE
STOP
Genetic code
Gene Model in Prokaryotes
A
T
G
1
2
START
3
T
A
A
T
A
G
T
G
A
CODON
STOP
Open Reading Frames
+1
+2
+3
-1
-2
-3
Properties of the sequence
 We can check if sequence in particular
ORF has some other features which
could tell us if this is a putative coding
sequence or the ORF is false positive.
We can look at the sequence content
and compare it with known coding
sequence and non-coding sequence
and check to which of these two the
ORF sequence is more similar to.
Markov chain
 A Markov chain describes at successive times the
states of a system. At these times the system may
have changed from the state it was in the moment
before to another or stayed in the same state. The
changes of state are called transitions. The Markov
property means the system is memoryless, i.e. it
does not "remember" the states it was in before, just
"knows" its present state, and hence bases its
"decision" to which future state it will transit purely on
the present, not considering the past.
Examples of Markov chain
 A game of Monopoly, snakes and ladders or any other game
whose moves are determined entirely by dice is a Markov chain.
This is in contrast to card games such as poker or blackjack,
where the cards represent a 'memory' of the past moves. To see
the difference, consider the probability for a certain event in the
game. In the above mentioned dice games, the only thing that
matters is the current state of the board. The next state of the
board depends on the current state, and the next roll of the dice.
It doesn't depend on how things got to their current state. In a
game such as poker or blackjack, a player can gain an
advantage by remembering which hands have already been
played (and hence which cards are no longer in the deck), so
the next state (or hand) of the game is not independent of the
past states.
Markov chain – weather model
 The matrix P represents the weather model in which a sunny
day is 90% likely to be followed by another sunny day, and a
rainy day is 50% likely to be followed by another rainy day.
0.1
0.9
0.5
0.5
0.9
0.1
0.5
0.5
P =
Markov chain – sequence properties
A
T
C
G
How can we do this?
CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCG
TATATCGCGATCGCATCGCGTACGCGATCGCTATCTCGCATCGCGTACTGCAT
CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCG
TATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT
CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCG
TATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT
CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCG
TATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT
20 G
7 GA
1 GG
5 GT
7 GC
7/20
1/20
5/20
7/20
For non-coding sequence we assume that
probability of each is equal. The more
‘popular’ in coding sequence transition, the
higher probability the sequence is coding
Markov Chains
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
GCGCTAGCGCCGATCATCTACTCG
}
}
}
first order
second order
fifth order
How far can we go?
 Order of our model will have influence
on specificity and sensitivity of our
program. Too short sequences may not
be specific enough and program may
return a lot of false positives. Long
chains may be to specific and our
program will not be sensitive enough
and will return a lot of false negatives.
Probability matrix
first order Markov Model - matrix of 16 probabilities
p(A/A),p(A/T),p(A/C),p(A/G)
p(T/A),p(T/T),p(T/C),p(T/G)
p(C/A),p(C/T),p(C/C),p(C/G)
p(G/A),p(G/T),p(G/C),p(G/G)
4
K+1
4
1+1
2
= 4 =16
4
2+1
4
3+1
3
= 4 =64
4
= 4 =256
Probabilities in different reading frames
GCGCTAGCGCCGATCATCTACTCG
GCG CTA GCG CCG ATC ATC TAC TCG
G CGC TAG CGC CGA TCA TCT ACT CG
GC GCT AGC GCC GAT CAT CTA CTC G
Does G more often follow GC when it is in third
position in codon or if it is in second or first
position?
Number of probabilities
1+1
2
4 = 4 = 16
2
1+1
4 = 4 = 3x16 = 48
Estimating coding potential
To estimate if the sequence is coding we have to calculate
the probability that the sequence is coding and the
probability the sequence is non-coding. Next we calculate
the logarithm of the ratio of these two probability values
i
LP (S) = log P (S)
P 0(S)
If the calculated value is > 0 the likelihood that the sequence
is coding is higher than the sequence is not coding. If the
value is < 0 there is higher likelihood that sequence is not
coding.
Markov Models - probabilities
i
LP(S) = log P (S)
P0(S)
S=AGGACG
1
1
2
3
1
2
P(S) =f(A,1)F(G,A)F(G,G)F(A,G)F(C,A)F(G,C)
P(S) = 0.27 x 0.19 x 0.27 x 0.24 x 0.21 x 0.12 = 0.00008377
P(S) = 0.25 x 0.25 x 0.25 x 0.25 x 0.25 x 0.25 =0.0002441
LP(S) = log(0.00008377/0.0002441) = -0.4644
Calculating LP
i
LP(S) = log P (S)
P0(S)
0.27
0.21
LP(S) = log 0.27 + log0.19 + log
+ log0.24 + log
+ log 0.12
0.25
0.25
0.25
0.25
0.25
0.25
LP(S) = log 1.08 + log 0.76 + log 1.08 + log 0.96 + log 0.84 + log 0.48
LP(S) = 0.0334 + (-0.1191) +0.0334 + (-0.0177) + (-0.0757) + (-0.3187)
LP(S) = -0.4644
GLIMMER
 Gene finding program for Prokaryotes
 Saltzberg et. al, 1998
 For prediction uses:
Start
Stop
Sequence composition
Interpolated Markov Models
The GLIMMER system
 Part 1 – Program is trained for given data set
(species)
 Part 2 – Program identifies putative genes in the
genomic sequence
 Identify all ORFs longer than a threshold
 Score each ORF in each reading frame and select
these which gets highest scores in correct
reading frame
 Score overlapping genes in each frame
separately to see which frame score the highest
Running the program
 First run build-imm on a set of sequences to
make the Markov models (long ORFs from
the same or closely related species)
build-imm train.seq
 Then run GLIMMER to find genes in your
sequence
glimmer your.seq train.seq <options>
GLIMMER options
 -g
set minimum gene length
 -o
set minimum overlap
 -p
set minimum overlap percentage
 +r/-r independent probability score ON/OFF
 -t
set threshold score for calling as gene
GLIMMER output
Minimum gene length = 180
Minimum overlap length = 30
Minimum overlap percent = 10.0%
Threshold score = 90
Use independent scores = True
Use first start codon = True
ID#
1
2
3
4
***
5
***
6
7
8
Orf
Fr
Start
F2
302
R1
660
F2
620
R3
1114
F3
1119
R3
2026
F3
1815
Overlaps #3
Reject #4
R2
2600
Overlaps #4
F1
2710
R3
4153
F1
3403
R2
4700
R2
68906
R1
101574
R3
193228
Gene
Start
305
633
650
1105
1140
1999
1830
by 170
2597
by 120
2719
4153
3403
4679
68897
101544
193204
End
616
220
901
638
1466
1118
2054
1935
3399
3962
4230
4455
68670
101296
193022
Lengths
Gene
-- Frame
Orf Gene
Score
F1 F2 F3
315
312
0
_ 0 _
441
414
99
_ _ _
282
252
0
_ 0 _
477
468
99
_ _ _
348
327
0
_ _ 0
909
882
99
_ _ _
240
225
99
_ _ 99
Overlap Region Scores: _ _ 0
666
Overlap
690
192
828
246
237
279
207
663
99
_
Region Scores: _
681
99
99
192
0
99
828
99
99
225
0
_
228
13
_
249
96
_
183
56
_
_
_
_
_
_
_
_
_
_
Scores R1 R2 R3
99 _ _
99 _ _
_ _ 99
_ _ 99
_ _ 99
_ _ 99
_ _ _
_ _ 99
_ _ 99 _
0 _ 99 _
_ _ _ _
_ _ _ 0
_ _ _ _
_ _ 0 _
_ _ 13 _
_ 96 _ _
_ _ _ 56
Indep
Score
0
0
0
0
0
0
0
0
0
0
0
0
0
99
86
3
43
List of putative genes
Putative Genes:
1
633
220
2 1105
638
3 1999 1118
5 2597 1935
6 2719 3399
7 3403 4230
...
39 38472 38741 [Shorter 40 80 74]
40 38662 39450 [Bad Overlap 39 80 25]
...
482 464206 464424 [Shadowed by 483]
...
636 616213 615965 [Delay by 33 637 50 0]
Output description
39
40
38472
38662
38741 [Shorter 40 80 74]
39450 [Bad Overlap 39 80 25]
[Bad Overlap a b c] means that gene number a overlapped this one and
was shorter but scored higher on the overlap region. b is the length
of the overlap region and c is the score of *this* gene on the overlap
region. There should be a [Shorter ...] notation with gene a
giving its score.
[Shorter a b c] means that gene number a overlapped this one and
was longer but scored lower on the overlap region. b is the length
of the overlap region and c is the score of *this* gene on the overlap
region. There should be a [Bad overlap ...] notation with gene a
giving its score.
Output description - 2
482 464206 464424 [Shadowed by 483]
...
636 616213 615965 [Delay by 33 637 50 0]
[Shadowed by a] means that this gene was completed contained as part
of gene a 's region, but in another frame.
[Delay by a b c d] means that this gene was tentatively rejected
because of an overlap with gene b , but if the start codon is postponed
by a positions, then this would be a valid gene. The start position
reported for this gene includes the delay. c is the length of the overlap
region that caused the rejection and d is the score in this gene's frame
on that overlap region.
Prokaryotic vs. Eukaryotic Genes
 Prokaryotes







small genomes
high gene density
no introns (or splicing)
no RNA processing
similar promoters
terminators important
overlapping genes
 Eukaryotes







large genomes
low gene density
introns (splicing)
RNA processing
heterogeneous promoters
terminators not important
polyadenylation
Coding regions in Prokaryotes
INTERGENIC
REGION
START
CODING
SEQUENCE
STOP
From DNA to protein
Gene structure
Pseudogenes and repetitive elements
Complicated gene structures
Overlapping
genes
Eukaryotic gene structure
E xon 1
Intron 1
E xon 2
E xon 3
Intron 2
E xon 4
Intron 3
5’
Promoter
TA TA
DNA
3’
Splice site
GGTGA G
Translation
Initiation
A TG
Splice site
CA G
Pyrimidine
tract
B ranchpoint
CTG A C
polyA signal
Stop codon
TA G/TGA /TA A
3’ UTR
5’ UTR
Ex1
In1
Ex2
Ex2
In2
Ex3
In3
Ex4
GT
In4
Ex5
AG
Markov Models
E0
E1
E2
I0
I1
I2
Eterm
Einit
5’ UTR
Single
exon
gene
3’ UTR
Poly A
promoter
Signal
Intergenic
region
Ex5
Searching for coding sequences using
Markov chain
In this case we do not want check if given sequence fragment is
coding or not but we rather want to identify coding fragments in
a long sequence. In most cases this is done by calculating
statistics in overlapping windows.
AGTACGATATTAGCGGCAATCGTATGACTACGTCTTGCTACGTCTTCTCTCGTCTGCTCTAG
Windows coding potentials are used to
create profiles. This example shows a profile
for a sequence analyzed using 120bp
windows with 10 bp distance between them.
Codon usage
Gly
Gly
Gly
Gly
GGG
GGA
GGT
GGC
17.08
19.31
13.66
24.94
0.23
0.26
0.18
0.33
Arg
Arg
Ser
Ser
AGG
AGA
AGT
AGC
12.09
11.73
10.18
18.54
0.22
0.21
0.14
0.25
Trp
End
Cys
Cys
TGG
TGA
TGT
TGC
Glu
Glu
Asp
Asp
GAG
GAA
GAT
GAC
38.82
27.51
21.45
27.06
0.59
0.41
0.44
0.56
Lys
Lys
Asn
Asn
AAG
AAA
AAT
AAC
33.79
22.32
16.43
21.30
0.60
0.40
0.44
0.56
End
End
Tyr
Tyr
TAG
TAA
TAT
TAC
Val
Val
Val
Val
GTG
GTA
GTT
GTC
28.60
6.09
10.30
15.01
0.48
0.10
0.17
0.25
Met
Ile
Ile
Ile
ATG
ATA
ATT
ATC
21.86
6.05
15.03
22.47
1.00
0.14
0.35
0.52
Leu
Leu
Phe
Phe
TTG
TTA
TTT
TTC
Ala
Ala
Ala
Ala
GCG
GCA
GCT
GCC
7.27
15.50
20.23
28.43
0.10
0.22
0.28
0.40
Thr
Thr
Thr
Thr
ACG
ACA
ACT
ACC
6.80
15.04
13.24
21.52
0.12
0.27
0.23
0.38
Ser
Ser
Ser
Ser
TCG
TCA
TCT
TCC
1.00
0.61
0.42
0.58
Arg
Arg
Arg
Arg
CGG
CGA
CGT
CGC
10.40
5.63
5.16
10.82
0.19
0.10
0.09
0.19
0.17
0.22
0.42
0.58
Gln
Gln
His
His
CAG
CAA
CAT
CAC
32.95
11.94
9.56
14.00
0.73
0.27
0.41
0.59
11.43
5.55
15.36
20.72
0.12
0.06
0.43
0.57
Leu
Leu
Leu
Leu
CTG
CTA
CTT
CTC
39.93
6.42
11.24
19.14
0.43
0.07
0.12
0.20
4.38
10.96
13.51
17.37
0.06
0.15
0.18
0.23
Pro
Pro
Pro
Pro
CCG
CCA
CCT
CCC
7.02
17.11
18.03
20.51
0.11
0.27
0.29
0.33
14.74
2.64
9.99
13.86
0.73
0.95
11.80
16.48
Codon usage
S=AGGACGGGATCA
DNA sequence can be divided into noneoverlapping
codons in three reading frames
C=C1C2...Cm
C 11=AGG
AGG ACG GGA TCA
A GGA CGG GAT CA
AG GAC GGG ATC A
2
1
C =GGA
3
2
C =GGG
Probability that sequence is coding
Probability that sequence is coding is equal probability that
sequence of codons is coding. Assuming independence
between adjacent codons the probabilty that sequence is
coding will be equal to the product of codon frequencies.
P(C) = F(C1)F(C2)..F(Cm)
0.022
0.038
AGG ACG GGA TCA
A GGA CGG GAT CA
AG GAC GGG ATC A
P(C) = F(AGG)F(ACG) = 0.022 x 0.038 =0.000836
Probability that sequence is
non-coding
If the sequence is non-coding the codon frequency will
be random and each codon will be equally prbable. In
this case frequency for each codon will be 0.0156. This
is because we have 64 codons and each of them is
equally possible.
Therefore probability that the sequence is non-coding
will be:
P(C) = F(AGG)F(AGC) = 0.0156 x 0.0156 = 0.000244
Log-likelihood ratio
i
LP (S) = log P (S)
P 0(S)
LP(S) = log 1.4102+log 2.4358 = 0.1493 + 0.3866 = 0.53 > 0
Rule based method
Minimal
length ORF
Codon usage
Splicing
sites
Putative exon
Neural network
Minimal
length ORF
Splicing
sites
Przypuszczalny
Putative exon
ekson
Codon
usage
GRAIL




Gene recognition and Analysis Internet Link
Uberbacher and Mural, 1991
First gene prediction program
GRAIL1


Neural network recognizing coding potential within fixed-size
window (100 bp)
Evaluates coding potential without looking for additional features
(e.g. splice junctions, start and stop codons)
 GRAIL2
 Variable size of windows
 Incorporated genomic context information (splice sites, start and
stop, polyadenylation signals)
 GRAILEXP
 http://compbio.ornl.gov/grailexp/
GRAILEXP
MZEF
 M. Zhang 1997
 Predicts exons only, does not build gene
structure
 Uses ‘quadratic discriminant analysis”
 Variable measures:
 Exon
length
 Intron-exon transition
 Branch site scores
 http://rulai.cshl.org/tools/genefinder/
MZEF
FGENES
 Solovyev et al., 1994
 Predicts internal exons
 Linear discriminant analysis
 Donor and acceptor splice sites
 Putative coding regions
 Intronic regions both 5’ and 3’ to the putative
exon
 Passes results to a dynamic programming
algorithm to come up with coherent gene
model
 http://www.softberry.com/berry.phtml
FGENESH
FGENESH
GENSCAN
 Burge and Karlin
 Search for general and specific compositional
properties of distinct functional units in eukaryotic
genes
 General fifth-order markov model of coding regions
 Analyze both DNA strands
 Sequences may contain multiple and/or partial
genes
 http://genes.mit.edu/GENSCAN
GENSCAN options
Organism
vertebrate
Maize
Arabidopsis
Output
predicted peptides only
predicted CDS and peptides
Suboptimal exon cutoff
1.00
0.50
0.25
...
0.01
GENSCAN output
Prom = promoter
Init = Initial exon
Intr = Internal exon
Term = Terminal exon
Sngl = Single exon gene
PlyA = poly-A signal
P-range
0.00-0.50
0.50-0.75
0.75-0.90
0.95-0.99
0.99-1.00
Accuracy
29.8%
54.1%
74.8%
92.4%
97.7%
Graphical output
Are predictions the same?
GRAILEXP
MZEF
FGENESH
GenScan
Evaluation statistic
TP
TP - true positive
FP - false positive
FN - false negative
TN - true negative
FN
FP
TN
TP
Sensitivity
Fraction of actual coding regions that are correctly
predicted as coding, ranging from 0 to 1
Sn = TP/(TP+FN)
Specificity
Fraction of the prediction that is actually correct,
ranging from 0 to 1
Sp = TP/(TP+FP)
Correlation
Combined measure of sensitivity and specificity,
ranging from -1 (always wrong) to +1 (always right)
CC =
TP x TN + FPx FN
(PP)(PN)(AP)(AN)
Experimental validation of predicted genes
20 not annotated human BAC clones
3 finished
17 unfinished
Genes that had at least two exons, each predicted by at least two
programs
the overlap of the predicted exons did not have to be perfect
similarity to ESTs or known genes was used as supporting evidence
but was not required
40 genes (number of exons 2-11)
Six single exons predicted by three or four programs
Three two-exon genes predicted by one program only but strongly
supported by similarities to EST sequences
12 genes were eliminated from further studies as they contained
repetitive elements and were most likely false positives
Total: 37 putative transcripts
Prediction programs performance
37 genes were tested, 16 of them (43%) were confirmed.
At the exon level there were 159 exons and 58 (36%)
were found to be real
predicted exons
specificity
sensitivity
MZEF
34
0.51
0.56
GRAIL
11
0.48
0.19
GENSCAN
52
0.46
0.91
FGENES
45
0.37
0.75
Gene structure and alternative splicing
Repetitive elements
12 genes containing large fragments of repetitive
elements were eliminated from experimental
validation. However some proteins are partially coded
by Alu or MER (Makalowski 1994, 2000)
Gene gm121 contains MER at 5' end of cDNA and this
is experimentally confirmed gene.
Gene gm124 was eliminated from our experimental
validation based on the presence of repetitive element.
Further analysis showed that this gene is a part of
transcript FLJ23129, GenBank accession AK026782
Repetitive elements
50% of mammalian genome are repeats:
DNA transposons
retrotransposons
LINEs
SINEs
tandem repeats
masking before similarity search - helps avoid getting
similarities caused by the presence of repetitive
elements, not because of sequences homology
predicted gene with repetitive elements are less likely
to be real, although sometimes repeats are true parts of
coding sequence
RepeatMasker
Searches for Alu, MIR, LINE, LTR and other repeats by
comparison to sequences in RepBase library
RepBase is a database of repetitive DNA sequence
elements found in a variety of eukaryotic organisms
including primates, rodents, cow, dog, chicken, fugu,
drosophila, arabidopsis, rice
Accepts local databases with repetitive elements
Repetitive elements are masked by replacing nucleotides
with string of letters N
Masking can be limited to certain types of repeats only
RepeatMasker
http://www.repeatmasker.org/
RepeatMasker outputs
Functional annotation- Gene Ontology
 Gene Ontology: A controlled vocabulary to describe gene
products - proteins and RNA - in any organism.
 Gene is described in three categories:
 Cellular component: where a gene
product acts
Molecular function: activities or “jobs”
of a gene product

Biological process: a commonly
recognized series of events

Annotation source
ISS
IDA
IPI
TAS
NAS
IMP
IGI
IEP
IC
ND
Inferred from Sequence/Structural Similarity
Inferred from Direct Assay
Inferred from Physical Interaction
Traceable Author Statement
Non-traceable Author Statement
Inferred from Mutant Phenotype
Inferred from Genetic Interaction
Inferred from Expression Pattern
Inferred by Curator
No Data available
IEA
Inferred from electronic annotation
Gene annotation
Annotating genes
 Model organism: look at curated databases
Annotating novel genes
 Similarity search against curated and well
annotated database: functional annotations
deducted from close and significant match
 Functional domains identification: mapping
domains and terms to Gene Ontology
Annotating novel genes
 Similarity search against curated and well
annotated database: functional annotations
deducted from close and significant match
 Identification of functional domains: mapping
domain to GO term
Regulatory sequences
 Difficult to identify using computer programs
 Most
regulatory elements are still not
identified
 They are usually very short: 6-10 base pairs
 They may differ in one or more places
Finding regulatory elements
 Phylogenetic footprinting – comparative
genomics used to identify conserved non-coding
sequences (MultiPipMaker)
 Phylogenetic shadowing – conceptually similar
to phylogenetic footprinting, with the difference
that closely related species are compared
(eShadow)
 Gibbs sampling – random iterations of the
promoter regions alignments to identify blocks of
conserved residues (Gibbs Motif Sampler)
 Hidden Markov Models – searching for known
regulatory elements using probability matrices
(Cister)
Phylogenetic shadowing
Gibbs sampling
ACGGTACGTTGG
GGTACGTAGGAC
TGGCTACCTTGG
CGGTACGTAGGT
******* **
Building model from existing alignment
ACA
TCA
ACA
AGA
ACC
- - - ATG
ACT ATC
C - - AGC
- - - ATC
G - - ATC insertion
Transition probabilities
Output Probabilities
A HMM model for a DNA motif alignments, The transitions are
shown with arrows whose thickness indicate their probability. In
each state, the histogram shows the probabilities of the four
bases.
Cister
Detects cis-elements clusters by using Hidden Markov
Model
For each element uses separate matrix with frequencies of
each nucleotide in each position; user can input matrix for
elements not included in the basic option
User can specify:
ƒ distance between neighboring cis-elements within a
cluster
ƒ number of cis-elements in the cluster
ƒ distance between clusters
ƒ half-width of the sliding window
Frequency matrix
NA
AML-1a
XX
DE
runt-factor AML-1
XX
BF
T02256; AML1a; Species: human,
sapiens.
XX
P0
A
C
G
T
01
5
1
2
49
02
2
2
52
1
03
4
14
1
38
04
0
0
57
0
05
1
0
55
1
06
1
4
0
52
Homo
T
G
T
G
G
T
TGTGGT
TGCGGT
TGTGGT
AGTGGT
TGTGGC
Sequences of experimentaly identified elements are aligned
and frequencies in each position are calculated
Cister
http://zlab.bu.edu/~mfrith/cister.shtml
Download