Gene Prediction Methods

advertisement
Chap 9. Gene Discovery
DNA
RNA
protein
cDNA
EST (Expressed Seq. Tag)
Gene Discovery



A major application of bioinformatics
Matching known patterns of genes
A gene


Promoter + 5’ UTR + Protein coding sequence + 3’ UTR
Coding sequence starts with ATG, stops with TAG,TGA or
TAA

Coding sequence is called an open reading frame (ORF)
Gene Structure
ORF (Open Reading Frame): DNA can
encode six Proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Transcription

Gene sequence is copied from one strand
 Sense strand = mRNA sequence
 Antisense strand is used to generate mRNA sequence

5’CGCTATAGCGTTTCAT 3’ -- antisense, template strand
3’GCGATATCGCAAAGTA 5’ – sense, coding strand

Template, anti-sense
sense
Transcription initiation

Double-helix DNA strands are separated in the gene coding region
 Which enzyme detects the beginning of a gene ?
 RNA Polymerase (multi-subunit enzyme that synthesize RNA)
binds to promoter
 RNA polymerase I – 28S, 5.8S and 18S rRNA genes
 RNA polymerase II – coding genes, snRNA
 RNA polymerase III – tRNA, 5S rRNA, snoRNA
 Other enzymes
 General (Basal) Transcription Factor (GTF)
 TFIIA, TFIIB, TFIID
 TFIID – recognize promoter sequence
 http://www.youtube.com/watch?v=MkUgkDLp2iE
Promoter in E.coli
Transcription initiation in E.coli
Transcription initiation in eukaryotes

Promoter consists of
 -25 or TATA box(TATAWAW; W=A, T)
 And Inr (initiator) seq. (YYCARR: Y=C,T; R=A,G)
Transcription initiation in eukaryotes
Initial contact is made by general
transcription factor (GTF) TFIID, which
consists of TATA-binding protein (TBP)
and at least 12 TBP-associated factors
(TAF)
Transcription Start Site (TSS)





www.cs.uml.edu/~Kim/580/review_polII_11_Kadonaga.pdf
TSS – the first base copied to mRNA
Core promoter – region around a TSS
Conventionally, core promoter has
 TA box at -30 bp of a Inr (Initiator)
 Transcription Factor (TF) bind to TATA box, Inr sequence, and
other sites; bend DNA 90 degree; recruite general TF
 CpG islands: 300-3000 bp of C & G in 40% of promoters
More recently,
 TATA box only in 10-20% or promoters
Core Promoter Elements





IIB Recognition Element (BRE) (SSRCGCC)
 BREu (BREd) suppresses (enhances) transcription
TATA box – TATAWAAR (metazoans)
 W (A,T); R (A,G-Purine); Y (T,C – Pyrimidine)
Inr – YYANWYY (A+1)
DPE (downstream Core Promoter Element)
MTE (Motif Ten Element)
Focused/Dispersed TSS


Focused (Sharp) TSS
 Distinct TSS site
 Usually TATA box in sharp TSS
 Primarily in tissue-specific expressions
Dispersed (Broad) TSS
 Multiple weak start sites in 50-100 nt
 A few Inr or Inr-like seq in the neighborhood
 Generally associated with ubiquitously expressed genes
 Thought to be related to CpG islands
How to recognize the end of transcription ?

Terminator seq. stalls polymerase
Splicing

Alternative splicing to produce mRNA

Splicesome – a collection of snRNA
Function of Introns


www.cs.uml.edu/~kim/580/review_intron.pdf
When inserted into protomer, boost expression level
First introns are long
 Alternative exons are flanked by long introns
 But, association between intron length and expression breadth in
human is not found
 Removal of 2nd intron of human beta-globin gene reduces the
efficiency of 3’-end formation
RNA pol II elongation rate – 3.8kb/min
 Introns may serve as time delays between activation of a gene


Annotation: How do I get from this…
>mouse_ear_cress_1080
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACC
GGTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGA
AAGCGGGTTGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAA
TTTACCAAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAG
AGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGT
TTTGGGATGTAGAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGA
ATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACA
AACTCTTTAAGAACGTATCTTTCAGTTTTCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACT
GAACCGAATTTAAACCGGAGGGAGGGTTTGACTTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGA
AGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAAGCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGA
CCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCCCAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTC
AACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGGAAAGGTTGATATTTTCCCCTTCGCTTTGGTCTT
ATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTTGGCTAAGAAGAGATCTTTACTCTCTGTAT
TTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAATAAAGTATTGAGCTTTACTAAGCTT
TCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTTCTCCAGCTCGACTACACTGAA
GGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAAAGAGAGTAATTGCTTTG
CGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACACTTCTCTAATTGAT
AACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTTTACTGTCTG
TGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATATTTGA
…to this?
Meaning?
Comparative Tools (Database searches)
What do we know about genes?

Expressed (Transcribed)



Regulated







3n basepairs
Codon usage
Translational start & stop/termination codons (TLSS, TLTS)
Translation artifacts (proteins)
Spliced


Promoters (TATAAA)
Transcription Factor Binding Sites
CpG
Meaningful (Translated)


Transcriptional start & termination sites (TXSS, TXTS)
Transcription artifacts (cDNA & ESTs (Expressed Sequence Tags))
Splice sites (GT-AG)
Derived (Homology: Paralogy/Orthology)

Search for known genes, proteins (BLAST)
How might this knowledge help to find genes?

Predict genes




Search databases



Look for potential starts and stops.
Connect them into open reading frames (ORFs).
Filter for “correct’ length & codon usage.
Known genes: UniGene
Known proteins: UniProt
Use transcript evidence



cDNA
ESTs (Expressed Sequence Tags)
proteins
Canonical splice sites
Exon
Intron
Exon
3’ Splice Site
Pre-mRNA
5’ Splice Site
Of 1588 examined predicted splice sites in Arabidopsis
1470 sites (93%) followed the canonical GT…AG
consensus. (Plant (2004) 39, 877–885)
Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94
Alternative Splicing
The primary transcript of a gene is spliced into different
mRNAs leading to multiple proteins generated from the
same gene.
- Contributes to protein diversity.
- Can occur in any part of the transcript including UTRs..
- Can alter start codons, stop codons, reading frame, CDS,
UTRs.
- May alter stability-life, translation (time, location, duration),
protein sequence, or both.
The dogmas – they are a~changing…
One gene, one enzyme
One gene, one polypetide
One gene, one set of transcripts (> 0)
Alternative splicing in metazoans (Animalia)
Splice statistics for human genes
Alternative splicing in animals. Nature
Genetics Research 36; 2004
Bridging the gap between genome and
transcriptome Nucleic Acids Research 32, 2004.
• Alternative splicing well characterized in animals.
• As many as 96% of human genes may have multiple splice forms.
• Functional significance of alternative spicing still poorly understood.
Alternative splicing in plants
RuBisCo alternative splicing one of first plant examples:
“ The data presented here demonstrate the existence of alternative
splicing in plant systems, but the physiological significance of synthesizing
two forms of rubisco activase remains unclear. However, this process may
have important implications in photosynthesis. If these polypeptides were
functionally equivalent enzymes in the chloroplast, there would be no
need for the production of both….”
Biological significance of AS in plants
…includes:
- regulation of flowering;
- resistance to diseases;
- enzyme activity (timing, duration, turn-over time, location).
Most genome databases give alternatively spliced plant gene variants
Example: Jasmonate signaling in Arabidopsis
- Plant hormone; affects cell division, growth, reproduction and responses to
insects, pathogens, and abiotic stress factors.
- Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ
10.3 and JAZ 10.4 differ in susceptibility to degradation.
- Phenotypic consequences include male sterility and altered root growth.
Example: Jasmonate signaling in Arabidopsis
-
Alternative splice sites C’ and D’ lead to different splice variants
JAZ10.3: premature stop codon in D exon, intact JAS domain
JAZ10.4: truncated C exon, protein lacks JAS domain
JAZ 10 encoded by At5G13220
AS in different Reading Frames
Gene Prediction
Gene Prediction Methods

Intrinsic or template methods (ab initio)



Search by signal
 Signals (Short, functional DNA elements involved in gene spec)
 Four basic signals defining coding exons
 Translation start site, 5’ (donor), 3’ (acceptor), stop site
Search by content
Extrinsic or look-up methods


Homology-based
 Compare sequence of interest against known coding sequences
Comparative gene prediction
 Compare sequence of interest against anonymous sequences
Gene Prediction Methods

Sequence-based


Alignment-based


Search for orthologous genes of other organisms
 Search for strong conservation of a genome region
Content-based


Search for ORFs, and consensus sequences
Search for patterns such as nucleotide or codon frequency,
characteristic of coding sequences
Probabilistic

Prediction algorithsm
Typical Computational Steps in Gene
Prediction




Identify and score suitable splice sites and start/stop signals along the
query sequence
Predict candidate exons as detected by these signals
Score exons as a function of signals and coding stats
 Factor in the quality of alignment between the query and known
coding sequences
Assemble a subset of these exon candidates into a predicted gene
structure
 Assemble to maximizes a particular scoring function
Prediction and Scoring of Exons

Protein coding regions have characteristic compositional
bias


e.g., A triplet pattern in coding region
Hexamer frequency method with 5th order Markov models
widely used

Likelihood of a particular base at a given position is dependent on
five preceding bases
From Exons to RNA



Assembly of several Exons to a gene
 Combinatorially difficult
 Can use dynamic programming
 GRAIL (Gene Recognition and Anslysis Internet Link),
FGENESH, GENEID
HMM (Hidden Markov Model)
 GENSCAN
Sequence Similarity-Based Gene Prediction
 GENEWISE
How Well Do Predictions Work ?



Sensitivity (Sn) = TP / (TP+FN)
Specificity (Sp) = TP / (TP+FP)
Correlation coefficient (CC)
Accuracy of Gene Finding Programs
•
Sanja Rogic, Alan K. Mackworth, and Francis B.F. Ouellette
(2001) Genome Research 11
Promoter Analysis
Annotation Cheat Sheet
A. DNA Subway
• Open existing project or generate new (Red square)
• Run RepeatMasker
• Generate evidence (Predictions, BLAST searches)
• Synthesize evidence into gene models (Apollo)
• Browse results locally and in context (Phytozome)
• Conduct functional analysis (link from Browser)
• Prospect for gene family (Yellow Line from Browser)
B. Apollo
• Select region that holds biological gene evidence
• Optimize work space and zoom to region (View tab)
• Expand all tiers (Tiers tab)
• Drag evidence item(s) onto workspace (mouse)
• Edit to match biol. evidence (right-click item for tools)
• Record what was done in Annotation Info Editor
• Assess necessity to build alternative model(s)
• Upload model(s) to DNA Subway (File tab)
Download