PowerPoint-presentation

advertisement
Annotation of eukaryote genomes and
transcriptomes
Some possibilities and some pitfalls
What is annotation?
• Identification of regions of interest in
sequence data
From a genome…
…to an annotated gene
An annotation GTF file
A reminder, you know this already…
There are two major parts of annotation
• 1) Structural: Find out where the regions of
interest (usually genes) are in the genome and
what they look like. How many exons/introns?
UTRs? Isoforms?
• 2) Functional: Find out what the regions do.
What do they code for?
The challenge: Which one is correct?
Transcriptomes are different but have their own challenges
• No introns, but
where are the
start and stop
codons?
• Still needs
functional
annotation
Why is annotation important?
RNA-seq reads
Genome
Data used - Proteins
• Conserved in sequence => conserved annotation
with little noise
• Proteins from model organisms often used =>
bias?
• Proteins can be incomplete => problems as many
annotation procedures are heavily dependent on
protein alignments
>ENSTGUP00000017616 pep:novel chromosome:taeGut3.2.4:8_random:2849599:2959678:-1 gene:ENSTGUG00000017338 transcript:ENSTGUT000000180
RSPNATEYNWHHLRYPKIPERLNPPAAAGPALSTAEGWMLPWGNGQHPLLARAPGKGRER
DGKELIKKPKTFKFTFLKKKKKKKKKTFK
>ENSTGUP00000017615 pep:novel chromosome:taeGut3.2.4:23_random:205321:209117:1 gene:ENSTGUG00000017337 transcript:ENSTGUT00000018017
PDLRELVLMFEHLHRVRNGGFRNSEVKKWPDRSPPPYHSFTPAQKSFSLAGCSGESTKMG
IKERMRLSSSQRQGSRGRQQHLGPPLHRSPSPEDVAEATSPTKVQKSWSFNDRTRFRASL
RLKPRIPAEGDCPPEDSGEERSSPCDLTFEDIMPAVKTLIRAVRILKFLVAKRKFKETLR
PYDVKDVIEQYSAGHLDMLGRIKSLQTRVEQIVGRDRALPADKKVREKGEKPALEAELVD
ELSMMGRVVKVERQVQSIEHKLDLLLGLYSRCLRKGSANSLVLAAVRVPPGEPDVTSDYQ
SPVEHEDISTSAQSLSISRLASTNMD
Data used - rna-seq
• Should always be included in a annotation
project
• From the same organism as the genomic data
=> unbiased
• Can be very noisy (tissue/species dependent),
can include pre-mRNA
• PASA, or some other filtering method, often
needed
Adding RNAseq - Spliced reads
Pre-mRNA
Includes everything that is transcribed
Stranded rna-seq
Bring home message: Combine data
General recommendations
• Always combine different types of evidence!
• One single method is not enough!
• Use Maker!
Types of evidence
1. De novo gene finders, some in combination
with RNA-seq
2. Transcripts assembled from mapped RNAseq reads
3. Assemble transcripts de novo followed by
mapping
4. Lifting over annotations from a closely
related genome
5. Protein alignments
De novo gene prediction
• Commonly used programs: Genemark-ES,
Augustus, Glimmer-HMM, Snap, Genscan,
FGENESH…
• Uses HMM-models to figure out how introns,
exons, UTRs etc. are structured
• These HMM-models need to be trained!
Different training files
De novo gene prediction
• Commonly used programs: Genemark-ES,
Glimmer-HMM, Snap, Genscan, FGENESH…
• Uses HMM-models to figure out how introns,
exons, UTRs etc. are structured
• These HMM-models need to be trained!
• glimmerhmm_linux fasta.file -d
GlimmerHMM/trained_dir/arabidopsis/
• Each program comes with information on how to
train it on ”your” organism
• Genemark-ES does it automatically
Cegma can help
De novo gene prediction
• Pros: Can give you both exons and CDSs, fast
• Cons: Very dependent on good training, as
less ”real” data is used, accuracy can suffer
Transcripts assembled from mapped RNA-seq
reads
• Bowtie/Tophat/Cufflinks/, PASA
• Reads are mapped to the genome, and
transcripts are assembled based on the
distribution of the reads
Cufflinks
Cufflinks
Transcripts assembled from mapped RNA-seq
reads
• Pros: Based on what is expressed, retains
isoforms, not biased
• Cons: Sensitive to ”dirt” in data, can create too
many isoforms , does not include CDS
annotations
Assemble transcripts de novo followed by mapping
Assemble transcripts de novo followed by mapping
• Overlapping reads are assembled into
transcripts, often without reference
• Trinity is commonly used, but Velvet with
Oases can also be used, or Newbler if you
have 454 rna-seq
• If the data is clean and the coverage is good
enough, full length transcripts will be
assembled. These can be mapped to the
genome using BLAT etc.
Mapped Trinity-assembled transcripts
Assemble transcripts de novo followed by mapping
• Pros: Isoforms are retained, UTRs are
included, only based on evidence from the
studied organism
• Cons: Can be ”noisy” with some intronic or
intergenic sequences retained, CDSs not
annotated
Lifting over annotations from a closely related genome
• Kraken
• Align the two genomes and then transfer
annotations between aligned regions
• Alignment is best done using Satsuma
Lift-overs
Lift-overs
Lifting over annotations from a closely related genome
• Pros: Works great for closely related
organisms, or re-assemblies, uses synteny
which helps in orthology determination and
functional annotation
• Cons: Not really an annotation, not based on
the target organism, dependent on good
alignment
Protein alignments
• Not really an annotation, but very useful in
combination with other data
• Exonerate often used, but Scipio is also good
• Conserved, little noise
Protein aligments using Scipio
Or combine everything
• Combine protein-alignments, aligned
transcripts, ESTs, and de novo gene
predictions into one single annotation
• Maker is the best ”light-weight” choice, but
there is also EVM. Ensembl pipeline needs
considerable expertise and resources.
Evidencemodeler
Maker
Transcript annotation
• Here the transcript is already defined. The
challenge is to find where the coding regions
starts and stops
• Transdecoder
Transdecoder
Transdecoder
Transcript annotation
• Here the transcript is already defined. The
challenge is to find where the coding regions
starts and stops
• Transdecoder
• Trinotate
Trinotate
2) Find out what the regions do, what do they code for?
• Blastbased
• Functional annotation based on
functional domains
• Transfering names based on overlapping
coordinates
• Blast2GO rocks!
Blast2GO
KEGG-mapping
Some take home messages
• Annotations are never final and are not the
absolute truth
• Annotations depend a lot on the method used
• If you intend to use RNA-seq for annotation,
stranded RNA-seq is great
Some take home messages
• Annotations are never final and are not the
absolute truth
• Annotations depend a lot on the method used
• Maker is currently the best light-weight tool
to combine different types of data
• If you intend to use RNA-seq for annotation,
stranded RNA-seq is great
• Get help!
BILS annotation resource
•
•
•
•
•
Starting up this autumn
Uses the Ensembl pipeline and/or Maker
Delivers high quality annotations
Also available for consultation
Biosupport.se
Biosupport.se
Download