Module 9. Genome annotation with Maker: Putting it all together

advertisement
Module 9. Genome annotation with Maker:
Putting it all together
Background
The Maker Gene Annotation Pipeline (Cantarel et al. 2008; Holt and Yandell 2012) combines the outputs
of several programs to generate a high quality set of gene annotations. Maker was designed primarily
for use on non-model organisms and to not over predict genes as is often the case solely using gene
predictors. Maker uses two types of evidence, intrinsic and extrinsic, to generate gene annotations.
Extrinsic evidence is sequence homology to known repeats and mRNA assembled from the organism or
closely related organisms. Homology is predicted from BLAST alignments which are then polished using
exonerate. Intrinsic evidence is evidence is evidence such as start and stop codons and intron-exon
boundaries predicted from gene predictors. The two gene predictors used by Maker are SNAP and
Augustus. Annotations are only produced if they are supported by multiple lines of evidence. For
example, a BLAST alignment must have at least a corresponding gene predictor result. A gene predictor
result without additional evidence will not be annotated as a gene. Maker assigns each annotation an
Annotation Evidence Distance (AED) score which corresponds to the amount of dissimilarity between the
evidence and the annotation. For example, an annotation that perfectly matches the evidence would
have an AED score of 0, an annotation with 50 % overlap with the evidence would have an AED of 0, and
a theoretical annotation that did not overlap any evidence would have an AED of 1. The most important
parts of Maker's output are a gff file, the predicted transcripts, and predicted proteins for each scaffold.
The program Apollo can be used to view and edit gene annotations produced by Maker.
V&C core competencies addressed
1) Ability to apply the process of science: Experimental design, Evaluation of experimental evidence,
Developing problem-solving strategies
2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical
methods to diverse data, Managing and analyzing large data sets
3) Use modeling and simulation to understand complex biological systems: Applying informatics tools,
Managing and analyzing large data sets
GCAT-SEEK sequencing requirements
None
Computer/program requirements for data analysis
Linux OS, Maker, GCATSEEK Linux Virtual Machine or access to Juniata HHMI cluster
Apollo requires a GUI and can be run on Mac or PC.
1
If using cluster from Window OS: Putty
If using from Mac or Linux OS: SSH
Protocols
This tutorial assumes you are using a fasta file named genome.fa you could, however, use any fasta file
simple replacing genome.fa with the name of the fasta file being used. This also assumes that you are
using the custom repeat library created in the RepeatScout tutorial, a protein file named proteins.fa, a
snap HMM file named genome.hmm, and an Augustus configuration directory species.
Also, running any of the listed commands without any additional input (or running them with the -h
option) will display the help message including a description of the program, the usage, and all options.
1) Make a Maker directory, move into it, and generate the Maker control files
$mkdir maker
$cd maker
$maker -CTL
This will generate three control files (maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl)
2) The maker_opts.ctl file contains all of the performance options for maker.
For this run, edit the following lines (nano -c maker_opts.ctl)
Line 2: genome=genome.fa
Line 22: protein=proteins.fa
Line 26: model_org=danio
Line 27: rmlib=genome.secondfilter.lib #created by repeatscout
Line 34: snaphmm=./44Sru.hmm
Line 36: augustus_species=SebastesRubrivinctus
Line 56: predstats=1
Note: The make_opts.ctl and maker_bopts.ctl (shown in full at the bottom) can be further edited to
adjust Maker further to adjust stringency of the blast searches. The maker_exe.ctl simply tells Maker
where to find the executables and should not need to be edited.
3) Once maker_opts.ctl and maker_bopts.ctl have been edited, run Maker.
If you are running on multiple processors:
$mpirun -n N maker
If you are running worker nodes on a cluster, use the line above as the last (command) line of the Qsub
script, and then run Qsub (see module 3 for details on Qsub).
If only running on a single processor:
$maker
2
Maker's Results
Maker will take each scaffold and place it in it's own directory along with the output that Maker
generates for that scaffold. The three most important files that are created are the gff, Maker's predicted
transcripts, and Maker's predicted proteins.
Maker's directory structure:
Directory_maker_ran_in -> genome.maker.output -> genome_datastore -> first_number ->
second_number -> scaffold_directory
Maker divides the scaffold directories into several layers of subdirectories so the filesystem will not be
slowed down by having to handle hundreds or even thousands of directories at one time.
Individual gffs can be viewed in genome browsers such as Apollo.
The most efficient way to make use of Maker's output is by combining Maker's output into a relatively
small number of large files. More information on handling these files is provided in Module 10.
1) Gather the predicted proteins (the predicted proteome).
In the genome_datastore directory
$cat ./*/*/*/*.proteins.fasta > genome.predictedproteins.fasta
2) Gather the predicted transcripts (the predicted transcriptome)
In the same directory
$cat ./*/*/*/*.transcripts.fasta > genome.predictedtranscripts.fasta
3) Gather all of the GFFs into one conglomerate GFF.
In the genome.maker.output directory
Several options:
Everything (including Maker's prediction, all evidence, masked repeats, and DNA)
$gff3_merge -d genome_master_datastore_index.log
No DNA at the end
$gff3_merge -n -d genome_master_datastore_index.log
Only Maker annotations
$gff3_merge -n -g -d genome_master_datastore_index.log
3
GFF file format
General Feature Format (GFF) is a standard file format, meaning that it is supported and used by several
programs and appears in the same form regardless of what program generated it. GFF is a tab delimited
file featuring nine columns of sequence data.
Column
Feature
Description
1
seqname
The name of the sequence
2
source
The program that created the
feature
3
feature
What the sequence is (gene, contig,
exon, match)
4
start
First base in the sequence
5
end
Last base in the sequence
6
Score
Score of the feature
7
strand
+ or -
8
frame
0,1, or 2 first, second, or thrid
frame
9
attribute
List of miscellaneous features not
covered in the first 8 columns, each
feature separated by a ;
Example gff (Shows part of a single exon maker gene found on scaffold 1)
Sequence 1 .
contig 1 1138 . . . name=sequence1
Sequence 1 maker
gene 42 501 . + . ID=maker-scaffold1-gene
Sequence 1 snap-masked match 42 501 . + . ID=snap-match
____________________________________________________
maker_opts.ctl
#-----Genome (these are always required)
genome= #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
4
#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=_#protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=_ #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for
RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=_#SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=_#Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
5
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
maker_bopts.ctl
#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'
pcov_blastn=0.8 #Blastn Percent Coverage Threshold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)
pcov_blastx=0.5 #Blastx Percent Coverage Threshold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)
pcov_tblastx=0.8 #tBlastx Percent Coverage Threshold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)
pcov_rm_blastx=0.5 #Blastx Percent Coverage Threshold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold
____________________________________________________
6
Running Apollo (requires a GUI)
1. Open terminal and launch Apollo as follows:
$apollo (click "OK" if any pop-ups crop up)
2. Choose "GFF3" as the data source. Under "GFF" file, upload a GFF file from your folder
/Maker/genome.maker.output/genome_datastore/##/##/scaffold#/*.gff.

The horizontal scale at the center shows the coordinates of your sequence

Panels above and below this scale relate to the DNA strand you are analyzing,
representing the forward and reverse strands respectively

Adjacent blue panels are workspaces for building gene models

Adjacent black panels display evidence for that gene model

The windows at the bottom display detailed information about selected features
3. Maximize the Apollo window. Focus on the strand that contains the gene you are working on by
clicking on the "view" tab in the menu bar at the top of the window and unchecking the strand
that does NOT contain the gene of interest.
7
4. Click the "X2" button (for example) at the bottom of the screen to zoom in (or out) on your
sequence. The structure of the genes will become apparent--with alternating exons (boxes) and
introns (lines). You can continue zooming until the center scale displays the nucleotide
sequence. The red and green bars at the top and the bottom of the screen represent potential
start and stop codons in each of the three reading frames on the forward (top) and reverse
(bottom) strands.
5. Clicking "reset" returns you to the normal view.
6. Click "Tiers" in the menu bar and select "Expand all tiers." This displays each piece of evidence
on a separate line, allowing you to see all of the results of your analyses. The results of each
prediction program (e.g. Augustus, Maker, blastx) display separately in white in the black
evidence panel.
7. Clicking on any of the pieces of evidence will display additional information at the bottom of the
window (see “selected element” in picture above).
8. Zoom in to focus on one set of gene models and BLAST evidence. Consider several reasons why
various pieces of evidence may not align perfectly (notice how the BLASTX predicted exon is
longer than the other predicted exons in the figure below):

The first and last exons are the hardest to predict because they have only one intron
boundary.

Gene predictors may miss exons of an alternatively spliced gene.

BLAST evidence may come from different species, whose exons differ in length.

The BLAST algorithm does not determine splice sites.
8

BLAST results may include false matches.

Scroll left and right for a gene that has one or more predictions (FGenesH, Augustus, SNAP), as
well as BLASTN or/and BLASTX evidence.
9. Use the "zoom" function to enlarge the gene and clearly display the exon/intron structure.
10. Compare the gene predictions from FGenesH, SNAP, and/or Augustus with evidence from
BLASTN and BLASTX

The gene prediction algorithms are quite good at identifying locations of genes and
correct splice sites (exon/intron boundaries). However, they have difficulty identifying
first and last exons because these have only one splice site.

BLAST results are the most authoritative evidence for exons because they are retrieved
from databases for biological evidence from mRNA or protein sequence from a given
organism. However, BLAST results often do not accurately reflect splice sites. Also the
exon structure may be different for a related gene from a different organism. Finally,
BLAST may return incomplete results as not all genes or proteins are represented in the
database.
11. Click on any exon in the evidence panel to highlight "edge matches" (making sure “show edge
matches” is checked in the “view” tab of the menu bar). White bars on the exon indicate
consensus exon-intron boundaries within the evidence. (However, these are not necessarily the
correct splice sites.)
9
12. The predicted gene model may contain several errors that need to be fixed:

A green or red arrow indicates that the gene model is missing a start or stop codon

A yellow arrow indicates a splice site that does not follow the GT/AG rule (aka, “noncanonical”)

Annotations can be edited using the Exon Detail Editor. You can adjust annotations by
dragging the boundaries of exons and save the changes directly into the GFF.
13. Double-clicking on a gene model will select an entire annotation. Right click, select sequence,
then hit peptide or protein. From this window you can copy the predicted protein sequence
across all exons, paste in the NCBI blastp web page and compare single annotations to other
species of interest.
Assessment
The number of genes that Maker annotates is sensitive to Blast stringency as specified in the bopts
control file. Students could run the default sensitivity, then increase or decrease stringency to see the
effect on number of genes predicted.
Use Apollo to pull a protein sequence from one gene and use gaps to see if any internal exons are
missing. Compare full length of the homologous NCBI proteins to the Maker predicted protein to see if
and edge exons were missed by Maker.
10
Time line of module
One two hour lab
Discussion topics for class
Look at the annotation of several scaffolds. Is there evidence that repeats disrupted the assembly?
Are all your genes complete?
References
Maker
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M.
2008.MAKER: an easy-to-use annotation pipeline designed for emerging model organism
genomes. Genome Res.18:188–196.
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for secondgeneration genome projects. BMC Bioinformatics. 2011;12:491. doi: 10.1186/1471-2105-12-491.
SNAP
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59
Augustus
Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron
submodel.Bioinformatics. 2003;19:ii215–ii225. doi: 10.1093/bioinformatics/btg1080.
Stanke M, Schöffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized
hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi:
10.1186/1471-2105-7-62.
11
Download