Module 9. Genome annotation with Maker: Putting it all together Background The Maker Gene Annotation Pipeline (Cantarel et al. 2008; Holt and Yandell 2012) combines the outputs of several programs to generate a high quality set of gene annotations. Maker was designed primarily for use on non-model organisms and to not over predict genes as is often the case solely using gene predictors. Maker uses two types of evidence, intrinsic and extrinsic, to generate gene annotations. Extrinsic evidence is sequence homology to known repeats and mRNA assembled from the organism or closely related organisms. Homology is predicted from BLAST alignments which are then polished using exonerate. Intrinsic evidence is evidence is evidence such as start and stop codons and intron-exon boundaries predicted from gene predictors. The two gene predictors used by Maker are SNAP and Augustus. Annotations are only produced if they are supported by multiple lines of evidence. For example, a BLAST alignment must have at least a corresponding gene predictor result. A gene predictor result without additional evidence will not be annotated as a gene. Maker assigns each annotation an Annotation Evidence Distance (AED) score which corresponds to the amount of dissimilarity between the evidence and the annotation. For example, an annotation that perfectly matches the evidence would have an AED score of 0, an annotation with 50 % overlap with the evidence would have an AED of 0, and a theoretical annotation that did not overlap any evidence would have an AED of 1. The most important parts of Maker's output are a gff file, the predicted transcripts, and predicted proteins for each scaffold. The program Apollo can be used to view and edit gene annotations produced by Maker. V&C core competencies addressed 1) Ability to apply the process of science: Experimental design, Evaluation of experimental evidence, Developing problem-solving strategies 2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical methods to diverse data, Managing and analyzing large data sets 3) Use modeling and simulation to understand complex biological systems: Applying informatics tools, Managing and analyzing large data sets GCAT-SEEK sequencing requirements None Computer/program requirements for data analysis Linux OS, Maker, GCATSEEK Linux Virtual Machine or access to Juniata HHMI cluster Apollo requires a GUI and can be run on Mac or PC. 1 If using cluster from Window OS: Putty If using from Mac or Linux OS: SSH Protocols This tutorial assumes you are using a fasta file named genome.fa you could, however, use any fasta file simple replacing genome.fa with the name of the fasta file being used. This also assumes that you are using the custom repeat library created in the RepeatScout tutorial, a protein file named proteins.fa, a snap HMM file named genome.hmm, and an Augustus configuration directory species. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options. 1) Make a Maker directory, move into it, and generate the Maker control files $mkdir maker $cd maker $maker -CTL This will generate three control files (maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl) 2) The maker_opts.ctl file contains all of the performance options for maker. For this run, edit the following lines (nano -c maker_opts.ctl) Line 2: genome=genome.fa Line 22: protein=proteins.fa Line 26: model_org=danio Line 27: rmlib=genome.secondfilter.lib #created by repeatscout Line 34: snaphmm=./44Sru.hmm Line 36: augustus_species=SebastesRubrivinctus Line 56: predstats=1 Note: The make_opts.ctl and maker_bopts.ctl (shown in full at the bottom) can be further edited to adjust Maker further to adjust stringency of the blast searches. The maker_exe.ctl simply tells Maker where to find the executables and should not need to be edited. 3) Once maker_opts.ctl and maker_bopts.ctl have been edited, run Maker. If you are running on multiple processors: $mpirun -n N maker If you are running worker nodes on a cluster, use the line above as the last (command) line of the Qsub script, and then run Qsub (see module 3 for details on Qsub). If only running on a single processor: $maker 2 Maker's Results Maker will take each scaffold and place it in it's own directory along with the output that Maker generates for that scaffold. The three most important files that are created are the gff, Maker's predicted transcripts, and Maker's predicted proteins. Maker's directory structure: Directory_maker_ran_in -> genome.maker.output -> genome_datastore -> first_number -> second_number -> scaffold_directory Maker divides the scaffold directories into several layers of subdirectories so the filesystem will not be slowed down by having to handle hundreds or even thousands of directories at one time. Individual gffs can be viewed in genome browsers such as Apollo. The most efficient way to make use of Maker's output is by combining Maker's output into a relatively small number of large files. More information on handling these files is provided in Module 10. 1) Gather the predicted proteins (the predicted proteome). In the genome_datastore directory $cat ./*/*/*/*.proteins.fasta > genome.predictedproteins.fasta 2) Gather the predicted transcripts (the predicted transcriptome) In the same directory $cat ./*/*/*/*.transcripts.fasta > genome.predictedtranscripts.fasta 3) Gather all of the GFFs into one conglomerate GFF. In the genome.maker.output directory Several options: Everything (including Maker's prediction, all evidence, masked repeats, and DNA) $gff3_merge -d genome_master_datastore_index.log No DNA at the end $gff3_merge -n -d genome_master_datastore_index.log Only Maker annotations $gff3_merge -n -g -d genome_master_datastore_index.log 3 GFF file format General Feature Format (GFF) is a standard file format, meaning that it is supported and used by several programs and appears in the same form regardless of what program generated it. GFF is a tab delimited file featuring nine columns of sequence data. Column Feature Description 1 seqname The name of the sequence 2 source The program that created the feature 3 feature What the sequence is (gene, contig, exon, match) 4 start First base in the sequence 5 end Last base in the sequence 6 Score Score of the feature 7 strand + or - 8 frame 0,1, or 2 first, second, or thrid frame 9 attribute List of miscellaneous features not covered in the first 8 columns, each feature separated by a ; Example gff (Shows part of a single exon maker gene found on scaffold 1) Sequence 1 . contig 1 1138 . . . name=sequence1 Sequence 1 maker gene 42 501 . + . ID=maker-scaffold1-gene Sequence 1 snap-masked match 42 501 . + . ID=snap-match ____________________________________________________ maker_opts.ctl #-----Genome (these are always required) genome= #genome sequence (fasta file or fasta embeded in GFF3 file) organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic #-----Re-annotation Using MAKER Derived GFF3 maker_gff= #MAKER derived GFF3 file est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no 4 #-----EST Evidence (for best results provide a file for at least one) est= #set of ESTs or assembled mRNA-seq in fasta format altest= #EST/cDNA sequence file in fasta format from an alternate organism est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file altest_gff= #aligned ESTs from a closly relate species in GFF3 format #-----Protein Homology Evidence (for best results provide a file for at least one) protein=_#protein sequence file in fasta format (i.e. from mutiple oransisms) protein_gff= #aligned protein homology evidence from an external GFF3 file #-----Repeat Masking (leave values blank to skip repeat masking) model_org=all #select a model organism for RepBase masking in RepeatMasker rmlib=_ #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) #-----Gene Prediction snaphmm=_#SNAP HMM file gmhmm= #GeneMark HMM file augustus_species=_#Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no #-----Other Annotation Feature Types (features MAKER doesn't recognize) other_gff= #extra features to pass-through to final MAKER generated GFF3 file #-----External Application Behavior Options alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI) #-----MAKER Behavior Options max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage) min_contig=1 #skip genome contigs below this length (under 10kb are often useless) pred_flank=200 #flank for extending evidence clusters sent to gene predictors pred_stats=0 #report AED and QI statistics for all predictions as well as models AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1) min_protein=0 #require at least this many amino acids in predicted proteins alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no 5 keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1) split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments) single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no single_length=250 #min length required for single exon ESTs if 'single_exon is enabled' correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes tries=2 #number of times to try a contig if there is a failure for some reason clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no TMP= #specify a directory other than the system default temporary directory for temporary files maker_bopts.ctl #-----BLAST and Exonerate Statistics Thresholds blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast' pcov_blastn=0.8 #Blastn Percent Coverage Threshold EST-Genome Alignments pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments eval_blastn=1e-10 #Blastn eval cutoff bit_blastn=40 #Blastn bit cutoff depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff) pcov_blastx=0.5 #Blastx Percent Coverage Threshold Protein-Genome Alignments pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments eval_blastx=1e-06 #Blastx eval cutoff bit_blastx=30 #Blastx bit cutoff depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff) pcov_tblastx=0.8 #tBlastx Percent Coverage Threshold alt-EST-Genome Alignments pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments eval_tblastx=1e-10 #tBlastx eval cutoff bit_tblastx=40 #tBlastx bit cutoff depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff) pcov_rm_blastx=0.5 #Blastx Percent Coverage Threshold For Transposable Element Masking pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking ep_score_limit=20 #Exonerate protein percent of maximal score threshold en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold ____________________________________________________ 6 Running Apollo (requires a GUI) 1. Open terminal and launch Apollo as follows: $apollo (click "OK" if any pop-ups crop up) 2. Choose "GFF3" as the data source. Under "GFF" file, upload a GFF file from your folder /Maker/genome.maker.output/genome_datastore/##/##/scaffold#/*.gff. The horizontal scale at the center shows the coordinates of your sequence Panels above and below this scale relate to the DNA strand you are analyzing, representing the forward and reverse strands respectively Adjacent blue panels are workspaces for building gene models Adjacent black panels display evidence for that gene model The windows at the bottom display detailed information about selected features 3. Maximize the Apollo window. Focus on the strand that contains the gene you are working on by clicking on the "view" tab in the menu bar at the top of the window and unchecking the strand that does NOT contain the gene of interest. 7 4. Click the "X2" button (for example) at the bottom of the screen to zoom in (or out) on your sequence. The structure of the genes will become apparent--with alternating exons (boxes) and introns (lines). You can continue zooming until the center scale displays the nucleotide sequence. The red and green bars at the top and the bottom of the screen represent potential start and stop codons in each of the three reading frames on the forward (top) and reverse (bottom) strands. 5. Clicking "reset" returns you to the normal view. 6. Click "Tiers" in the menu bar and select "Expand all tiers." This displays each piece of evidence on a separate line, allowing you to see all of the results of your analyses. The results of each prediction program (e.g. Augustus, Maker, blastx) display separately in white in the black evidence panel. 7. Clicking on any of the pieces of evidence will display additional information at the bottom of the window (see “selected element” in picture above). 8. Zoom in to focus on one set of gene models and BLAST evidence. Consider several reasons why various pieces of evidence may not align perfectly (notice how the BLASTX predicted exon is longer than the other predicted exons in the figure below): The first and last exons are the hardest to predict because they have only one intron boundary. Gene predictors may miss exons of an alternatively spliced gene. BLAST evidence may come from different species, whose exons differ in length. The BLAST algorithm does not determine splice sites. 8 BLAST results may include false matches. Scroll left and right for a gene that has one or more predictions (FGenesH, Augustus, SNAP), as well as BLASTN or/and BLASTX evidence. 9. Use the "zoom" function to enlarge the gene and clearly display the exon/intron structure. 10. Compare the gene predictions from FGenesH, SNAP, and/or Augustus with evidence from BLASTN and BLASTX The gene prediction algorithms are quite good at identifying locations of genes and correct splice sites (exon/intron boundaries). However, they have difficulty identifying first and last exons because these have only one splice site. BLAST results are the most authoritative evidence for exons because they are retrieved from databases for biological evidence from mRNA or protein sequence from a given organism. However, BLAST results often do not accurately reflect splice sites. Also the exon structure may be different for a related gene from a different organism. Finally, BLAST may return incomplete results as not all genes or proteins are represented in the database. 11. Click on any exon in the evidence panel to highlight "edge matches" (making sure “show edge matches” is checked in the “view” tab of the menu bar). White bars on the exon indicate consensus exon-intron boundaries within the evidence. (However, these are not necessarily the correct splice sites.) 9 12. The predicted gene model may contain several errors that need to be fixed: A green or red arrow indicates that the gene model is missing a start or stop codon A yellow arrow indicates a splice site that does not follow the GT/AG rule (aka, “noncanonical”) Annotations can be edited using the Exon Detail Editor. You can adjust annotations by dragging the boundaries of exons and save the changes directly into the GFF. 13. Double-clicking on a gene model will select an entire annotation. Right click, select sequence, then hit peptide or protein. From this window you can copy the predicted protein sequence across all exons, paste in the NCBI blastp web page and compare single annotations to other species of interest. Assessment The number of genes that Maker annotates is sensitive to Blast stringency as specified in the bopts control file. Students could run the default sensitivity, then increase or decrease stringency to see the effect on number of genes predicted. Use Apollo to pull a protein sequence from one gene and use gaps to see if any internal exons are missing. Compare full length of the homologous NCBI proteins to the Maker predicted protein to see if and edge exons were missed by Maker. 10 Time line of module One two hour lab Discussion topics for class Look at the annotation of several scaffolds. Is there evidence that repeats disrupted the assembly? Are all your genes complete? References Maker Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. 2008.MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res.18:188–196. Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for secondgeneration genome projects. BMC Bioinformatics. 2011;12:491. doi: 10.1186/1471-2105-12-491. SNAP Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59 Augustus Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel.Bioinformatics. 2003;19:ii215–ii225. doi: 10.1093/bioinformatics/btg1080. Stanke M, Schöffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi: 10.1186/1471-2105-7-62. 11