MAKER-P Genome Annotation using Atmosphere Rationale and background: MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes. MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER-P was developed by the Yandell Lab. Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading. MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing. This tutorial will take users through steps of: 1. Launching the MAKER-P Atmosphere image 2. Setting up and running MAKER-P on an example small genome assembly 3. Copying output back to your iPlant Data Store using icommands 4. Visualizing MAKER-P output using the Integrated Genome Viewer (IGV) Note: Parts of this tutorial requires stand-alone VNC Viewer. Please download VNC Viewer (For windows, if you are unsure of which version to download, select the 32bit.exe file) Part 1: Connect to an instance of an Atmosphere Image (virtual machine) Step 1. Go to https://atmo.iplantcollaborative.org and log in with your iPlant credentials. Step 2. Click on the Launch New Instance button and search for MAKER-P_2.28 Step 3. Select the image MAKER-P_2.28 (emi-F13821D0) and click Launch Instance. It will take 10-15 minutes for the cloud instance to be launched. You will be notified by email when the image is ready. Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, m1.small (default). For real genome annotation larger instances will be required. Step 4. Launch the standalone VNC Viewer application on your desktop computer and enter your VM's IP address following by :1 Once you have connected, you will see your virtual Desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer. Part 2: Set up and run MAKER-P using the Terminal window Step 1. On the desktop click Terminal. A terminal window will open giving a command-line interface. Step 2. Get oriented. You will find staged example data in folder "MAKER-P_example_data" within the folder "Desktop". List its contents with the ls command: $ ls Desktop/MAKER-P_example_data/ README mRNA.fasta maker_exe.ctl example_output maker_bopts.ctl maker_opts.ctl msu-irgsp-proteins.fasta plant_repeats.fasta test_genome.fasta The fasta files include a scaled-down genome (test_genome.fasta) comprised of the first 100kb of each chromosome of rice. The remaining fasta files provide evidence for annotation: mRNA sequences from NCBI, publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0), and a collection of plant repeats. The maker_*.ctl files are configuration files for MAKER which can be optionally used for this exercise or generated as described later. Executables for running MAKER-P are located in /opt/maker/bin and /opt/maker/exe: $ ls /opt/maker/bin cegma2zff fasta_merge chado2gff3 fasta_tool compare genemark_gtf2gff3 iprscan2gff3 iprscan_wrap maker maker2jbrowse maker2wap maker2zff maker_functional_gff maker_map_ids map2assembly map_gff_ids mpi_evaluator mpi_iprscan cufflinks2gff3 evaluator gff3_merge ipr_update_gff $ ls /opt/maker/exe/ RepeatMasker augustus blast maker2chado maker2eval_gtf exonerate maker_functional maker_functional_fasta mpich2-1.5 mpich2.tar.gz map_data_ids map_fasta_ids tophat2gff3 snap As the names suggest the /opt/maker/bin directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P. Both Cufflinks and cufflinks2gff3 are available as tools the iPlant Discovery Environment (DE). Other auxiliary scripts now available in the DE include: tophat2gff3, maker2jbrowse, and maker2zff. RepeatMasker, Augustus, Blast, Exonerate, and SNAP are programs that MAKER-P uses in its pipeline. We recommend the MAKER Tutorial at GMOD for additional reading about these. Step 3. Set up your environment. Note: this step is not essential for the purpose of this tutorial. Skipping this step you will need to specify the complete path when running the maker command (/opt/maker/bin/maker). $ export PATH=/opt/maker/bin/:$PATH $ export ZOE=/opt/maker/exe/snap $ export AUGUSTUS_CONFIG_PATH=/opt/maker/exe/augustus/config Step 4. Set up a MAKER-P run. Create a working directory called "maker_run" using the mkdir command and use cd to move into that directory: $ mkdir maker_run $ cd maker_run Step 5. Create another directory called "test_data" and copy the staged fasta data into test_data using cp. Verify using the ls command $ cp ~/Desktop/MAKER-P_example_data/*.fasta test_data/. $ ls test_data $ ls test_data/ mRNA.fasta msu-irgsp-proteins.fasta plant_repeats.fasta test_genome.fasta Step 6. Run the maker command with the --help flag to get a usage statement and list of options: $ maker --help MAKER version 2.28 Usage: maker [options] <maker_opts> <maker_bopts> <maker_exe> Description: MAKER is a program that produces gene annotations in GFF3 format using evidence such as EST alignments and protein homology. MAKER can be used to produce gene annotations for new genomes as well as update annotations from existing genome databases. The three input arguments are control files that specify how MAKER should behave. All options for MAKER should be set in the control files, but a few can also be set on the command line. Command line options provide a convenient machanism to override commonly altered control file values. MAKER will automatically search for the control files in the current working directory if they are not specified on the command line. Input files listed in the control options files must be in fasta format unless otherwise specified. Please see MAKER documentation to learn more about control file configuration. MAKER will automatically try and locate the user control files in the current working directory if these arguments are not supplied when initializing MAKER. It is important to note that MAKER does not try and recalculated data that it has already calculated. For example, if you run an analysis twice on the same dataset you will notice that MAKER does not rerun any of the BLAST analyses, but instead uses the blast analyses stored from the previous run. To force MAKER to rerun all analyses, use the -f flag. MAKER also supports parallelization via MPI on computer clusters. Just launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be configured during the MAKER installation process for this to work though Options: -genome|g <file> -RM_off|R -datastore/ nodatastore -old_struct -base <string> -tries|t <integer> -cpus|c <integer> -force|f -again|a -quiet|q -qq -dsindex Overrides the genome file path in the control files Turns all repeat masking options off. Forcably turn on/off MAKER's two deep directory structure for output. Always on by default. Use the old directory styles (MAKER 2.26 and lower) Set the base name MAKER uses to save output files. MAKER uses the input genome file name by default. Run contigs up to the specified number of tries. Tells how many cpus to use for BLAST analysis. Note: this is for BLAST and not for MPI! Forces MAKER to delete old files before running again. This will require all blast analyses to be rerun. recaculate all annotations and output files even if no settings have changed. Does not delete old analyses. Regular quiet. Only a handlful of status messages. Even more quit. There are no status messages. Quickly generate datastore index file. Note that this -nolock -TMP -CTL -OPTS -BOPTS -EXE -MWAS <option> will not check if run settings have changed on contigs Turn off file locks. May be usful on some file systems, but can cause race conditions if running in parallel. Specify temporary directory to use. Generate empty control files in the current directory. Generates just the maker_opts.ctl file. Generates just the maker_bopts.ctl file. Generates just the maker_exe.ctl file. Easy way to control mwas_server for web-based GUI options: STOP START RESTART -version Prints the MAKER version. -help|? Prints this usage statement. Step 7. Create control files that tell MAKER-P what to do. Three files are required: • maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER-P behavior • maker_exe.ctl - gives path information for the underlying executables. • maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results To create these files run the maker command with the -CTL flag. Verify with ls: $ maker -CTL $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl test_data The maker_exe.ctl is automatically generated with the correct paths to executables and does not need to be modified. The maker_bopt.ctl is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters. The automatically generated maker_opts.ctl file does need to be modified in order to specify the genome file and evidence files to be used as input. Several text editors are available, including emacs, nano and gedit. If you have not tried any of these gedit probably is easiest to learn as it has a familiar graphical user interface. It can be started by typing gedit & on the command line or by selecting it from the menu on the VNC desktop (Applications=>Accessories=>Text Editor). Note: If pressed for time a pre-edited version of the maker_opts.ctl file is staged in ~/Desktop/MAKER-P_example_data/. Delete the current file and copy the staged version here. Then skip to Step 8. $ rm maker_opts.ctl $ cp ~/Desktop/MAKER-P_example_data/maker_opts.ctl . Here are the sections of the maker_opts.ctl file you need to edit. Add path information to files as shown. Do not allow any spaces after the equal sign or anywhere else. You can specify a complete path or relative path as shown here. In general you can specify multiple files by separating with a comma without spaces. This section pertains to specifying the genome assembly to be annotated and setting organism type: #-----Genome (these are always required) genome=./test_data/test_genome.fasta #genome sequence (fasta file or fasta embeded in GFF3 file) organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the altest parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter: #-----EST Evidence (for best results provide a file for at least one) est=./test_data/mRNA.fasta #set of ESTs or assembled mRNA-seq in fasta format altest= #EST/cDNA sequence file in fasta format from an alternate organism est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file altest_gff= #aligned ESTs from a closly relate species in GFF3 format The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database: #-----Protein Homology Evidence (for best results provide a file for at least one) protein=./test_data/msu-irgsp-proteins.fasta #protein sequence file in fasta format (i.e. from mutiple oransi protein_gff= #aligned protein homology evidence from an external GFF3 file This next section pertains to repeat identification: #-----Repeat Masking (leave values blank to skip repeat masking) model_org= #select a model organism for RepBase masking in RepeatMasker rmlib=./test_data/plant_repeats.fasta #provide an organism specific repeat library in fasta format for RepeatM repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice. Specifying the entire path to the hmm file would not be necessary if the ZOE environment variable is set as described in step 3 above. #-----Gene Prediction snaphmm=/opt/maker/exe/snap/HMM/O.sativa.hmm #SNAP HMM file gmhmm= #GeneMark HMM file augustus_species= #Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file est2genome=0 #infer gene predictions directly from ESTs, 1 = protein2genome=0 #infer predictions from protein homology, 1 unmask=0 #also run ab-initio prediction programs on unmasked (annotation pass-through) yes, 0 = no = yes, 0 = no sequence, 1 = yes, 0 = no These lines control other behaviors of MAKER-P and do not need to be changed for this demonstration: #-----Other Annotation Feature Types (features MAKER doesn't recognize) other_gff= #extra features to pass-through to final MAKER generated GFF3 file #-----External Application Behavior Options alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI) #-----MAKER Behavior Options max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage) min_contig=1 #skip genome contigs below this length (under 10kb are often useless) pred_flank=200 #flank for extending evidence clusters sent to gene predictors pred_stats=1 #report AED and QI statistics for all predictions as well as models AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1) min_protein=0 #require at least this many amino acids in predicted proteins alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no always_complete=1 #extra steps to force start and stop codons, 1 = yes, 0 = no map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1) split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments) single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no single_length=250 #min length required for single exon ESTs if 'single_exon is enabled' correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes tries=3 #number of times to try a contig if there is a failure for some reason clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no TMP=/tmp #specify a directory other than the system default temporary directory for temporary files Step 8. Run MAKER-P Make sure you are in the maker_run directory and all of your files are in place. Perform these steps to check: $ pwd /home/steinj/maker_run $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl stderr test_data $ ls test_data/ mRNA.fasta msu-irgsp-proteins.fasta plant_repeats.fasta test_genome.fasta Starting the MAKER-P run is as simple as entering the command maker. If you did not set the PATH environmental variable in Step 3 then you will need to type the entire path: /opt/maker/bin/maker. Because your maker control files are in the present directory you do not have to explicitly specify these in the command; they will be found automatically. MAKER-P automatically outputs thousands of lines of STDERR, reporting on its progress and producing warnings and errors if they arise. It is a good practice to redirect this output to a log file so you have a record of it, especially if something goes wrong. Since we are in the bash shell you can redirect the output by typing the command as follows: maker 2> log_file &. Another useful practice is to employ the unix time command to report statistics on how long the run took to complete. The output of time will appear at the end of the captured standard error file. Putting all of this together we type the following command to start MAKER-P: (time /opt/maker/bin/maker) 2> log_file & This part of the tutorial usually takes about 30 minutes to complete. You will know MAKER-P is finished when the log_file announces "Maker is now finished!!!". Use the tail command to look at the last 10 lines of the log_file: $ tail log_file processing chunk output processing contig output Maker is now finished!!! real user sys 31m11.250s 36m52.210s 1m34.106s Step 9. Examining MAKER-P output Output data appears in a new directory called test_genome.maker.output. Move to that directory and examine its contents. Note if your run is not yet complete move instead to the staged example output (cd ~/Desktop/MAKERP_example_data/example_output/test_genome.maker.output/): $ cd test_genome.maker.output/ $ ls maker_bopts.log maker_exe.log maker_opts.log mpi_blastdb seen.dbm test_genome.all.gff test_genome_datastore test_genome.db test_genome_master_datastore_index.log • The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER. • The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases. • test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored. • The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file. Check the test_genome_master_datastore_index.log to see if there were any failures: $ less test_genome_master_datastore_index.log Chr1 Chr1 Chr10 Chr10 Chr11 Chr11 Chr12 Chr12 Chr2 Chr2 Chr3 Chr3 Chr4 Chr4 Chr5 Chr5 Chr6 Chr6 Chr7 Chr7 Chr8 test_genome_datastore/41/30/Chr1/ test_genome_datastore/41/30/Chr1/ test_genome_datastore/7C/72/Chr10/ test_genome_datastore/7C/72/Chr10/ test_genome_datastore/1E/AA/Chr11/ test_genome_datastore/1E/AA/Chr11/ test_genome_datastore/1B/FA/Chr12/ test_genome_datastore/1B/FA/Chr12/ test_genome_datastore/E9/36/Chr2/ test_genome_datastore/E9/36/Chr2/ test_genome_datastore/CC/EF/Chr3/ test_genome_datastore/CC/EF/Chr3/ test_genome_datastore/A3/11/Chr4/ test_genome_datastore/A3/11/Chr4/ test_genome_datastore/8A/9B/Chr5/ test_genome_datastore/8A/9B/Chr5/ test_genome_datastore/13/44/Chr6/ test_genome_datastore/13/44/Chr6/ test_genome_datastore/91/B7/Chr7/ test_genome_datastore/91/B7/Chr7/ test_genome_datastore/9A/9E/Chr8/ STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED FINISHED STARTED Chr8 Chr9 Chr9 test_genome_datastore/9A/9E/Chr8/ test_genome_datastore/87/90/Chr9/ test_genome_datastore/87/90/Chr9/ FINISHED STARTED FINISHED All completed. Other possible status entries include: • FAILED - indicates a failed run on this contig, MAKER will retry these • RETRY - indicates that MAKER is retrying a contig that failed • SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl) • DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl) The actual output data is stored in in nested set of directories under* test_genome_datastore* in a nested directory structure. A typical set of outputs for a contig looks like this: $ ls test_genome_datastore/41/30/Chr1/ Chr1.gff Chr1.maker.non_overlapping_ab_initio.proteins.fasta Chr1.maker.non_overlapping_ab_initio.transcripts.fasta Chr1.maker.proteins.fasta Chr1.maker.snap_masked.proteins.fasta Chr1.maker.snap_masked.transcripts.fasta Chr1.maker.transcripts.fasta run.log theVoid.Chr1 • The Chr1.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models. • The fasta files Chr1.maker.proteins.fasta and Chr1.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER-P gene calls. • The Chr1.maker.non_overlapping_ab_initio.proteins.fasta and Chr1.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER-P genes that were rejected for lack of support. • The Chr1.maker.snap_masked.proteins.fasta and Chr1.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER-P The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta. Step 10. Consolidate output using utility scripts fasta_merge and gff3_merge. Before running make sure you are in the same directory with the test_genome_master_datastore_index.log file. $ /opt/maker/bin/gff3_merge -d test_genome_master_datastore_index.log $ /opt/maker/bin/fasta_merge -d test_genome_master_datastore_index.log This results in the following new files. Note that the fasta files were merged in a fashion that maintains categories. test_genome.all.gff test_genome.all.maker.non_overlapping_ab_initio.proteins.fasta test_genome.all.maker.non_overlapping_ab_initio.transcripts.fasta test_genome.all.maker.proteins.fasta test_genome.all.maker.snap_masked.proteins.fasta test_genome.all.maker.snap_masked.transcripts.fasta test_genome.all.maker.transcripts.fasta Step 11. Copy data back to your iPlant Data Store. We will use icommands for this. First you will need to initialize icommands using the iinit command. Provide the information shown in red: $ iinit One or more fields in your iRODS environment file (.irodsEnv) are missing; please enter them. Enter the host name (DNS) of the server to connect to:data.iplantcollaborative.org Enter the port number:1247 Enter your irods user name:<your user name> Enter your irods zone:iplant Those values will be added to your environment file (for use by other i-commands) if the login succeeds. Enter your current iRODS password:<your iplant password> Typing ils will show the files and directories in your Data Store—give it a try. Create a folder for your data called “maker_atmosphere_tutorial_output” and move to that folder as follows: $ imkdir maker_atmosphere_tutorial_output $ icd maker_atmosphere_tutorial_output $ ils /iplant/home/steinj/maker_atmosphere_tutorial_output: Copy files to your Data Store using iput. For example: $ iput test_genome.all.gff $ iput test_genome.all.maker.proteins.fasta $ iput test_genome.all.maker.transcripts.fasta $ ils /iplant/home/steinj/maker_atmosphere_tutorial_output: test_genome.all.gff test_genome.all.maker.proteins.fasta test_genome.all.maker.transcripts.fasta For more on icommands… See https://pods.iplantcollaborative.org/wiki/display/start/Using+icommands Part 3: Visualize annotation data using the Integrated Genome Viewer (IGV): The MAKER-P Atmosphere image is loaded with several software packages for viewing genomics data. On the VNC Desktop open the folder NGS_Viewers to access and select IGV-2.0.7 The IGV will likely open with a pre-loaded genome, such as human. To load your genome go under the file menu and select Import Genome … Next use the dialog box to enter the genome assembly fasta “test_genome.fasta” and gene file “test_genome.all.gff”. Once loaded you can select a chromosome using the drop-down menus. By default the gene track is collapsed. To expand, right click on the track name and select expanded. Because the GFF3 files contain all of the evidence features in addition to genes so to will your track show data not limited to genes. This could be changed by parsing the GFF3 file into individual files containing individual features, e.g. repeat data and maker gene features.