Module 8. Genome annotation with Maker: Gene finder training Background Finding and characterizing genes in eukaryotic genomes is still a difficult and unsolved problem. Genomes differ enough from one another that the accurate gene finding necessitates training of gene finding algorithms for the genome of interest unless there are gene models from close relatives available for the gene finders you are using. Gene Finder Training An important step in genome annotation is optimizing the gene finders that will be used to annotate the genome. Maker employs annotation from SNAP (Korf 2004) and Augustus (Stanke & Waack 2003; Stanke et al. 2006) as part of its gene annotation pipeline. Gene finders are trained using an initial set (several hundred) of high quality gene annotations. We will generate a high quality gene set by doing the following: Isolate the longest scaffolds available for the species of interest (done already) Blastx the long scaffolds to NCBI and download protein hits as a multifasta file (done already) Generate a draft training set of gene annotations by running Protein2genome program (in Maker) that will align these proteins to the scaffolds and construct annotations based the blastx hits and general knowledge of eukaryotic splice sites, start and stop codons, etc. SNAP’s fathom algorithm will be used to filter out incomplete genes. Fathom will generate a hidden markov model (HMM) file that summarizes what a typical gene looks like from your pilot data. The HMM file is used by SNAP in the maker pipeline to predict genes Augustus can be trained on the web using the long scaffolds and homologous proteins by using web training site. Goals Generate gene training files for a novel genome Run Maker in parallel on a high performance cluster for a whole genome V&C core competencies addressed 1 Ability to use quantitative reasoning: Applying statistical methods to diverse data, Mathematical modeling, Managing and analyzing large data sets Use modeling and simulation to understand complex biological systems: Computational modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data sets, Incorporating stochasticity into biological models GCAT-SEEK sequencing requirements None Computer/program requirements for data analysis Linux OS, SNAP, Maker, GCAT-SEEK Virtual Machine or Juniata’s HHMI Cluster Protocols Generating a Draft Training Set This tutorial assumes you are using: A multifasta file named genome.fa containing the longest scaffolds in your genome The custom repeat library created in the RepeatScout Module A protein file named proteins.fa Running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options. 1) Generate the Maker control files $Maker -CTL This will generate three control files (maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl) 2) The maker_opts.ctl file contains all of the performance options for maker. For this run, edit the following lines (nano -c maker_opts.ctl) _______________________________ Line 2: genome=genome.fa Line 22: protein=proteins.fa Line 27: rmlib=genome.secondfilter.lib Line 41: protein2genome=1 2 _______________________________ 3) Run protein2genome on the genome. If you are running on multiple processors on your personal computer, specify number of processors (N) and run Maker using: $mpirun -n N maker If running on a cluster, edit a Qsub script to distribute the job to a worker node, using the command line above and then run Qsub as described in Module 3. If only running on a single processor: maker Training SNAP: Filter out incomplete genes, generate an HMM file 1) Once Maker has finished, the annotations need to be compiled into a single file. $cd genome.maker.output $gff3_merge -d genome_master_datastore_index.log -g 2) Make a new directory for training SNAP and put the gff in there. $mkdir SNAPTraining $mv genome.all.gff SNAPTraining $cd SNAPTraining 3) Convert the gff to zff (the file format that SNAP recognizes for training). $maker2zff -n genome.all.gff This will create two new files: genome.ann and genome.dna 4) Fathom and Forge will be used to train SNAP. Fathom has several subroutines that help filter and analyze gene annotations. First, examine the annotations to get an impression of what annotations there are. $fathom -validate genome.ann genome.dna $fathom -gene-stats genome.ann genome.dna 5) Next, split the annotations into four categories: unique genes, warnings, alternative spliced genes, overlapping genes, and errors. 3 $fathom -categorize 1000 genome.ann genome.dna 6) Next, export the genes. $fathom -export 1000 uni.ann uni.dna This will generate four new files (export.aa, export.ann, export.dna, and export.tx) 7) Make a new directory for the new parameters. $mkdir parameters $cd parameters 8) Generate the new parameters. The command below will run forge on the directory above parameters (i.e. the parent directory for parameters), but put the output in the working directory (parameters). $forge ../export.ann ../export.dna 9) Generate the new HMM (the file that SNAP uses to generate gene predictions), by moving into the directory above parameters, and entering the command that follows. $cd .. $hmm-assembler.pl genome parameters > genome.hmm 4 Training Augustus 1) Go to http://bioinf.uni-greifswald.de/webaugustus/training/create 2) In the designated spaces fill in your email and the species name 3) Upload export.dna as the genome file. 4) Upload export.aa as the protein file. 5) Type in the verification string and click “Start Training” 6) After the Augustus web training finishes, download the directory of results. If using HHMI Cluster: Rename the directory by the species name and copy into /share/apps/maker/exe/augustus/config/species directory. If using Virtual Box: Rename the directory by the species name and copy into /User/lib/maker/exe/augustus/config/species directory. Assessment After reviewing and practicing HMM generation (see suggestion below) students should be able to do it by hand on small datasets from a nucleotide alignment. They could then score a short DNA sequence for presence of a motif by hand. Students should be able to repeat the steps starting with long (> 10x average gene size; see module 1 for average gene size given genome size) scaffolds from any genome. Time line of module 1 hour of lab Discussion topics for class Relevant background topics include HMM generation. Chapter 16 from Genomics and Bioinformatics: An Introduction to Programming Tools for Life Scientists by Tore Samuelsson is a good reference. The Korf 2004 paper is required reading on this topic. References Maker Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. 5 2008.MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res.18:188–196. Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011;12:491. doi: 10.1186/14712105-12-491. SNAP Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59 Augustus Stanke M, Waack S. 2003. Gene prediction with a hidden Markov model and a new intron submodel.Bioinformatics.19:ii215–ii225. Stanke M, Schöffmann O, Morgenstern B, Waack S. 2006. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 7:62. 6