Genome annotation with Maker: Gene finder training

advertisement
Module 8. Genome annotation with Maker:
Gene finder training
Background
Finding and characterizing genes in eukaryotic genomes is still a difficult and unsolved problem.
Genomes differ enough from one another that the accurate gene finding necessitates training of gene
finding algorithms for the genome of interest unless there are gene models from close relatives available
for the gene finders you are using.
Gene Finder Training
An important step in genome annotation is optimizing the gene finders that will be used to annotate the
genome. Maker employs annotation from SNAP (Korf 2004) and Augustus (Stanke & Waack 2003;
Stanke et al. 2006) as part of its gene annotation pipeline. Gene finders are trained using an initial set
(several hundred) of high quality gene annotations. We will generate a high quality gene set by doing
the following:







Isolate the longest scaffolds available for the species of interest (done already)
Blastx the long scaffolds to NCBI and download protein hits as a multifasta file (done
already)
Generate a draft training set of gene annotations by running Protein2genome program
(in Maker) that will align these proteins to the scaffolds and construct annotations
based the blastx hits and general knowledge of eukaryotic splice sites, start and stop
codons, etc.
SNAP’s fathom algorithm will be used to filter out incomplete genes.
Fathom will generate a hidden markov model (HMM) file that summarizes what a
typical gene looks like from your pilot data.
The HMM file is used by SNAP in the maker pipeline to predict genes
Augustus can be trained on the web using the long scaffolds and homologous proteins
by using web training site.
Goals


Generate gene training files for a novel genome
Run Maker in parallel on a high performance cluster for a whole genome
V&C core competencies addressed
1


Ability to use quantitative reasoning: Applying statistical methods to diverse data, Mathematical
modeling, Managing and analyzing large data sets
Use modeling and simulation to understand complex biological systems: Computational
modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data
sets, Incorporating stochasticity into biological models
GCAT-SEEK sequencing requirements
None
Computer/program requirements for data analysis
Linux OS, SNAP, Maker, GCAT-SEEK Virtual Machine or Juniata’s HHMI Cluster
Protocols
Generating a Draft Training Set
This tutorial assumes you are using:



A multifasta file named genome.fa containing the longest scaffolds in your genome
The custom repeat library created in the RepeatScout Module
A protein file named proteins.fa
Running any of the listed commands without any additional input (or running them with the -h option)
will display the help message including a description of the program, the usage, and all options.
1) Generate the Maker control files
$Maker -CTL
This will generate three control files (maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl)
2) The maker_opts.ctl file contains all of the performance options for maker.
For this run, edit the following lines (nano -c maker_opts.ctl)
_______________________________
Line 2: genome=genome.fa
Line 22: protein=proteins.fa
Line 27: rmlib=genome.secondfilter.lib
Line 41: protein2genome=1
2
_______________________________
3) Run protein2genome on the genome.
If you are running on multiple processors on your personal computer, specify number of
processors (N) and run Maker using:
$mpirun -n N maker
If running on a cluster, edit a Qsub script to distribute the job to a worker node, using the
command line above and then run Qsub as described in Module 3.
If only running on a single processor:
maker
Training SNAP: Filter out incomplete genes, generate an HMM file
1) Once Maker has finished, the annotations need to be compiled into a single file.
$cd genome.maker.output
$gff3_merge -d genome_master_datastore_index.log -g
2) Make a new directory for training SNAP and put the gff in there.
$mkdir SNAPTraining
$mv genome.all.gff SNAPTraining
$cd SNAPTraining
3) Convert the gff to zff (the file format that SNAP recognizes for training).
$maker2zff -n genome.all.gff
This will create two new files: genome.ann and genome.dna
4) Fathom and Forge will be used to train SNAP. Fathom has several subroutines that help filter
and analyze gene annotations. First, examine the annotations to get an impression of what
annotations there are.
$fathom -validate genome.ann genome.dna
$fathom -gene-stats genome.ann genome.dna
5) Next, split the annotations into four categories: unique genes, warnings, alternative spliced
genes, overlapping genes, and errors.
3
$fathom -categorize 1000 genome.ann genome.dna
6) Next, export the genes.
$fathom -export 1000 uni.ann uni.dna
This will generate four new files (export.aa, export.ann, export.dna, and export.tx)
7) Make a new directory for the new parameters.
$mkdir parameters
$cd parameters
8) Generate the new parameters. The command below will run forge on the directory above
parameters (i.e. the parent directory for parameters), but put the output in the working
directory (parameters).
$forge ../export.ann ../export.dna
9) Generate the new HMM (the file that SNAP uses to generate gene predictions), by moving
into the directory above parameters, and entering the command that follows.
$cd ..
$hmm-assembler.pl genome parameters > genome.hmm
4
Training Augustus
1) Go to http://bioinf.uni-greifswald.de/webaugustus/training/create
2) In the designated spaces fill in your email and the species name
3) Upload export.dna as the genome file.
4) Upload export.aa as the protein file.
5) Type in the verification string and click “Start Training”
6) After the Augustus web training finishes, download the directory of results.
If using HHMI Cluster: Rename the directory by the species name and copy into
/share/apps/maker/exe/augustus/config/species directory.
If using Virtual Box: Rename the directory by the species name and copy into
/User/lib/maker/exe/augustus/config/species directory.
Assessment
After reviewing and practicing HMM generation (see suggestion below) students should be able to do it
by hand on small datasets from a nucleotide alignment. They could then score a short DNA sequence
for presence of a motif by hand.
Students should be able to repeat the steps starting with long (> 10x average gene size; see module 1 for
average gene size given genome size) scaffolds from any genome.
Time line of module
1 hour of lab
Discussion topics for class
Relevant background topics include HMM generation. Chapter 16 from Genomics and
Bioinformatics: An Introduction to Programming Tools for Life Scientists by Tore Samuelsson is a
good reference. The Korf 2004 paper is required reading on this topic.
References
Maker
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M.
5
2008.MAKER: an easy-to-use annotation pipeline designed for emerging model organism
genomes. Genome Res.18:188–196.
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for
second-generation genome projects. BMC Bioinformatics. 2011;12:491. doi: 10.1186/14712105-12-491.
SNAP
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59
Augustus
Stanke M, Waack S. 2003. Gene prediction with a hidden Markov model and a new intron
submodel.Bioinformatics.19:ii215–ii225.
Stanke M, Schöffmann O, Morgenstern B, Waack S. 2006. Gene prediction in eukaryotes with a
generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 7:62.
6
Download