Daniel Ence Yandell Lab University of Utah Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology Protein Domains and Families InterPro Pfam GO and other ontologies Pathways SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV SUCCESS >Smg5 MEVTFSSGGSSNASSECAIDGGTNRCRGL EPNNGTCILSQEVKDLYRSLYTASKQLDD AKRNVQSVGQLFQHEIEEKRSLLVQLCKQ IIFKDYQSVGKKVREVMWRRGYYEFIAFV MAKER An annotation pipeline and genome-database management tool for “next-generation” genome projects MAKER User Requirements: Can be run by a single individual with little bioinformatics experience MAKER User Requirements: System Requirements: Can be run by a single individual with little bioinformatics experience Can run on Linux or Mac OS X based systems MAKER User Requirements: System Requirements: Program Output: Can be run by a single individual with little bioinformatics experience Can run on Linux or Mac OS X based systems Output is compatible with popular annotation tools like WebApollo and JBrowse MAKER User Requirements: System Requirements: Program Output: Can be run by a single individual with little bioinformatics experience Availability: Free for the academic community (including source code) Can run on Linux or Mac OS X based systems Output is compatible with popular annotation tools like WebApollo and JBrowse • • • mRNA-seq integration Integrating new evidence into existing databases Update/revise legacy annotation sets Legacy Annotation Set 1 Legacy Annotation Set 2 Legacy Annotation Set n new data current assembly • Identify legacy annotation most consistent with new data • Automatically revise it in light of new data • If no existing annotation, create new one Legacy Annotation Set 1 Legacy Annotation Set 2 Legacy Annotation Set n new data current assembly • Identify legacy annotation most consistent with new data • Automatically revise it in light of new data • If no existing annotation, create new one • Supports Message Passing Interface (MPI), a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine. MAKER-P MAKER-P Plant MAKER-P Plant Parallelized MAKER-P Plant Parallelized Publication Publication: MAKER-P: a tool-kit for the rapid creation, management, and quality control of plant genome annotations Campbell, Law, Holt et al., Plant Phys. 2013 Atmosphere MPI enabled for parallel computation Maximum instance size 16 CPU http://www.iplantcollaborative.org TACC Lonestar Supercomputer with 22,656 CPU MPI enabled for parallel computation Can complete entire rice genome in ~2 hrs (1,152 cores) 96 CPU per chromosome Currently being integrated into the iPlant Discovery Environment http://www.iplantcollaborative.org XSEDE https://www.xsede.org Performance on Zea maize genome (~ 2Gb) 8,640 cpus on TACC ~37 hours with queue (runtime 14 hours 37 minutes) Throughput of > 1 Gb/hour Assembly & Annotation at iPlant Genome Assembly Conversions tools Visualiza on ALLPATHS-LG maker2jbrowse JBROWSE Newbler maker2zff Web-Apollo SOAPdenovo MAKER output SCARF ABySS Oasis Genome input SNAP Training Fathom/Forge MPI-MAKER TACC Lonestar HMM-assembler Velvet (22,656 cores) Ray Augustus Post Annota on SNAP InterProScan Exonerate InterPro2GO Transcriptome Assembly De novo: Trinity BLAST Data Commons RepeatMasker Reference genomes Reference annota ons SNAP HMM models Repeat Libraries Transcriptome data SOAPdenovo-Trans Velvet/Oasis Trans-ABySS Reference-guided: Tophat Cufflinks Evidence input Conversions tools ncRNA Annota on miRDeep2 tophat2gff cufflinks2gff Key: DE TACC in progress non-coding RNA support better repeat annotation better pseudogene annotation tRNAscan support Will run from inside MAKER Doesn’t install automatically snoScan support Can supply data file for annotation Will run from inside automatically Doesn’t install automatically In the past: Custom Repeat library de novo generated RepeatModeler Now: RepeatModeler, but better. Step-by-step guide available at: http://weatherby.genetics.utah.edu/MAKER/wiki /index.php/Repeat_Library_Construction--Basic To be automated in the future Expanded ncRNA support MAKER-EVM Expanded Augustus/bam support Better integration with iPlant’s Discovery environment More of a feeling than a to-do list lncRNAs Haas et al., Genome Biology 2008 Cantarel et al., 2008; Holt and Yandell, 2010 EVM Cantarel et al., 2008; Holt and Yandell, 2010 MAKER gives Augustus hints Augustus can take better hints from a bam file Users will be able to supply a bam file in the MAKER control file Bam files open up a world of possibilities! Assembly & Annotation at iPlant Genome Assembly Conversions tools Visualiza on ALLPATHS-LG maker2jbrowse JBROWSE Newbler maker2zff Web-Apollo SOAPdenovo MAKER output SCARF ABySS Oasis Genome input SNAP Training Fathom/Forge MPI-MAKER TACC Lonestar HMM-assembler Velvet (22,656 cores) Ray Augustus Post Annota on SNAP InterProScan Exonerate InterPro2GO Transcriptome Assembly De novo: Trinity BLAST Data Commons RepeatMasker Reference genomes Reference annota ons SNAP HMM models Repeat Libraries Transcriptome data SOAPdenovo-Trans Velvet/Oasis Trans-ABySS Reference-guided: Tophat Cufflinks Evidence input Conversions tools ncRNA Annota on miRDeep2 tophat2gff cufflinks2gff Key: DE TACC in progress • • • • • • Trichmonas vaginalis Pinus taeda Apis dorsata Cronartium quercuum Common Pigeon Cardiocondyla obscurior • • • • • • • Southern right whale Tardigrade Spotted Gar Gibbon Turkey 9 spined stickelback Golden Eagle • I’d like to thank and recognize all contributions from Mark Yandell at the University of Utah, as well as lab members Barry Moore, Michael Campbell, Daniel Ence, and former lab member Meiyee Law. • Special thank you to Scott Cain, Robert Buels, and Amelia Ireland. • I would also like to recognize collaborators Ian Korf at UC Davis • MAKER-P and integration into iPlant infrastructure: • • • • • • • • • • • • • • • Josh Stein (CSHL) Kevin Childs (MSU) Gaurav Moghe (MSU) David Hufnagel (MSU) Jikai Lei (MSU) Rujira Achawanantakun (MSU) Carolyn Lawrence (USDA-ARS CICGRU) Doreen Ware (CSHL) Shin-Han Shiu (MSU) Yanni Sun (MSU) Ning Jiang (MSU) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) • Pinus taeda genome project: • • • • • • • • • • • • • • • • • • • • Jill Wegrzyn (UConn) John Liechty (UC Davis) Kristian Stevens (UC Davis) Carol Loopstra (Texas A&M) Hans Vasquez-Gross (UC Davis) Brian Lin (UC Davis) Matt Dougherty (UC Davis) Jacob Zieve (UC Davis) Pedro J Martinez-Garcia (UC Davis) James A Yorke (U. Maryland( Marc Crepeau (UC Davis) Daniela Puiu (Johns Hopkins) Steven L Salzberg (Johh Hopkins) Pieter J. deJong (CHORI-BACPAC Resources Center) Keithanne Mockaitis (Indiana University) Dorrie Main (Washington State) Chuck Langley (UC Davis) David Neale (UC Davis) MAKER-devel community Funding from the NHGRI through an RO1 grant entitled Software for the creation and quality control of genome annotations. Mailing List: maker-devel at yandell-lab.org Download: http://yandell-lab.org/software/maker.html Email me: dence at genetics.utah.edu