ASM General Meeting, Boston. Annotating Metagenomes Using the NMPDR Rob Edwards See also poster: B-179 (126B) Aziz et al Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division, Argonne National Laboratory www.nmpdr.org www.theseed.org Number of known sequences How much has been sequenced? 100 Environmental bacterial sequencing genomes First bacterial genome 1,000 bacterial genomes Year www.nmpdr.org www.theseed.org How much will be sequenced? Everybody in USA Everybody in Boston 100 people All cultured Bacteria www.nmpdr.org One genome from every species Most major microbial environments www.theseed.org The Problem How do you generate consistent and accurate annotations for metagenomes? www.nmpdr.org www.theseed.org The SEED Family www.nmpdr.org www.theseed.org Annotations using subsystems FIG has developed the notion of Subsystem – a generalization of “pathway” as a collection of functional roles jointly involved in a biological process or complex Extended subsystems into FIGfams – protein families that perform the same functions. www.nmpdr.org www.theseed.org Wikipedia Metabolism http://en.wikipedia.org/wiki/Portal:Metabolism Subsystems make up metabolism SEED Viewer www.nmpdr.org www.theseed.org Populated Subsystem www.nmpdr.org www.theseed.org Subsystems Are Not Just Pathways genome context (virulence islands, prophages, conserved gene clusters) virulence mechanism enzymatic activity cellular localization predicted or measured co-regulation common phenotype combinations of criteria www.nmpdr.org www.theseed.org Automated Annotations of Complete genomes http://rast.nmpdr.org/ • Automated user originated processing • Takes 1-7 hours depending on size and complexity of the genome • ~1,500 external submissions, including 150 genomes not yet publicly released. • Reannotation of >500 genomes complete • 789 users, 160 organizations, Automated Annotations of Complete Metagenomes http://metagenomics.theseed.org/ MG-RAST Server Accurate and consistent annotations in a few days Automatic metabolic reconstruction Freely available after registration www.nmpdr.org www.theseed.org Metagenome Annotation Automated pipeline – upload sequences in fasta, with or without Qscores – removes exact duplicates (454 artefact) – renumbers sequences (mapping provided) – BLAST against SEED nr, 16S rDNA – Annotations and metabolic reenactment – Taxonomic summary www.nmpdr.org www.theseed.org Metagenome Metabolic Reenactment Phylogenomics Comparing Metagenomes to Genomes (or other metagenomes!) Metabolic potential in environments Hours of Compute Time MG-RAST computation ~19 hours of compute per input megabyte Input size (MB) How much so far ~200 GS20 ~200 FLX ~200 Sanger] 676 metagenomes 10,012,793,995 bp (10 Gbp) Average: ~15 M bp per genome Compute time (on a single CPU): 190,243 hours = 7,926 days = 21 years www.nmpdr.org www.theseed.org Lots of sequences all pyrosequencing www.nmpdr.org www.theseed.org From Sequences To Environments Stress Membrane transport Sulfur Signaling Capsule Motility Phosphorus RNA CDA 60.2% CDA 21.7% Mine Saltern Coral Fish Respiration Marine Microbialites Animals Freshwater Dinsdale et al, Nature 200 Upcoming Features • More user options (removing sequences, E-values, percent identities, etc) • More databases (ACLAME, human, etc) • More user generated content (mashups) via webservices and published API www.nmpdr.org www.theseed.org Accessing Data via Web Services Thanks: Bahador Nosrat SDSU Workshops Free workshops on NMPDR, RAST, mgRAST, SEED Upcoming workshops: Greece, Argonne, Urbana-Champaign, San Diego Contact Leslie McNeil lkmcneil@ncsa.uiuc.edu or visit http://www.nmpdr.org/ Acknowledgements Metagenomics Annotation Server FIG Rick Stevens Ross Overbeek Daniel Paarman Veronika Vonstein Folker Meyer Annotators Bob Olsen Mark D'Souza Statistics & Web services Liz Dinsdale Dana Hall Environmental Genomics Beltran Rodriguez-Brito Forest Rohwer Bahador Nosrat and the labs that provided sequence www.nmpdr.org www.theseed.org