Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future? Why Metagenomics? • What is there? • How many are there? • What are they doing? How do you sequence the environment? • Extract DNA CsCl step gradient 1.1 g ml-1 1.35 g ml-1 1.5 g ml-1 1.7 g ml-1 CsCl step gradient How do you sequence the environment? • Extract DNA • Create library Linker-Amplified Shotgun Libraries (LASLs) Soil Extraction Kit Hydroshear Blunt-ending Hydroshear Blunt-ending Addition of Linkers Amplification of Fragments This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA Addition of Linkers Amplification of Fragments David Mead - Breitbart (2002) PNAS How do you sequence the environment? • Extract DNA • Create library • Sequence fragments Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future? Why Phages? • Phages are viruses that infect bacteria – 10:1 ratio of phages:bacteria – 1031 phages on the planet • Specific interactions (probably) – one virus : one host • Small genome size – Higher coverage • Horizontal gene transfer – 1025-1028 bp DNA per year in the oceans Uncultured Viruses 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence Bioinformatics • BLASTagainst NR – blastx, tblastn, tblastx • BLAST against boutique databases – Complete phage genomes, ACLAME, Other libraries, 16S • Parsing to present data in a useful format BLAST and Parsing • http://phage.sdsu.edu/blast • Submit BLAST to local and remote databases – Local (as fast as possible) – NCBI (one search every 3 seconds) • Many concurrent searches – One search versus 1,000 searches • Parse data into tables – Access to taxonomy etc Most Viral Genes are Unknown Known 22% Unknown 78% TBLAST (E<0.001) 3,093 sequences Breitbart (2002) PNAS Rohwer (2003) Cell GenBank has more than doubled since 2001 … 60 billion base pairs 60 million sequences GenBank has more than doubled since 2001 … but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol. All of the new genes in the databases are coming from environmental sequences Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future? Human-associated viruses • More bacteria than somatic cells by at least an order of magnitude • More phages than bacteria by an order of magnitude • Sample the bacteria in the intestine by sampling their phage Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% TBLAST (E<0.001) 532 sequences Phages 94% Breitbart (2003) J. Bacteriol. Adults Versus Babies No bacteria or viruses in 1st fecal sample Abundant bacterial and viral communities by 1 week of age >108 VLP/g feces Baby Feces Viruses • Most sequences are unknown (≈70%) • Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts • From microarray studies, sequences are stable in the baby over a 3 month period • Same types of phage as present in adult feces – one identical sequence to an unrelated adult! DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses? Most Human RNA Viruses are Known Unknown 8% Known 92% TBLAST (E<0.001) ≈36,000 sequences Other Plant Viruses 9% Pepper Mild Mottle Virus 65% Other 26% Zhang (2006) PLoS Biology Pepper Mild Mottle Virus (PMMV) • ssRNA virus; ≈6 kb genome • Related to Tobacco Mosaic Virus • Infects members of Capsicum family • Widely distributed – spread through seeds • Fruits are small, malformed, mottled • Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http://www.rothamsted.bbsrc.ac.u k/ppi/links/pplinks/virusems/ PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV S1 S2 S3 S4 S5 S6 S7 S8 S9 PMMV San Diego : 78% people are positive Singapore : 67% people are positive 10-50 fold increase in feces compared to food 106-109 PMMV copies per gram dry weight of feces Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS Koch’s Postulates Thesunmachine.net http://www.sweatnspice.com Human microbial metagenome is more important than human genome Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future? How do you sequence the environment? • Extract DNA • Create library • Sequence fragments How do you sequence the environment? • Extract DNA • Pyrosequence 454 Pyrosequencing •DNA extraction from environment •Whole genome amplification • Emulsion-based PCR • Luciferase-based sequencing } SDSU } 454 Inc. Margulies (2005) Nature 454 Sequence Data (Only from Rohwer Lab) • 21 libraries – 10 microbial, 11 phage • 597,340,328 bp total – 20% of the human genome – 50% of all complete and partial microbial genomes • 5,769,035 sequences – Average 274,716 per library • Average read length 103.5 bp – Av. read length has not increased in 7 months Growth of sequence data 600 million bp 6 million reads Cost of sequencing • • • • • • One reaction = $10,000 One reaction = 250,000 reads 250 reads = $10 1 read = 4¢ 454 sequencing does 1 read = 100bp cot require cloning, arraying 1 bp = 0.04¢ etc. ($400 per 1x 1,000,000 bp) • Sanger sequencing ca. $1/rxn, 0.2¢/bp – real cost ca. $5/rxn, 1¢/bp Bioinformatics • 597,340,328 bp total • 5,769,035 sequences • 7 months • Existing tools are not sufficient Current Pipeline http://phage.sdsu.edu/~rob/Pyrosequencing/ • Dereplicate • BLAST against – 16S – Complete phage – nr (SEED) – subsystems Sequencing is cheap and easy. Bioinformatics is neither. Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future? The Soudan Mine, Minnesota Red Stuff Black Stuff Oxidized Reduced Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16S are indistinguishable Cloned Red Red Annotation of metagenomes by subsystems A subsystem is a group of genes that work together – Metabolism – Pathway – Cellular structures – Anything an annotator thinks is interesting There are different amounts of metabolism in each environment There are different amounts of substrates in each environment Red Stuff Black Stuff But are the differences significant? • Sample 10,000 proteins from site 1 • Count frequency of each subsystem • Repeat 20,000 times • Repeat for sample 2 • Combine both samples • Sample 10,000 proteins 20,000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review Examples of significantly different subsystems Red Stuff Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Stuff Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8]) Red stuff: ferric iron (goethite [FeO(OH)]) Nitrification differentiates the samples Edwards (2006) In review Not all biochemistry happens in a single organism Anaerobic methane oxidation Boetius et al. Nature, 2000. CH4 + SO42- -> HCO3- + HS- + H2S Archaea CH4 + H2O -> HCO3- + OH + H2 -> CO2 + H2O Bacteria SO42- + H2O -> HS- + OH + 2O2 The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation We are moving away from one organism one reaction and towards studying the biochemistry of whole environments Bacteria don’t live alone Summary From 454 sequence: – Identify microbial composition – Identify metabolic function – Identify statistically significant differences in metabolism – Who, what, why of microbial ecology Metazoan associated Sampling Sites Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments Corals Fish Human blood Human stool Freshwater Aquifer Glacial lake Extreme Terrestrial/Soil Amazon rainforest Konza prairie Joshua Tree desert Singapore Air Hot springs (84oC; 78oC) Soda lake (pH 13) Solar saltern (>35% salt) SDSU FIG Forest Rohwer Mya Breitbart Beltran Rodriguez-Brito Rohwer Lab: Linda Wegley Florent Angly Matt Haynes Also at SDSU Anca Segall Willow Segall Stanley Maloy Genome Institute of Singapore: Zhang Tao Charlie Lee Chia Lin Wei Yijun Ruan MIT: Ed DeLong Veronika Vonstein Ross Overbeek Annotators Math Guys@SDSU Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller NSF - Biotic Surveys and Inventories - Biological Oceanography - Biocomplexity Viral Community Structure • Contigs assembled from fragments with >= 98% identity over 20 bp are a resampling of a single phage genome • Contig specturm is the number of contigs that have one sequence, the number that have two sequences, and so on • Use both analytical and Monte-Carlo simulations to predict community structure from contig spectrum The Math Guys (2006) In preparation Abundance of the species (%) Determine the actual contig spectrum of the sample 2.00 1.80 1.60 1.40 1.20 1.00 0.80 Predict a contig spectrum using a species abundance model 0.60 0.40 0.20 0.00 0 10 20 30 40 50 Species Rank Continue this procedure until we obtain the smallest error Compute the error between the actual and predicted Adjust the parameters in the species abundance model to minimize errors Error Model parameters Find the smallest error, a global minimum Viral Communities are Extremely Diverse Fecal Seawater Marine Sediments Lots of rare viral genotypes 10 Sediment Viruses 9 Seawater Viruses Seawater Viruses 8 7 Fecal Viruses Shannon- 6 Wiener 5 Index Bacteria on Corals Agriculture Soil Bacteria Soil Nematodes 4 3 2 1 0 Cropland Earthworms Rainforest Spiders Amazon Fish Rainforest Birds Forest Mammals Temperate Forest Beetles River Bacteria Forest Amphibians Fossil Corals