Signature: Name: Section #: _______________________________ __________________________________ _______________________________ 12 • Bioinformatic Analysis of Metagenomes INTRODUCTION Summary New sequencing technologies are capable of sequencing millions of base pairs of DNA brining us closer than ever to understanding the total genetic potential of a microbial community. Unfortunately this vast amount of sequence data presents a problem: How can you compare and analyze millions of individual sequence reads to come up with useful conclusions and generalizations about the population you are analyzing? It is not an easy undertaking. The problem is comparable to examining all of the stars in the known universe and trying to come up with patterns and explanations for how things came to be as they are. Enter bioinformatics: This high-tech, all-encompassing approach applies the awesome power of information technology to solve practical biological problems that would otherwise be intractable. Perched on the imaginary wall that separates biology from computer science, bioinformaticians get to see the best of both fields and they get to use their computer programming expertise and raw sequence data to answer pertinent scientific questions. As mentioned during the last lab period, sequenced-based metagenomics analysis encompasses the sequencing of entire genomes, and the sequencing of 16S rDNA sequences to learn more about the organisms that make up a particular community. Today we are primarily concerned with the later. Last time you analyzed the results of your 16S rDNA PCR reaction using gel electrophoresis. In principle you could sequence the DNA you obtained and learn about the organisms that make up the community where you acquired your sample. Unfortunately this is expensive and time-consuming. Instead, we’ll primarily be examining metagenomics data obtained by scientists to investigate the bacteria found in the human gut and in the Mississippi river. While you will be working on a computer it is important to remember that you will be analyzing REAL data obtained from REAL samples by REAL scientists. Before we can examine this data however, we must first learn a little bit about how organisms are classified. For videos about cutting edge metagenomics research on human gut microbes see: www.nature.com/nature/videoarchive/gutmicrobes/ PLEASE REVIEW THE “How organisms are classified” SECTION OF THE MICROBIAL WORLD LAB. This information will be invaluable to understanding this lab. 1 Genomes, Species, and Operational Taxonomic Units (OTUs) As stated above, assigning organisms into groups and species is often times done with the help of DNA from their genome. It is worth asking the question what constitutes an organism’s genome. Defining a genome is actually more complex than you might think. Scientists agree that a genome represents the entirety of an organism’s hereditary information but it is unclear whether differences that arise between individuals are variations of the same genome, or whether they represent separate and distinct genomes. For example, many microbes (especially pathogens) can rearrange their genes in novel combinations. This ability helps these microbes survive in diverse and varying environmental conditions. Are two of these rearranged genomes still the same genome, or are they different? The matter is further complicated by the fact that many microbes contain extrachromosomal plasmids which can readily be transferred to other microbes, including different species. Horizontal gene transfer (or lateral gene transfer) is the process though which organisms can acquire genes from sources other than descent such as through viral infection, or the uptake of DNA from the environment. Even Escherichia coli (E. coli), arguably one of the most studied organisms on the planet, cannot be said to have a single ‘E. coli genome’. The genome of the harmless K12 laboratory strain of E. coli differs from the pathogenic O157:H7 strain and other strains by more than 25%! Large differences such as these are not uncommon among microbes so researchers are beginning to reconsider how they think about genomes. Instead of focusing on a single invariant genome, researchers now study the pangenome of a species. The pangenome is the full set of the genes present in all strains of a species. It contains all of the genes shared by all members of the species (only 40% of E. coli genes are shared between all the strains), as well as genes which are only found in a subset of strains1. Just as it is challenging to define an organism’s genome, it can also be difficult to assign an organism to a particular species. On the surface a species seems like a simple concept. With larger organisms, a species is generally a group of organisms which are similar in their appearance and habits and which interbreed (if they reproduce sexually). Similarity between genomes is also used to identify and classify organisms as belonging to a particular species. Human DNA and chimpanzee DNA are thought to be 95-98% identical, but we are clearly different species. In contrast, microscopic organisms with genomes that vary by 3-25% are often considered the same species1. Microbes whose genomes differ by 3% or less are usually considered to be members of the same species2, but this cutoff, while convenient, is rather arbitrary. The question of whether organisms are considered to be the same species is not merely academic, as it has profound ramifications for regulatory agencies and for medicine. What does it mean to say that your beef must be free of E. coli if the E. coli genome is so variable? How would you know for sure whether your beef was safe? Which drug do you use to treat a pathogen that is only 80% similar to Staphylococcus aureus? Are some microbes used in industry and academia actually the same species as the microbe that causes anthrax? How we name an organism does make a difference for public perceptions and public policy decisions; however, the definition of a species is less important than the ability to intelligently compare the degree of relatedness across groups to understand their evolutionary history and to make meaningful generalizations. One way to help determine which taxon your organism belongs to is to perform an rDNA analysis. Scientists often can perform PCR on an environmental sample to obtain DNA from the conserved and variable regions of the ribosome. You did this a couple of weeks ago when you performed 16S rDNA PCR on your DNA extracted from your microbe sample. Though it is often used in conjunction with metagenomics projects, it should be noted that rDNA analysis by itself is not considered to be 2 metagenomics research because it only focuses on a single gene and not ‘the genome’. This technique can tell you what types of organisms are out there, but it is not without its problems. Horizontal gene transfer of rDNA genes from one species to another can make them seem much more related than they really are3. In addition a single bacterium or eukaryote may harbor multiple different copies of the 16S or 18S rDNA respectively. It is known for example that bacteria usually have between 1 and 15 copies of 16S rDNA genes4,5 and eukaryotic copy number of 18S rDNA varies even more. For these reasons, many other housekeeping genes including rpoB, amoA, pmoA, nirS, nirK, nosZ, and pufM have been suggested as alternatives to 16S rDNA for classifying microorganisms4–6. Because the concept of a species is intricate, metagenomics studies use computer defined operational taxonomic units (OTUs) and often focus on the most common phyla or genera in the sample. When assigning an OTU to a particular taxon, you must select a cutoff value which represents how similar your OTU has to be to a taxonomic group before it can be classified as belonging to that group. If your OTU is 100% identical to Euglenoids then you would feel pretty confident that your sequences came from this group. What if your sequences were only 90% the same? 50%? The cutoff value can be difficult to select but an 80% cutoff is commonly used. That means that if your OTU is more than 80% identical to a known taxon, it will be considered a part of that group. If it is ≤ 80% similar to existing taxons, then it will not be assigned any group and will instead be considered unclassified. Bioinformatics and MOTHUR Put simply, bioinformatics is the application of computer science to biological data. Bioinformaticians are scientists and programmers who create algorithms and applications to analyze complex biological data. Bioinformatics programs are used in diverse applications and many are used for metagenomics. These programs are the workhorses behind many of the tasks we’ve been discussing so far; they assemble genomes from sequences, call and annotate genes, compare metagenomics data between different sample sties, and analyze the diversity of microbes present in a particular community. Extracting meaningful information from the millions of new genomic sequences presents a serious challenge for bioinformaticians. Metagenomics sequence data tends to be noisy and partial as it comes from heterogeneous communities of microorganisms which sometimes number in the tens of thousands7. In any given sample, DNA from bacteria, archaea, eukaryotes, and viral species may be present at different levels of diversity and abundance. Bioinformaticians face many new and exciting challenges to create innovative solutions for genome assembly, gene calling, and function prediction. Because sequences obtained from environmental sources are fragmented, it is frequently difficult or impossible to determine the species from which a specific sequence came. Bioinformatics programs have been designed to perform a variety of important functions including digital filtering, diversity analysis, and comparative metagenomics. The digital filtering performed by bioinformatics programs is a lot like the physical filtering you did during your ‘Microbial World’ lab. However, instead of physically removing undesirable microbes, the computer can inspect sequences and remove ones from unwanted organisms. If you are interested in bacteria for example, you can filter out all sequences which appear to be eukaryotic in origin7. While this is a powerful technique, it sometimes calls sequences incorrectly so it is not a substitute for physical filtering. Bioinformatics programs can also examine the diversity of microorganisms within and between samples. They can identify taxonomic groups and species and can tell you their relative abundance in your sample. Finally, bioinformatics 3 programs have also been designed to perform comparative metagenomics to compare bacterial samples taken at different times or places. Here, the GC content, microbial genome size, and the taxonomic and functional content can be compared. You can also look for correlations between your metagenomics results and environmental variables such as temperature. Today we’ll use a program called MOTHUR to perform a number of functions on the provided data set8. First, we’ll use it to convert raw sequencing data into a form that can easily be used for further analysis. Second, MOTHUR will identify the different taxonomic groups that are present in some human gut samples. Finally, we’ll perform a simple comparative metagenomic analysis to compare the bacteria present in different human gut and Mississippi river samples. OBJECTIVES 1. Define ‘bioinformatics’ and explain how it can be used in metagenomics. 2. Describe how organisms are classified. 3. Explain the difference between an Operational Taxonomic Unit (OTU) and a species. 4. List steps that MOTHUR performs to process raw sequences to prepare them for analysis. 5. Make a hypothesis about how microbial populations in the human gut and Mississippi river will change at different times or different locations. 6. Use MOTHUR to analyze data in the form of rarefaction curves, phylogenetic trees, Venn diagrams, and histograms to test your hypothesis and make conclusions about microbes present in different sample locations. MATERIALS Bioinformatics Function-based Metagenomics Computer with Treeview X, SVGview, Strawberry Mixed E. coli fosmid libraries on antibiotic plates Perl(for PC), and MOTHUR installed along with after 3 days of growth. human gut and Mississippi microbe data sets Ruler Magic marker (for marking colonies) Colony counter (optional) 4 Function-based Metagenomics Examine the results of your functional selection for resistance to your chosen antibiotics You may do this step right away, or you may save time by examining your plates while MOTHUR is processing data. 1. Carefully examine your plates as a group of four. What do the controls tell you? How many resistant clones were identified at each site? Resistant Clones from Sample Site #____ Resistant Clones from Sample Site #____ Antibiotic 1: Antibiotic 2: 2. What can you infer from a colony’s size and what does a colony’s distance from the antibiotic disc tell you? Hypothesis 1 Reasons for making this hypothesis 1. Did your data support your hypothesis about differences in the prevalence of antibiotic resistance in microbes between sites? Why or why not? 5 2. If your results did not correspond with your predictions, propose an explanation for the observed results. 3. If you had to do this experiment over again, what would you change? What new questions would you ask? Hypothesis 2 Reasons for making this hypothesis 1. Did your data support your hypothesis about differences in the prevalence of antibiotic resistance in microbes between sites? Why or why not? 2. If your results did not correspond with your predictions, propose an explanation for the observed results. 3. If you had to do this experiment over again, what would you change? What new questions would you ask? During today’s lab your TA will call a brief break and ask for a show of hand for how many groups found data to support or refute their hypothesis. Your TA will also call on one or more groups to briefly present their data. If you are called simply tell your class what your hypothesis was, what results you got, and what you might do differently in the future if you were to redo this experiment. 6 Sequence Based Metagenomics-16S rRNA profiling 1. Exploring the Human Gut Metagenome using MOTHUR Although MOTHUR is a powerful program, it does not have a user interface like you may be used to. Instead, you must manually enter in commands or run a batch file. A batch file is simply a text file that contains a list of instructions or commands that tell the MOTHUR program what to do. In order for scientists to create batch files for their projects, they must first understand what commands they can give to MOTHUR and what they do. Today you’ll be running a batch file and then following along with the worksheets below to learn more about what MOTHUR is doing. In this first exercise you’ll be working with data obtained from studying the human gut metagenome . The data consists of DNA sequences of 16S rDNA PCR products similar to the 16S rDNA PCR products you generated from your bacterial sample. Recall that at least ten trillion microbes live in the human gut and these cells out number human cells ten to one9. If all of these microbes were placed on a scale, they would weigh about 2 pounds! It is estimated that we each have about 1,000 species of microbes in our guts and some can affect our health in important ways. The data set you’ll be working with was generated to determine how eating a probiotic (a food which contains living microorganisms) like yogurt might impact the composition of microorganisms living in your gut. Yogurt contains Lactobacillus acidophilus, a bacterium commonly found in the human mouth and gut. It ferments sugars into lactic acid and it belongs to the Firmicutes phylum. Stool samples were collected from three different people before and after they started eating Lactobacillus acidophilus containing yogurt. Samples A and B belong to one person, samples C and D belong to a second person, and E and F belong to a third person. The first letter represents their baseline intestinal bacteria (e.g. Sample A) while the second letter represents their bacteria content after consuming yogurt (e.g. Sample B). Yogurt is known to increase the amount of species of the Bacteroidetes phylum in the gut and increased amounts of these microbes are associated with slim people10. (This begs the questions as to whether or not eating yogurt can change the composition of your gut bacteria in a way that helps you lose weight. The verdict is still out.) To obtain the data within this file, scientists had to do a number of steps similar to the ones you’ve been learning about in lab. First they had to physically filter their sample to maximize what they were interested in studying: bacteria in the human gut. Next they extracted DNA from those bacteria. Third, they performed 16S rDNA PCR reactions on each sample. The 16S rDNA PCR products were then sequenced and used to generate an output file containing all the individual reads (or sequences) obtained from the stool samples. This file, called stool.fasta, is the one you’ll be working with today. This data set originally contained 0.1 million reads but it has been reduced to only 6,000 reads for this exercise so that we can complete these operations within a single class period. By running MOTHUR functions on this dataset you will be able to make conclusions about the relative levels of sampling completeness, sample relatedness, and sample diversity. You will also be able to determine which taxa are present in the samples. 7 To begin, run mothur_config which is located in the dock. This will copy the MOTHUR files to the user’s desktop. Wait 3-5 minutes for it to finish copying the data. DO NOT OPEN THE MOTHUR FOLDER OR DO ANYTHING ELSE WHILE MOTHUR IS COPYING DATA. The terminal will then automatically open. Type “cd Desktop/mothur ” and press enter to select the proper directory. Then type “./mothur gutbatchmac.txt ” to have Command Key MOTHUR run the gutbatchmac.txt batch file. It will take about 5-10 minutes to run this batch file. This would be a good time to examine your antibiotic plates from the previous lab. When MOTHUR has finished processing the batch file, open the mothur folder on the desktop and look for the most recent log file. It will be named ‘mothur.’ and will be followed by a ten-digit number (e.g. mother.1331139964)’. Open this file by holding the ‘command key’, clicking on the file, selecting open with, and then choosing ‘Text Edit’. This log file contains the complete record of everything MOTHUR did when it was running the batch file. Use this file to following along with the exercise below. 1. The “summary.seqs(fasta=stool.fasta)” command tells MOTHUR to open a file called stool.fasta and to summarize this file for you. On the bottom, you can see it tells you that your file contains a total of 6,000 reads. The Start and End categories on the top signify where sequences begin and end. Right now all sequences start at position 1 and end at some position downstream. The NBases category tells you the number of bases within each sequence. Currently the NBases and the End categories should be the same because all sequences start at position #1. The Minimum, Median, and Maximum and percentiles are shown to give you basic statistics about your sequences. 2. The trim function, “trim.seqs(fasta=current,oligos=oligos.txt,maxambig=0, maxhomop=8,minlength=300, processors=2) ”, removes reads and parts of sequences that we don’t want to look at including primer and barcode sequences. The minlength=300 command removes any sequences less than 300 because they are considered to be errors. The “summary.seqs() ” command gives you a new summary. Note that now there are only 5,809 sequences. The new minimum size is 300 (which is equal to our old minimum size without the primer sequences). 8 3. The Unique function, “unique.seqs(fasta=current) ”, will remove extra copies of sequences generated by PCR so that only unique sequences will be analyzed further. The “summary.seqs()” command is run again to allow you to compare the total number of sequences before and after you ran unique.seqs. How many sequences were removed by the unique.seqs function?___________ 4. Remember that all of the sequences you have are just different versions of the 16S rRNA gene. The align function, align.seqs(reference=silva.bacteria.fasta, processors=2), will align your sequences with known 16S rDNA sequences from a variety of species so that you can start comparing the differences between species. Example of aligned sequence reads. Hyphens represent locations where some known species have extra bases that aren’t present in the displayed sequences. THIS DATA IS NOT DISPLAYED BY MOTHUR, but is shown to illustrate what MOTHUR is doing behind the scenes. 5. Because it is difficult to compare sequences that don’t overlap with each other, the Screen.seqs function, screen.seqs(fasta=current, name=current, start=3103, end=7922, group=stool.groups, processors=2), is used to eliminate sequences which don’t fully overlap over a specified range. Keep in mind, screen.seqs is NOT truncating your sequences so that they exactly overlap, it is removing entire sequence reads which don’t overlap well with the others. This function will eliminate sequences that start after nucleotide 3103 and/or end before nucleotide 7922. Only 5% of the total number of sequences will be removed because these values were the 97.5% Start and 2.5% End values (as shown on the summary). The summary.seqs() command is run again to give you a new summary. 9 On the diagram above left, the box represents the desired range. Draw a larger box on this diagram to represent a larger range. If you ran the Screen function with your new larger range, any sequence which doesn’t span the entire length of this range would be removed. If you used this larger box as your range [MORE or LESS](circle one) sequences would be removed from the analysis compared to the previous range (nucleotides 3103-7922). Why do you think this is so? 6. The Filter.seqs command, filter.seqs(fasta=current,vertical=T,trump=., processors=2), does not eliminate any sequences but instead truncates them so they all begin and end at the same position to make it easier to compare them with summary.seqs(). How long is the filtered alignment?______________ Notice that most sequences now have the same beginning and end points. 7. To counteract random sequencing errors which occur once per every 100 base pairs, the pre.cluster function, pre.cluster(fasta=current,name=current,diffs=1), combines sequences which are less than 1% different from each other. The summary.seqs() command is run so you can see how many sequences were eliminated. 8. The “dist.seqs(fasta=current,cutoff=0.25, processors=2) ” command calculates how similar each sequence is to every other sequence. For example, if you compare two sequences, you might find they are 98% identical. Because there are lots of sequences, the software makes a makes a 2X2 grid. This should take about 16 seconds. The tables shown below WILL NOT BE DISPLAYED ON YOUR MOTHUR SCREEN. They are meant to show you what the computer is doing during this step. 10 HOW DISTANCE IS CALCULATED Pair A Seq # 1 Seq # 2 A A T A G G C C C C G T T A A G G G G G These two sequences share 6/10 bases (60% similar) so the distance score is 0.400 Pair B Seq # 1 Seq # 3 A A T A G C C C C G G T T A A G T G G G What is the distance between these two sequences?_________ EXAMPLE OF A DISTANCE MATRIX Seq#1 Seq#2 Seq#3 Seq#4 Seq#5 Seq#6 Seq#7 Seq#1 0.000 0.900 0.800 0.450 0.320 0.950 0.010 Seq#2 0.900 0.000 0.550 0.230 0.030 0.001 0.001 Seq#3 0.800 0.550 0.000 0.001 0.220 0.670 0.530 Seq#4 0.450 0.230 0.001 0.000 0.030 0.001 0.001 Seq#5 0.320 0.030 0.220 0.030 0.000 0.780 0.970 Seq#6 0.950 0.001 0.670 0.001 0.780 0.000 0.880 Seq#7 0.010 0.001 0.530 0.001 0.970 0.880 0.000 Are sequences #1 and #7 more related than sequences #2 and #7? YES or NO (Circle One) 9. The “cluster(column=current,name=current) ” function will cluster your reads into Operational Taxonomic Units (OTUs) based on the dist.seq results. Remember that OTUs are the rough equivalent of a species or a genus. In reality, there could be multiple species within an OTU but all the sequences in the OTU have a similar 16S rDNA sequence. Sequences which are more than 97% similar will be clustered together into the same OTU. EXAMPLE CLUSTER DIAGRAM The cluster diagram represents what the computer is doing. It WILL NOT be displayed by MOTHUR. Each dot in this cluster diagram represents a single sequence and lines represent the evolutionary distance between them. Dots that are close to each other are part of the same cluster or OTU. Sequences connected by thick black lines are ≥97% similar (distance is ≤ 0.03) while sequences connected by thin dotted lines are < 97% similar (distance is > 0.03). 10. The next two commands, “make.shared(list=current, group=current,label=0.03) ” and “rarefaction.single(shared=current,freq=50) ”, will help make a rarefaction curve to give you an estimate of your “sampling completeness”. It is essentially a graph of number of reads (AKA: PCR product sequences) vs. number of OTUs. Eventually as you get more and more sequences, you stop finding more species so the curve plateaus. Based on the slope of your rarefaction curve you can estimate the completeness of your data. If the slope is steep, you need more reads and more sample to represent the microbial community at the specified sample site. If the slope is flat, as it gets when it plateaus, then you’ve sampled all of the microbes living in that type of environment and more sampling won’t get you any new sequences. You can also use rarefaction curves to estimate diversity. In general steeper slopes correlate with higher diversity. 11 system(perl ./rarefaction.pl stool.trim.unique.good.filter.precluster.an.groups.rarefaction) system(perl ./result.pl gut) The above commands will transfer the rarefaction data onto an excel spreadsheet so that you’ll be able to graph it. If you open up the MOTHUR/gut results/taxa_rarefaction folder you’ll find you now have an excel file called rarefaction. The number of reads will be shown on the far left in column A and the number of OTUs for each sample will be on the right in columns B-G. It should look like this: 12 Select all the data up to 900 reads then insert a line graph. Do not include data after this point as it is less complete. You should now see a rarefaction curve that looks something like this: Like all the rarefaction curves we’ve seen, the y-axis is the number of OTUs and the x-axis is the number of sequences. Answer the following questions with the aid of the graph on the previous page. Approximately how many reads do you think you would need to get a good representation of all the species in sample C? (Hint: You may have to estimate)______________________ Which sample plateaus the soonest? ______________________________________ Do different people have equally diverse gut microbiota before yogurt was consumed (Samples A, C, & E)? YES or NO Why do you think this might be so? Which sample most likely contains the greatest number of microbial species and is therefore the most diverse? ______________ Did the diversity change after eating yogurt? If so how? (Hint: recall that A+B belong to person #1, samples C+D belong to person #2, and samples E+F belong to person #3. Samples A, C, and E were taken before yogurt consumption and samples B, D, and F were taken after.) 13 11. The classify.seqs function, classify.seqs(fasta=current,template=trainset6_032010.rdp.fasta, taxonomy=trainset6_032010.rdp.tax, iters=1000,probs=F), will assign each pyrosequence read a taxonomic name. The computer will give you its best guess as to what family your sequence belongs to and it also sometimes provides information about the genus and species as well. The system(perl ./unifrac.pl) and system(perl ./result.pl gut) commands will output the data into your results folder. The ‘Phylum’, ‘Class’, ‘Order’, ‘Family’, ‘Genus’, ‘Species’ and ‘taxa’ files will be in the taxa_rarefaction folder. Open the taxa file. The taxa file shows the OTU name followed by its classification at each of the different taxonomic levels (e.g. Domain, kingdom, phylum, class, order, family, genus, species). Because most species in the environment (and the intestine) are unknown, most of the OTUs only get classified to the level of Family and they can’t be assigned a species name because they are different than known species. Open the Phylum file. On the left you’ll see a list of the phyla present in the stool samples. Each sample is listed across the top. The numbers below represent the number of sequence reads within each sample that belong to a particular phylum. For example, sample B contains about 246 Firmicutes reads (although this number may vary). Manually select the entire chart (except for the top empty row), and click on ‘Charts’ on the top menu bar. Select Column, 100% stacked and it should make a graph. On the menu bar click a button called ‘switch plot’ to switch the x and y axes. Each bar tells you the proportion of each microbe present in each sample. For example, Firmicutes make up about 36% of all microbes in sample A. Note: You have to make the graph larger to see the full key. If you were successful, you should get a figure that looks something like this example: 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Acidobacteria Planctomycetes Chloroflexi Synergistetes Actinobacteria Bacteroidetes Tenericutes Lentisphaerae Proteobacteria Firmicutes 14 Save this file as phylum2 and keep it open for comparison with Mississippi River samples later on. How many reads belong to the Firmicutes phylum in Sample A? __________ What are the three most common phyla in Sample A? __________________________________ (Hint: Move mouse cursor over the bars to identify the taxon.) Approximately what percentage of the organisms in Sample D belong to Phylum Proteobacteria? __________ The hypothesis was that Bacteriodetes would increase after eating yogurt. What happens to the relative amount of Bacteriodetes after these three people ate yogurt? What can you conclude? How might you do the experiment differently to be more confident of your results? 12. Now the tree command, “tree.shared(calc=thetayc) ”, will tell MOTHUR to make a phylogenetic tree and “system(perl ./result.pl gut) ” will make a file that will let you see the results. The results files will be placed in a folder called tree within the gut_results folder. Click on the stool.trim.unique.good.filter.precluster.an.thetayc.0.03 file to open it in Treeview X in order to view the tree. It should look like this: 15 This tree is similar to a phylogenetic tree of individual species but you should keep in mind that many species are present within each sample. Samples which are closer together have more OTUs and therefore more species in common. When creating the trees, the tree.shared function ignores OTUs which are only found in one group (e.g. points with no overlap on the venn diagram). In general, what types of samples are the most similar, samples obtained from the same person, or samples obtained from different people either before or after eating yogurt? Which two individuals have the most similar set of gut microbes? ________________________ (Hint: recall samples A+B belong to person #1, samples C+D belong to person #2, and samples E+F belong to person #3) 13. Next we will examine species that are shared between samples by making Venn diagrams using the following two commands. venn(nseqs=T,permute=t) system(perl ./result.pl gut) Venn diagrams consist of overlapping circles of different colors. Open the MOTHUR>gut_results> Venn folder to find the correct files. You will find several files inside. MOTHUR can only make a Venn diagram of four samples at once. Because we have six groups, MOTHUR has to make 15 different Venn diagrams to compare these groups. Open the file called: stool.trim.unique.good.filter.precluster.an.0.03.sharedsobs.sampleA-sampleBsampleC-sampleD It should look something like this: 16 The numbers in each area of the Venn diagram represent the number of OTUs in that category. Overlapping regions tell you the number of OTUs shared between samples. Regions which do not overlap represent the number of OTUs which were only found in one group. How many OTUs are shared between Sample A and Sample B? _____ How many OTUs overlap between Sample A and Sample C? _______ Open the other Venn diagram files and determine how many OTUs are shared by Sample A and Sample F. __________________. 2. Analyze the Mississippi River Metagenome Using MOTHUR The same type of analysis can also be performed on metagenomics data from the Mississippi River. A data set has been prepared which contains sequence reads from ten different Mississippi river locations within Minnesota. To save time and to focus on data analysis, the Mississippi River Results folder has already been prepared for you. You won’t need to run anymore MOTHUR commands. Come up with a hypothesis BEFORE lab. Prior to coming to lab, develop a hypothesis about how land use patterns or a physical property of the water could influence microbial populations in the Mississippi River. Indicate how your chosen factor will influence overall species diversity, the relative amounts of specific microbial taxa, or the degree of similarity between microbial populations at different sites. Also, explain a reason why you think this might be so. Your hypothesis must be testable with the data you have available. For example, you might propose that a certain group of bacteria might be more likely to be present when iron concentrations are high. Alternatively you might propose that river sites which are similar in location or properties may have similar bacterial populations. Finally, you could propose that certain physical factors in the water might alter overall bacterial diversity between sites. Based on your hypothesis, predict which sites might have similar microbial populations. Ambitious students may choose to test more than one hypothesis. Experiment: After learning the basics of the MOTHUR bioinformatics program, you’ll analyze MOTHUR results for the 16S rDNA sequencing data from the Mississippi River to test your hypothesis. Experimental Outcome: The data you examine should support or refute your hypothesis. It will also show you which sites along the Mississippi River have similar microbial populations. 17 Resources to use to form a hypothesis: A. Table 1: A list of bioinformatics functions, the output they provide, and what they can tell you. B. Figure 1: Map of Minnesota Mississippi Metagenomics Project (M3P) Sampling Sites. C. Figure 2: Maps of land use patterns at each site. D. Honors Moodle Site: An Excel spreadsheet of physical factors (including nutrients, metals, and pollutants) at different sampling sites on the Mississippi. Note: Online you can find information about the nutrient and growth requirements of selected microbial groups (Just Google one you are interested in. For example you can type in ‘iron loving bacteria’ and see what comes up.) A. Table 1. MOTHUR Bioinformatics Functions Before lab, choose one or more bioinformatics functions you could use to test your hypothesis. Function rarefaction.single() Output Rarefaction curve tree.shared(calc=thetayc) Tree Diagram venn() Venn Diagram classify.seqs() Spreadsheet of taxa What it tells you Shows if representative samples were obtained at a given site and suggests diversity differences between sites. Shows the overall degree of genetic relatedness between sites. Sites comprised of similar species are more related to each other. Identifies unique and shared taxa present at each site. Shows how many bacteria taxa are shared between sites. Shows what taxa are present in your sample 100% stacked bar graph Relative proportion of different taxa at each sample site 18 B. Figure 1. Map of M3P Project Sampling Sites. The metagenomics data obtained will be used to educate the public, and help guide regulations and policies to protect this important resource. 19 C. Figure 2. Land Use Data At Different M3P Sample Sites D. An Excel Spreadsheet of Physical Factors (see the Biol 1009 honors Moodle site) With this excel spread sheet, you can determine at which Minnesota Mississippi Metagenomics Project sampling site(s) nutrients, metals and pollutants can be found, as well as, their respective concentrations. 20 Using the resources provided, develop your own hypothesis for how land use and physical data (the presence of nutrients, metals and pollutants) affect the metagenomic data (i.e. the overall level of diversity, similarity and abundance of taxa between sites). For example: 1. Humans release toxic chemicals into the river, which kill off microbes. 2. Therefore, river sites near human settlements (e.g. site #4) will have reduced bacterial diversity compared to a site with less human settlements (e.g. site #1). Use the prepared MOTHUR results files to test your hypothesis about how physical factors can affect the diversity or abundance of Mississippi River microbes. Hypothesis: Reasons for making this hypothesis: Results examined: Do these results support your hypothesis? If not, propose a possible explanation for the observed results: What additional information might you wish to know if you were going to repeat this experiment? 21 Additional Questions about Mississippi River Samples: 1. Open the rarefaction file in the Mothur\river results\taxa_rarefaction folder and use the data to construct a rarefaction curve as you did with the gut samples above. Select all the data (including the labels like “WS5”) using control + ‘a’ and go to insert chart. Select ‘line’ from the chart types then select the first line graph type available (also called ‘line’). The _______ axis shows the number of OTUs while the ________ axis shows the number of sequence reads. Which sample appears to be the most diverse based on the rarefaction curves?__________ Which is the least diverse?__________ Is more sampling needed to obtain representative microbes? Why or why not? 2. Examine the tree file found in mothur/river results/tre Which water samples have similar populations of microbes?______________________________ _________________________________________________________________________________ 3. Open the phylum file in the taxa_rarefaction folder and construct a stacked chart as before. What are the most common phyla? Name three phyla in the river sample that were also present in the gut sample (Do not include “Unclassified”). 1. _______________________________ 2. _______________________________ 3. _______________________________ 22 References 1. Sciences, N.A. of & NRC The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. Design 171 (National Academies Press: Washington, DC, 2007).at <http:/www.nap.edu/catalog.php?record_id=11902> 2. Gevers, D. et al. Opinion: Re-evaluating prokaryotic species. Nature reviews. Microbiology 3, 733-9 (2005). 3. Schouls, L.M., Schot, C.S. & Jacobs, J.A. Horizontal transfer of segments of the 16S rRNA genes between species of the Streptococcus anginosus group. J Bacteriol 185, 7241-6 (2003). 4. Case, R.J. et al. Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl Environ Microbiol 73, 278-88 (2007). 5. DeSantis, T.Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72, 5069-72 (2006). 6. Klappenbach, J.A., Saxman, P.R., Cole, J.R. & Schmidt, T.M. rrndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Res 29, 181-4 (2001). 7. Wooley, J.C., Godzik, A. & Friedberg, I. A primer on metagenomics. PLoS Comput Biol 6, e1000667 (2010). 8. Schloss, P.D. et al. Introducing mothur: open-source, platform-independent, communitysupported software for describing and comparing microbial communities. Appl Environ Microbiol 75, 7537-7541 (2009). 9. Mullard, A. Microbiology: the inside story. Nature 453, 578-80 (2008). 10. Turnbaugh, P.J. et al. A core gut microbiome in obese and lean twins. Nature 457, 480-4 (2009). Notes to Lab Coordinators/TAs 1. Ensure that MOTHUR and the Treeview software have been installed. 2. Every time students run MOTHUR it will create a bunch of files. When students of the next lab period run it, the old files will be over-written. There is nothing you have to do to manage these files. 3. Students should come up with a hypothesis before they come to lab and test that hypothesis using the data in the River results folder near the end of the lab. 23 4. In the event of an unrecoverable error, all the results can be generated by activating the batch file from the command prompt. 5. Be sure to allow students to briefly present their findings to the group. Also, take note of any intriguing results so they can be re-examined by a M3P project scientist. 24