Spring 2015 BIOL 312: Microbiology A Town on Fire Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire Instructor: Dr. Tammy Tobin Susquehanna University E-Mail: tobinjan@susqu.edu Team Application Activity #2: Microbial Community Analysis Using MacQIIME Names of Team Members: Introduction: During the last class period, you learned how to use LINUX and QIIME, and used QIIME to sort and quality filter your 16S rRNA metagenomic sequences. In the interim, you and your team were asked to begin to think about a single genus that was likely to be present in Centralia soils, given what you know about the chemistry and temperature of those soils. Today, you will formulate and test your hypotheses. Which Species Do You Expect to Find in Centralia? At the beginning of this case study, you were asked to read an article about the species present in a geothermal environment in China. You have also been introduced to Centralia’s unique ecology and have access to thermal and chemical data from each sampling site (in the mapping file in the Centralia Case Study folder on your desktops…open it up with TextEdit). Using this information, as well as that found in Chapter 18 of Microbiology, An Evolving Science, pick one species that you think may be present in one of the sample sites in Centralia. 1. Which species have you chosen? Give the complete taxonomic assignment of this species (Domain, Phylum, Class, (Subclass, if needed), Order, (Suborder, if needed), Family, Genus and Species). You may find it helpful to also use the Tree of Life at: http://tolweb.org/tree/phylogeny.html In order to browse around in the Tree of Life, Click on “Root of the Tree” under “Browse the Site”. You will see “Eubacteria” this is the domain (we will use the name Bacteria instead of Eubacteria). If you click on Eubacteria, the next page will show you all of the Bacterial phyla. Clicking on these phyla will allow you to delve deeper…eventually to the genus and species, which will always be italicized. Note that some taxa are more completely described than others. That is, there may or may not be a lot of defined taxonomic levels between phylum and species if a phylum is newly described and only one or two species exist. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1 2. Defend your choice using biological/ecological information about that species, as well as information about the sampling site in which you expect to find it. Testing Your Hypothesis Using QIIME Last time, you practiced with LINUX and QIIME, and then split and quality filtered your metagenomic sequences. Today, you will assign each of the 16S rRNA gene sequences to Operational Taxonomy Units (OTUs) based on sequence similarity, then use a representative member of each taxon to make and view taxonomic assignments. By default, if 16S rRNA bacterial sequences show 97% identity to each other they are clustered together as members of the same species, with lower amounts of identity used as the cut-off for genus, etc. These taxonomic clusters are then compared to the sequences of known species from a curated database, allowing us to identify whether the OTUs belong to a known taxon, or if they represent a potentially new one. We use the Greengenes 16S rRNA database in this analysis, because it is of high quality, and is regularly updated. Don’t forget to type the MacQIIME command each time you start your QIIME analysis on the Terminal, and then to cd to the Centralia_Case_Study folder to begin your analysis. Also, make use of the Filename Completion function to avoid typos. By typing part of the name of a command, filename or directory and pressing the [Tab] key, the shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again. Finally, if you get an error message, you can type History to view your previous commands to see what you did wrong. 1. List the contents of the Centralia_Case_Study folder. You should now see a folder that is named split_library_centralia_output. cd to that folder and list its contents. There should be three. Write the file names below: 2. cd back to the Centralia_Case_Study folder (remember you simply type cd .. to move up one directory). List the contents to make sure you are there. 3. The seqs.fna that you saw in the split library folder is the FASTA file with all of the sequences that have been retained after passing the quality controls (a Phred score of at least 25, a sequence that is neither too short nor too long, that does not have too many ‘homopolymers’, and for which barcode and primer sequences match the expected sequences exactly). This is the file that we will use for the OTU picking. 4. In order to pick OTUs you will use the command pick_otus.py, using the seqs.fna file as your input file, and designating picked_otus_default as your output file. The exact command you should type after the $ prompt is: pick_otus.py -i split_library_centralia_output/seqs.fna -o picked_otus_default 5. List the contents of your Centralia Case Study folder now. You should see a new folder named picked_otus_default. Inside that folder is a text file called seqs_otus.txt. This contains all of the quality-filtered sequences assigned to the highest taxonomic level possible (in this case, it will be genus, but we will get to that later). We will use that file to pick representative taxa from each of the OTUS for further analysis. Why do we do this? Quite simply, to keep the computing as simple (fast) as possible. If two (or two hundred) sequences have been grouped together into the same taxon by the OTU analysis, we do not need to analyze all of them – only one representative. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 2 6. The command for picking a representative set (make sure you are in the Centralia_Case_Study folder before you begin) is: pick_rep_set.py -i picked_otus_default/seqs_otus.txt -f split_library_centralia_output/seqs.fna -o rep_set1.fna pick_rep_set.py is the command seqs_otus.txt is the input file seqs.fna is the original quality trimmed file that has been sorted by sample site rep_set1.fna will be the name of the output text file containing the representative sequences of each taxon present in Centralia 7. If all goes well, there should now be a file called rep_set1.fna in your Centralia_Case_Study folder. Look at the QIIME flowchart at the end of this handout…you have now completed the third step…congratulations! 8. Even though the next step in the flow chart is to align the sequences, we will skip that step for now. It will become necessary before we do phylogenetic and statistical analysis, but is not necessary for assigning taxonomy or for viewing the taxonomic assignments. 9. We must now assign taxonomy to the representative sequences. This is done using the command assign_taxonomy.py with your representative sequence set as your input file and centralia_assigned_taxonomy as the output folder. The command is: assign_taxonomy.py -i rep_set1.fna -o centralia_assigned_taxonomy 10. Now you will have QIIME make a table of all of that information so that it can be more easily viewed. The command for this is: make_otu_table.py -i picked_otus_default/seqs_otus.txt -t centralia_assigned_taxonomy/rep_set1_tax_assignments.txt -o output.biom This command tells the computer which program to run, the input file to use, the taxonomic assignment file to use, and the output file name to use. 11. Finally (for today)! You will summarize the information in the biom table into graphical plots that are easy to view and analyze. This is done using the command summarize_taxa_through_plots.py with your biom table as the input file and an output folder called centralia_taxa_summary_plots. The program also needs the information from the original mapping file so that the graphical output can be organized by sampling site. The command is: summarize_taxa_through_plots.py -i output.biom -m mapping_centralia.txt -o centralia_taxa_summary_plots 12. We are done with QIIME for today! Go ahead and type exit and then quit the program. 13. Next, double click on the Centralia_Case_Study folder on your desktop to open it, then on the centralia_taxa_summary_plots folder, and then, finally on the taxa_summary_plots folder. In that folder you will see an html file called bar charts.html. Double click on that to open it in a web browser. You will see six different bar charts that will look something like the figure below. As you scroll down the page, the taxonomic level shown changes, with the first chart (below) only showing Kingdom (bacteria) and phylum (p). The last chart goes all the way down to genus (g) level, if it is possible to do so based on the representative sequence. Under each bar chart you will also see a Legend. The legend shows, from the left hand column to the right, a colored box, then it defines what that color represents, taxonomically, and then finally the percent that that taxon represented in each of the three sample locations. In this case, red is unassigned – that is, it represents potentially new phyla..bacteria that have never been described before. Note that in S2 in this dataset, these make up 47.9% of the population! In these legends, the letter k=kingdom (Bacteria), p=phylum, c=class, o=order, f=family and g=genus. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 3 14. Does the data support your hypothesis? Answer this question by first referring back to the taxonomy of your species, and then by looking at the bar charts. If the exact genus is not there, look for evidence for the presence of the correct family, order, class, etc, recognizing that your species may well be in there, but simply not be detectable with this sequence dataset (for example, your representative sequence may have been long, or good enough quality, to get a good match for ‘firmicutes’, but not for any deeper taxa. Another sequencing reaction or DNA isolation on another day might have had better luck) Note that if a taxon is present at <0.1% of the population, it will be listed in the table as 0%, but may still be present in the population. In order to see if it really is there, click on the “View Table (.txt)” link located between the chart and the legend. Scroll down until you find your taxon, and then determine if your taxon really is present or not. 15. Which taxonomic level (if any) shows evidence for the presence of your hypothesized species? What percent of the total population does this taxon comprise (from “View Table (.txt)”)? Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 4 Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 5