Part 2 - Susquehanna University

advertisement
Spring 2015
BIOL 312: Microbiology
A Town on Fire
Metagenomic Analysis of Bacterial
Communities in Soils Overlying the
Centralia, Pennsylvania Mine Fire
Instructor: Dr. Tammy Tobin
Susquehanna University
E-Mail: tobinjan@susqu.edu
Team Application Activity #2: Microbial Community Analysis Using
MacQIIME
Names of Team Members:
Introduction: During the last class period, you learned how to use LINUX and QIIME, and used QIIME to sort and
quality filter your 16S rRNA metagenomic sequences. In the interim, you and your team were asked to begin to think about a
single genus that was likely to be present in Centralia soils, given what you know about the chemistry and temperature of those
soils. Today, you will formulate and test your hypotheses.
Which Species Do You Expect to Find in Centralia?
At the beginning of this case study, you were asked to read an article about the species present in a geothermal environment in
China. You have also been introduced to Centralia’s unique ecology and have access to thermal and chemical data from each
sampling site (in the mapping file in the Centralia Case Study folder on your desktops…open it up with TextEdit). Using this
information, as well as that found in Chapter 18 of Microbiology, An Evolving Science, pick one species that you think may be
present in one of the sample sites in Centralia.
1.
Which species have you chosen? Give the complete taxonomic assignment of this species (Domain, Phylum, Class,
(Subclass, if needed), Order, (Suborder, if needed), Family, Genus and Species).
You may find it helpful to also use the Tree of Life at:
http://tolweb.org/tree/phylogeny.html
In order to browse around in the Tree of Life, Click on “Root of the Tree” under “Browse the Site”. You will see
“Eubacteria” this is the domain (we will use the name Bacteria instead of Eubacteria). If you click on Eubacteria, the next
page will show you all of the Bacterial phyla. Clicking on these phyla will allow you to delve deeper…eventually to the
genus and species, which will always be italicized. Note that some taxa are more completely described than others. That
is, there may or may not be a lot of defined taxonomic levels between phylum and species if a phylum is newly described
and only one or two species exist.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
1
2.
Defend your choice using biological/ecological information about that species, as well as information about the sampling
site in which you expect to find it.
Testing Your Hypothesis Using QIIME
Last time, you practiced with LINUX and QIIME, and then split and quality filtered your metagenomic sequences. Today, you
will assign each of the 16S rRNA gene sequences to Operational Taxonomy Units (OTUs) based on sequence similarity, then use a
representative member of each taxon to make and view taxonomic assignments. By default, if 16S rRNA bacterial sequences show
97% identity to each other they are clustered together as members of the same species, with lower amounts of identity used as the
cut-off for genus, etc. These taxonomic clusters are then compared to the sequences of known species from a curated database,
allowing us to identify whether the OTUs belong to a known taxon, or if they represent a potentially new one. We use the
Greengenes 16S rRNA database in this analysis, because it is of high quality, and is regularly updated.
Don’t forget to type the MacQIIME command each time you start your QIIME analysis on the Terminal, and then to cd to the
Centralia_Case_Study folder to begin your analysis. Also, make use of the Filename Completion function to avoid typos. By
typing part of the name of a command, filename or directory and pressing the [Tab] key, the shell will complete the rest of the
name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting
you to type a few more letters before pressing the tab key again. Finally, if you get an error message, you can type History to
view your previous commands to see what you did wrong.
1.
List the contents of the Centralia_Case_Study folder. You should now see a folder that is named
split_library_centralia_output. cd to that folder and list its contents. There should be three. Write the file names
below:
2.
cd back to the Centralia_Case_Study folder (remember you simply type cd .. to move up one directory). List the
contents to make sure you are there.
3.
The seqs.fna that you saw in the split library folder is the FASTA file with all of the sequences that have been
retained after passing the quality controls (a Phred score of at least 25, a sequence that is neither too short nor too
long, that does not have too many ‘homopolymers’, and for which barcode and primer sequences match the
expected sequences exactly). This is the file that we will use for the OTU picking.
4.
In order to pick OTUs you will use the command pick_otus.py, using the seqs.fna file as your input file, and
designating picked_otus_default as your output file. The exact command you should type after the $ prompt is:
pick_otus.py -i split_library_centralia_output/seqs.fna -o picked_otus_default
5.
List the contents of your Centralia Case Study folder now. You should see a new folder named
picked_otus_default. Inside that folder is a text file called seqs_otus.txt. This contains all of the quality-filtered
sequences assigned to the highest taxonomic level possible (in this case, it will be genus, but we will get to that
later). We will use that file to pick representative taxa from each of the OTUS for further analysis.
Why do we do this? Quite simply, to keep the computing as simple (fast) as possible. If two (or two hundred)
sequences have been grouped together into the same taxon by the OTU analysis, we do not need to analyze all of
them – only one representative.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
2
6.
The command for picking a representative set (make sure you are in the Centralia_Case_Study folder before you
begin) is:
pick_rep_set.py -i picked_otus_default/seqs_otus.txt -f split_library_centralia_output/seqs.fna -o rep_set1.fna
pick_rep_set.py is the command
seqs_otus.txt is the input file
seqs.fna is the original quality trimmed file that has been sorted by sample site
rep_set1.fna will be the name of the output text file containing the representative sequences of each taxon present
in Centralia
7.
If all goes well, there should now be a file called rep_set1.fna in your Centralia_Case_Study folder. Look at the
QIIME flowchart at the end of this handout…you have now completed the third step…congratulations!
8.
Even though the next step in the flow chart is to align the sequences, we will skip that step for now. It will
become necessary before we do phylogenetic and statistical analysis, but is not necessary for assigning taxonomy
or for viewing the taxonomic assignments.
9.
We must now assign taxonomy to the representative sequences. This is done using the command
assign_taxonomy.py with your representative sequence set as your input file and centralia_assigned_taxonomy as
the output folder. The command is:
assign_taxonomy.py -i rep_set1.fna -o centralia_assigned_taxonomy
10. Now you will have QIIME make a table of all of that information so that it can be more easily viewed. The
command for this is:
make_otu_table.py -i picked_otus_default/seqs_otus.txt -t centralia_assigned_taxonomy/rep_set1_tax_assignments.txt -o output.biom
This command tells the computer which program to run, the input file to use, the taxonomic assignment file to
use, and the output file name to use.
11. Finally (for today)! You will summarize the information in the biom table into graphical plots that are easy to
view and analyze. This is done using the command summarize_taxa_through_plots.py with your biom table as
the input file and an output folder called centralia_taxa_summary_plots. The program also needs the information
from the original mapping file so that the graphical output can be organized by sampling site. The command is:
summarize_taxa_through_plots.py -i output.biom -m mapping_centralia.txt -o centralia_taxa_summary_plots
12. We are done with QIIME for today! Go ahead and type exit and then quit the program.
13. Next, double click on the Centralia_Case_Study folder on your desktop to open it, then on the
centralia_taxa_summary_plots folder, and then, finally on the taxa_summary_plots folder. In that folder you will
see an html file called bar charts.html. Double click on that to open it in a web browser. You will see six different
bar charts that will look something like the figure below. As you scroll down the page, the taxonomic level shown
changes, with the first chart (below) only showing Kingdom (bacteria) and phylum (p). The last chart goes all the
way down to genus (g) level, if it is possible to do so based on the representative sequence.
Under each bar chart you will also see a Legend. The legend shows,
from the left hand column to the right, a colored box, then it defines
what that color represents, taxonomically, and then finally the percent
that that taxon represented in each of the three sample locations. In this
case, red is unassigned – that is, it represents potentially new
phyla..bacteria that have never been described before. Note that in S2
in this dataset, these make up 47.9% of the population! In these
legends, the letter k=kingdom (Bacteria), p=phylum, c=class, o=order,
f=family and g=genus.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
3
14. Does the data support your hypothesis? Answer this question by first referring back to the taxonomy of your
species, and then by looking at the bar charts. If the exact genus is not there, look for evidence for the presence of
the correct family, order, class, etc, recognizing that your species may well be in there, but simply not be
detectable with this sequence dataset (for example, your representative sequence may have been long, or good
enough quality, to get a good match for ‘firmicutes’, but not for any deeper taxa. Another sequencing reaction or
DNA isolation on another day might have had better luck)
Note that if a taxon is present at <0.1% of the population, it will be listed in the table as 0%, but may
still be present in the population. In order to see if it really is there, click on the “View Table (.txt)” link
located between the chart and the legend. Scroll down until you find your taxon, and then determine if
your taxon really is present or not.
15. Which taxonomic level (if any) shows evidence for the presence of your hypothesized species? What percent of
the total population does this taxon comprise (from “View Table (.txt)”)?
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
4
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
5
Download