Part 3 - Susquehanna University

advertisement
Spring 2015
BIOL 312: Microbiology
A Town on Fire
Metagenomic Analysis of Bacterial
Communities in Soils Overlying the
Centralia, Pennsylvania Mine Fire
Instructor: Dr. Tammy Tobin
University
E-Mail: tobinjan@susqu.edu
Susquehanna
Team Application Activity #3: Statistical Analysis of Microbial
Community Diversity and Composition.
Names of Team Members:
Introduction: During the last class period, you assigned your metagenomic sequences to OTUs, picked representative
sequences for each OTU, made a .biom table to summarize your OTU data, and then processed the .biom tables into bar graphs in
order to better visualize whether or not the results supported your hypothesis regarding the presence or absence of a single species
in a Centralia soil sample. You were then asked to hypothesize which environmental parameter, from the ones in the mapping file
(temperature, pH, ammonia, nitrate, sulfate or total sulfur concentration) you believe plays the largest role in determining
microbial community diversity and structure in Centralia. In this activity you will use QIIME to prepare your sequence data for
phylogenetic analysis, to generate a phylogenetic tree from your sequence data (we will analyze that tree during the last class
period), and to analyze how microbial community diversity varies with environmental conditions.
Diversity Analyses
We will be performing two different types of diversity analysis in this case study. Alpha diversity looks at the diversity within
samples, in this case, the OTU diversity within each borehole. Beta diversity describes the differences between samples. Both types
of diversity analysis can be computed using QIIME. Since some of the diversity metrics we will be using require the existence of a
phylogenetic tree, we will construct that first. We will not actually analyze the tree until the next session.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
1
Using QIIME to Construct Phylogenetic Trees
The underlying assumption for all DNA sequenced-based phylogenetic analyses is that the more closely related two species are,
evolutionarily, the more closely related their DNA sequences will be. This underlying assumption does have some flaws that need
to be kept in mind. As you have already learned, horizontal gene transfer between species can make two species look more (or
less) related than they truly are. Also, not all DNA changes are equal in terms of phenotypic outcome. Some mutations are
selectively neutral while others are not. Thus, selective pressures will have an impact on the rate of nucleotide changes observed in
different parts of the genome over time. Some analysis metrics take this latter situation into account by weighting base changes in
different codon positions differently (to account for silent mutations, etc.). This is not done in 16S rRNA sequence analysis
because there are no codons (no protein is produced).
1. Aligning sequences. In order to construct a phylogenetic tree, all of the 16S rRNA sequences in our quality-filtered fasta files will
first need to be aligned to make sure that the base changes observed between sequences are due to mutations to that site in the gene,
rather than to comparison of two completely different parts of the gene. QIIME will also insert gaps, as needed, to account for the
fact that insertions and deletions of bases also occur during evolution. By way of example, take the three related phrases below:
AFATCAT
AFFATCAT
TINYRATFEAREDAFFATCAT
If these phrases were compared without adjusting the default alignment (left justified) above, they would show almost no similarity
at all. Even phrases 1 and 2, which are obviously very similar, would have only two letters in common as they are currently
aligned: the first A and F. The third sequence (TI) would not match at all. After those first two letters, almost every subsequent
letter is different. Aligned versions of these phrases are shown below:
A_FATCAT
AFFATCAT
TINYRATFEAREDAFFATCAT
In this scenario, QIIME has shifted phrases 1 and 2 over to the right, so they match the corresponding phrase in 3, and has also
inserted a gap in phrase 1, to account for the additional F’s in phrases 2 and 3. This will give a much more accurate picture of the
overall sequence identity.
The command for aligning sequences in QIIME is align_seqs.py, with an input of your representative OTU fasta files, and an
output directory of centralia_repset1_aligned. This step will take several minutes, so be patient and do not hit return or enter
another command until you see the $ prompt. The total command is:
align_seqs.py -i rep_set1.fna -o aligned_sequences
2. Next you will filter out uninformative sequence data. This script will remove positions that are gaps in every sequence (which can
happen with some alignment programs), as well as those, such as ‘TINYRATFEARED’ in phrase 3 above that are non-conserved
positions (it is only present in one phrase), and thus are uninformative for tree building.
The command for this is: filter_alignment.py. The input file will be the aligned sequence file, and the output will be a folder
entitled ‘filtered_alignments’. The total command is:
filter_alignment.py -i aligned_sequences/rep_set1_aligned.fasta -o filtered_alignments
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
2
3. Finally, you will ask QIIME to generate a phylogenetic tree from your filtered alignments using FastTree, a modified NeighborJoining method (uses pairwise sequence comparisons to build the tree) that is much faster at metagenomics level analyses than
many other methods, but is still reliable (see article in References section for more details). The command is:
make_phylogeny.py -i filtered_alignments/rep_set1_aligned_pfiltered.fasta
You will be able to visualize trees in three basic ways, as shown below. The first two trees are ‘rooted’, while that last is not.
Remember that no matter which way the tree is diagrammed, they show phylogenetic relationships in the same way.
1.
In the first tree above, which taxon is most closely related to Nimravidae?
2.
In the second tree above, which taxon is most closely related to Spirochetes?
That’s it for phylogeny for now. We can next move on to calculating alpha diversity.
Calculating Alpha Diversity
To calculate alpha diversity, QIIME must first generate alpha rarefaction tables (in biom format). As you know from your
readings, rarefaction data will not only provide information regarding the amount of diversity present within each sample, but will
also help you determine if you have sampled at a sufficient depth to reveal an acceptable level of the diversity present in your
sample. We will use three different methods to actually analyze alpha diversity. Faith's Phylogenetic Diversity, Heip's Evenness
and total number of OTUs.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
3
Faith's Phylogenetic Diversity is based on the phylogenetic tree you generated in the first part of this activity. This method adds up
all the branch lengths in the tree as a measure of alpha diversity. Branch lengths are roughly analogous to the number of nucleotide
changes represented in the tree. So, if you add a new OTU to a dataset that is closely related to another OTU in the sample (very
few nucleotide changes between them), it will only cause a small increase in diversity. However, if you add a new OTU that comes
from a totally different lineage than anything else in the sample, it will cause a much larger increase in the diversity.
Heip's Evenness measures how close in numbers the OTUs an environment are. If a sample contains 100 sequences, and there are
roughly equal numbers of sequences from each OTU, then Heip’s evenness will be close to one. If, however, only one of the
sequences came from one OTU, and 99 came from another, then the score will be very close to zero.
Total number of OTUs is exactly what it sounds like. If there are more OTUs, then the samples will be considered to be more
diverse.
Alpha diversity is generated in four steps:
1.
Generating rarefaction tables. In this step, QIIME will subsample the original OTU table at a variety of specified sequence
depths, and will report the number of OTUs revealed at each depth. In the script we will use, the OTU will first be subsampled
10 times at a depth of 10 sequences/sample, then 10 times at a depth of 120 sequences/sample, and so on until the maximum
rarefaction depth is reached (we will stop at 1000 sequences/sample, although we could well choose to look at even more
sequences if our data indicates this would be beneficial). The step size is 110, which means that each sampling depth will be
increased by 110 until 1000 sequences/sample is reached. Since each subsampling depth will be repeated 10 times, a total of 100
subsampled OTU tables will be generated in our output (multiple_rarefactions) folder.
The script for this analysis is:
multiple_rarefactions.py -i output.biom -o multiple_rarefactions -m 10 -x 1000 -n 10 -s 110
2.
Next, the alpha diversity of the rarefied samples will be computed using the three different metrics.
alpha_diversity.py -i multiple_rarefactions/ -o alpha_diversity -m PD_whole_tree,observed_species,heip_e -t
filtered_alignments/rep_set1_aligned_pfiltered.tre
3.
At this stage there are still a ton of separate files, so those will need to be collated into a single file for graphing purposes. The
command is:
collate_alpha.py -i alpha_diversity/ -o collated_alpha_diversity
4.
Now we can plot the results, including the original mapping data so that we can see if alpha diversity varies with sample site,
chemical or temperature parameters.
make_rarefaction_plots.py -i collated_alpha_diversity/ -m mapping_centralia.txt -o rarefaction_plots
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
4
5.
Open the rarification_plots folder in your Centralia_Case_Study folder. Double click on rarefication_plots.html to open it. You
should see a window that looks like this:
From the “Select a Metric” drop-down, choose PD_whole_tree and
from the ‘Select a category” choose “Sample ID”. You will now see
Faith’s Phylogenetic Diversity rarefication curves for all three sample
sites. Which sample has the highest species richness? Do you believe
we have sufficiently sampled this location in order to see all of its
microbial diversity? Justify your answer.
6.
Change the metric to “Observed Species”. Does this metric support your conclusions in question 6? Explain.
7.
Finally, change the metric to “Heip’s”. Which microbial community shows the greatest evenness? How does species evenness
differ from species richness?
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
5
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
6
Download