Bioinormatic Analysis of Metagenomes

advertisement
Signature:
Name:
Section #:
_______________________________
__________________________________
_______________________________
12 • Bioinformatic Analysis of Metagenomes
INTRODUCTION
Summary
New sequencing technologies are capable of sequencing millions of base pairs of DNA brining us closer
than ever to understanding the total genetic potential of a microbial community. Unfortunately this vast
amount of sequence data presents a problem: How can you compare and analyze millions of individual
sequence reads to come up with useful conclusions and generalizations about the population you are
analyzing? It is not an easy undertaking. The problem is comparable to examining all of the stars in the
known universe and trying to come up with patterns and explanations for how things came to be as they
are. Enter bioinformatics: This high-tech, all-encompassing approach applies the awesome power of
information technology to solve practical biological problems that would otherwise be intractable.
Perched on the imaginary wall that separates biology from computer science, bioinformaticians get to see
the best of both fields and they get to use their computer programming expertise and raw sequence data to
answer pertinent scientific questions.
As mentioned during the last lab period, sequenced-based metagenomics analysis encompasses the
sequencing of entire genomes, and the sequencing of 16S rDNA sequences to learn more about the
organisms that make up a particular community. Today we are primarily concerned with the later. Last
time you analyzed the results of your 16S rDNA PCR reaction using gel electrophoresis. In principle you
could sequence the DNA you obtained and learn about the organisms that make up the community where
you acquired your sample. Unfortunately this is expensive and time-consuming. Instead, we’ll primarily
be examining metagenomics data obtained by scientists to investigate the bacteria found in the human gut
and in the Mississippi river. While you will be working on a computer it is important to remember that
you will be analyzing REAL data obtained from REAL samples by REAL scientists. Before we can
examine this data however, we must first learn a little bit about how organisms are classified.
For videos about cutting edge metagenomics research on human gut microbes see:
www.nature.com/nature/videoarchive/gutmicrobes/
PLEASE REVIEW THE “How organisms are classified” SECTION OF THE MICROBIAL
WORLD LAB. This information will be invaluable to understanding this lab.
1
Genomes, Species, and Operational Taxonomic Units (OTUs)
As stated above, assigning organisms into groups and species is often times done with the help of DNA
from their genome. It is worth asking the question what constitutes an organism’s genome. Defining a
genome is actually more complex than you might think. Scientists agree that a genome represents the
entirety of an organism’s hereditary information but it is unclear whether differences that arise between
individuals are variations of the same genome, or whether they represent separate and distinct genomes.
For example, many microbes (especially pathogens) can rearrange their genes in novel combinations.
This ability helps these microbes survive in diverse and varying environmental conditions. Are two of
these rearranged genomes still the same genome, or are they different? The matter is further complicated
by the fact that many microbes contain extrachromosomal plasmids which can readily be transferred to
other microbes, including different species. Horizontal gene transfer (or lateral gene transfer) is the
process though which organisms can acquire genes from sources other than descent such as through viral
infection, or the uptake of DNA from the environment. Even Escherichia coli (E. coli), arguably one of
the most studied organisms on the planet, cannot be said to have a single ‘E. coli genome’. The genome of
the harmless K12 laboratory strain of E. coli differs from the pathogenic O157:H7 strain and other strains
by more than 25%! Large differences such as these are not uncommon among microbes so researchers are
beginning to reconsider how they think about genomes. Instead of focusing on a single invariant genome,
researchers now study the pangenome of a species. The pangenome is the full set of the genes present in
all strains of a species. It contains all of the genes shared by all members of the species (only 40% of E.
coli genes are shared between all the strains), as well as genes which are only found in a subset of strains1.
Just as it is challenging to define an organism’s genome, it can also be difficult to assign an organism to a
particular species. On the surface a species seems like a simple concept. With larger organisms, a species
is generally a group of organisms which are similar in their appearance and habits and which interbreed
(if they reproduce sexually). Similarity between genomes is also used to identify and classify organisms
as belonging to a particular species. Human DNA and chimpanzee DNA are thought to be 95-98%
identical, but we are clearly different species. In contrast, microscopic organisms with genomes that vary
by 3-25% are often considered the same species1. Microbes whose genomes differ by 3% or less are
usually considered to be members of the same species2, but this cutoff, while convenient, is rather
arbitrary. The question of whether organisms are considered to be the same species is not merely
academic, as it has profound ramifications for regulatory agencies and for medicine. What does it mean to
say that your beef must be free of E. coli if the E. coli genome is so variable? How would you know for
sure whether your beef was safe? Which drug do you use to treat a pathogen that is only 80% similar to
Staphylococcus aureus? Are some microbes used in industry and academia actually the same species as
the microbe that causes anthrax? How we name an organism does make a difference for public
perceptions and public policy decisions; however, the definition of a species is less important than the
ability to intelligently compare the degree of relatedness across groups to understand their evolutionary
history and to make meaningful generalizations.
One way to help determine which taxon your organism belongs to is to perform an rDNA analysis.
Scientists often can perform PCR on an environmental sample to obtain DNA from the conserved and
variable regions of the ribosome. You did this a couple of weeks ago when you performed 16S rDNA
PCR on your DNA extracted from your microbe sample. Though it is often used in conjunction with
metagenomics projects, it should be noted that rDNA analysis by itself is not considered to be
2
metagenomics research because it only focuses on a single gene and not ‘the genome’. This technique can
tell you what types of organisms are out there, but it is not without its problems. Horizontal gene transfer
of rDNA genes from one species to another can make them seem much more related than they really are3.
In addition a single bacterium or eukaryote may harbor multiple different copies of the 16S or 18S rDNA
respectively. It is known for example that bacteria usually have between 1 and 15 copies of 16S rDNA
genes4,5 and eukaryotic copy number of 18S rDNA varies even more. For these reasons, many other
housekeeping genes including rpoB, amoA, pmoA, nirS, nirK, nosZ, and pufM have been suggested as
alternatives to 16S rDNA for classifying microorganisms4–6.
Because the concept of a species is intricate, metagenomics studies use computer defined operational
taxonomic units (OTUs) and often focus on the most common phyla or genera in the sample. When
assigning an OTU to a particular taxon, you must select a cutoff value which represents how similar your
OTU has to be to a taxonomic group before it can be classified as belonging to that group. If your OTU is
100% identical to Euglenoids then you would feel pretty confident that your sequences came from this
group. What if your sequences were only 90% the same? 50%? The cutoff value can be difficult to select
but an 80% cutoff is commonly used. That means that if your OTU is more than 80% identical to a known
taxon, it will be considered a part of that group. If it is ≤ 80% similar to existing taxons, then it will not be
assigned any group and will instead be considered unclassified.
Bioinformatics and MOTHUR
Put simply, bioinformatics is the application of computer science to biological data. Bioinformaticians are
scientists and programmers who create algorithms and applications to analyze complex biological data.
Bioinformatics programs are used in diverse applications and many are used for metagenomics. These
programs are the workhorses behind many of the tasks we’ve been discussing so far; they assemble
genomes from sequences, call and annotate genes, compare metagenomics data between different sample
sties, and analyze the diversity of microbes present in a particular community.
Extracting meaningful information from the millions of new genomic sequences presents a serious
challenge for bioinformaticians. Metagenomics sequence data tends to be noisy and partial as it comes
from heterogeneous communities of microorganisms which sometimes number in the tens of thousands7.
In any given sample, DNA from bacteria, archaea, eukaryotes, and viral species may be present at
different levels of diversity and abundance. Bioinformaticians face many new and exciting challenges to
create innovative solutions for genome assembly, gene calling, and function prediction. Because
sequences obtained from environmental sources are fragmented, it is frequently difficult or impossible to
determine the species from which a specific sequence came.
Bioinformatics programs have been designed to perform a variety of important functions including digital
filtering, diversity analysis, and comparative metagenomics. The digital filtering performed by
bioinformatics programs is a lot like the physical filtering you did during your ‘Microbial World’ lab.
However, instead of physically removing undesirable microbes, the computer can inspect sequences and
remove ones from unwanted organisms. If you are interested in bacteria for example, you can filter out all
sequences which appear to be eukaryotic in origin7. While this is a powerful technique, it sometimes calls
sequences incorrectly so it is not a substitute for physical filtering. Bioinformatics programs can also
examine the diversity of microorganisms within and between samples. They can identify taxonomic
groups and species and can tell you their relative abundance in your sample. Finally, bioinformatics
3
programs have also been designed to perform comparative metagenomics to compare bacterial samples
taken at different times or places. Here, the GC content, microbial genome size, and the taxonomic and
functional content can be compared. You can also look for correlations between your metagenomics
results and environmental variables such as temperature.
Today we’ll use a program called MOTHUR to perform a number of functions on the provided data set8.
First, we’ll use it to convert raw sequencing data into a form that can easily be used for further analysis.
Second, MOTHUR will identify the different taxonomic groups that are present in some human gut
samples. Finally, we’ll perform a simple comparative metagenomic analysis to compare the bacteria
present in different human gut and Mississippi river samples.
OBJECTIVES
1. Define ‘bioinformatics’ and explain how it can be used in metagenomics.
2. Describe how organisms are classified.
3. Explain the difference between an Operational Taxonomic Unit (OTU) and a species.
4. List steps that MOTHUR performs to process raw sequences to prepare them for analysis.
5. Make a hypothesis about how microbial populations in the human gut and Mississippi river will change
at different times or different locations.
6. Use MOTHUR to analyze data in the form of rarefaction curves, phylogenetic trees, Venn diagrams,
and histograms to test your hypothesis and make conclusions about microbes present in different sample
locations.
MATERIALS
Bioinformatics
Function-based Metagenomics
Computer with Treeview X, SVGview, Strawberry Mixed E. coli fosmid libraries on antibiotic plates
Perl(for PC), and MOTHUR installed along with after 3 days of growth.
human gut and Mississippi microbe data sets
Ruler
Magic marker (for marking colonies)
Colony counter (optional)
4
Function-based Metagenomics
Examine the results of your functional selection for resistance to your chosen antibiotics
You may do this step right away, or you may save time by examining your plates while MOTHUR is
processing data.
1. Carefully examine your plates as a group of four. What do the controls tell you? How many
resistant clones were identified at each site?
Resistant Clones from
Sample Site #____
Resistant Clones from
Sample Site #____
Antibiotic 1:
Antibiotic 2:
2. What can you infer from a colony’s size and what does a colony’s distance from the antibiotic
disc tell you?
Hypothesis 1
Reasons for making this hypothesis
1.
Did your data support your hypothesis about differences in the prevalence of antibiotic resistance
in microbes between sites? Why or why not?
5
2. If your results did not correspond with your predictions, propose an explanation for the observed
results.
3. If you had to do this experiment over again, what would you change? What new questions would
you ask?
Hypothesis 2
Reasons for making this hypothesis
1. Did your data support your hypothesis about differences in the prevalence of antibiotic resistance
in microbes between sites? Why or why not?
2. If your results did not correspond with your predictions, propose an explanation for the observed
results.
3. If you had to do this experiment over again, what would you change? What new questions would
you ask?
During today’s lab your TA will call a brief break and ask for a show of hand for how many groups found
data to support or refute their hypothesis. Your TA will also call on one or more groups to briefly present
their data. If you are called simply tell your class what your hypothesis was, what results you got,
and what you might do differently in the future if you were to redo this experiment.
6
Sequence Based Metagenomics-16S rRNA profiling
1.
Exploring the Human Gut Metagenome using MOTHUR
Although MOTHUR is a powerful program, it does not have a user interface like you may be used to.
Instead, you must manually enter in commands or run a batch file. A batch file is simply a text file that
contains a list of instructions or commands that tell the MOTHUR program what to do. In order for
scientists to create batch files for their projects, they must first understand what commands they can give
to MOTHUR and what they do. Today you’ll be running a batch file and then following along with the
worksheets below to learn more about what MOTHUR is doing.
In this first exercise you’ll be working with data obtained from studying the human gut metagenome . The
data consists of DNA sequences of 16S rDNA PCR products similar to the 16S rDNA PCR products you
generated from your bacterial sample. Recall that at least ten trillion microbes live in the human gut and
these cells out number human cells ten to one9. If all of these microbes were placed on a scale, they would
weigh about 2 pounds! It is estimated that we each have about 1,000 species of microbes in our guts and
some can affect our health in important ways.
The data set you’ll be working with was generated to determine how eating a probiotic (a food which
contains living microorganisms) like yogurt might impact the composition of microorganisms living in
your gut. Yogurt contains Lactobacillus acidophilus, a bacterium commonly found in the human mouth
and gut. It ferments sugars into lactic acid and it belongs to the Firmicutes phylum. Stool samples were
collected from three different people before and after they started eating Lactobacillus acidophilus
containing yogurt. Samples A and B belong to one person, samples C and D belong to a second person,
and E and F belong to a third person. The first letter represents their baseline intestinal bacteria (e.g.
Sample A) while the second letter represents their bacteria content after consuming yogurt (e.g. Sample
B). Yogurt is known to increase the amount of species of the Bacteroidetes phylum in the gut and
increased amounts of these microbes are associated with slim people10. (This begs the questions as to
whether or not eating yogurt can change the composition of your gut bacteria in a way that helps
you lose weight. The verdict is still out.)
To obtain the data within this file, scientists had to do a number of steps similar to the ones you’ve been
learning about in lab. First they had to physically filter their sample to maximize what they were
interested in studying: bacteria in the human gut. Next they extracted DNA from those bacteria. Third,
they performed 16S rDNA PCR reactions on each sample. The 16S rDNA PCR products were then
sequenced and used to generate an output file containing all the individual reads (or sequences) obtained
from the stool samples. This file, called stool.fasta, is the one you’ll be working with today. This data set
originally contained 0.1 million reads but it has been reduced to only 6,000 reads for this exercise so that
we can complete these operations within a single class period. By running MOTHUR functions on this
dataset you will be able to make conclusions about the relative levels of sampling completeness, sample
relatedness, and sample diversity. You will also be able to determine which taxa are present in the
samples.
7
To begin, run mothur_config which is located in the dock. This will copy
the MOTHUR files to the user’s desktop. Wait 3-5 minutes for it to finish
copying the data. DO NOT OPEN THE MOTHUR FOLDER OR DO
ANYTHING ELSE WHILE MOTHUR IS COPYING DATA. The terminal
will then automatically open. Type “cd Desktop/mothur ” and press enter to
select the proper directory. Then type “./mothur gutbatchmac.txt ” to have
Command Key
MOTHUR run the gutbatchmac.txt batch file. It will take about 5-10 minutes
to run this batch file. This would be a good time to examine your antibiotic
plates from the previous lab. When MOTHUR has finished processing the batch file, open the mothur
folder on the desktop and look for the most recent log file. It will be named ‘mothur.’ and will be
followed by a ten-digit number (e.g. mother.1331139964)’. Open this file by holding the ‘command key’,
clicking on the file, selecting open with, and then choosing ‘Text Edit’. This log file contains the
complete record of everything MOTHUR did when it was running the batch file. Use this file to
following along with the exercise below.
1. The “summary.seqs(fasta=stool.fasta)” command tells MOTHUR to open a file called stool.fasta and
to summarize this file for you. On the bottom, you can see it tells you that your file contains a total of
6,000 reads. The Start and End categories on the top signify where sequences begin and end. Right now
all sequences start at position 1 and end at some position downstream. The NBases category tells you the
number of bases within each sequence. Currently the NBases and the End categories should be the same
because all sequences start at position #1. The Minimum, Median, and Maximum and percentiles are
shown to give you basic statistics about your sequences.
2. The trim function, “trim.seqs(fasta=current,oligos=oligos.txt,maxambig=0,
maxhomop=8,minlength=300, processors=2) ”, removes reads and parts of sequences that we don’t
want to look at including primer and barcode sequences. The minlength=300 command removes any
sequences less than 300 because they are considered to be errors.
The “summary.seqs() ” command gives you a new summary. Note that now there are only 5,809
sequences. The new minimum size is 300 (which is equal to our old minimum size without the primer
sequences).
8
3. The Unique function, “unique.seqs(fasta=current) ”, will remove extra copies of sequences generated
by PCR so that only unique sequences will be analyzed further. The “summary.seqs()” command is run
again to allow you to compare the total number of sequences before and after you ran unique.seqs.
How many sequences were removed by the unique.seqs function?___________
4. Remember that all of the sequences you have are just different versions of the 16S rRNA gene. The
align function, align.seqs(reference=silva.bacteria.fasta, processors=2), will align your sequences with
known 16S rDNA sequences from a variety of species so that you can start comparing the differences
between species.
Example of aligned sequence reads. Hyphens represent locations where some known species have extra
bases that aren’t present in the displayed sequences. THIS DATA IS NOT DISPLAYED BY MOTHUR,
but is shown to illustrate what MOTHUR is doing behind the scenes.
5. Because it is difficult to compare sequences that don’t overlap with each other, the Screen.seqs
function, screen.seqs(fasta=current, name=current, start=3103, end=7922, group=stool.groups,
processors=2), is used to eliminate sequences which don’t fully overlap over a specified range. Keep in
mind, screen.seqs is NOT truncating your sequences so that they exactly overlap, it is removing entire
sequence reads which don’t overlap well with the others.
This function will eliminate sequences that start after nucleotide 3103 and/or end before nucleotide 7922.
Only 5% of the total number of sequences will be removed because these values were the 97.5% Start and
2.5% End values (as shown on the summary).
The summary.seqs() command is run again to give you a new summary.
9
On the diagram above left, the box represents the desired range. Draw a larger box on this diagram to
represent a larger range. If you ran the Screen function with your new larger range, any sequence which
doesn’t span the entire length of this range would be removed.
If you used this larger box as your range [MORE or LESS](circle one) sequences would be removed
from the analysis compared to the previous range (nucleotides 3103-7922). Why do you think this is so?
6. The Filter.seqs command, filter.seqs(fasta=current,vertical=T,trump=., processors=2), does not
eliminate any sequences but instead truncates them so they all begin and end at the same position to make
it easier to compare them with summary.seqs().
How long is the filtered alignment?______________
Notice that most sequences now have the same beginning and end points.
7. To counteract random sequencing errors which occur once per every 100 base pairs, the pre.cluster
function, pre.cluster(fasta=current,name=current,diffs=1), combines sequences which are less than
1% different from each other. The summary.seqs() command is run so you can see how many sequences
were eliminated.
8. The “dist.seqs(fasta=current,cutoff=0.25, processors=2) ” command calculates how similar each
sequence is to every other sequence. For example, if you compare two sequences, you might find they are
98% identical. Because there are lots of sequences, the software makes a makes a 2X2 grid. This should
take about 16 seconds. The tables shown below WILL NOT BE DISPLAYED ON YOUR MOTHUR
SCREEN. They are meant to show you what the computer is doing during this step.
10
HOW DISTANCE IS CALCULATED
Pair A
Seq # 1
Seq # 2
A
A
T
A
G
G
C
C
C
C
G
T
T
A
A
G
G
G
G
G
These two sequences share 6/10 bases (60% similar) so the distance score is 0.400
Pair B
Seq # 1
Seq # 3
A
A
T
A
G
C
C
C
C
G
G
T
T
A
A
G
T
G
G
G
What is the distance between these two sequences?_________
EXAMPLE OF A DISTANCE MATRIX
Seq#1
Seq#2 Seq#3 Seq#4 Seq#5 Seq#6 Seq#7
Seq#1
0.000
0.900
0.800
0.450
0.320
0.950
0.010
Seq#2
0.900
0.000
0.550
0.230
0.030
0.001
0.001
Seq#3
0.800
0.550
0.000
0.001
0.220
0.670
0.530
Seq#4
0.450
0.230
0.001
0.000
0.030
0.001
0.001
Seq#5
0.320
0.030
0.220
0.030
0.000
0.780
0.970
Seq#6
0.950
0.001
0.670
0.001
0.780
0.000
0.880
Seq#7
0.010
0.001
0.530
0.001
0.970
0.880
0.000
Are sequences #1 and #7 more related than sequences #2 and #7? YES or NO (Circle One)
9. The “cluster(column=current,name=current) ” function will cluster your reads into Operational
Taxonomic Units (OTUs) based on the dist.seq results. Remember that OTUs are the rough equivalent of
a species or a genus. In reality, there could be multiple species within an OTU but all the sequences in the
OTU have a similar 16S rDNA sequence. Sequences which are more than 97% similar will be clustered
together into the same OTU.
EXAMPLE CLUSTER DIAGRAM
The cluster diagram represents what the computer is
doing. It WILL NOT be displayed by MOTHUR.
Each dot in this cluster diagram represents a single
sequence and lines represent the evolutionary distance
between them. Dots that are close to each other are part
of the same cluster or OTU.
Sequences connected by thick black lines are ≥97%
similar (distance is ≤ 0.03) while sequences connected by
thin dotted lines are < 97% similar (distance is > 0.03).
10. The next two commands, “make.shared(list=current, group=current,label=0.03) ” and
“rarefaction.single(shared=current,freq=50) ”, will help make a rarefaction curve to give you an
estimate of your “sampling completeness”. It is essentially a graph of number of reads (AKA: PCR
product sequences) vs. number of OTUs. Eventually as you get more and more sequences, you stop
finding more species so the curve plateaus. Based on the slope of your rarefaction curve you can estimate
the completeness of your data. If the slope is steep, you need more reads and more sample to represent the
microbial community at the specified sample site. If the slope is flat, as it gets when it plateaus, then
you’ve sampled all of the microbes living in that type of environment and more sampling won’t get you
any new sequences. You can also use rarefaction curves to estimate diversity. In general steeper slopes
correlate with higher diversity.
11
system(perl ./rarefaction.pl stool.trim.unique.good.filter.precluster.an.groups.rarefaction)
system(perl ./result.pl gut)
The above commands will transfer the rarefaction data onto an excel spreadsheet so that you’ll be able to
graph it. If you open up the MOTHUR/gut results/taxa_rarefaction folder you’ll find you now have an
excel file called rarefaction. The number of reads will be shown on the far left in column A and the
number of OTUs for each sample will be on the right in columns B-G.
It should look like this:
12
Select all the data up to 900 reads then insert a line graph. Do not include data after this point as it is less
complete.
You should now see a rarefaction curve that looks something like this:
Like all the rarefaction curves we’ve seen, the y-axis is the number of OTUs and the x-axis is the number
of sequences. Answer the following questions with the aid of the graph on the previous page.
Approximately how many reads do you think you would need to get a good representation of all the
species in sample C? (Hint: You may have to estimate)______________________
Which sample plateaus the soonest? ______________________________________
Do different people have equally diverse gut microbiota before yogurt was consumed (Samples A,
C, & E)? YES or NO
Why do you think this might be so?
Which sample most likely contains the greatest number of microbial species and is therefore the
most diverse? ______________
Did the diversity change after eating yogurt? If so how? (Hint: recall that A+B belong to person #1,
samples C+D belong to person #2, and samples E+F belong to person #3. Samples A, C, and E were
taken before yogurt consumption and samples B, D, and F were taken after.)
13
11. The classify.seqs function, classify.seqs(fasta=current,template=trainset6_032010.rdp.fasta,
taxonomy=trainset6_032010.rdp.tax, iters=1000,probs=F), will assign each pyrosequence read a
taxonomic name. The computer will give you its best guess as to what family your sequence belongs to
and it also sometimes provides information about the genus and species as well.
The system(perl ./unifrac.pl) and system(perl ./result.pl gut) commands will output the data into your
results folder.
The ‘Phylum’, ‘Class’, ‘Order’, ‘Family’, ‘Genus’, ‘Species’ and ‘taxa’ files will be in the
taxa_rarefaction folder. Open the taxa file. The taxa file shows the OTU name followed by its
classification at each of the different taxonomic levels (e.g. Domain, kingdom, phylum, class, order,
family, genus, species). Because most species in the environment (and the intestine) are unknown, most
of the OTUs only get classified to the level of Family and they can’t be assigned a species name because
they are different than known species.
Open the Phylum file. On the left you’ll see a list of the phyla present in the stool samples. Each
sample is listed across the top. The numbers below represent the number of sequence reads within each
sample that belong to a particular phylum. For example, sample B contains about 246 Firmicutes reads
(although this number may vary). Manually select the entire chart (except for the top empty row), and
click on ‘Charts’ on the top menu bar. Select Column, 100% stacked and it should make a graph. On the
menu bar click a button called ‘switch plot’ to switch the x and y axes. Each bar tells you the proportion
of each microbe present in each sample. For example, Firmicutes make up about 36% of all microbes in
sample A. Note: You have to make the graph larger to see the full key.
If you were successful, you should get a figure that looks something like this example:
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Acidobacteria
Planctomycetes
Chloroflexi
Synergistetes
Actinobacteria
Bacteroidetes
Tenericutes
Lentisphaerae
Proteobacteria
Firmicutes
14
Save this file as phylum2 and keep it open for comparison with Mississippi River samples later on.
How many reads belong to the Firmicutes phylum in Sample A? __________
What are the three most common phyla in Sample A? __________________________________
(Hint: Move mouse cursor over the bars to identify the taxon.)
Approximately what percentage of the organisms in Sample D belong to Phylum Proteobacteria?
__________
The hypothesis was that Bacteriodetes would increase after eating yogurt. What happens to the
relative amount of Bacteriodetes after these three people ate yogurt?
What can you conclude?
How might you do the experiment differently to be more confident of your results?
12. Now the tree command, “tree.shared(calc=thetayc) ”, will tell MOTHUR to make a phylogenetic
tree and “system(perl ./result.pl gut) ” will make a file that will let you see the results. The results files
will be placed in a folder called tree within the gut_results folder.
Click on the stool.trim.unique.good.filter.precluster.an.thetayc.0.03 file to open it in Treeview X in
order to view the tree.
It should look like this:
15
This tree is similar to a phylogenetic tree of individual species but you should keep in mind that
many species are present within each sample. Samples which are closer together have more OTUs and
therefore more species in common. When creating the trees, the tree.shared function ignores OTUs which
are only found in one group (e.g. points with no overlap on the venn diagram).
In general, what types of samples are the most similar, samples obtained from the same person, or
samples obtained from different people either before or after eating yogurt?
Which two individuals have the most similar set of gut microbes? ________________________
(Hint: recall samples A+B belong to person #1, samples C+D belong to person #2, and samples E+F
belong to person #3)
13. Next we will examine species that are shared between samples by making Venn diagrams using the
following two commands.
venn(nseqs=T,permute=t)
system(perl ./result.pl gut)
Venn diagrams consist of overlapping circles of different colors. Open the MOTHUR>gut_results> Venn
folder to find the correct files. You will find several files inside. MOTHUR can only make a Venn
diagram of four samples at once. Because we have six groups, MOTHUR has to make 15 different Venn
diagrams to compare these groups.
Open the file called: stool.trim.unique.good.filter.precluster.an.0.03.sharedsobs.sampleA-sampleBsampleC-sampleD
It should look something like this:
16
The numbers in each area of the Venn diagram represent the number of OTUs in that category.
Overlapping regions tell you the number of OTUs shared between samples. Regions which do not overlap
represent the number of OTUs which were only found in one group.
How many OTUs are shared between Sample A and Sample B? _____
How many OTUs overlap between Sample A and Sample C? _______
Open the other Venn diagram files and determine how many OTUs are shared by Sample A and
Sample F. __________________.
2.
Analyze the Mississippi River Metagenome Using MOTHUR
The same type of analysis can also be performed on metagenomics data from the Mississippi River. A
data set has been prepared which contains sequence reads from ten different Mississippi river locations
within Minnesota. To save time and to focus on data analysis, the Mississippi River Results folder has
already been prepared for you. You won’t need to run anymore MOTHUR commands.
Come up with a hypothesis BEFORE lab. Prior to coming to lab, develop a hypothesis about how
land use patterns or a physical property of the water could influence microbial populations in the
Mississippi River. Indicate how your chosen factor will influence overall species diversity, the relative
amounts of specific microbial taxa, or the degree of similarity between microbial populations at different
sites. Also, explain a reason why you think this might be so. Your hypothesis must be testable with the
data you have available. For example, you might propose that a certain group of bacteria might be more
likely to be present when iron concentrations are high. Alternatively you might propose that river sites
which are similar in location or properties may have similar bacterial populations. Finally, you could
propose that certain physical factors in the water might alter overall bacterial diversity between sites.
Based on your hypothesis, predict which sites might have similar microbial populations. Ambitious
students may choose to test more than one hypothesis.
Experiment: After learning the basics of the MOTHUR bioinformatics program, you’ll analyze
MOTHUR results for the 16S rDNA sequencing data from the Mississippi River to test your hypothesis.
Experimental Outcome: The data you examine should support or refute your hypothesis. It will also
show you which sites along the Mississippi River have similar microbial populations.
17
Resources to use to form a hypothesis:
A. Table 1: A list of bioinformatics functions, the output they provide, and what they can tell you.
B. Figure 1: Map of Minnesota Mississippi Metagenomics Project (M3P) Sampling Sites.
C. Figure 2: Maps of land use patterns at each site.
D. Honors Moodle Site: An Excel spreadsheet of physical factors (including nutrients, metals, and
pollutants) at different sampling sites on the Mississippi.
Note:
Online you can find information about the nutrient and growth requirements of selected microbial groups
(Just Google one you are interested in. For example you can type in ‘iron loving bacteria’ and see what
comes up.)
A. Table 1. MOTHUR Bioinformatics Functions
Before lab, choose one or more bioinformatics functions you could use to test your hypothesis.
Function
rarefaction.single()
Output
Rarefaction curve
tree.shared(calc=thetayc)
Tree Diagram
venn()
Venn Diagram
classify.seqs()
Spreadsheet of taxa
What it tells you
Shows if representative samples were
obtained at a given site and suggests diversity
differences between sites.
Shows the overall degree of genetic
relatedness between sites. Sites comprised of
similar species are more related to each other.
Identifies unique and shared taxa present at
each site. Shows how many bacteria taxa are
shared between sites.
Shows what taxa are present in your sample
100% stacked bar graph
Relative proportion of different taxa at each
sample site
18
B. Figure 1. Map of M3P Project Sampling Sites.
The metagenomics data obtained will be used to educate the public, and help guide regulations and
policies to protect this important resource.
19
C. Figure 2. Land Use Data At Different M3P Sample Sites
D. An Excel Spreadsheet of Physical Factors (see the Biol 1009 honors Moodle site)
With this excel spread sheet, you can determine at which Minnesota Mississippi Metagenomics Project
sampling site(s) nutrients, metals and pollutants can be found, as well as, their respective concentrations.
20
Using the resources provided, develop your own hypothesis for how land use and physical data (the
presence of nutrients, metals and pollutants) affect the metagenomic data (i.e. the overall level of
diversity, similarity and abundance of taxa between sites).
For example:
1. Humans release toxic chemicals into the river, which kill off microbes.
2. Therefore, river sites near human settlements (e.g. site #4) will have reduced bacterial diversity
compared to a site with less human settlements (e.g. site #1).
Use the prepared MOTHUR results files to test your hypothesis about how physical factors can affect the
diversity or abundance of Mississippi River microbes.
Hypothesis:
Reasons for making this hypothesis:
Results examined:
Do these results support your hypothesis? If not, propose a possible explanation for the observed results:
What additional information might you wish to know if you were going to repeat this experiment?
21
Additional Questions about Mississippi River Samples:
1. Open the rarefaction file in the Mothur\river results\taxa_rarefaction folder and use the data to
construct a rarefaction curve as you did with the gut samples above. Select all the data (including the
labels like “WS5”) using control + ‘a’ and go to insert chart. Select ‘line’ from the chart types then select
the first line graph type available (also called ‘line’).
The _______ axis shows the number of OTUs while the ________ axis shows the number of
sequence reads.
Which sample appears to be the most diverse based on the rarefaction curves?__________
Which is the least diverse?__________
Is more sampling needed to obtain representative microbes? Why or why not?
2. Examine the tree file found in mothur/river results/tre
Which water samples have similar populations of microbes?______________________________
_________________________________________________________________________________
3. Open the phylum file in the taxa_rarefaction folder and construct a stacked chart as before.
What are the most common phyla?
Name three phyla in the river sample that were also present in the gut sample (Do not include
“Unclassified”).
1. _______________________________
2. _______________________________
3. _______________________________
22
References
1.
Sciences, N.A. of & NRC The New Science of Metagenomics: Revealing the Secrets of
Our Microbial Planet. Design 171 (National Academies Press: Washington, DC, 2007).at
<http:/www.nap.edu/catalog.php?record_id=11902>
2.
Gevers, D. et al. Opinion: Re-evaluating prokaryotic species. Nature reviews.
Microbiology 3, 733-9 (2005).
3.
Schouls, L.M., Schot, C.S. & Jacobs, J.A. Horizontal transfer of segments of the 16S
rRNA genes between species of the Streptococcus anginosus group. J Bacteriol 185,
7241-6 (2003).
4.
Case, R.J. et al. Use of 16S rRNA and rpoB genes as molecular markers for microbial
ecology studies. Appl Environ Microbiol 73, 278-88 (2007).
5.
DeSantis, T.Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl Environ Microbiol 72, 5069-72 (2006).
6.
Klappenbach, J.A., Saxman, P.R., Cole, J.R. & Schmidt, T.M. rrndb: the Ribosomal RNA
Operon Copy Number Database. Nucleic Acids Res 29, 181-4 (2001).
7.
Wooley, J.C., Godzik, A. & Friedberg, I. A primer on metagenomics. PLoS Comput Biol
6, e1000667 (2010).
8.
Schloss, P.D. et al. Introducing mothur: open-source, platform-independent, communitysupported software for describing and comparing microbial communities. Appl Environ
Microbiol 75, 7537-7541 (2009).
9.
Mullard, A. Microbiology: the inside story. Nature 453, 578-80 (2008).
10.
Turnbaugh, P.J. et al. A core gut microbiome in obese and lean twins. Nature 457, 480-4
(2009).
Notes to Lab Coordinators/TAs
1. Ensure that MOTHUR and the Treeview software have been installed.
2. Every time students run MOTHUR it will create a bunch of files. When students of the next lab period
run it, the old files will be over-written. There is nothing you have to do to manage these files.
3. Students should come up with a hypothesis before they come to lab and test that hypothesis using the
data in the River results folder near the end of the lab.
23
4. In the event of an unrecoverable error, all the results can be generated by activating the batch file from
the command prompt.
5. Be sure to allow students to briefly present their findings to the group. Also, take note of any intriguing
results so they can be re-examined by a M3P project scientist.
24
Download