Class_9_QIIME_final - Genome Projects at University of

advertisement
E S S E N T I A L S O F N E X T G E N E R A T I O N
S E Q U E N C I N G W O R K S H O P 2 0 1 5
U N I V E R S I T Y O F K E N T U C K Y A G T C
9
Class
Analyzing Metagenomic
Data with QIIME
Metagenomics
OBJECTIVE: We will use the open source software package, Quantitative Insights Into Microbial
Ecology (QIIME, pronounced ‘chime’) to process and analyze 16S ribosomal RNA (16S rRNA)
sequence reads. Specific aims will be to:
1. Check the format of a sample mapping file and preprocess the sequence reads
2. Filter sequencing reads based on quality
3. Group reads into Operational Taxonomic Units (OTUs), which represent groups of closely
related species
4. Compare species (OTU) abundance between 16S rRNA datasets acquired from different
experimental conditions
QIIME is a popular software pipeline that handles metagenomic data analysis all the way from raw data
input, through sequence analysis, to the deposition of data into databases. Rather that being a single
program (like TopHat or Cufflinks), QIIME consists of a number of scripts written in the Python
programming language. Some of these handle a single task, while others incorporate other python scripts
and wrappers, or other software tools (for example blast). This allows multiple scripts to be combined
into an analysis pipeline that can be invoked with a single command. The options/switches that control
the behavior of each script (methods used, input/output file paths, etc) can be set in a configuration file,
which controls the default behavior of the program.
Caporaso et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature
Methods 7:335-336.
http://qiime.org
QIIME requires the following input files:
Essentials of Next Generation Sequencing 2015
Page 1 of 10

Study data in format fasta+qual (or fasta only), or for 454 data, sff.txt files if they are available

A “mapping” file. This is a tab-delimited text file that describes each sample in the data set,
including sample ID, sample barcode sequence, primer sequences for clipping, all information
about the project, like grouping categories and descriptions

A BLAST database with taxonomic information. For 16S rRNA studies, the blast database is
named “greengenes,” and it is downloadable with the qiime software from
http://qiime.wordpress.com/2013/05/20/greengenes-13_5/
The following exercises will present workflows for a typical metagenomic analysis of 454 sequences and
will closely follow the tutorial presented on the QIIME website (http://www.qiime.org).
9.1 Validate the mapping file
The mapping file is essential for successful deconvolution of the sequence data. At a minimum, it should
contain the name of each sample, the barcode sequence used, the linker/primer sequence used to amplify
the sample, and a description column. In addition it may contain additional metadata relating to the
samples. The mapping file is tab delimited, sample names may contain only alphanumeric characters and
‘.’ (periods). Empty lines are ignored, as are comment lines starting with #. The category represented in
each tab-delimited column is described in the first row (which should be prefaced with the ‘#’ character).
Example:
#SampleID BarcodeSequence
LinkerPrimerSequence
Treatment
DOB
Description
PC.354 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT
Control 20061218 Control_mouse__I.D._354
PC.355 AACTCGTCGATG YATGCTGCCTCCCGTAGGAGT
Control 20061218 Control_mouse__I.D._355
PC.356 ACAGACCACTCA
Control 20061126 Control_mouse__I.D._356
YATGCTGCCTCCCGTAGGAGT
 Make sure you are in the qiime directory. All of the following commands assume
that you are in the qiime directory. Remember, you do not need to change
directories to list, copy or move files
 To get started with QIIME, you must first “source” its activation script to load settings into
your shell that allow QIIME to find the specific software versions it requires. If you log
out and log back in, you must run this command again!

source /opt/qiime_software/activate.sh
 Check the mapping file that has been provided for this exercise

validate_mapping_file.py -m Fasting_Map.txt -o mapping_output
 Hopefully, this will return a message that no errors were found
Essentials of Next Generation Sequencing 2015
Page 2 of 10
9.2 Use split-libraries.py to split the input files according to
barcode
Split-libraries.py is a multi-functional script that identifies and renames reads (and their corresponding
quality files) according to their barcodes.
Reads without barcodes are rejected, as are those without the correct primer sequence. This script can
also perform quality trimming (or, if desired, outright sequence rejection) based on user-defined quality
standards. After this step each read is renamed to incorporate the sample identifier along with a unique
(consecutive) number within that sample.
 Make sure you are still in the qiime directory
 Run split-libraries.py on the example dataset:

split_libraries.py -m Fasting_Map.txt -f Fasting_Example.fna \
–q Fasting_Example.qual -o split_library_output
-o
name of output directory (this directory will be created when the script is run)
-f
specifies the name of the input multifasta (.fasta) file
-q
specifies the name of the input quality (.qual) file [optional]
-m
specifies the name of the mapping file. For a single, non-barcoded sample,
split_libraries.py can be provided with a mapping file that has an empty field for the
BarcodeSequence and the -b 0 option is used.
After this step, the original read names are removed and are replaced with the sampleID code and a
unique number.
 View the output documents histograms.txt and seqs.fna to check for proper performance of the
script
9.3 Pick Operational Taxonomic Units (OTUs)
All of the sequences from each mouse sample will be clustered into Operational Taxonomic Units
(OTUs) based on their sequence similarity. OTUs in QIIME are clusters of sequences, frequently
intended to represent some degree of taxonomic relatedness.
There are 3 main protocols for OTU picking:
a) pick_de_novo_otus.py: Reads are clustered against one another without any external reference
sequence collection. Useful for studying populations where there is poor characterization of
existing data. In addition to clustering, the script also performs taxonomy assignment, sequence
alignment, and tree-building steps.
Essentials of Next Generation Sequencing 2015
Page 3 of 10
b) pick_closed_reference_otus.py: Reads are clustered against a reference sequence collection and
any reads which do not hit a sequence in the reference sequence collection are excluded. Useful
when existing taxa are well characterized and one is not interested in novel ones. Taxonomic
assignments in the reference database are assigned to the OTUs.
c) pick_open_reference_otus.py: Reads are clustered against a reference sequence collection and
any reads which do not hit the reference sequence collection are clustered de novo. Applicable
when existing taxa are well characterized and one also wishes to discover novel ones.
All three scripts have similar usage and options:
Example usage (do not enter this command):
pick_open_reference_otus.py –r reference_taxa.fna –i
split_library_output/seqs.fna -o otus
-r
file path to reference sequences (not required for “pick_denovo” script)
-i
file path to input sequences (post-splitting if applicable)
-o
name of output directory (this directory will be created when the script is run)
 So let’s go ahead and run the pick_denovo_otus.py script:
 Make sure you are still in the qiime directory

pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus
This will run a number of analyses, as follows:
a. Pick OTUs: all of the sequences from all of the samples will be clustered into Operational
Taxonomic Units (OTUs) based on selected clustering method and user defined sequence
similarity threshold. There are many similarity methods already implemented in QIIME, like
blast, uclust or usearch.
b. Pick a representative sequence set, one sequence, most abundant, per OTU: This
representative sequence will be used for taxonomic identification of the OTU and for
phylogenetic alignment. The relevant file (seqs_rep_set.fasta) can be found in the
otus/rep_set/ directory, along with a log file.
c. Align the representative sequence set - necessary if phylogenetic tools such as UniFrac will
be used. Alignments can be generated de novo with programs such as MUSCLE, or to an
existing reference using tools like PyNAST. The alignment file is created in the directory
otus/pynast_aligned_seqs/, along with a log file.
d. Assign taxonomies to the representative sequences in the seqs_rep_set.fasta file. The
sequences are compared with established databases to define taxonomic identities. QIIME
uses the uclust classification system as default. Assigned taxonomies (including confidence
values) are written to a text file in the otus/uclust_assigned_taxonomy/ directory.
e. Filter the alignment prior to tree building to remove positions that are phylogenetically
uninformative. The filtered sequence file is created in the directory
otus/pynast_aligned_seqs/.
Essentials of Next Generation Sequencing 2015
Page 4 of 10
f. Build a phylogenetic tree. Tree building is necessary for UniFrac diversity measurements and
other phylogeny based analyses. A Newick format tree is written to rep_set.tre, in the otus/
directory and can be viewed using tree visualization software (for example TreeVector
[Pethica, R. and Barker, G. and Kovacs, T. and Gough, J. (2010). "TreeVector: scalable,
interactive, phylogenetic trees for the web". PLoS ONE 5: e8934.
doi:10.1371/journal.pone.0008934]).
g. Build an OTU table: Using the OTU map (step 1) and taxonomic assignments (step 3),
QIIME assembles a matrix of OTU abundance in each sample with meaningful taxonomic
identifiers for each OTU. The results are written to a file (otu_table.biom) in the otus/
directory.
 Let’s examine the results of the pick_denovo_otus.py script. First look at the otu_table.biom
file in the otus directory just created by the previous script
Kind of hard to interpret isn’t it?
 We can use another utility, biom summarize-table, to obtain a very brief summary, which
lists the number of OTUs identified in each sample (Counts/sample detail):

biom summarize-table –i otus/otu_table.biom –o table_summary.txt
 Examine the table_summary.txt output file
 Now let’s use summarize_taxa_through_plots.py to group OTUs at different taxonomic
levels (phylum, class, order, etc.) and plot the results:

summarize_taxa_through_plots.py -i otus/otu_table.biom \
–o wf_taxa_summary -m Fasting_Map.txt
This will generate a new directory, wf_taxa_summary, in which information on the proportional
representation of taxonomic groups within each sample is provided. The taxonomic level for
information is designated with the -L option; for RDP classifier Level 2=Domain, 3=Phylum,
4=Class, 5=Order, 6=Family and 7=Genus. Taxa summary tables will be output in both classic (tabseparated) and BIOM formats. In addition, we can group samples using the -c option to summarize
taxa by group of samples (for example -c Treatment will provide taxa summaries for fasting vs.
standard diet). The BIOM-formatted taxa summary tables can be used as input to other QIIME
scripts that accept BIOM files.
 To view the resulting charts, use SCP/WinSCP to transfer the taxa_summary_plots folder to
the desktop of the Mac/PC and open the bar_charts.html file (note: on the Mac, you will need
to use the –r flag to transfer the whole folder):

scp –r taxa_summary_plots user.name@csurs11.csr.uky.edu:
-r
recursively copy the entire directory
Essentials of Next Generation Sequencing 2015
Page 5 of 10
9.4 Calculate Alpha Diversity
R. H. Whittaker defined three levels of species diversity: Alpha diversity corresponds to species diversity
in sites/habitats at a local scale; beta diversity comprises species diversity among sites/habitats (beta
diversity); and Gamma diversity represents the diversity across an entire landscape. Conveniently, the
QIIME package contains tools for estimating alpha and beta diversities – the components that together
determine gamma diversity.
To start, we will look at alpha diversity - a measure of the bacterial species diversity within an individual
experimental sample – in this case, mice that have been fed a specific diet.
First we must consider the fact that the number of sequences in each dataset will affect the estimated
species diversity (fewer sequences  lower species diversity). Consequently, we must normalize the
datasets to account for the uneven sequencing depth, using a process known as rarefaction. In QIIME,
this task is performed using the OTU table as input and the script creates a folder containing many new
tables representing random subsamples from the original OTU table.
Usage:
multiple_rarefactions.py [options] –i path/to/OTU_table –o
output_directory –m minimum_num_samples_in_rarefied_tables –x
maximum_num_samples_in_rarefied_tables
-s step_size
-i
path to otu table
-o
output directory name
-m
smallest number of samples in rarefied tables
-x
largest number of samples in rarefied tables
-s
step size
 Rarefy the OTU table produced by the pick_denovo_otus.py script.

multiple_rarefactions.py -i otus/otu_table.biom -m 20 -x 100 \
-s 20 -n 10 -o rarefied_20-100
Running the above script creates OTU tables for subsets containing 20, 40, 60, 80 and 100 randomly
selects sequences (10 tables per count), and saves them in a folder named rarefied_20-100. These new
tables will be utilized in the alpha diversity calculation. When selecting sample sizes for rarefaction
we need to consider size of each sample and our computing power. In our example each sample
consists of about 150 reads, so choosing 100 reads will yield a fairly random selection.
There are many methods for calculating alpha diversity, so it's important to think about what is most
meaningful for your experiment, and your biological question. The QIIME script for calculating
alpha diversity in samples is called alpha_diversity.py.
Usage: alpha_diversity.py [options] –i path/to/OTU_directory
–o output_directory
Essentials of Next Generation Sequencing 2015
Page 6 of 10
 Calculate the alpha diversity using as inputs the rarefied tables produced in the previous step:

alpha_diversity.py -i rarefied_20-100 \
-o alpha_diversity_results –t otus/rep_set.tre
-t, --tree_path
file path to Newick tree – this is required when using
phylogenetic metrics to calculate alpha diversity
The results are written to the alpha_diversity_results directory but are not very easy to interpret because
an output file is produced for each of the rarefied tables.
 List the contents of the alpha_diversity_results directory to see for yourself
 Take a look at one or two files. Each file lists the three calculated diversity measures in the
different samplings
Interpretation of the results is best accomplished using the collate_alpha_py script and plotting the
results:
 Collate the alpha diversity calculations into a visualizable format

collate_alpha.py -i alpha_diversity_results -o alpha_div_collated
The results are organized to facilitate spreadsheet analysis with MS Office or Open Office - one for
each metric being considered. The data can also be graphed to show the diversity for each sample at
varying depths of rarefaction.
 Generate plots of the collated alpha diversity data:

make_rarefaction_plots.py -i alpha_div_collated –m \
Fasting_Map.txt -o rarefaction_plots
This will generate an .html file within a directory named rarefaction_plots. This directory will also
contain a number of subdirectories containing image files for displaying in the interactive webpage
specified by the html file.
 Use SCP/WinSCP to transfer the entire rarefaction_plots directory to the Mac/PC desktop
 Open the directory and double-click on the rarefaction_plots.html file
 Select the metric “PD_whole_tree.”
 Select category “treatment.”
This particular plot shows the minimum phylogenetic tree size (total branch lengths) at different
levels of sequencing depth. Clearly the Fasting treatment shows greater diversity, in terms of
phylogenetic distances between its constituent community members. This difference is not so
pronounced, however, when using the “observed species” metric; and is absent with the
“Chao1” measure.
Essentials of Next Generation Sequencing 2015
Page 7 of 10
9.5 Calculate Beta Diversity
Beta diversity is a term for the comparison of samples to one another. A beta diversity metric calculates a
distance between a pair of samples. If you have many samples (for this tutorial we have nine), a beta
diversity metric will return a matrix showing the distances of all samples to all other samples. If you have
experience in phylogenetics, you may know that a distance matrix can be visualized as a tree. Distance
matrices can also be visualized as a graph of points, a network, or any other creative method you can
come up with. In this tutorial, we'll use principal coordinates analysis to visualize distances between
samples on an x-y-z plot.
 Make sure you are still in the qiime directory
 Use the workflow script jackknifed_beta_diversity.py to run a beta diversity analysis
Usage:
jackknifed_beta_diversity.py [options] -i otu_table –o output_dir
–e num_seqs_per_jacknife_rep -m -mapping_file
This script performs the following workflow:
a. Computes a beta diversity distance matrix from the complete data set
b. Performs multiple rarefactions at a single depth
c. Computes distance matrices for all the rarefied OTU tables
d. Builds UPGMA trees for the rarefactions
e. Compares all the trees to determine the consensus tree and support values for each branch
f. Performs principal coordinates analysis on the rarefied distance matrices
g. Generates plots of the principal coordinates
 Start the beta diversity pipeline:

jackknifed_beta_diversity.py -i otus/otu_table.biom –o \
jackknifed_beta_diversity -e 90 -m Fasting_Map.txt -t otus/rep_set.tre
-t (or --tree_fp)
Path to the tree file [default: None; but REQUIRED for
phylogenetic measures]
The jackknifed_beta_diversity.py workflow creates many new files inside the parent directory
jackknifed_beta_diversity. Here you will find a subdirectory for each of the different distance metrics we
selected (by way of the qiime_parameters.txt file). We will look at the output of one representative analysis
to gain insight into the “within” and “between” treatment diversity. The unweighted_uniFrac subdirectory
contains a folder emperor_pcoa_plots containing the file index.html. This file contains the results of the
principal coordinates analysis.

Use SCP/WinSCP to transfer the entire emperor_pcoa_plots directory to the Mac/PC desktop
Essentials of Next Generation Sequencing 2015
Page 8 of 10

Open the directory and double-click on the index.html file. You may get some security
warnings. Just hit the options that allow you to proceed and run the program.
Click your mouse on the plot and drag to rotate the axes in different directions. The colored clouds
represent the variation due to rarefied sampling within each sequence set, while the distances
between the clouds represent the variation among the sequence sets. Note how the samples from
each treatment tend to cluster in 3d space and are separable on the PC1 axis. In other words, most
of the variation between the nine samples (the principle component) is explained by differences
between the Control and Fasting populations.
9.6 Test whether any OTU is significantly associated with a
particular experimental category
QIIME provides the otu_category_significance.py script to find OTUs that are associated with
particular experimental treatments or measured variables. A number of statistical tests are available
for this purpose, including ANOVA, the G test of independence, Pearson correlation, or a paired ttest.
Specific tests that are performed are as follows:
g_test:
Uses the G test of independence to determine if OTU presence/absence is
associated with a category (e.g. are certain OTUs more likely to be
associated with fasting mice?)
ANOVA:
determines if OTU relative abundance is different in one condition versus
another.
Pearson correlation:
tests whether OTU abundance is correlated with a continuous variable (e.g.
which OTUs show positive or negative correlations with the length of time
in a fasting condition).
Paired t-test
This is for when measurements are taken “before” and “after” a treatment.
There must be exactly two measurements for each individual/site. The
category mapping file must again have an individual column, indicating
which sample is from which individual, and a reference_sample column,
that has a 1 for the before time point and a 0 for the after.
Usage:
otu_category_significance.py [options] –i otu_table –m
category_mapping_file –c category -o output_directory
-i
path to input otu table
-m
path to category mapping file
-c
category to test for differential representation
-s
statistical test to run. (default = ANOVA)
Here we will learn how to use the ANOVA test to identify OTUs whose abundance is different
between the control and fasting conditions.
Essentials of Next Generation Sequencing 2015
Page 9 of 10
 Make sure you are still in the qiime directory
 Let’s perform an ANOVA test to identify OTUs that might be significantly overrepresented
in one condition relative to the other:

group_significance.py -i otus/otu_table.biom -m Fasting_Map.txt \
-c Treatment -o single_anova.txt
 Take a look at the single_anova.txt results file. Hopefully you can see that, once we factor in
the False Discovery Rate (FDR column), none of the OTUs show any significant difference.
9.7 Testing for differences in OTU abundance at different
taxonomic levels
Closer inspection of the single_anova.txt file reveals that the absence of significant differences is
probably because the mean OTU abundances in the control and fasting samples are very low. So,
let’s boost the OTU numbers by re-examining the data at higher taxonomic levels:
 Make sure you are in the qiime directory!
 First we need to summarize our OTUs at different taxonomic levels (L2, L3, L4, etc.). We
use the summarize_taxa.py script for this purpose:

summarize_taxa.py –i otus/otu_table.biom –L 2 –o taxa_summaries
-i
input otu table
-L
taxonomic level to summarize
-o
output directory to store summaries
 Generate additional summary tables at taxonomic levels 3 and 4
 Now test the OTU category significances at each level. For each of L2, L3, and L4, run:

group_significance.py -i taxa_summaries/otu_table_L2.biom \
-m Fasting_Map.txt -s ANOVA -c Treatment -o anova_tests_L2.txt
-i
A single .biom file to use as input
-o
filename for the output summary
A message about missing metadata is expected and can be ignored.
 View the output file and examine the significance values in the FDR_P column. (Note: we
are focusing on this column, as opposed to the Bonferroni_P column, because the latter
correction tends to be too conservative and often obscures true differences). You should
now see that there are significant differences (FDR_P < 0.05) in OTU abundance at the
levels of Phylum and Class but not Order.
 Use awk to report the taxa that show significant differences in abundance.
Essentials of Next Generation Sequencing 2015
Page 10 of 10
Download