E S S E N T I A L S O F N E X T G E N E R A T I O N S E Q U E N C I N G W O R K S H O P 2 0 1 5 U N I V E R S I T Y O F K E N T U C K Y A G T C 9 Class Analyzing Metagenomic Data with QIIME Metagenomics OBJECTIVE: We will use the open source software package, Quantitative Insights Into Microbial Ecology (QIIME, pronounced ‘chime’) to process and analyze 16S ribosomal RNA (16S rRNA) sequence reads. Specific aims will be to: 1. Check the format of a sample mapping file and preprocess the sequence reads 2. Filter sequencing reads based on quality 3. Group reads into Operational Taxonomic Units (OTUs), which represent groups of closely related species 4. Compare species (OTU) abundance between 16S rRNA datasets acquired from different experimental conditions QIIME is a popular software pipeline that handles metagenomic data analysis all the way from raw data input, through sequence analysis, to the deposition of data into databases. Rather that being a single program (like TopHat or Cufflinks), QIIME consists of a number of scripts written in the Python programming language. Some of these handle a single task, while others incorporate other python scripts and wrappers, or other software tools (for example blast). This allows multiple scripts to be combined into an analysis pipeline that can be invoked with a single command. The options/switches that control the behavior of each script (methods used, input/output file paths, etc) can be set in a configuration file, which controls the default behavior of the program. Caporaso et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7:335-336. http://qiime.org QIIME requires the following input files: Essentials of Next Generation Sequencing 2015 Page 1 of 10 Study data in format fasta+qual (or fasta only), or for 454 data, sff.txt files if they are available A “mapping” file. This is a tab-delimited text file that describes each sample in the data set, including sample ID, sample barcode sequence, primer sequences for clipping, all information about the project, like grouping categories and descriptions A BLAST database with taxonomic information. For 16S rRNA studies, the blast database is named “greengenes,” and it is downloadable with the qiime software from http://qiime.wordpress.com/2013/05/20/greengenes-13_5/ The following exercises will present workflows for a typical metagenomic analysis of 454 sequences and will closely follow the tutorial presented on the QIIME website (http://www.qiime.org). 9.1 Validate the mapping file The mapping file is essential for successful deconvolution of the sequence data. At a minimum, it should contain the name of each sample, the barcode sequence used, the linker/primer sequence used to amplify the sample, and a description column. In addition it may contain additional metadata relating to the samples. The mapping file is tab delimited, sample names may contain only alphanumeric characters and ‘.’ (periods). Empty lines are ignored, as are comment lines starting with #. The category represented in each tab-delimited column is described in the first row (which should be prefaced with the ‘#’ character). Example: #SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description PC.354 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._354 PC.355 AACTCGTCGATG YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._355 PC.356 ACAGACCACTCA Control 20061126 Control_mouse__I.D._356 YATGCTGCCTCCCGTAGGAGT Make sure you are in the qiime directory. All of the following commands assume that you are in the qiime directory. Remember, you do not need to change directories to list, copy or move files To get started with QIIME, you must first “source” its activation script to load settings into your shell that allow QIIME to find the specific software versions it requires. If you log out and log back in, you must run this command again! source /opt/qiime_software/activate.sh Check the mapping file that has been provided for this exercise validate_mapping_file.py -m Fasting_Map.txt -o mapping_output Hopefully, this will return a message that no errors were found Essentials of Next Generation Sequencing 2015 Page 2 of 10 9.2 Use split-libraries.py to split the input files according to barcode Split-libraries.py is a multi-functional script that identifies and renames reads (and their corresponding quality files) according to their barcodes. Reads without barcodes are rejected, as are those without the correct primer sequence. This script can also perform quality trimming (or, if desired, outright sequence rejection) based on user-defined quality standards. After this step each read is renamed to incorporate the sample identifier along with a unique (consecutive) number within that sample. Make sure you are still in the qiime directory Run split-libraries.py on the example dataset: split_libraries.py -m Fasting_Map.txt -f Fasting_Example.fna \ –q Fasting_Example.qual -o split_library_output -o name of output directory (this directory will be created when the script is run) -f specifies the name of the input multifasta (.fasta) file -q specifies the name of the input quality (.qual) file [optional] -m specifies the name of the mapping file. For a single, non-barcoded sample, split_libraries.py can be provided with a mapping file that has an empty field for the BarcodeSequence and the -b 0 option is used. After this step, the original read names are removed and are replaced with the sampleID code and a unique number. View the output documents histograms.txt and seqs.fna to check for proper performance of the script 9.3 Pick Operational Taxonomic Units (OTUs) All of the sequences from each mouse sample will be clustered into Operational Taxonomic Units (OTUs) based on their sequence similarity. OTUs in QIIME are clusters of sequences, frequently intended to represent some degree of taxonomic relatedness. There are 3 main protocols for OTU picking: a) pick_de_novo_otus.py: Reads are clustered against one another without any external reference sequence collection. Useful for studying populations where there is poor characterization of existing data. In addition to clustering, the script also performs taxonomy assignment, sequence alignment, and tree-building steps. Essentials of Next Generation Sequencing 2015 Page 3 of 10 b) pick_closed_reference_otus.py: Reads are clustered against a reference sequence collection and any reads which do not hit a sequence in the reference sequence collection are excluded. Useful when existing taxa are well characterized and one is not interested in novel ones. Taxonomic assignments in the reference database are assigned to the OTUs. c) pick_open_reference_otus.py: Reads are clustered against a reference sequence collection and any reads which do not hit the reference sequence collection are clustered de novo. Applicable when existing taxa are well characterized and one also wishes to discover novel ones. All three scripts have similar usage and options: Example usage (do not enter this command): pick_open_reference_otus.py –r reference_taxa.fna –i split_library_output/seqs.fna -o otus -r file path to reference sequences (not required for “pick_denovo” script) -i file path to input sequences (post-splitting if applicable) -o name of output directory (this directory will be created when the script is run) So let’s go ahead and run the pick_denovo_otus.py script: Make sure you are still in the qiime directory pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus This will run a number of analyses, as follows: a. Pick OTUs: all of the sequences from all of the samples will be clustered into Operational Taxonomic Units (OTUs) based on selected clustering method and user defined sequence similarity threshold. There are many similarity methods already implemented in QIIME, like blast, uclust or usearch. b. Pick a representative sequence set, one sequence, most abundant, per OTU: This representative sequence will be used for taxonomic identification of the OTU and for phylogenetic alignment. The relevant file (seqs_rep_set.fasta) can be found in the otus/rep_set/ directory, along with a log file. c. Align the representative sequence set - necessary if phylogenetic tools such as UniFrac will be used. Alignments can be generated de novo with programs such as MUSCLE, or to an existing reference using tools like PyNAST. The alignment file is created in the directory otus/pynast_aligned_seqs/, along with a log file. d. Assign taxonomies to the representative sequences in the seqs_rep_set.fasta file. The sequences are compared with established databases to define taxonomic identities. QIIME uses the uclust classification system as default. Assigned taxonomies (including confidence values) are written to a text file in the otus/uclust_assigned_taxonomy/ directory. e. Filter the alignment prior to tree building to remove positions that are phylogenetically uninformative. The filtered sequence file is created in the directory otus/pynast_aligned_seqs/. Essentials of Next Generation Sequencing 2015 Page 4 of 10 f. Build a phylogenetic tree. Tree building is necessary for UniFrac diversity measurements and other phylogeny based analyses. A Newick format tree is written to rep_set.tre, in the otus/ directory and can be viewed using tree visualization software (for example TreeVector [Pethica, R. and Barker, G. and Kovacs, T. and Gough, J. (2010). "TreeVector: scalable, interactive, phylogenetic trees for the web". PLoS ONE 5: e8934. doi:10.1371/journal.pone.0008934]). g. Build an OTU table: Using the OTU map (step 1) and taxonomic assignments (step 3), QIIME assembles a matrix of OTU abundance in each sample with meaningful taxonomic identifiers for each OTU. The results are written to a file (otu_table.biom) in the otus/ directory. Let’s examine the results of the pick_denovo_otus.py script. First look at the otu_table.biom file in the otus directory just created by the previous script Kind of hard to interpret isn’t it? We can use another utility, biom summarize-table, to obtain a very brief summary, which lists the number of OTUs identified in each sample (Counts/sample detail): biom summarize-table –i otus/otu_table.biom –o table_summary.txt Examine the table_summary.txt output file Now let’s use summarize_taxa_through_plots.py to group OTUs at different taxonomic levels (phylum, class, order, etc.) and plot the results: summarize_taxa_through_plots.py -i otus/otu_table.biom \ –o wf_taxa_summary -m Fasting_Map.txt This will generate a new directory, wf_taxa_summary, in which information on the proportional representation of taxonomic groups within each sample is provided. The taxonomic level for information is designated with the -L option; for RDP classifier Level 2=Domain, 3=Phylum, 4=Class, 5=Order, 6=Family and 7=Genus. Taxa summary tables will be output in both classic (tabseparated) and BIOM formats. In addition, we can group samples using the -c option to summarize taxa by group of samples (for example -c Treatment will provide taxa summaries for fasting vs. standard diet). The BIOM-formatted taxa summary tables can be used as input to other QIIME scripts that accept BIOM files. To view the resulting charts, use SCP/WinSCP to transfer the taxa_summary_plots folder to the desktop of the Mac/PC and open the bar_charts.html file (note: on the Mac, you will need to use the –r flag to transfer the whole folder): scp –r taxa_summary_plots user.name@csurs11.csr.uky.edu: -r recursively copy the entire directory Essentials of Next Generation Sequencing 2015 Page 5 of 10 9.4 Calculate Alpha Diversity R. H. Whittaker defined three levels of species diversity: Alpha diversity corresponds to species diversity in sites/habitats at a local scale; beta diversity comprises species diversity among sites/habitats (beta diversity); and Gamma diversity represents the diversity across an entire landscape. Conveniently, the QIIME package contains tools for estimating alpha and beta diversities – the components that together determine gamma diversity. To start, we will look at alpha diversity - a measure of the bacterial species diversity within an individual experimental sample – in this case, mice that have been fed a specific diet. First we must consider the fact that the number of sequences in each dataset will affect the estimated species diversity (fewer sequences lower species diversity). Consequently, we must normalize the datasets to account for the uneven sequencing depth, using a process known as rarefaction. In QIIME, this task is performed using the OTU table as input and the script creates a folder containing many new tables representing random subsamples from the original OTU table. Usage: multiple_rarefactions.py [options] –i path/to/OTU_table –o output_directory –m minimum_num_samples_in_rarefied_tables –x maximum_num_samples_in_rarefied_tables -s step_size -i path to otu table -o output directory name -m smallest number of samples in rarefied tables -x largest number of samples in rarefied tables -s step size Rarefy the OTU table produced by the pick_denovo_otus.py script. multiple_rarefactions.py -i otus/otu_table.biom -m 20 -x 100 \ -s 20 -n 10 -o rarefied_20-100 Running the above script creates OTU tables for subsets containing 20, 40, 60, 80 and 100 randomly selects sequences (10 tables per count), and saves them in a folder named rarefied_20-100. These new tables will be utilized in the alpha diversity calculation. When selecting sample sizes for rarefaction we need to consider size of each sample and our computing power. In our example each sample consists of about 150 reads, so choosing 100 reads will yield a fairly random selection. There are many methods for calculating alpha diversity, so it's important to think about what is most meaningful for your experiment, and your biological question. The QIIME script for calculating alpha diversity in samples is called alpha_diversity.py. Usage: alpha_diversity.py [options] –i path/to/OTU_directory –o output_directory Essentials of Next Generation Sequencing 2015 Page 6 of 10 Calculate the alpha diversity using as inputs the rarefied tables produced in the previous step: alpha_diversity.py -i rarefied_20-100 \ -o alpha_diversity_results –t otus/rep_set.tre -t, --tree_path file path to Newick tree – this is required when using phylogenetic metrics to calculate alpha diversity The results are written to the alpha_diversity_results directory but are not very easy to interpret because an output file is produced for each of the rarefied tables. List the contents of the alpha_diversity_results directory to see for yourself Take a look at one or two files. Each file lists the three calculated diversity measures in the different samplings Interpretation of the results is best accomplished using the collate_alpha_py script and plotting the results: Collate the alpha diversity calculations into a visualizable format collate_alpha.py -i alpha_diversity_results -o alpha_div_collated The results are organized to facilitate spreadsheet analysis with MS Office or Open Office - one for each metric being considered. The data can also be graphed to show the diversity for each sample at varying depths of rarefaction. Generate plots of the collated alpha diversity data: make_rarefaction_plots.py -i alpha_div_collated –m \ Fasting_Map.txt -o rarefaction_plots This will generate an .html file within a directory named rarefaction_plots. This directory will also contain a number of subdirectories containing image files for displaying in the interactive webpage specified by the html file. Use SCP/WinSCP to transfer the entire rarefaction_plots directory to the Mac/PC desktop Open the directory and double-click on the rarefaction_plots.html file Select the metric “PD_whole_tree.” Select category “treatment.” This particular plot shows the minimum phylogenetic tree size (total branch lengths) at different levels of sequencing depth. Clearly the Fasting treatment shows greater diversity, in terms of phylogenetic distances between its constituent community members. This difference is not so pronounced, however, when using the “observed species” metric; and is absent with the “Chao1” measure. Essentials of Next Generation Sequencing 2015 Page 7 of 10 9.5 Calculate Beta Diversity Beta diversity is a term for the comparison of samples to one another. A beta diversity metric calculates a distance between a pair of samples. If you have many samples (for this tutorial we have nine), a beta diversity metric will return a matrix showing the distances of all samples to all other samples. If you have experience in phylogenetics, you may know that a distance matrix can be visualized as a tree. Distance matrices can also be visualized as a graph of points, a network, or any other creative method you can come up with. In this tutorial, we'll use principal coordinates analysis to visualize distances between samples on an x-y-z plot. Make sure you are still in the qiime directory Use the workflow script jackknifed_beta_diversity.py to run a beta diversity analysis Usage: jackknifed_beta_diversity.py [options] -i otu_table –o output_dir –e num_seqs_per_jacknife_rep -m -mapping_file This script performs the following workflow: a. Computes a beta diversity distance matrix from the complete data set b. Performs multiple rarefactions at a single depth c. Computes distance matrices for all the rarefied OTU tables d. Builds UPGMA trees for the rarefactions e. Compares all the trees to determine the consensus tree and support values for each branch f. Performs principal coordinates analysis on the rarefied distance matrices g. Generates plots of the principal coordinates Start the beta diversity pipeline: jackknifed_beta_diversity.py -i otus/otu_table.biom –o \ jackknifed_beta_diversity -e 90 -m Fasting_Map.txt -t otus/rep_set.tre -t (or --tree_fp) Path to the tree file [default: None; but REQUIRED for phylogenetic measures] The jackknifed_beta_diversity.py workflow creates many new files inside the parent directory jackknifed_beta_diversity. Here you will find a subdirectory for each of the different distance metrics we selected (by way of the qiime_parameters.txt file). We will look at the output of one representative analysis to gain insight into the “within” and “between” treatment diversity. The unweighted_uniFrac subdirectory contains a folder emperor_pcoa_plots containing the file index.html. This file contains the results of the principal coordinates analysis. Use SCP/WinSCP to transfer the entire emperor_pcoa_plots directory to the Mac/PC desktop Essentials of Next Generation Sequencing 2015 Page 8 of 10 Open the directory and double-click on the index.html file. You may get some security warnings. Just hit the options that allow you to proceed and run the program. Click your mouse on the plot and drag to rotate the axes in different directions. The colored clouds represent the variation due to rarefied sampling within each sequence set, while the distances between the clouds represent the variation among the sequence sets. Note how the samples from each treatment tend to cluster in 3d space and are separable on the PC1 axis. In other words, most of the variation between the nine samples (the principle component) is explained by differences between the Control and Fasting populations. 9.6 Test whether any OTU is significantly associated with a particular experimental category QIIME provides the otu_category_significance.py script to find OTUs that are associated with particular experimental treatments or measured variables. A number of statistical tests are available for this purpose, including ANOVA, the G test of independence, Pearson correlation, or a paired ttest. Specific tests that are performed are as follows: g_test: Uses the G test of independence to determine if OTU presence/absence is associated with a category (e.g. are certain OTUs more likely to be associated with fasting mice?) ANOVA: determines if OTU relative abundance is different in one condition versus another. Pearson correlation: tests whether OTU abundance is correlated with a continuous variable (e.g. which OTUs show positive or negative correlations with the length of time in a fasting condition). Paired t-test This is for when measurements are taken “before” and “after” a treatment. There must be exactly two measurements for each individual/site. The category mapping file must again have an individual column, indicating which sample is from which individual, and a reference_sample column, that has a 1 for the before time point and a 0 for the after. Usage: otu_category_significance.py [options] –i otu_table –m category_mapping_file –c category -o output_directory -i path to input otu table -m path to category mapping file -c category to test for differential representation -s statistical test to run. (default = ANOVA) Here we will learn how to use the ANOVA test to identify OTUs whose abundance is different between the control and fasting conditions. Essentials of Next Generation Sequencing 2015 Page 9 of 10 Make sure you are still in the qiime directory Let’s perform an ANOVA test to identify OTUs that might be significantly overrepresented in one condition relative to the other: group_significance.py -i otus/otu_table.biom -m Fasting_Map.txt \ -c Treatment -o single_anova.txt Take a look at the single_anova.txt results file. Hopefully you can see that, once we factor in the False Discovery Rate (FDR column), none of the OTUs show any significant difference. 9.7 Testing for differences in OTU abundance at different taxonomic levels Closer inspection of the single_anova.txt file reveals that the absence of significant differences is probably because the mean OTU abundances in the control and fasting samples are very low. So, let’s boost the OTU numbers by re-examining the data at higher taxonomic levels: Make sure you are in the qiime directory! First we need to summarize our OTUs at different taxonomic levels (L2, L3, L4, etc.). We use the summarize_taxa.py script for this purpose: summarize_taxa.py –i otus/otu_table.biom –L 2 –o taxa_summaries -i input otu table -L taxonomic level to summarize -o output directory to store summaries Generate additional summary tables at taxonomic levels 3 and 4 Now test the OTU category significances at each level. For each of L2, L3, and L4, run: group_significance.py -i taxa_summaries/otu_table_L2.biom \ -m Fasting_Map.txt -s ANOVA -c Treatment -o anova_tests_L2.txt -i A single .biom file to use as input -o filename for the output summary A message about missing metadata is expected and can be ignored. View the output file and examine the significance values in the FDR_P column. (Note: we are focusing on this column, as opposed to the Bonferroni_P column, because the latter correction tends to be too conservative and often obscures true differences). You should now see that there are significant differences (FDR_P < 0.05) in OTU abundance at the levels of Phylum and Class but not Order. Use awk to report the taxa that show significant differences in abundance. Essentials of Next Generation Sequencing 2015 Page 10 of 10