Tutorial 2 Taxonomical classification of SSU rRNA (16 / 18S) data Extracting ecological signal from noise: introduction to tools for the analysis of NGS data from microbial communities Bergen, 19-20 April 2012 Anders Lanzén Overview • The two amplicon datasets from Tutorial 1, de-noised • Alignment to SSU rRNA reference database using Megablast • Prediction of taxonomical composition and diversity with LCAClassifier • Comparison of composition between datasets • “Grepping” sequences from a particular taxon • Classification using the Greengenes taxonomy • Classification using the RDP Classifier • Optional: Using MEGAN with the CREST databases for classification 2 Copy data and check installation LCAClassifier is the program in the package CREST (Classification Resources for Environmental Sequence Tags) that is responsible for taxonomical classification of sequences based on Megablast alignments to a reference database. On the course lab computers, this program is installed in the directory /usr/local/lcaclassifier. The installation directory contains a configuration file that defines the location of database files that the LCAClassifier needs in order to map Megablast results to a taxonomy. Have a look at this file: less /usr/local/lcaclassifier/parts/etc/lcaclassifier.conf By default, two databases are available: the modified version of the Silva Taxonomy, called SilvaMod and Greengenes. These are downloaded when LCAClassifier is installed under the folder parts/flatdb. Each database contains the following files, named like the database itself, e.g. for silvamod: 1) silvamod.fasta.n* (.nhr, .nin and .nsq) –binary search index files for BLAST / Megablast. The FASTA-formatted sequences themselves are not needed by BLAST and not downloaded by default, to speed up the installation. 2) silvamod.tre - a text file in Newick tree format that defines the topology of the taxonomical tree 3) silvamod.map - a tab-separated text file that specifies a name and rank for each taxon ID in the tree-file In this tutorial we will classify the de-noised files resulting from the previous AmpliconNoise tutorial, called SnotDNA_F_Good.fa and SnotRNA_F_Good.fa. We will also use the sequence representatives for each OTU (Both_Good.otus.fasta). If you reached the end of the last tutorial, make a new directory under your $HOME called Tutorial2 and copy these files to it: cd mkdir Tutorial2 cp Tutorial1_AN/*_F_Good.fa Tutorial2/ cp Tutorial1_AN/Both_Good.otus.fasta Tutorial2/ 3 Hint: In case you did not have time to finish Tutorial1, they are also found in the folder Solution inside of the Tutorial1 folder. Then go to the new directory you have created for this tutorial: cd Tutorial2 4 Aligning files to the reference database (SilvaMod) using Megablast The first step in the CREST workflow is alignment to a reference database, using Megablast, which is a faster version of BLAST (Basic Local Alignment Search Tool) for nucleotide sequence alignments. We will use the SilvaMod database made from SILVA’s SSURef alignment of full-length SSU rRNA sequences, release 106. Using the resulting alignments (BLAST results), taxonomical classification is then done with the LCAClassifier program (or alternatively, MEGAN). The LCAClassifier only supports results from the NCBI blastall implementation of Megablast, in XML format. It does not work with results from the newer BLAST+ implementation. Align the sequences in SnotDNA_F_Good.fa to SilvaMod using: megablast -i SnotDNA_F_Good.fa -b 100 -v 100 -m 7 -d /usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod. fasta -a 4 -o SnotDNA_silvamod.xml The arguments given to the megablast command mean: -i : the nucleotide sequence input file -b 100 -v 100 : Report only the 100 best alignments for each sequence -m 7: Produce output in XML format -d : the reference database to use*. -a 4 : Use four CPU cores. -o : destination file of the Megablast output *Note that the file silvamod.fasta itself is actually missing by default, but is not needed since BLAST only uses the search index files produced by the command formatdb from silvamod.fasta. If you don’t get any error message, the Megablast alignment has probably worked fine. You can also have a look at the resulting file in a text editor, to familiarise yourself with the BLAST XML format, which is the same for Megablast and normal BLAST. However, it is not really intended for being read by human beings and is a bit too structured for us. 5 Then, align the other two FASTA-files : megablast -i SnotRNA_F_Good.fa -b 100 -v 100 -m 7 -d /usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.fasta -a 4 -o SnotRNA_silvamod.xml megablast -i Both_Good.otus.fasta -b 100 -v 100 -m 7 -d /usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.fasta -a 4 -o OTUs_silvamod.xml This should only take a couple of minutes per file, but if things are slow, it’s a good opportunity for a short coffee break. Taxonomical classification using LCAClassifier Classification with the LCAClassifier program using default parameters is quite easy and carried out using the command classify. To classify the Megablast-aligned SnotDNA, simply type: classify SnotDNA_silvamod.xml This will result in two files. The first, named SnotDNA_silvamod_Composition.txt, is a tab-separated text file, presenting the number of reads, relative abundance, unique sequences and a Chao-estimate of minimum diversity for each taxon, at different rank levels (domain, phylum, class, order, family and genus). The number of reads classified at each rank level is also summarised. Open this file in a spreadsheet editor (like OpenOffice) - What proportion of the original reads could be classified to at least family level? - What is the relative abundance of Sulfurimonas and how many unique sequences are represented in this genus? What is the Chao estimate? The other file, SnotDNA_silvamod_Tree.txt, lists the number of reads assigned in a simple space-delimited tree-format. Here are the first ten lines of the file: 6 head SnotDNA_silvamod_Tree.txt root: 718 No hits: 2 Cellular organisms: 716 Archaea: 213 Korarchaeota: 2 Marine Benthic Group B - Deep Sea Archaeal Group (DSAG): 2 Miscellaneous Crenarchaeotic Group: 16 Thaumarchaeota: 9 Group 1A - pSL12: 2 FS243A-60: 2 Comparison of composition between datasets The LCAClassifier can also classify several Blast result files at the same time, providing that they were aligned to the same reference database. Try this out: classify -o -p *.xml This results in the same two types of result files for each dataset (SnotDNA, SnotRNA and the OTU representative sequences). However, empty assignments are inserted with zero abundance for each taxon present in at least one dataset, where no assignments to the corresponding taxon was made in a particular dataset. This helps to compare datasets. The options –o and –b specify that results in alternative output formats should also be written. Open SnotDNA_Silvamod_Composition.txt and SnotRNA_silvamod_Composition.txt in a spreadsheet editor (e.g. LibreOffice). Copy the last four columns (“Abundance” to “Chao”) of one file into the right of the columns of the other one, so that each taxon is compared for the two datasets side by side. - Can you find any taxa with an average abundance > 1% with less than half the relative abundance in RNA compared to DNA? 7 - How about a dominant taxon with almost twice the abundance in RNA? - In the DNA dataset, what is the most diverse genus, in terms of number of unique de-noised sequences? - Have a look at the file All_Composition.txt. Can you figure out how many DNA sequences that were assigned to the domain Bacteria but could not be assigned to any particular phylum. “Grepping” sequences from a particular taxon The LCAClassifier can also write output files that specify the assignment for each individual sequence. Using option -p (or --rdp), the identifier of each sequence and its predicted taxonomical path is written with suffix “_Assignments.txt”. For example: SnotRNA_47_5 Cellular organisms;Archaea;Euryarchaeota; Methanomicrobia;ANME-1;ANME-1a↵ Another useful option is to write a new FASTA file with taxonomical annotations added in the FASTA header. This is done using option -a (or --fasta). By default, only the aligned part of each sequence is written and entries that could not be aligned are omitted. Alternatively, the entire sequences can be written. The fasta file to read in sequences from then must be specified using -i (or --fastain). To produce these files for the OTU representative sequences, type: classify -i Both_Good.otus.fasta -a -p OTUs_silvamod.xml This produces the files OTUs_silvamod_Assignments.fasta and OTUs_silvamod_Assignments.txt. The first can be used for extracting all sequences from a particular taxon, using the LINUX command grep. For example, to save all archaeal sequences to the file A.fa: grep Archaea OTUs_silvamod_Assignments.fasta -A1 > A.fa 8 - How many are there? (Hint: grep -ca ‘>’ A.fa) Some of the sequences are assigned to “Unknown” taxa. This means that the minimum sequence similarity filter of the LCAClassifer has prevented it from being classified to a higher rank. These sequences may be interesting because of their novelty. A.fa contains five such sequences. Have a look at them: grep Unknown -A1 A.fa The last sequence (OTU23_1), is assigned to “Unknown Halobacteria order”. This means that it is less than 90% similar to the closest reference sequence and thus cannot be assigned to a particular order, but instead to the class Halobacteria under Euryarchaeota. Copy the sequence to NCBI Blast (http://www.ncbi.nlm.nih.gov/blast ) and have a look at the closest sequence matches. The first two are from the same sequence (as it is submitted), but the third is from another hydrothermal sediment and is only 91% similar. To read an overview of all options available with the LCAClassifier, type: classify --help 9 Classification using Greengenes As an alternative to the SilvaMod Taxonomy, the LCAClassifier can also be used together with the Greengenes Taxonomy. Similarly to SILVA’s (and “SilvaMod”), this is basically a reference database consisting of environmental sequences and type strains annotated taxonomically based on the clustering, or in other words the topology of the distance tree. The difference from SILVA is that the annotations in Greengenes were made automatically, using a heuristic algorithm. To read more about the Greengenes taxonomy, see McDonald et al (2012), 'An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea', ISME J, 6:610-618. As the reference database is different, the first step is to align the sequences to the Greengenes reference database, which can be found in /usr/local/lcaclassifier/parts/flatdb/greengenes. To align e.g. the SnotDNA sequences: megablast -i SnotDNA_F_Good.fa -b 100 -v 100 -m 7 -d /usr/local/lcaclassifier/parts/flatdb/greengenes/greenge nes.fasta -a 4 -o SnotDNA_GG.xml To classify the sequences based on the Megablast result, you need to tell the classify command to use the Greengenes taxonomy, since SilvaMod is the default. classify -d greengenes -p SnotDNA_GG.xml Open SnotDNA_GG_Composition.txt in a spreadsheet editor. - What proportion of reads could be classified at genus level? - Compare it to the community composition based on the SilvaMod taxonomy. What major differences can you find between the results based o the two reference databases / taxonomies, on family level? 10 Classification using the RDP Classifier The RDP Classifier is a popular tool for taxonomic classification. Instead of BLAST alignments, it uses nucleotide composition for classification. It is an open source Java-program that can be downloaded together with a default training dataset, currently in version 2.4 with training dataset v7. The default training dataset includes mainly cultured type strains and is much smaller than Greengenes or SILVA SSURef, with about 10,000 sequences, annotated with the RDP Taxonomy. RDP Classifier is also integrated in the MG-RAST webserver and in the amplicon analysis program package QIIME, where a training set from Greengenes with corresponding taxonomical classifications is also included. First, download the RDP Classifier program, which is distributed as a java jar-file from http://sf.net/projects/rdp-classifier/. You can also do this using the command wget: wget http://downloads.sf.net/project/rdpclassifier/rdp-classifier/rdp_classifier_2.4.zip Unzip it: unzip rdp_classifier_2.4.zip Then, to classify the Snot DNA dataset with it, use: java -jar rdp_classifier_2.4/rdp_classifier-2.4.jar -q SnotDNA_F_Good.fa -o SnotDNA_RDP.txt The tab-separated (default) output format looks like this: $ head SnotDNA_RDP.txt |tail -1 SnotDNA_s60_c01_T220_P_BC_s30_c08_10_2 Root norank 1.0 Bacteria domain 1.0 "Proteobacteria" phylum 0.51 Deltaproteobacteria class 0.25 Bdellovibrionales order 0.25 Bdellovibrionaceae family 0.25 Bdellovibrio genus 0.25↵ After the sequence name follows the classification at each rank, starting with “Root”, followed by the name of that rank and then a bootstrap “confidence” value. The developers recommend a cut-off at 11 0.8, which in this case means the sequence can only be classified at domain level. - How was this sequence classified by the LCAClassifier using SilvaMod? Hint: grep SnotDNA_s60_c01_T220_P_BC_s30_c08_10_2 SnotDNA_silvamod_Assignments.txt - How about with Greengenes? There is a python script called rdpclassifierParse in the LCAClassifier package for converting the RDP Classifier output into the same spreadsheet-friendly composition overview format as that produced by the classify command. To do this, use: cd /usr/local/lcaclassifier bin/python src/LCAClassifier/rdpclassifierParse.py ~/Tutorial2/SnotDNA_RDP.txt 0.8 > ~/Tutorial2/SnotDNA_RDP_Compo.txt cd ~/Tutorial2 Open the file SnotDNA_RDP_Compo in a spreadsheet editor. - What proportion of reads could be classified to at least phylum level? - Compare it to the community composition based on the SilvaMod taxonomy. What differences can you find between the results at phylum level? 12 Classification with MEGAN (optional) MEGAN is not installed in this course lab, so you have to download and install the latest version from the program’s website at http://ab.inf.uni-tuebingen.de/software/megan/ Download the UNIX installation script using a web browser or with wget: wget http://ab.inf.unituebingen.de/data/software/megan4/download/MEGAN_unix_4 _67_1.sh Then install the program by executing the downloaded script: chmod +x MEGAN_unix_4_67_1.sh ./MEGAN_unix_4_67_1.sh This will open a dialog box asking where to install the program. Chose your $HOME directory under “megan” and tick the box “Create symlinks”. Then start MEGAN simply by typing “MEGAN” (or ticking “Run MEGAN”). The first time you start it you may have to enter a command that opens up the possibility for using alternative taxonomies (this is because the option is “locked” for now). In the Window menu, click Command Input, which opens a command line interface window to MEGAN. In this window, type first: setprop allow-read-weights=true; Then press Apply. Then type the following and click Apply: setprop allow-read-weights-underscore=true; Then re-start MEGAN. Now click “Use Alternative Taxonomy...” under Edit>Preferences. Then navigate to the directory /usr/local/lcaclassifier/parts/flatdb/silvamod and select the file silvamod.tre. If everything works correctly, the SilvaMod taxonomy should now be enabled and you should see the following output in the Messages window: 13 Executing: load treefile='/usr/local/lcaclassifier/parts/flatdb/silvamod/silv amod.tre' [..] File name: /usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.tre' Load mapping: taxId2TaxLevel: 1179077 done: 302383 Load tree: done: 302379 nodes, 302378 edges Number of taxa: 302382 Now, import the BLAST results, using File>Import from BLAST. Select one of the Megablast XML files, then go to the sheet LCA Params and change the settings to the following: Min support=1, Min score=155 and Top Percent=2. Also, tick the box “Use Percent Identity Filters”. Then, click Apply! An interactive taxonomy tree will now appear, listing the number of assignments to different taxa and also symbolising the abundance of a taxon by a circle, whose area is proportional to the number of assignments. Try to explore the tree by collapsing and uncollapsing nodes and by right-clicking a node and selecting Inspect. 14