Tutorial 2 - Taxonomical classification

advertisement
Tutorial 2
Taxonomical classification of
SSU rRNA (16 / 18S) data
Extracting ecological signal from noise:
introduction to tools for the analysis of
NGS data from microbial communities
Bergen, 19-20 April 2012
Anders Lanzén
Overview
• The two amplicon datasets from Tutorial 1, de-noised
• Alignment to SSU rRNA reference database using Megablast
• Prediction of taxonomical composition and diversity with
LCAClassifier
• Comparison of composition between datasets
• “Grepping” sequences from a particular taxon
• Classification using the Greengenes taxonomy
• Classification using the RDP Classifier
• Optional: Using MEGAN with the CREST databases for
classification
2
Copy data and check installation
LCAClassifier is the program in the package CREST (Classification
Resources for Environmental Sequence Tags) that is responsible for
taxonomical classification of sequences based on Megablast
alignments to a reference database. On the course lab computers, this
program is installed in the directory /usr/local/lcaclassifier. The
installation directory contains a configuration file that defines the
location of database files that the LCAClassifier needs in order to map
Megablast results to a taxonomy. Have a look at this file:
less /usr/local/lcaclassifier/parts/etc/lcaclassifier.conf
By default, two databases are available: the modified version of the
Silva Taxonomy, called SilvaMod and Greengenes. These are
downloaded when LCAClassifier is installed under the folder
parts/flatdb. Each database contains the following files, named like
the database itself, e.g. for silvamod:
1) silvamod.fasta.n* (.nhr, .nin and .nsq) –binary search index
files for BLAST / Megablast. The FASTA-formatted sequences
themselves are not needed by BLAST and not downloaded by
default, to speed up the installation.
2) silvamod.tre - a text file in Newick tree format that defines the
topology of the taxonomical tree
3) silvamod.map - a tab-separated text file that specifies a name
and rank for each taxon ID in the tree-file
In this tutorial we will classify the de-noised files resulting from the
previous AmpliconNoise tutorial, called SnotDNA_F_Good.fa and
SnotRNA_F_Good.fa. We will also use the sequence representatives for
each OTU (Both_Good.otus.fasta). If you reached the end of the last
tutorial, make a new directory under your $HOME called Tutorial2
and copy these files to it:
cd
mkdir Tutorial2
cp Tutorial1_AN/*_F_Good.fa Tutorial2/
cp Tutorial1_AN/Both_Good.otus.fasta Tutorial2/
3
Hint: In case you did not have time to finish Tutorial1, they are also
found in the folder Solution inside of the Tutorial1 folder.
Then go to the new directory you have created for this tutorial:
cd Tutorial2
4
Aligning files to the reference
database (SilvaMod) using Megablast
The first step in the CREST workflow is alignment to a reference
database, using Megablast, which is a faster version of BLAST (Basic
Local Alignment Search Tool) for nucleotide sequence alignments.
We will use the SilvaMod database made from SILVA’s SSURef
alignment of full-length SSU rRNA sequences, release 106. Using the
resulting alignments (BLAST results), taxonomical classification is
then done with the LCAClassifier program (or alternatively, MEGAN).
The LCAClassifier only supports results from the NCBI blastall
implementation of Megablast, in XML format. It does not work with
results from the newer BLAST+ implementation.
Align the sequences in SnotDNA_F_Good.fa to SilvaMod using:
megablast -i SnotDNA_F_Good.fa -b 100 -v 100 -m 7
-d
/usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.
fasta -a 4 -o SnotDNA_silvamod.xml
The arguments given to the megablast command mean:
-i : the nucleotide sequence input file
-b 100 -v 100 : Report only the 100 best alignments for each
sequence
-m 7: Produce output in XML format
-d : the reference database to use*.
-a 4 : Use four CPU cores.
-o : destination file of the Megablast output
*Note that the file silvamod.fasta itself is actually missing by default, but is not
needed since BLAST only uses the search index files produced by the
command formatdb from silvamod.fasta.
If you don’t get any error message, the Megablast alignment has
probably worked fine. You can also have a look at the resulting file in
a text editor, to familiarise yourself with the BLAST XML format,
which is the same for Megablast and normal BLAST. However, it is
not really intended for being read by human beings and is a bit too
structured for us.
5
Then, align the other two FASTA-files :
megablast -i SnotRNA_F_Good.fa -b 100 -v 100 -m 7 -d
/usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.fasta
-a 4 -o SnotRNA_silvamod.xml
megablast -i Both_Good.otus.fasta -b 100 -v 100 -m 7 -d
/usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.fasta
-a 4 -o OTUs_silvamod.xml
This should only take a couple of minutes per file, but if things are
slow, it’s a good opportunity for a short coffee break.
Taxonomical classification using
LCAClassifier
Classification with the LCAClassifier program using default
parameters is quite easy and carried out using the command classify.
To classify the Megablast-aligned SnotDNA, simply type:
classify SnotDNA_silvamod.xml
This will result in two files. The first, named
SnotDNA_silvamod_Composition.txt, is a tab-separated text file,
presenting the number of reads, relative abundance, unique
sequences and a Chao-estimate of minimum diversity for each taxon,
at different rank levels (domain, phylum, class, order, family and
genus). The number of reads classified at each rank level is also
summarised. Open this file in a spreadsheet editor (like OpenOffice)
- What proportion of the original reads could be classified to at
least family level?
- What is the relative abundance of Sulfurimonas and how many
unique sequences are represented in this genus? What is the
Chao estimate?
The other file, SnotDNA_silvamod_Tree.txt, lists the number of reads
assigned in a simple space-delimited tree-format. Here are the first
ten lines of the file:
6
head SnotDNA_silvamod_Tree.txt
root: 718
No hits: 2
Cellular organisms: 716
Archaea: 213
Korarchaeota: 2
Marine Benthic Group B - Deep Sea Archaeal Group (DSAG): 2
Miscellaneous Crenarchaeotic Group: 16
Thaumarchaeota: 9
Group 1A - pSL12: 2
FS243A-60: 2
Comparison of composition between
datasets
The LCAClassifier can also classify several Blast result files at the
same time, providing that they were aligned to the same reference
database. Try this out:
classify -o -p *.xml
This results in the same two types of result files for each dataset
(SnotDNA, SnotRNA and the OTU representative sequences). However,
empty assignments are inserted with zero abundance for each taxon
present in at least one dataset, where no assignments to the
corresponding taxon was made in a particular dataset. This helps to
compare datasets. The options –o and –b specify that results in
alternative output formats should also be written.
Open SnotDNA_Silvamod_Composition.txt and
SnotRNA_silvamod_Composition.txt in a spreadsheet editor (e.g.
LibreOffice). Copy the last four columns (“Abundance” to “Chao”) of
one file into the right of the columns of the other one, so that each
taxon is compared for the two datasets side by side.
- Can you find any taxa with an average abundance > 1% with
less than half the relative abundance in RNA compared to DNA?
7
- How about a dominant taxon with almost twice the abundance
in RNA?
- In the DNA dataset, what is the most diverse genus, in terms of
number of unique de-noised sequences?
- Have a look at the file All_Composition.txt. Can you figure out
how many DNA sequences that were assigned to the domain
Bacteria but could not be assigned to any particular phylum.
“Grepping” sequences from a
particular taxon
The LCAClassifier can also write output files that specify the
assignment for each individual sequence. Using option -p (or --rdp),
the identifier of each sequence and its predicted taxonomical path is
written with suffix “_Assignments.txt”. For example:
SnotRNA_47_5 Cellular organisms;Archaea;Euryarchaeota;
Methanomicrobia;ANME-1;ANME-1a↵
Another useful option is to write a new FASTA file with taxonomical
annotations added in the FASTA header. This is done using option -a
(or --fasta). By default, only the aligned part of each sequence is
written and entries that could not be aligned are omitted.
Alternatively, the entire sequences can be written. The fasta file to
read in sequences from then must be specified using -i (or --fastain).
To produce these files for the OTU representative sequences, type:
classify -i Both_Good.otus.fasta -a -p OTUs_silvamod.xml
This produces the files OTUs_silvamod_Assignments.fasta and
OTUs_silvamod_Assignments.txt. The first can be used for extracting
all sequences from a particular taxon, using the LINUX command
grep. For example, to save all archaeal sequences to the file A.fa:
grep Archaea OTUs_silvamod_Assignments.fasta -A1 > A.fa
8
-
How many are there?
(Hint: grep -ca ‘>’ A.fa)
Some of the sequences are assigned to “Unknown” taxa. This means
that the minimum sequence similarity filter of the LCAClassifer has
prevented it from being classified to a higher rank. These sequences
may be interesting because of their novelty. A.fa contains five such
sequences. Have a look at them:
grep Unknown -A1 A.fa
The last sequence (OTU23_1), is assigned to “Unknown Halobacteria
order”. This means that it is less than 90% similar to the closest
reference sequence and thus cannot be assigned to a particular order,
but instead to the class Halobacteria under Euryarchaeota. Copy the
sequence to NCBI Blast (http://www.ncbi.nlm.nih.gov/blast ) and
have a look at the closest sequence matches. The first two are from
the same sequence (as it is submitted), but the third is from another
hydrothermal sediment and is only 91% similar.
To read an overview of all options available with the LCAClassifier,
type:
classify --help
9
Classification using Greengenes
As an alternative to the SilvaMod Taxonomy, the LCAClassifier can
also be used together with the Greengenes Taxonomy. Similarly to
SILVA’s (and “SilvaMod”), this is basically a reference database
consisting of environmental sequences and type strains annotated
taxonomically based on the clustering, or in other words the topology
of the distance tree. The difference from SILVA is that the annotations
in Greengenes were made automatically, using a heuristic algorithm.
To read more about the Greengenes taxonomy, see McDonald et al
(2012), 'An improved Greengenes taxonomy with explicit ranks for
ecological and evolutionary analyses of bacteria and archaea', ISME J,
6:610-618.
As the reference database is different, the first step is to align the
sequences to the Greengenes reference database, which can be found
in /usr/local/lcaclassifier/parts/flatdb/greengenes. To align e.g. the
SnotDNA sequences:
megablast -i SnotDNA_F_Good.fa -b 100 -v 100 -m 7 -d
/usr/local/lcaclassifier/parts/flatdb/greengenes/greenge
nes.fasta -a 4 -o SnotDNA_GG.xml
To classify the sequences based on the Megablast result, you need to
tell the classify command to use the Greengenes taxonomy, since
SilvaMod is the default.
classify -d greengenes -p SnotDNA_GG.xml
Open SnotDNA_GG_Composition.txt in a spreadsheet editor.
- What proportion of reads could be classified at genus level?
- Compare it to the community composition based on the
SilvaMod taxonomy. What major differences can you find
between the results based o the two reference databases /
taxonomies, on family level?
10
Classification using the RDP Classifier
The RDP Classifier is a popular tool for taxonomic classification.
Instead of BLAST alignments, it uses nucleotide composition for
classification. It is an open source Java-program that can be
downloaded together with a default training dataset, currently in
version 2.4 with training dataset v7. The default training dataset
includes mainly cultured type strains and is much smaller than
Greengenes or SILVA SSURef, with about 10,000 sequences,
annotated with the RDP Taxonomy. RDP Classifier is also integrated
in the MG-RAST webserver and in the amplicon analysis program
package QIIME, where a training set from Greengenes with
corresponding taxonomical classifications is also included.
First, download the RDP Classifier program, which is distributed as a
java jar-file from http://sf.net/projects/rdp-classifier/. You can also
do this using the command wget:
wget http://downloads.sf.net/project/rdpclassifier/rdp-classifier/rdp_classifier_2.4.zip
Unzip it:
unzip rdp_classifier_2.4.zip
Then, to classify the Snot DNA dataset with it, use:
java -jar rdp_classifier_2.4/rdp_classifier-2.4.jar -q
SnotDNA_F_Good.fa -o SnotDNA_RDP.txt
The tab-separated (default) output format looks like this:
$ head SnotDNA_RDP.txt |tail -1
SnotDNA_s60_c01_T220_P_BC_s30_c08_10_2
Root norank
1.0
Bacteria
domain
1.0
"Proteobacteria" phylum
0.51
Deltaproteobacteria
class 0.25 Bdellovibrionales order 0.25
Bdellovibrionaceae
family
0.25 Bdellovibrio
genus
0.25↵
After the sequence name follows the classification at each rank,
starting with “Root”, followed by the name of that rank and then a
bootstrap “confidence” value. The developers recommend a cut-off at
11
0.8, which in this case means the sequence can only be classified at
domain level.
-
How was this sequence classified by the LCAClassifier using
SilvaMod?
Hint:
grep SnotDNA_s60_c01_T220_P_BC_s30_c08_10_2
SnotDNA_silvamod_Assignments.txt
-
How about with Greengenes?
There is a python script called rdpclassifierParse in the LCAClassifier
package for converting the RDP Classifier output into the same
spreadsheet-friendly composition overview format as that produced
by the classify command. To do this, use:
cd /usr/local/lcaclassifier
bin/python src/LCAClassifier/rdpclassifierParse.py
~/Tutorial2/SnotDNA_RDP.txt 0.8 >
~/Tutorial2/SnotDNA_RDP_Compo.txt
cd ~/Tutorial2
Open the file SnotDNA_RDP_Compo in a spreadsheet editor.
- What proportion of reads could be classified to at least phylum
level?
- Compare it to the community composition based on the
SilvaMod taxonomy. What differences can you find between the
results at phylum level?
12
Classification with MEGAN (optional)
MEGAN is not installed in this course lab, so you have to download
and install the latest version from the program’s website at
http://ab.inf.uni-tuebingen.de/software/megan/
Download the UNIX installation script using a web browser or with
wget:
wget http://ab.inf.unituebingen.de/data/software/megan4/download/MEGAN_unix_4
_67_1.sh
Then install the program by executing the downloaded script:
chmod +x MEGAN_unix_4_67_1.sh
./MEGAN_unix_4_67_1.sh
This will open a dialog box asking where to install the program.
Chose your $HOME directory under “megan” and tick the box “Create
symlinks”. Then start MEGAN simply by typing “MEGAN” (or ticking
“Run MEGAN”). The first time you start it you may have to enter a
command that opens up the possibility for using alternative
taxonomies (this is because the option is “locked” for now). In the
Window menu, click Command Input, which opens a command line
interface window to MEGAN. In this window, type first:
setprop allow-read-weights=true;
Then press Apply. Then type the following and click Apply:
setprop allow-read-weights-underscore=true;
Then re-start MEGAN. Now click “Use Alternative Taxonomy...” under
Edit>Preferences. Then navigate to the directory
/usr/local/lcaclassifier/parts/flatdb/silvamod and select the file
silvamod.tre. If everything works correctly, the SilvaMod taxonomy
should now be enabled and you should see the following output in
the Messages window:
13
Executing: load
treefile='/usr/local/lcaclassifier/parts/flatdb/silvamod/silv
amod.tre' [..]
File name:
/usr/local/lcaclassifier/parts/flatdb/silvamod/silvamod.tre'
Load mapping:
taxId2TaxLevel: 1179077
done: 302383
Load tree:
done: 302379 nodes, 302378 edges
Number of taxa: 302382
Now, import the BLAST results, using File>Import from BLAST. Select
one of the Megablast XML files, then go to the sheet LCA Params and
change the settings to the following: Min support=1, Min score=155
and Top Percent=2. Also, tick the box “Use Percent Identity Filters”.
Then, click Apply!
An interactive taxonomy tree will now appear, listing the number of
assignments to different taxa and also symbolising the abundance of
a taxon by a circle, whose area is proportional to the number of
assignments. Try to explore the tree by collapsing and uncollapsing
nodes and by right-clicking a node and selecting Inspect.
14
Download