Metagenomics

advertisement
Metagenomics
or
marker gene studies
Jarno Tuimala, PhD
RS-koulutus / SPR Veripalvelu
Program
10.15-11.45 data preprocessing and alignment
11.45-12.30 lunch
12.30-14.00 sequence filtering
14.00-14.15 coffee break
14.15-16.15 sequence classification, statistics
16.15-16.45 wrap-up, feedback etc.
And other breaks when needed
Aims of the course
• To learn the essential analysis steps for
marker gene studies
• To learn how to use mothur software for
performing the analyses
• Alternating between lectures and hands-on
exercises
Metagenomics
1. Shotgun sequencing of niche specific samples
– Needs assembly (putting the sequence pieces in a
correct order and to the correct species)
2. Sequencing certain markers genes, such as
16S rRNA, RecA, RpoB
– Sequence alignment is enough
• In fact, calling the latter metagenomics is
misleading. Marker gene studies have been
performed for at least 20 years.
Sanger Sequencing
from Wikipedia
Roche 454 pyrosequencing
Review: Medini et al, 2008
16S rRNA sequences from 454
• Primer (adaptor key in the end) + tag + template specific
primer + 16S rRNA sequence
• Adaptor contains a biotin tag and is needed during
sequencing of the template
• Tag allows discrimination between samples (barcoding)
• Primer is needed for amplification of the sequence
• These are usually removed from the sequences before the
actual analysis
• Example: TCAGTACTCGGCCTACGGGAGGCAGCAG
Electrofluorogram
Base calling bias (454)
http://www.biomedcentral.com/1471-2164/12/245
Files needed on this course
• Demodata 1 (SOP) and Demodata 2 (Turku)
– Extract to the Desktop
• Software
–
–
–
–
Extract to the Desktop
Copy mothur.exe and uchime.exe to the demo data folder
Seaview
R
• Reference sequences
– Originally from Silva.bacteria.zip, Silva.gold.bacteria.zip and
Trainset9_032012.rdp.zip (from mothur wiki), but repackaged for the
course
– Extract to the Desktop
– After preparing the input data, copy to the data folder
• Scripts
– Used for the statistical analyses
Demodata 1
•
•
•
•
SOP data from mothur
Mice gut microbiome followed after weaning
Data is for one animal only, but ten time points
We are using sequence files in fasta format and
corresponding quality info files
• If you need to process sff files, please see SOP
data example in mothur wiki, and the example on
the forth coming slides
– You’ll need a lookup file from
http://www.mothur.org/wiki/Lookup_files
Demodata 2
• Ruminant data from Turku AMK
• Sample numbers indicate the individuals cows,
the sample character indicate the primer set.
• Altogether 12 samples.
• All in SFF format!
Keep a lab book!
• Use some text editor (Notepad!) to first type in
the commands, and then copy and paste them
into mothur.
• This creates a file of mothur commands that can
then later on be used as batch file for processing
the data.
• It also documents what you have done.
• Better to adapt this type of a habit sooner than
later.
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
File formats
• FastA
>ERR051700.1 FXCMJDV02HWG0G length=10
TCAGTACTCG
>ERR051700.2 FXCMJDV02GE0HC length=10
TCAGTACTCG
• Qual
>ERR051700.1 FXCMJDV02HWG0G length=10
37 37 37 37 34 28 29 29 29 26
>ERR051700.2 FXCMJDV02GE0HC length=10
37 37 37 37 40 40 40 40 40 40
Preparing the data, option 1
• Open the command shell (cmd.exe), and
change to the directory (cd) containing the
data files
• Append all the fasta and all the qual files
together
– copy /b *.fasta all.fasta
– copy /b *.qual all.qual
Preparing the data, option 2
• If the data is in sff format, fasta and qual files
can be extracted from it with the mothur
command sffinfo():
– sffinfo(sff=454Reads.RL1.sff)
• And then continue with the fasta and qual
files as laid out on the previous slide
Exercise A
• Extract demo data to Desktop (or some other
folder where you have write rights)
• Prepare the data for further analysis by
appending the fasta and qual files into a single
fasta file and a single qual file
• Once done, extract mothur to the same place,
and copy mothur.exe and uchime.exe from the
subfolder mothur to the demo data folder
Mothur
• Mothur is one of the tools meant for analysis of
marker gene studies
• Developed by Pat Schloss’ research group at
University of Michigan
• Freely available command line program
• http://www.mothur.org/
• http://www.mothur.org/wiki/Mothur_manual
• http://www.mothur.org/wiki/Analysis_examples
Using mothur
• Installation, several possibilities
– Copy to a certain folder, add to Windows’ path
variable, invoke from command line (cmd.exe)
– Copy to the folder where the data resides, and
start by double clicking on it
• Once mothur starts, it takes you to its own
environment.
– mothur >
Mothur commands
• See the manual on the web for commands.
– http://www.mothur.org/wiki
• Alternatively type help() to mothur prompt to
list all commands.
• Help on a particular command can be invoked by,
e.g., summary.seqs(help).
• Command are always associated with brackets,
and the command arguments are given inside the
brackets.
– summary.seqs(fasta=all.fasta)
Mothur data
• Mothur does all the operations in the memory,
but saves the data files directly on the hard drive
to the working directory.
• It also keeps a logfile that records everything that
is written on the screen (mothur console), and
sometimes a few extra things.
• After you’ve run some command, please follow
the screen output closely, and check the files
that were created to the working directory.
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Trim sequences 1
• Input
– Fasta file (prepared at prepare data step)
– Quality file (prepared at prepare data step)
– Oligo file (Will be prepared now)
• Output
– Fasta file of good sequences (all.trim.fasta)
– Fasta file of scrap sequences (all.scrap.fasta)
– Groups file (indicates into which group/sample the
sequences belong to) (all.groups)
Trim sequences 2
• Oligo file
– Tab-delimited
– Lines can be commented out with a #
– “The oligos option takes a file that can contain the sequences of the
forward and reverse primers and barcodes and their sample identifier.
Each line of the oligos file can start with the key words "forward",
"reverse", "barcode", "linker" and "spacer" or it can start with a "#" to
tell mothur to ignore that line of the oligos file.” [Mothur wiki]
• Example:
forward CCTACGGGAGGCAGCAG
#reverse GTATTACCGCGGCTGCTG
barcode TCAGTAGCGCG
ERR051699
barcode TCAGTACTCGG
ERR051702
barcode TCAGTAGCGCC
ERR051705
Trim sequences 3
• Options
–
–
–
–
–
–
–
–
–
–
–
–
–
–
fasta
fasta file
oligos
oligo file
qfile
quality file
qaverage
average sequence quality
qwindowaverage average quality of window
qwindowsize
window width
flip
reverse complement seqs
maxambig
max. of ambig. bases
maxhomop
max. lenght of homopolymer
bdiffs
max. diffs in barcode
pdiffs
max. diffs in primers
ldiffs
max. diffs in linkers
sdiffs
max. diffs in spacers
tdiffs
max. diffs in total
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 -> delete the sequence
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qaverage = 40 ->
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
• qwindowaverage = 40, qwindowsize = 3
Trim sequences 4
• flip=T
– TCAGTACTCG -> CGAGTACTGA
– AGTCATGAGC
• maxambiq = 1
– TCAGTACNCG
• maxhomp = 3
• TCAGTTTTCG
Common problems
• The sequences are not in the same order in the
fasta and in the qual files.
• But, mothur assumes that they are, and gives an
error if this is not the case.
• Solutions:
– Do not use quality file when trimming the sequences
– Sort the sequences and the quality files
– Sometimes not all sequences in the fasta file have a
match in the qual file, then you don’t have any other
option but to not to use the quality files during the
trimming
Trim sequences 5
• Example commands
> summary.seqs(fasta=all.fasta)
> trim.seqs(fasta=all.fasta,
qfile=all.qual, oligos=oligo.tsv,
pdiffs=2, bdiffs=1, ldiffs=1, sdiffs=1,
qaverage=25, flip=T)
> trim.seqs(fasta=all.fasta,
oligos=GQY1XT001.oligos, pdiffs=2,
bdiffs=1, ldiffs=1, sdiffs=1,
maxambig=0, maxhomop=8, flip=T)
> get.current()
> summary.seqs(fasta=all.trim.fasta)
Exercise B
• Run mothur by double clicking on it. It should open in a
new DOS window.
• Trim the sequences of demodata using a) an average
quality score of 25 for each sequence OR b) a sliding
window method with a windows quality score of 25 OR
c) removal of oligo sequences and too ambiguous or
polymeric (this is handy if other options fail).
• Run summary.seqs() after each trimming to check how
many sequences were retained in the data.
• Which method would you pick for further processing
steps, and why?
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Remove non-unique seqs
• Why?
– Sequence amplification can sometimes produces large
amounts of identical sequences, even if there are really not
that many organisms present that have that particular
sequence -> aftefacts.
unique.seqs(fasta=all.trim.fasta)
get.current()
summary.seqs(fasta=current)
•Output files:
–Sequences (all.trim.unique.fasta)
–List of sequence groups (all.trim.names)
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Sequence alignment
• Idea is to line up the sequences so that the similarity
(score) of sequences is maximized (description of what the
computer does)
• Example of a pairwise global alignment (an alignment of
two sequences along their whole lenghts)
tgagttgaact
tgagt-gagc• Minus (-) signs are called gaps
• Gaps can be modelled using gap opening and extension
penalties, e.g., match=5, mismatch=0, opening=-2,
extension=-1. Total score for the alignment above is 31, and
computer tries to maximize this alignment score.
Alignment in mothur
• Sequences are aligned, one by one, against a set
of reference sequences.
• This usually makes the 16 S rRNA alignment
better than other options (multiple alignment or
alignment of several sequences), but this does
not necessarily hold for all possible genes!
• By default, mothur performs a global pairwise
alignment (Needleman-Wunch algorithm) where
gap opening and extension are penalized equally.
• See, e.g.,
http://koti.mbnet.fi/tuimala/oppaat/bioinfolaaja.pdf for more info on alignments.
Alignment in mothur
• Example commands
align.seqs(fasta=all.trim.unique.fasta,
reference=silva.bacteria.fasta)
summary.seqs(fasta=current, name=current)
• The reference set silva.bacteria.fasta or some other
offered by, e.g., mothur site, is obligatory!
• After the alignment is ready, it can be checked in some
alignment software, such as Seaview.
• Output files:
– aligned sequences (all.trim.unique.align)
– alignment report listing the files and the location they
were aligned against (all.trim.unique.align.report)
Exercise C
• Remove the non-unique sequences
• Copy the reference sequence file
silva.bacteria.fasta to the data folder
• Align the unique sequences against the
referense sequence set
• Open the resulting alignment file in the
Seaview editor, and check the results visually
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Filtering aligned sequences
• After aligning the sequences, it is a good idea to make
the alignment more neat
tgagttgaact (reference)
..agt-gag..
..agg-gaa..
• Remove all gaps-only columns (-)
• Remove all columns containg a specific character (.).
align.seqs() will precede any position before the
sequence start with dots. But a single mis-aligned
sequence can lead to the whole alignment being
deleted!
Filtering
• Example mothur command
> filter.seqs(fasta=current, vertical=T,
trump=.)
> summary.seqs(fasta=current, name=current)
• Result:
agtgaa (reference)
agtgag
agggaa
•Output files:
– Trimmed and aligned sequences (all.trim.unique.filter.fasta)
– Filter mask (all.filter)
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Preclustering
• The aim is to remove sequences that are
probably due to sequencing errors.
• Common sequences are assumed to generate
erroneous sequences more often than rare
sequences.
• Sequences are first ranked according to
abundance, and then the list is walked through.
• Sequences at a certain (edit) distance from each
other (threshold) are grouped together and
merged into a single sequence.
Preclustering
• Example commands
– Threshold of one
> pre.cluster(fasta=current, name=current,
diffs=1)
> summary.seqs()
• Output files:
– Aligned sequences (all.trim.unique.filter.precluster.fasta)
– Grouping of sequences (all.trim.unique.filter.precluster.names)
– Groupwise sequence alignments
(all.trim.unique.filter.precluster.map)
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Remove chimeras
• Chimeric sequences are, e.g., artifacts which contain
parts of the sequence from at least two different
sequences.
• These are typically removed from the data before
further processing steps.
• Several methods, we’ll use UCHIME, because it is
currently thought to be the most exact (and the fastest)
algorithm
• Two options for UCHIME:
– Compare againts a reference set
– Compare inside the sequenced set only (more accurate)
Remove chimeras
• Example commands
>chimera.uchime(fasta=all.trim.unique.filter.precluster.
fasta, name=all.trim.unique.filter.precluster.names)
>remove.seqs(accnos=all.trim.unique.filter.precluster.uc
hime.accnos,
fasta=all.trim.unique.filter.precluster.fasta,
name=all.trim.unique.filter.precluster.names,
group=all.groups)
> summary.seqs()
> get.current()
• Output files:
– Grouping of sequences into clusters(all.trim.unique.filter.precluster.pick.names)
– Sequences (all.trim.unique.filter.precluster.pick.fasta)
– Grouping of sequences into samples or lanes (all.pick.groups)
Exercise D
• Perform a) filtering, b) preclustering and c)
chimera check for the sequence set you’re
working with.
• Be careful with the filtering, and check
meticulously that you still have sequences left
after that step. If not, skip it altogether.
• How many sequences have been filtered away
from the original data after completing these
preprocessing step?
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Analyses
• Operational taxonomic unit (OTU) –based
analyses
– Usually OTUs (~species) are delineated with a 3%
sequence dissimilarity, and higher taxa with
increasingly larger dissimilarity
– OTUs or groups of OTUs can later be assigned
taxonomic names
• Phylotype analyses
– Sequences are directly assigned to taxa on the
basis on reference sequences
Classification
• Sequences are assigned to groups of species
(taxa) (class/order/family/genus/species)
• Sequences are compared to a reference set, and
the taxon that the sequence is most similar to is
assigned to the sequence.
• Comparison is done using short stretches of
sequence at a time (kmers). This is combined
with bootstrapping, so we get a confidence
estimate for the classification.
• Method is described by Wang et al., and used by
RDP, also. Some consider it as rather good.
Classification
• Example command
>classify.seqs(fasta=all.trim.unique.filter.preclust
er.pick.fasta,
template=trainset9_032012.rdp.fasta,
taxonomy=trainset9_032012.rdp.tax, iters=1000)
• Output files:
– Taxonomy assignment for each sequence
(all.trim.unique.filter.precluster.pick.rdp.wang.taxonomy)
– Taxa specific overview of the classification
all.trim.unique.filter.precluster.pick.rdp.wang.tax.summary
Final touch
system(copy
all.trim.unique.filter.precluster.pic
k.rdp.wang.taxonomy
final.all.taxonomy)
system(copy
all.trim.unique.filter.precluster.pic
k.names final.all.names)
system(copy
all.trim.unique.filter.precluster.pic
k.fasta final.all.fasta)
system(copy all.pick.groups
final.all.groups)
Exercise E
• Classify the sequences using the RDP
templates.
• Copy the final result files to new names as
suggested on the previous slide.
• All downstream analyses are performed on
the final result files.
Work flow
•
•
•
•
•
•
•
•
•
•
Prepare input data
Trim sequences
Unique sequences
Align sequences
Screen sequences
Filter sequences
Pre-cluster
Remove chimeric sequences
Classify sequences
Analyses...
Analysis methods
• Visual
– Rarefaction curves
• Checking whether the species sampling has been thorought
enough
– Frequency-based visualizations
• Rank-abundance plots
– Heatmap
– Ordination methods
• Effect on ”environmental” factors on species composition
• Statistical (hypothesis testing approach)
– AMOVA
– etc.
Hypothesis testing approach
• Tests for comparing ”populations”
– Homogeneity of molecular variance (HOMOVA)
– Analysis of molecular variance (AMOVA)
– Analysis of similarity (ANOSIM)
– Libshuff
– Indicator species approach
• Ordination methods
– Redundancy analysis (RDA)
– Canonical correspondence analysis (CCA)
Ordination analysis
• Takes a species count table (rows=taxa, columns=samples,
cells=frequency)
• Additionally takes a comparable matrix of environmental
measurements
• Creates an image, and allows testing for the significance of
the environmental factors to the species occurrnace of
frequency
• Software:
– Ginkgo
• http://biodiver.bio.ub.es/veganaweb/main/?section=../bvegana/conte
nt.jsp
– R
– Canoco
Ordination approaches
By Pierre
Legendre
Jarno Tuimala, 2011
69
RDA, an example plot
Statistical analyses - diversity
• Contributed diversity
– alpha
• diversity inside an area or ecosystem (species richness)
– beta
• diversity between ecosystems
– gamma
• overall diversity of all ecosystems in a particular area
• Diversity can be measured with different indexes, such
as Shannon entropy or just the count of species (but
the species count is dependent on the sampling depth,
which can be checked using the rarefaction curves)
Statistical analyses – comparing groups
• Do the groups differ in species composition?
– Permutational Multivariate Analysis of Variance
Using Distance Matrices
– Multivariate homogeneity of groups dispersions
(variances)
– Analysis of Molecular Variance
• Based on a (euclidean) distance matrix between
sequences
• Distances (or their variance, to be more exact) are
partitioned according to a grouping variable into a
within group and between groups variance (this is
similar to standard one-way ANOVA)
Indicator species approach
• What are the taxa that differentiate between
the group in a best possible way?
Running the analyses
• Browse to the R-2.15.2-TurkuAMK/i386/bin and run Rgui.exe.
• Select ’Source R code’ from the File menu, and select the analyses.R
script file.
– R will then prompt you to select the folder where the groups and
taxonomy files are located.
– It reads them in, and writes out a count table, and a phenodata table
(description of the experiment).
– Fill in the group column of the phenodata table in Notepad, and save
the file.
• Select ’Source R code’ from the File menu, and select the
statistics.R script file.
– R will then prompt you to select the folder where the groups and
taxonomy files are located.
• The result should be one PDF file and one txt file containing the
results of the analyses.
Exercise F
• Run the statistical analysis as specified on the
previous slide.
• Interpret the results – are there differences in
the species composition between the groups?
Download
Study collections