Metagenomics or marker gene studies Jarno Tuimala, PhD RS-koulutus / SPR Veripalvelu Program 10.15-11.45 data preprocessing and alignment 11.45-12.30 lunch 12.30-14.00 sequence filtering 14.00-14.15 coffee break 14.15-16.15 sequence classification, statistics 16.15-16.45 wrap-up, feedback etc. And other breaks when needed Aims of the course • To learn the essential analysis steps for marker gene studies • To learn how to use mothur software for performing the analyses • Alternating between lectures and hands-on exercises Metagenomics 1. Shotgun sequencing of niche specific samples – Needs assembly (putting the sequence pieces in a correct order and to the correct species) 2. Sequencing certain markers genes, such as 16S rRNA, RecA, RpoB – Sequence alignment is enough • In fact, calling the latter metagenomics is misleading. Marker gene studies have been performed for at least 20 years. Sanger Sequencing from Wikipedia Roche 454 pyrosequencing Review: Medini et al, 2008 16S rRNA sequences from 454 • Primer (adaptor key in the end) + tag + template specific primer + 16S rRNA sequence • Adaptor contains a biotin tag and is needed during sequencing of the template • Tag allows discrimination between samples (barcoding) • Primer is needed for amplification of the sequence • These are usually removed from the sequences before the actual analysis • Example: TCAGTACTCGGCCTACGGGAGGCAGCAG Electrofluorogram Base calling bias (454) http://www.biomedcentral.com/1471-2164/12/245 Files needed on this course • Demodata 1 (SOP) and Demodata 2 (Turku) – Extract to the Desktop • Software – – – – Extract to the Desktop Copy mothur.exe and uchime.exe to the demo data folder Seaview R • Reference sequences – Originally from Silva.bacteria.zip, Silva.gold.bacteria.zip and Trainset9_032012.rdp.zip (from mothur wiki), but repackaged for the course – Extract to the Desktop – After preparing the input data, copy to the data folder • Scripts – Used for the statistical analyses Demodata 1 • • • • SOP data from mothur Mice gut microbiome followed after weaning Data is for one animal only, but ten time points We are using sequence files in fasta format and corresponding quality info files • If you need to process sff files, please see SOP data example in mothur wiki, and the example on the forth coming slides – You’ll need a lookup file from http://www.mothur.org/wiki/Lookup_files Demodata 2 • Ruminant data from Turku AMK • Sample numbers indicate the individuals cows, the sample character indicate the primer set. • Altogether 12 samples. • All in SFF format! Keep a lab book! • Use some text editor (Notepad!) to first type in the commands, and then copy and paste them into mothur. • This creates a file of mothur commands that can then later on be used as batch file for processing the data. • It also documents what you have done. • Better to adapt this type of a habit sooner than later. Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... File formats • FastA >ERR051700.1 FXCMJDV02HWG0G length=10 TCAGTACTCG >ERR051700.2 FXCMJDV02GE0HC length=10 TCAGTACTCG • Qual >ERR051700.1 FXCMJDV02HWG0G length=10 37 37 37 37 34 28 29 29 29 26 >ERR051700.2 FXCMJDV02GE0HC length=10 37 37 37 37 40 40 40 40 40 40 Preparing the data, option 1 • Open the command shell (cmd.exe), and change to the directory (cd) containing the data files • Append all the fasta and all the qual files together – copy /b *.fasta all.fasta – copy /b *.qual all.qual Preparing the data, option 2 • If the data is in sff format, fasta and qual files can be extracted from it with the mothur command sffinfo(): – sffinfo(sff=454Reads.RL1.sff) • And then continue with the fasta and qual files as laid out on the previous slide Exercise A • Extract demo data to Desktop (or some other folder where you have write rights) • Prepare the data for further analysis by appending the fasta and qual files into a single fasta file and a single qual file • Once done, extract mothur to the same place, and copy mothur.exe and uchime.exe from the subfolder mothur to the demo data folder Mothur • Mothur is one of the tools meant for analysis of marker gene studies • Developed by Pat Schloss’ research group at University of Michigan • Freely available command line program • http://www.mothur.org/ • http://www.mothur.org/wiki/Mothur_manual • http://www.mothur.org/wiki/Analysis_examples Using mothur • Installation, several possibilities – Copy to a certain folder, add to Windows’ path variable, invoke from command line (cmd.exe) – Copy to the folder where the data resides, and start by double clicking on it • Once mothur starts, it takes you to its own environment. – mothur > Mothur commands • See the manual on the web for commands. – http://www.mothur.org/wiki • Alternatively type help() to mothur prompt to list all commands. • Help on a particular command can be invoked by, e.g., summary.seqs(help). • Command are always associated with brackets, and the command arguments are given inside the brackets. – summary.seqs(fasta=all.fasta) Mothur data • Mothur does all the operations in the memory, but saves the data files directly on the hard drive to the working directory. • It also keeps a logfile that records everything that is written on the screen (mothur console), and sometimes a few extra things. • After you’ve run some command, please follow the screen output closely, and check the files that were created to the working directory. Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Trim sequences 1 • Input – Fasta file (prepared at prepare data step) – Quality file (prepared at prepare data step) – Oligo file (Will be prepared now) • Output – Fasta file of good sequences (all.trim.fasta) – Fasta file of scrap sequences (all.scrap.fasta) – Groups file (indicates into which group/sample the sequences belong to) (all.groups) Trim sequences 2 • Oligo file – Tab-delimited – Lines can be commented out with a # – “The oligos option takes a file that can contain the sequences of the forward and reverse primers and barcodes and their sample identifier. Each line of the oligos file can start with the key words "forward", "reverse", "barcode", "linker" and "spacer" or it can start with a "#" to tell mothur to ignore that line of the oligos file.” [Mothur wiki] • Example: forward CCTACGGGAGGCAGCAG #reverse GTATTACCGCGGCTGCTG barcode TCAGTAGCGCG ERR051699 barcode TCAGTACTCGG ERR051702 barcode TCAGTAGCGCC ERR051705 Trim sequences 3 • Options – – – – – – – – – – – – – – fasta fasta file oligos oligo file qfile quality file qaverage average sequence quality qwindowaverage average quality of window qwindowsize window width flip reverse complement seqs maxambig max. of ambig. bases maxhomop max. lenght of homopolymer bdiffs max. diffs in barcode pdiffs max. diffs in primers ldiffs max. diffs in linkers sdiffs max. diffs in spacers tdiffs max. diffs in total Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3 Trim sequences 4 • flip=T – TCAGTACTCG -> CGAGTACTGA – AGTCATGAGC • maxambiq = 1 – TCAGTACNCG • maxhomp = 3 • TCAGTTTTCG Common problems • The sequences are not in the same order in the fasta and in the qual files. • But, mothur assumes that they are, and gives an error if this is not the case. • Solutions: – Do not use quality file when trimming the sequences – Sort the sequences and the quality files – Sometimes not all sequences in the fasta file have a match in the qual file, then you don’t have any other option but to not to use the quality files during the trimming Trim sequences 5 • Example commands > summary.seqs(fasta=all.fasta) > trim.seqs(fasta=all.fasta, qfile=all.qual, oligos=oligo.tsv, pdiffs=2, bdiffs=1, ldiffs=1, sdiffs=1, qaverage=25, flip=T) > trim.seqs(fasta=all.fasta, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, ldiffs=1, sdiffs=1, maxambig=0, maxhomop=8, flip=T) > get.current() > summary.seqs(fasta=all.trim.fasta) Exercise B • Run mothur by double clicking on it. It should open in a new DOS window. • Trim the sequences of demodata using a) an average quality score of 25 for each sequence OR b) a sliding window method with a windows quality score of 25 OR c) removal of oligo sequences and too ambiguous or polymeric (this is handy if other options fail). • Run summary.seqs() after each trimming to check how many sequences were retained in the data. • Which method would you pick for further processing steps, and why? Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Remove non-unique seqs • Why? – Sequence amplification can sometimes produces large amounts of identical sequences, even if there are really not that many organisms present that have that particular sequence -> aftefacts. unique.seqs(fasta=all.trim.fasta) get.current() summary.seqs(fasta=current) •Output files: –Sequences (all.trim.unique.fasta) –List of sequence groups (all.trim.names) Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Sequence alignment • Idea is to line up the sequences so that the similarity (score) of sequences is maximized (description of what the computer does) • Example of a pairwise global alignment (an alignment of two sequences along their whole lenghts) tgagttgaact tgagt-gagc• Minus (-) signs are called gaps • Gaps can be modelled using gap opening and extension penalties, e.g., match=5, mismatch=0, opening=-2, extension=-1. Total score for the alignment above is 31, and computer tries to maximize this alignment score. Alignment in mothur • Sequences are aligned, one by one, against a set of reference sequences. • This usually makes the 16 S rRNA alignment better than other options (multiple alignment or alignment of several sequences), but this does not necessarily hold for all possible genes! • By default, mothur performs a global pairwise alignment (Needleman-Wunch algorithm) where gap opening and extension are penalized equally. • See, e.g., http://koti.mbnet.fi/tuimala/oppaat/bioinfolaaja.pdf for more info on alignments. Alignment in mothur • Example commands align.seqs(fasta=all.trim.unique.fasta, reference=silva.bacteria.fasta) summary.seqs(fasta=current, name=current) • The reference set silva.bacteria.fasta or some other offered by, e.g., mothur site, is obligatory! • After the alignment is ready, it can be checked in some alignment software, such as Seaview. • Output files: – aligned sequences (all.trim.unique.align) – alignment report listing the files and the location they were aligned against (all.trim.unique.align.report) Exercise C • Remove the non-unique sequences • Copy the reference sequence file silva.bacteria.fasta to the data folder • Align the unique sequences against the referense sequence set • Open the resulting alignment file in the Seaview editor, and check the results visually Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Filtering aligned sequences • After aligning the sequences, it is a good idea to make the alignment more neat tgagttgaact (reference) ..agt-gag.. ..agg-gaa.. • Remove all gaps-only columns (-) • Remove all columns containg a specific character (.). align.seqs() will precede any position before the sequence start with dots. But a single mis-aligned sequence can lead to the whole alignment being deleted! Filtering • Example mothur command > filter.seqs(fasta=current, vertical=T, trump=.) > summary.seqs(fasta=current, name=current) • Result: agtgaa (reference) agtgag agggaa •Output files: – Trimmed and aligned sequences (all.trim.unique.filter.fasta) – Filter mask (all.filter) Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Preclustering • The aim is to remove sequences that are probably due to sequencing errors. • Common sequences are assumed to generate erroneous sequences more often than rare sequences. • Sequences are first ranked according to abundance, and then the list is walked through. • Sequences at a certain (edit) distance from each other (threshold) are grouped together and merged into a single sequence. Preclustering • Example commands – Threshold of one > pre.cluster(fasta=current, name=current, diffs=1) > summary.seqs() • Output files: – Aligned sequences (all.trim.unique.filter.precluster.fasta) – Grouping of sequences (all.trim.unique.filter.precluster.names) – Groupwise sequence alignments (all.trim.unique.filter.precluster.map) Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Remove chimeras • Chimeric sequences are, e.g., artifacts which contain parts of the sequence from at least two different sequences. • These are typically removed from the data before further processing steps. • Several methods, we’ll use UCHIME, because it is currently thought to be the most exact (and the fastest) algorithm • Two options for UCHIME: – Compare againts a reference set – Compare inside the sequenced set only (more accurate) Remove chimeras • Example commands >chimera.uchime(fasta=all.trim.unique.filter.precluster. fasta, name=all.trim.unique.filter.precluster.names) >remove.seqs(accnos=all.trim.unique.filter.precluster.uc hime.accnos, fasta=all.trim.unique.filter.precluster.fasta, name=all.trim.unique.filter.precluster.names, group=all.groups) > summary.seqs() > get.current() • Output files: – Grouping of sequences into clusters(all.trim.unique.filter.precluster.pick.names) – Sequences (all.trim.unique.filter.precluster.pick.fasta) – Grouping of sequences into samples or lanes (all.pick.groups) Exercise D • Perform a) filtering, b) preclustering and c) chimera check for the sequence set you’re working with. • Be careful with the filtering, and check meticulously that you still have sequences left after that step. If not, skip it altogether. • How many sequences have been filtered away from the original data after completing these preprocessing step? Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Analyses • Operational taxonomic unit (OTU) –based analyses – Usually OTUs (~species) are delineated with a 3% sequence dissimilarity, and higher taxa with increasingly larger dissimilarity – OTUs or groups of OTUs can later be assigned taxonomic names • Phylotype analyses – Sequences are directly assigned to taxa on the basis on reference sequences Classification • Sequences are assigned to groups of species (taxa) (class/order/family/genus/species) • Sequences are compared to a reference set, and the taxon that the sequence is most similar to is assigned to the sequence. • Comparison is done using short stretches of sequence at a time (kmers). This is combined with bootstrapping, so we get a confidence estimate for the classification. • Method is described by Wang et al., and used by RDP, also. Some consider it as rather good. Classification • Example command >classify.seqs(fasta=all.trim.unique.filter.preclust er.pick.fasta, template=trainset9_032012.rdp.fasta, taxonomy=trainset9_032012.rdp.tax, iters=1000) • Output files: – Taxonomy assignment for each sequence (all.trim.unique.filter.precluster.pick.rdp.wang.taxonomy) – Taxa specific overview of the classification all.trim.unique.filter.precluster.pick.rdp.wang.tax.summary Final touch system(copy all.trim.unique.filter.precluster.pic k.rdp.wang.taxonomy final.all.taxonomy) system(copy all.trim.unique.filter.precluster.pic k.names final.all.names) system(copy all.trim.unique.filter.precluster.pic k.fasta final.all.fasta) system(copy all.pick.groups final.all.groups) Exercise E • Classify the sequences using the RDP templates. • Copy the final result files to new names as suggested on the previous slide. • All downstream analyses are performed on the final result files. Work flow • • • • • • • • • • Prepare input data Trim sequences Unique sequences Align sequences Screen sequences Filter sequences Pre-cluster Remove chimeric sequences Classify sequences Analyses... Analysis methods • Visual – Rarefaction curves • Checking whether the species sampling has been thorought enough – Frequency-based visualizations • Rank-abundance plots – Heatmap – Ordination methods • Effect on ”environmental” factors on species composition • Statistical (hypothesis testing approach) – AMOVA – etc. Hypothesis testing approach • Tests for comparing ”populations” – Homogeneity of molecular variance (HOMOVA) – Analysis of molecular variance (AMOVA) – Analysis of similarity (ANOSIM) – Libshuff – Indicator species approach • Ordination methods – Redundancy analysis (RDA) – Canonical correspondence analysis (CCA) Ordination analysis • Takes a species count table (rows=taxa, columns=samples, cells=frequency) • Additionally takes a comparable matrix of environmental measurements • Creates an image, and allows testing for the significance of the environmental factors to the species occurrnace of frequency • Software: – Ginkgo • http://biodiver.bio.ub.es/veganaweb/main/?section=../bvegana/conte nt.jsp – R – Canoco Ordination approaches By Pierre Legendre Jarno Tuimala, 2011 69 RDA, an example plot Statistical analyses - diversity • Contributed diversity – alpha • diversity inside an area or ecosystem (species richness) – beta • diversity between ecosystems – gamma • overall diversity of all ecosystems in a particular area • Diversity can be measured with different indexes, such as Shannon entropy or just the count of species (but the species count is dependent on the sampling depth, which can be checked using the rarefaction curves) Statistical analyses – comparing groups • Do the groups differ in species composition? – Permutational Multivariate Analysis of Variance Using Distance Matrices – Multivariate homogeneity of groups dispersions (variances) – Analysis of Molecular Variance • Based on a (euclidean) distance matrix between sequences • Distances (or their variance, to be more exact) are partitioned according to a grouping variable into a within group and between groups variance (this is similar to standard one-way ANOVA) Indicator species approach • What are the taxa that differentiate between the group in a best possible way? Running the analyses • Browse to the R-2.15.2-TurkuAMK/i386/bin and run Rgui.exe. • Select ’Source R code’ from the File menu, and select the analyses.R script file. – R will then prompt you to select the folder where the groups and taxonomy files are located. – It reads them in, and writes out a count table, and a phenodata table (description of the experiment). – Fill in the group column of the phenodata table in Notepad, and save the file. • Select ’Source R code’ from the File menu, and select the statistics.R script file. – R will then prompt you to select the folder where the groups and taxonomy files are located. • The result should be one PDF file and one txt file containing the results of the analyses. Exercise F • Run the statistical analysis as specified on the previous slide. • Interpret the results – are there differences in the species composition between the groups?