less 763577454-SRS014459-Stool.fasta.bowtie2out.txt

advertisement
Meta’omic Analysis with
MetaPhlAn, HUMAnN, and LEfSe
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
08-08-13
Some setup notes
• Slides with green titles or text include
instructions not needed today, but useful for
your own analyses
• Keep an eye out for red warnings of
particular importance
• Command lines and program/file names
appear in a monospaced font.
2
Getting some HMP data
• Go to http://hmpdacc.org
Click “Get Data”
3
Getting some HMP data
• Check out what’s available
Click “HMIWGS”
4
Getting some HMP data
• Check out what’s available
Click on your favorite body site
5
Getting some HMP data
• Check out what’s available
Don’t click on anything!
6
Getting some (prepped) HMP data
• Connect to the server instead and run:
– for S in `ls /class/stamps-shared/chuttenh/7*.fasta`;
do ln -s $S; done
• These are subsamples of six HMP files:
–
–
–
–
–
–
SRS014459.tar.bz2  763577454-SRS014459-Stool.fasta
SRS014464.tar.bz2  763577454-SRS014464-Anterior_nares.fasta
SRS014470.tar.bz2  763577454-SRS014470-Tongue_dorsum.fasta
SRS014472.tar.bz2  763577454-SRS014472-Buccal_mucosa.fasta
SRS014476.tar.bz2  763577454-SRS014476-Supragingival_plaque.fasta
SRS014494.tar.bz2  763577454-SRS014494-Posterior_fornix.fasta
• All six shotgunned body sites from
– One subject, first visit
– Subsampled to 20,000 reads
7
Who’s there: MetaPhlAn
X is a core gene for clade Y
X is a unique marker gene for clade Y
Gene X
8
Who’s there: MetaPhlAn
• Go to http://huttenhower.sph.harvard.edu/metaphlan
Scroll down
9
Who’s there: MetaPhlAn
• You could download MetaPhlAn by clicking here
10
Who’s there: MetaPhlAn
•
But don’t! Instead, we’ve downloaded MetaPhlAn already for you by clicking here
11
Who’s there: MetaPhlAn
• And saving just the file metaphlan.py in stamps-software/bin
12
Who’s there: MetaPhlAn
• You could download the bowtie2 database here
13
From the command line...
• But don’t! Instead, get it by:
– for S in `ls /class/stamps-shared/chuttenh/*.bt2`;
do ln -s $S; done
• To see what you can do, run:
– module load stamps
– metaphlan.py -h | less
– Use the arrow keys to move up and down,
q to quit back to the prompt
14
Who’s there: MetaPhlAn
15
Who’s there: MetaPhlAn
• For future reference: these extra options aren’t necessary if
you download the whole “default” MetaPhlAn package
– metaphlan.py your_input.fasta > your_output.txt
• To launch your first analysis, run:
– metaphlan.py --bowtie2db mpa
763577454-SRS014459-Stool.fasta >
763577454-SRS014459-Stool.txt
16
Who’s there: MetaPhlAn
• What did you just do?
• Two new output files:
– 763577454-SRS014459-Stool.fasta.bowtie2out.txt
• Contains a mapping of reads to MetaPhlAn markers
– 763577454-SRS014459-Stool.txt
• Contains taxonomic abundances as percentages
17
Who’s there: MetaPhlAn
• less 763577454-SRS014459-Stool.fasta.bowtie2out.txt
18
Who’s there: MetaPhlAn
• less 763577454-SRS014459-Stool.txt
19
Who’s there: MetaPhlAn
• Now finish the job:
– metaphlan.py --bowtie2db mpa
763577454-SRS014464-Anterior_nares.fasta >
763577454-SRS014464-Anterior_nares.txt
– ...
• Note that you can use the up arrow key to
make your life easier!
20
Who’s there: MetaPhlAn
• Let’s make a single table containing all six
samples:
– mkdir tmp
– mv *.bowtie2out.txt tmp
– merge_tables.py -l -d *.txt | zero.py >
763577454.tsv
• You can look at this file using less
– Note 1: The arguments less -x4 -S will help
– Note 2: You can set this “permanently” using
export LESS="-x4 -S"
21
Who’s there: MetaPhlAn
• But it’s easier using MeV; go to http://www.tm4.org/mev.html
Click
“Download”
22
An interlude: MeV
• Don’t forget to transfer your 763577454.tsv
file locally for viewing using scp
• Unzip, launch MeV, and select File/Load data
23
An interlude: MeV
• Click “Browse” to your TSV file, then
– Tell MeV it’s a two-color array
– Uncheck “Load annotation”
– Click on the upper-leftmost data value
24
An interlude: MeV
• “Load” your data, then make is visible by:
– Display/Set Color Scale Limits
– Choose Single Gradient, min 0, max 10
25
An interlude: MeV
• Finally, to play around a bit:
–
–
–
–
Display/Set Element Size/whatever you’d like
Clustering/Hierarchical Clustering
Optimize both gene and sample order
And select Manhattan Distance (imperfect!)
26
An interlude: MeV
• If you’d like, you can
– Display/Sample-Column Labels/Abbr. Names
27
An interlude: MeV
• MeV is a tool; imperfect, but convenient
– You should likely include just “leaf” nodes
• Species, whose names start include “s__”
• You can filter your file using:
cat 763577454.tsv | grep -E '(Stool)|(s__)' >
763577454_species.tsv
– You can, but might not want to, z-score normalize
• Adjust Data/Gene-Row Adjustments/Normalize Genes-Rows
• Many other tools built in – experiment!
28
What they’re doing: HUMAnN
• Back to the task at hand; you could download HUMAnN at:
http://huttenhower.sph.harvard.edu/humann
Click
here
29
What they’re doing: HUMAnN
• ...but instead we’ve already downloaded it
• Expand HUMAnN (no install!)
– tar -xzf /class/stampsshared/chuttenh/sources/humann-0.98.tar.gz
• Set up a link to the KEGG reference DB:
– ln -s /class/stamps-shared/chuttenh/kegg.reduced.udb
• And although you would normally download
USEARCH from here:
– http://www.drive5.com/usearch/download.html
We’re going to use it preinstalled instead
30
What they’re doing: HUMAnN
• If we weren’t all running this, you’d need to:
– Get KEGG – used to be free, now it’s not!
• Fortunately, we have a HUMAnN-compatible
distributable version; contact me...
– Index it for USEARCH:
• usearch6 -makeudb_usearch kegg.reduced.fasta
-output kegg.reduced.udb
• This takes a minute or two, so we’ve
precomputed it; thus, forge ahead...
31
What they’re doing: HUMAnN
• Did you notice that we didn’t QC our data at all?
– MetaPhlAn is very robust to junk sequence
– HUMAnN is pretty robust, but not quite as much
• We’ve already run a standard metagenomic QC:
– Quality trim by removing bad bases (typically Q ~15)
– Length filter to remove short sequences (typically <75%)
32
What they’re doing: HUMAnN
• Must start from FASTQ files to do this
• Quality trim by removing bad bases:
– TrimBWAstyle.py < your_data.fastq >
your_trimmed_data.fastq
• Length filter by removing short sequences:
– 75% of original length is standard (thus 75nt from 100nt reads)
– remove_bad_seqs.py 75 < your_trimmed_data.fastq >
your_filtered_data.fastq
• Now convert your FASTQ to a FASTA:
– fastq2fasta.py < your_filtered_data.fastq >
your_filtered_data.fasta
• Some final caveats:
– If you’re using paired end reads, match filters!
– See my course homeworks at http://huttenhower.sph.harvard.edu/bio508
– Aren’t you glad you’re not doing this today?
33
What they’re doing: HUMAnN
• Enter the humann directory
– cd humann-0.98
• Run your first translated BLAST search:
– usearch6.0.192_i86linux32
-usearch_local ../763577454-SRS014459-Stool.fasta
-db ../kegg.reduced.udb -id 0.8
-blast6out input/763577454-SRS014459-Stool.txt
• What did you just do?
– less input/763577454-SRS014459-Stool.txt
– Recall BLAST’s tab-delimited output headers:
• qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore
• Rinse and repeat for the remaining samples
34
What they’re doing: HUMAnN
• Normally you’d need to install SCons from:
– http://www.scons.org
• Instead, we’ll use it preinstalled as well, so...
GO!
– /class/stamps-software/scons/bin/scons
• You should see a bunch of text scroll by
– Note: you can run scons -j8 to parallelize tasks
35
What they’re doing: HUMAnN
• After a minute or two, you should see:
36
What they’re doing: HUMAnN
• This has created four main files:
– Two each for pathways (big) and modules (small)
– Two each for coverage and relative abundance
• Each is tab-delimited text with one column per sample
• All four are in the output directory:
– output/04a-hit-keg-mpt-cop-nul-nve-nve-xpe.txt
• Coverage (a) of pathways (t)
– output/04a-hit-keg-mpm-cop-nul-nve-nve-xpe.txt
• Coverage (a) of modules (m)
– output/04b-hit-keg-mpt-cop-nul-nve-nve.txt
• Abundance (b) of pathways (t)
– output/04b-hit-keg-mpm-cop-nul-nve-nve.txt
• Abundance (b) of modules (m)
• I almost always just use 04b-mpm (module abundances)
37
What they’re doing: HUMAnN
• Let’s take a look:
– less output/04b-hit-keg-mpm-cop-nul-nve-nve.txt
38
What they’re doing: HUMAnN
• That’s ugly; it gets much better in Excel
– Note: this is very sparse since we’re using a small subset of KEGG
– Note: the mock community demo data is included on the right
39
What they’re doing: HUMAnN
• And there’s nothing stopping us from using MeV
– Or R, or QIIME, or anything that’ll read tab-delimited text
40
What matters: LEfSe
• All of these analyses give you tables
– 16S
 OTUs
– MetaPhlAn  Species
– HUMAnN  Modules (or pathways or genes)
• If you know something about your samples...
– Case/control
– Different habitats
– Different time points
• How can you identify features that change?
41
What matters: LEfSe
• Let’s get all of the HMP species data:
http://hmpdacc.org/resources/data_browser.php
Click “HMSMCP”
42
What matters: LEfSe
• Download the MetaPhlAn table for all 700 samples
Right click this irritatingly tiny icon
43
Downloading from the command line
• Instead of saving this, download it by:
– Right-click to copy the URL
– Run wget <paste URL here>
– Note: curl –O <URL> works just as well
44
What matters: LEfSe
• Make sure this file is in your home directory, and expand it:
– bunzip2 HMP.ab.txt.bz2
• Look at the result
– less HMP.ab.txt
• IMPORTANT!!!
– This file’s too big to analyze directly today
– ln -s /class/stamps-shared/chuttenh/HMP.ab.filtered.txt
• This is great – tons of data, but no metadata
– HUMAnN to the rescue
– From the humann-0.98 directory:
python src/metadata.py input/hmp_metadata.dat
< ../HMP.ab.filtered.txt > ../HMP.ab.filtered.metadata.tsv
• NOW take a look again
45
What matters: LEfSe
• Let’s modify this file to be LEfSe-compatible
• Open it up in Excel
46
What matters: LEfSe
• Delete all of the metadata rows except:
– RANDSID and STSite
– Save it as tab-delimited text: HMP.ab.filtered.metadata.txt
47
What matters: LEfSe
• Visit LEfSe at: http://huttenhower.sph.harvard.edu/lefse
Click
here
48
What matters: LEfSe
• Then upload your formatted table
– After you upload, wait for the progress meter to turn green!
1. Click
here
2. Then
here
3. Then
here
4. Then
watch
here
49
What matters: LEfSe
• Then tell LEfSe about your metadata:
1. Click
here
2. Then
select STSite
3. Then select
RANDSID
4. Then here
50
What matters: LEfSe
• Then select LDA=4, “One-against-all,” and run LEfSe!
– You can change other default statistical parameters if desired
1. Click
here
2. Then here
(finds only very
extreme differences)
3. Then here
4. Then GO!
(finds differences in at
least one condition rather
than in all conditions)
51
What matters: LEfSe
• You can plot the results as a bar plot
– Again, lots of graphical parameters to modify if desired
1. Click
here
2. Then here
52
What matters: LEfSe
• In Galaxy, view a result by clicking on its “eye”
Click here
53
What matters: LEfSe
54
What matters: LEfSe
• You can plot the results as a cladogram
– Lots and lots of graphical parameters to modify if desired
1. Click
here
2. Then here
55
What matters: LEfSe
56
What matters: LEfSe
• Finally, you can see the raw data for individual biomarkers
– These are generated as a zip file of individual plots
1. Click
here
2. Then here
57
What matters: LEfSe
Click here
• In Galaxy, download a result by clicking on its “disk”
Then here
58
What matters: LEfSe
Actinobacteria
Strep. mitis
Veillonellaceae
59
Summary
• MetaPhlAn
– Raw metagenomic reads in
– Tab-delimited species relative abundances out
• HUMAnN
– Quality-controlled metagenomic reads in
– Tab-delimited gene, module, and pathway
relative abundances out
• LEfSe
– Tab-delimited, stratified relative abundances in
– Significantly differentially abundant features out
60
Thanks!
Interested? We’re recruiting
just about everything!
Human Microbiome Project
Nicola Segata Levi Waldron Xochi Morgan
Felix Wong
Emma
Schwager
Eric Franzosa
Joseph Moon
Jim Kaminski
Boyu Ren
Brian Palmer
Ren Lu
Koji Yasuda
Tim Tickle
Dirk Gevers
Kat Huang
Daniela
Boernigen
Ramnik Xavier
Harry Sokol
Dan Knights
Moran Yassour
Owen White
Sahar Abubucker
Joe Petrosino
Brandi Cantarel
George Weinstock
Alyx Schubert
Karen Nelson
Mathangi Thiagarajan
Lita Proctor
Beltran Rodriguez-Mueller
Erica Sodergren Makedonka Mitreva
Anthony Fodor
Yuzhen Ye
Marty Blaser
Mihai Pop
Jacques Ravel
Larry Forney
Pat Schloss
Barbara Methe
Bruce Birren Mark Daly
Doyle Ward Ashlee Earl
Wendy Garrett
Michelle Rooks
Rob Beiko
Morgan Langille
Ruth Ley
Omry Koren
Jacques Izard
Katherine Lemon
Rob Knight
Jesse Zaneveld
Greg Caporaso
Bruce Sands
Mark Silverberg
Boyko Kabakchiev
Andrea Tyler
Jeroen Raes
Karoline Faust
http://huttenhower.sph.harvard.edu
61
Download