Problem Set 04

Clinical and Research Genomics Assignment #4
From Lecture_10-11 (April 30th): Microbiome and Metagenome Characterizations and CrossSpecies Analysis
Assignment: Analyze and contextualize raw sequence data from the PathoMap project to
characterize the microbiome and metagenome of your samples.
Due Date: 12:00PM on May 7th
This assignment has two sets of questions.
Downloading Data
1) Download the raw, publicly available data:
2) Download the SRA toolkit:
3) Use sratoolkit2.4.4 to convert SRA to FASTQ, then FASTQ to FASTA, like this:
>./fastq-dump.2.4.4 ~/Desktop/SRR1749190.sra
To complete this assignment you will need the following files:
 P00606 – Station – SAMN03270567
 P00427 – Station – SAMN03270390
 PQiagenBeadWash928-1 (Culture01) – Culture Sample – SAMN3271460
 P00046 – Sequence Sample – SAMN03270043
 Any Gowanus sample (Sample ID GCSS-XX, SAMN03271470-1484)
 Any abandoned station sample (Sample ID PABXX, SAMN03271423-52)
Running MetaPhlAn
Upload data to the free Galaxy instance of MetaPhlAn 1.7 and run on default parameters:
Running BLAST
The sequences will be too large for the web-based BLAST tool to handle so you will need to take a subset
of the sequence and run that. For those familiar with the command line, you can open terminal and run:
head –n 1000 sequence.fasta > sequence_1000.fasta
This will take the first 1000 lines of your sequence
If you are not familiar with terminal you can just open the sequence in a text editor and copy a part of the
sequence and paste into BLAST.
To run BLAST follow the tutorial posted on the website:
Short Answer Questions
1. Summarize the MetaPhlAn/BLAST results. Do you see any similarities? Any differences? For each
sample list 5 species you found interesting. Give the species name as well as a brief description of the
2. Did you find anything interesting in sample P00606? Hint: Check the MetaPhlAn results in the
Enterobacteriales family. Now take that sample and run it on MetaPhlAn v2.0 on Galaxy. Have the results
changed? If so, what could be a reason?
3. Did you find anything interesting in sample P00427? Hint: Check the MetaPhlAn results in the
Bacillaceae family. Do you find this finding plausible? Why or why not?
4. We found interesting molecular echoes that make the abandoned station and Gowanus samples stand
out against the rest of our dataset. Are there any organisms that you found only in the Gowanus, and only
in the abandoned station, but not in the other samples? Any explanation as to why/how they were
introduced there?
5. Compare and contrast the culture and sequencing methods. What are the pros/cons of each? Compare
the results of Culture01 and P00046, they were taken from the same station, but one swab was cultured
then sequenced, the other was sequenced. Do you find any organisms found in both samples?
Essay Questions
6. The findings of plague and anthrax in our dataset were scrutinized by some scientists and the media. In
response we posted a blog that highlighted further research into these findings. Read “The Long Road
from Data to Wisdom, and from DNA to Pathogen” Also read, “ What do you think of the discussion? In your opinion, what challenges face the field of
7. Put on the hat of metagenomics expert. You now have a massive metagenomic dataset of nearly 1500
samples collected from New York City’s subway system that represent all species in a city. What sort of
analysis would you like to perform and research question would you like to delve into?
8. After reading the study and playing with the data yourself, has your perception of the subway changed
at all? Some coverage included the comment , “It’s probably fine to lick the subway poles.” Would you? Why or why not?
Please hand the assignment on the day of the lecture, or email if you cannot attend.
For any questions, please contact Ebrahim Afshinnekoo (,
Priyanka Vijay (, or Professor Mason (