LabExercise_Mapping

advertisement
Genetics 875 Lab Exercise 2: short-read mapping
Software used:
bowtie
UCSC Genome Browser
Mauve
Overview: You will map a dataset of short paired-end reads to several different genomes using
bowtie. The dataset is from an Illumina run of Escherichia coli strain K-12 substrain MG1655 with
36 bp paired end reads and a 500 bp insert size (7047668 pairs of reads, or 14095336 reads
total). To model the situation where your data is from a strain for which a completed genome
sequence is not available, you will map the reads not only to E. coli K-12 MG1655, but also to a
derivative K-12 strain, two more distantly related E. coli isolates, and a different Escherichia
species. While the mapping is running we will examine representative mapping results using the
UCSC Genome Browser, a commonly used platform which supports a large number of genomes.
In addition to this document, you will need to download two files for the exercise. Be sure your
computer is booted into Mac OS before proceeding. The files are available from either of two
servers (same files on both servers). Download, unzip, and move the folders to the Desktop.
(1) Genetics875_lab2_bowtie.zip (609 MB):
http://www.genome.wisc.edu/pub/Genetics875_lab2_bowtie.zip - OR http://gel.ahabs.wisc.edu/~guy/Genetics875_lab2_bowtie.zip
genomes (contains 8 fasta files)
NC_000913.fna (Escherichia coli K-12 MG1655 chromosome)
NC_011743.fna (Escherichia coli K-12 DH10B chromosome)
NC_002695.fna (Escherichia coli O157:H7 Sakai chromosome)
NC_002128.fna (Escherichia coli O157:H7 Sakai plasmid pO157)
NC_002127.fna (Escherichia coli O157:H7 Sakai plasmid pOSAK1)
NC_004431.fna (Escherichia coli CFT073 chromosome)
NC_011740.fna (Escherichia fergusonii ATCC 35469 chromosome)
NC_010473.fna (Escherichia fergusonii ATCC 35469 plasmid pEFER)
indexes (an empty folder)
reads (contains 2 fastq files)
SRR001666 1.fastq
SRR001666 2.fastq
sample-bowtie-output.txt (a text file with an example of bowtie alignment results)
(2) Genetics875_lab2_Mauve.zip (11 MB):
http://www.genome.wisc.edu/pub/Genetics875_lab2_Mauve.zip - OR http://gel.ahabs.wisc.edu/~guy/Genetics875_lab2_Mauve.zip
Mauve_URL.txt (text file with Mauve download URL; you want the Mac OS X 10.4+ version)
MG1655-vs-CFT073 (pre-built Mauve alignment)
NC_000913.gbk (Escherichia coli K-12 MG1655 chromosome, GenBank flatfile format)
NC_004431.gbk (Escherichia coli CFT073 chromosome, GenBank flatfile format)
Part 1 -- Mapping reads to genomes using bowtie
Launch Terminal and navigate to directory:
cd Desktop/Genetics875_lab2_bowtie/genomes
Build genome indexes for bowtie and move them to the indexes folder. In Terminal, run each of
the following commands (this takes about 1 minute/genome):
bowtie-build NC_000913.fna e_coli_mg1655
bowtie-build NC_010473.fna e_coli_dh10b
bowtie-build NC_002695.fna,NC_002128.fna,NC_002127.fna e_coli_sakai
bowtie-build NC_004431.fna e_coli_cft073
bowtie-build NC_011740.fna,NC_011743.fna e_fergusonii
mv *.ebwt ../indexes
cd ..
Use bowtie to map the reads to each genome, in succession (this will take ~10-15 minutes on
average/genome). After each run, copy the mapping summary from the screen before running
the next mapping. It will take the following form (you will have actual numbers instead of 0s):
Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 00:00:00
# reads processed: 0000000
# reads with at least one reported alignment: 0000000 (00.00%)
# reads that failed to align: 000000 (0.00%)
Reported 0000000 paired-end alignments to 1 output stream(s)
Time searching: 00:00:00
Overall time: 00:00:00
Here are the commands for each run:
bowtie -t -X 600 -p 2 e_coli_mg1655 -1 reads/SRR001666_1.fastq -2
reads/SRR001666_2.fastq mg1655.out
bowtie -t -X 600 -p 2 e_coli_dh10b -1 reads/SRR001666_1.fastq -2
reads/SRR001666_2.fastq dh10b.out
bowtie -t -X 600 -p 2 e_coli_sakai -1 reads/SRR001666_1.fastq -2
reads/SRR001666_2.fastq sakai.out
bowtie -t -X 600 -p 2 e_coli_cft073 -1 reads/SRR001666_1.fastq -2
reads/SRR001666_2.fastq cft073.out
bowtie -t -X 600 -p 2 e_fergusonii -1 reads/SRR001666_1.fastq -2
reads/SRR001666_2.fastq fergusonii.out
Q: How does the time it takes to map the reads, and the proportion of reads that can be
mapped, change when you map the same data to the different genomes?
Q: What might that tell you about the relatedness of the data and reference genomes?
Part 2 – examing sample bowtie output
In the Finder, as opposed to the Terminal, open the Genetics875_lab2_bowtie folder and then
open the example output file “sample-bowtie-output.txt,” which is from MG1655 reads mapped
to the CFT073 genome.
bowtie output contains one alignment per line. Each line is a collection of 8 fields separated by
tabs; from left to right, the fields are:
(1) Name of read that aligned
(2) Reference strand aligned to, + for forward strand, - for reverse
(3) Name of reference sequence where alignment occurs, or numeric ID if no name was
provided
(4) 0-based offset into the forward reference strand where leftmost character of the alignment
occurs
(5) Read sequence (reverse-complemented if orientation is -).
(6) ASCII-encoded read qualities (reversed if orientation is -). The encoded quality values are on
the Phred scale and the encoding is ASCII-offset by 33.
(7) The number of other instances where the same sequence aligned against the same reference
characters as were aligned against in the reported alignment. This is not the number of other
places the read aligns with the same number of mismatches. The number in this column is
generally not a good proxy for that number (e.g., the number in this column may be '0' while the
number of other alignments with the same number of mismatches might be large).
(8) Comma-separated list of mismatch descriptors. If there are no mismatches in the alignment,
this field is empty. A single descriptor has the format offset:reference-base>read-base. The
offset is expressed as a 0-based offset from the high-quality (5') end of the read.
Find a read that mapped with a mismatch. Note its location in the CFT073 genome.
Find a read that mapped to the same sequence in more than one place in the genome,
and note where it was “assigned” in the CFT073 genome.
In the Finder, open the Genetics875_lab2_Mauve folder and install Mauve (download the Mac
OS X 10.4+ version from the address in “Mauve_URL.txt”). Then open the pre-built alignment
“MG1655-vs-CFT073.” Note: double-clicking on the file won’t work; you need to launch Mauve
and then use File: Open alignment… Zoom in (“+” magnifying glass icon) until individual
annotated features are displayed. Use Mauve’s sequence navigation ability to go to the location
in the CFT073 genome where you noted a sequence mismatch [View: Go To: Sequence Position,
select Escherichia coli CFT073, and then enter the sequence coordinate from the sample bowtie
output]. Hover over any feature that overlaps that position, and note its information. Do the
same for the asigned location of the read that mapped to more than one place in the genome.
Q: What feature, if any, does the mismatch fall into? Is it in a region that Mauve
aligned as part of a collinear block? What might the mismatch indicate?
Q: What feature, if any, does the read with multiple locations fall into? Is it in a region
that Mauve aligned as part of a collinear block? Does the the type of feature or its
annotations suggest an explanation for why the read mapped to other locations in
addition to here?
Part 3 – The UCSC Genome Browser
While bowtie is running we will look at the UCSC Genome Browser, a commonly used graphical
tool for examining the results of read mapping experiments. The UCSC Microbial Genome
Browser, an instance of the browser dedicated to bacterial and archaeal genomes, is at
http://microbes.ucsc.edu/index.html.
The Genome Browser will take user-uploaded files in a variety of formats, including BED, BAM,
GFF, GTF, and WIG. Unfortunately, bowtie will output only its own alignments and SAM files, so
we needed to either convert the bowtie output or use a different mapping tool. Uploading files
to the UCSC browser takes a while; since you will be running bowtie, and Mauve, we will upload
some files to the Genome Browser on the computer up front for you to see:
SRR001666.wig = coverage histogram of MG1655 data mapped to CFT073 genome (39.4 MB)
cft073.bed = mapped location of each MG1655 read to CFT073 genome (240.7 MB)
Q: Both files represent the same data, but in a different format. Can you think of uses
for which one of the file formats would be more useful than the other?
We’ll also show some screenshots of other data displayed using the browser:
Mouse RNA-Seq data (chrI:94,328,000-94,397,000)
Yeast ChIP-Seq data (chrXIII:261,000-282,000)
Download