Genetics 875 Lab Exercise 2: short-read mapping Software used: bowtie UCSC Genome Browser Mauve Overview: You will map a dataset of short paired-end reads to several different genomes using bowtie. The dataset is from an Illumina run of Escherichia coli strain K-12 substrain MG1655 with 36 bp paired end reads and a 500 bp insert size (7047668 pairs of reads, or 14095336 reads total). To model the situation where your data is from a strain for which a completed genome sequence is not available, you will map the reads not only to E. coli K-12 MG1655, but also to a derivative K-12 strain, two more distantly related E. coli isolates, and a different Escherichia species. While the mapping is running we will examine representative mapping results using the UCSC Genome Browser, a commonly used platform which supports a large number of genomes. In addition to this document, you will need to download two files for the exercise. Be sure your computer is booted into Mac OS before proceeding. The files are available from either of two servers (same files on both servers). Download, unzip, and move the folders to the Desktop. (1) Genetics875_lab2_bowtie.zip (609 MB): http://www.genome.wisc.edu/pub/Genetics875_lab2_bowtie.zip - OR http://gel.ahabs.wisc.edu/~guy/Genetics875_lab2_bowtie.zip genomes (contains 8 fasta files) NC_000913.fna (Escherichia coli K-12 MG1655 chromosome) NC_011743.fna (Escherichia coli K-12 DH10B chromosome) NC_002695.fna (Escherichia coli O157:H7 Sakai chromosome) NC_002128.fna (Escherichia coli O157:H7 Sakai plasmid pO157) NC_002127.fna (Escherichia coli O157:H7 Sakai plasmid pOSAK1) NC_004431.fna (Escherichia coli CFT073 chromosome) NC_011740.fna (Escherichia fergusonii ATCC 35469 chromosome) NC_010473.fna (Escherichia fergusonii ATCC 35469 plasmid pEFER) indexes (an empty folder) reads (contains 2 fastq files) SRR001666 1.fastq SRR001666 2.fastq sample-bowtie-output.txt (a text file with an example of bowtie alignment results) (2) Genetics875_lab2_Mauve.zip (11 MB): http://www.genome.wisc.edu/pub/Genetics875_lab2_Mauve.zip - OR http://gel.ahabs.wisc.edu/~guy/Genetics875_lab2_Mauve.zip Mauve_URL.txt (text file with Mauve download URL; you want the Mac OS X 10.4+ version) MG1655-vs-CFT073 (pre-built Mauve alignment) NC_000913.gbk (Escherichia coli K-12 MG1655 chromosome, GenBank flatfile format) NC_004431.gbk (Escherichia coli CFT073 chromosome, GenBank flatfile format) Part 1 -- Mapping reads to genomes using bowtie Launch Terminal and navigate to directory: cd Desktop/Genetics875_lab2_bowtie/genomes Build genome indexes for bowtie and move them to the indexes folder. In Terminal, run each of the following commands (this takes about 1 minute/genome): bowtie-build NC_000913.fna e_coli_mg1655 bowtie-build NC_010473.fna e_coli_dh10b bowtie-build NC_002695.fna,NC_002128.fna,NC_002127.fna e_coli_sakai bowtie-build NC_004431.fna e_coli_cft073 bowtie-build NC_011740.fna,NC_011743.fna e_fergusonii mv *.ebwt ../indexes cd .. Use bowtie to map the reads to each genome, in succession (this will take ~10-15 minutes on average/genome). After each run, copy the mapping summary from the screen before running the next mapping. It will take the following form (you will have actual numbers instead of 0s): Time loading reference: 00:00:00 Time loading forward index: 00:00:00 Time loading mirror index: 00:00:00 Seeded quality full-index search: 00:00:00 # reads processed: 0000000 # reads with at least one reported alignment: 0000000 (00.00%) # reads that failed to align: 000000 (0.00%) Reported 0000000 paired-end alignments to 1 output stream(s) Time searching: 00:00:00 Overall time: 00:00:00 Here are the commands for each run: bowtie -t -X 600 -p 2 e_coli_mg1655 -1 reads/SRR001666_1.fastq -2 reads/SRR001666_2.fastq mg1655.out bowtie -t -X 600 -p 2 e_coli_dh10b -1 reads/SRR001666_1.fastq -2 reads/SRR001666_2.fastq dh10b.out bowtie -t -X 600 -p 2 e_coli_sakai -1 reads/SRR001666_1.fastq -2 reads/SRR001666_2.fastq sakai.out bowtie -t -X 600 -p 2 e_coli_cft073 -1 reads/SRR001666_1.fastq -2 reads/SRR001666_2.fastq cft073.out bowtie -t -X 600 -p 2 e_fergusonii -1 reads/SRR001666_1.fastq -2 reads/SRR001666_2.fastq fergusonii.out Q: How does the time it takes to map the reads, and the proportion of reads that can be mapped, change when you map the same data to the different genomes? Q: What might that tell you about the relatedness of the data and reference genomes? Part 2 – examing sample bowtie output In the Finder, as opposed to the Terminal, open the Genetics875_lab2_bowtie folder and then open the example output file “sample-bowtie-output.txt,” which is from MG1655 reads mapped to the CFT073 genome. bowtie output contains one alignment per line. Each line is a collection of 8 fields separated by tabs; from left to right, the fields are: (1) Name of read that aligned (2) Reference strand aligned to, + for forward strand, - for reverse (3) Name of reference sequence where alignment occurs, or numeric ID if no name was provided (4) 0-based offset into the forward reference strand where leftmost character of the alignment occurs (5) Read sequence (reverse-complemented if orientation is -). (6) ASCII-encoded read qualities (reversed if orientation is -). The encoded quality values are on the Phred scale and the encoding is ASCII-offset by 33. (7) The number of other instances where the same sequence aligned against the same reference characters as were aligned against in the reported alignment. This is not the number of other places the read aligns with the same number of mismatches. The number in this column is generally not a good proxy for that number (e.g., the number in this column may be '0' while the number of other alignments with the same number of mismatches might be large). (8) Comma-separated list of mismatch descriptors. If there are no mismatches in the alignment, this field is empty. A single descriptor has the format offset:reference-base>read-base. The offset is expressed as a 0-based offset from the high-quality (5') end of the read. Find a read that mapped with a mismatch. Note its location in the CFT073 genome. Find a read that mapped to the same sequence in more than one place in the genome, and note where it was “assigned” in the CFT073 genome. In the Finder, open the Genetics875_lab2_Mauve folder and install Mauve (download the Mac OS X 10.4+ version from the address in “Mauve_URL.txt”). Then open the pre-built alignment “MG1655-vs-CFT073.” Note: double-clicking on the file won’t work; you need to launch Mauve and then use File: Open alignment… Zoom in (“+” magnifying glass icon) until individual annotated features are displayed. Use Mauve’s sequence navigation ability to go to the location in the CFT073 genome where you noted a sequence mismatch [View: Go To: Sequence Position, select Escherichia coli CFT073, and then enter the sequence coordinate from the sample bowtie output]. Hover over any feature that overlaps that position, and note its information. Do the same for the asigned location of the read that mapped to more than one place in the genome. Q: What feature, if any, does the mismatch fall into? Is it in a region that Mauve aligned as part of a collinear block? What might the mismatch indicate? Q: What feature, if any, does the read with multiple locations fall into? Is it in a region that Mauve aligned as part of a collinear block? Does the the type of feature or its annotations suggest an explanation for why the read mapped to other locations in addition to here? Part 3 – The UCSC Genome Browser While bowtie is running we will look at the UCSC Genome Browser, a commonly used graphical tool for examining the results of read mapping experiments. The UCSC Microbial Genome Browser, an instance of the browser dedicated to bacterial and archaeal genomes, is at http://microbes.ucsc.edu/index.html. The Genome Browser will take user-uploaded files in a variety of formats, including BED, BAM, GFF, GTF, and WIG. Unfortunately, bowtie will output only its own alignments and SAM files, so we needed to either convert the bowtie output or use a different mapping tool. Uploading files to the UCSC browser takes a while; since you will be running bowtie, and Mauve, we will upload some files to the Genome Browser on the computer up front for you to see: SRR001666.wig = coverage histogram of MG1655 data mapped to CFT073 genome (39.4 MB) cft073.bed = mapped location of each MG1655 read to CFT073 genome (240.7 MB) Q: Both files represent the same data, but in a different format. Can you think of uses for which one of the file formats would be more useful than the other? We’ll also show some screenshots of other data displayed using the browser: Mouse RNA-Seq data (chrI:94,328,000-94,397,000) Yeast ChIP-Seq data (chrXIII:261,000-282,000)