Mapping of Paired-End Reads to a Reference Genome and SNP discovery Background Galaxy (main.g2.bx.psu.edu) is a powerful collection of analysis tools that allows users to perform sophisticated genomic analysis. Before this module can be started students will need to have access to the internet and a Galaxy account. These accounts are free. It should also be noted that the analyses in this module can take a lot of time. Therefore this module may work best if the instructor has already completed the activity and preformed and saved the analysis on their own Galaxy ‘history’. This way students can be guided through the analysis, but not have to wait for each step to be completed (which in many instances can take hours). This module may also be given out as a ‘take home assignment’ and students can then have access to the completed project in their saved history. The instructor can then go over each step during the lab period. In this module students will learn how to map Illumina paired-end reads back to a reference genome, with the goal of identifying variants or single nucleotide polymorphisms (SNPs). The identification of genome wide variation has implications in many lines of study. One may wish to see where variation is distributed throughout the genome for a particular individual or accession. For example, are there mutations specific to a phenotype and do they occur in specific genes. This information may be useful in identifying specific genomic regions (or even genes) that contribute to a particular phenotype or disease. To map sequence reads requires minimally two types of data: a reference genome and next generation sequences. It is possible to work with non-model organisms; however this will require the additional step of sequencing, assembling and annotating the genome of interest – not a trivial task. We will therefore be using a genome form an organism that has been sequenced and well characterized – yeast (Saccharomyces cerevisiae). For a eukaryote yeast has a relatively small genome and should be easier to work with, but there are a number of reference genomes available via Galaxy (e.g. humans, Drosophila, Arabidopsis…). You will also paired-end Illumina data to map back to the genome. You could upload your own sequence data if you had it available, but luckily there are a number of published data sets that are available for use. These are accessible through the Galaxy interface, more specifically through the European Nucleotide Archive (www.ebi.ac.uk/enaA). The basic workflow for this analysis is as follows: 1-Obtain data and get it to your Galaxy workspace 2-Filter the data using FASTQ Groomer 3-Create summary statistics for your FASTQ data 4-Split your paired-end reads 5-Map your reads to the reference genome FASTQ data format stored your sequences and their associated quality scores in a single file. Please be aware that the quality scores may be stored differently for Illumina/Solexa, 454 or SoLID sequencing 1 platforms. Galaxy can handle all of these formats, you just need to be sure what data you are working with. Module Goals -Align NGS sequence data to a reference genome -Obtain SNPs By the end of this module you should be able to: -Upload and/or obtain FASTQ data on Galaxy -Filter and obtain summary statistics on FASTQ data -Map sequence reads back to a reference genome -Identify SNPs Vision and Change Competencies Ability to apply the process of science Ability to use quantitative reasoning Ability to use modeling and simulation Ability to tap into the interdisciplinary nature of science Sequencing Requirements Sequencing requirements are project specific. This module utilized paired-end Illumina data. Software and Hardware Requirements This module utilizes the Galaxy server. A computer with internet access is required. Procedures Obtain the test dataset (https://main.g2.bx.psu.edu/u/andres-aguilar/h/yeastgcat). Alternatively you could upload your own data. To do this click: Get Data (Left Panel) Upload Data -Now you need to select the data file and upload it. An alternative to this is that you can access this data set from my shared history (web link). Once you have the data uploaded we now need to ‘clean it up’. The process is very similar to what is done in the ‘Sequence Processing and Quality Control’ Module. Now you need to run this dataset through FASTQ Groomer. This can be found on the left hand side under NGS: QC and Manipulation Be sure you have selected the correct file under the File to Groom tab. Also select the correct file type under the Input FASTQ Quality Scores Type. For this dataset use the Illumina 1.3-1.7 selection. 2 When this is done you should have a new file in your history. Something like FASTQ Groomer in data XX. We will want to them look at the summary statistics for this groomed dataset. To do this chose the FASTQ Summary Statistics under NGS: QC and Manipulation. Select your groomed FASTQ dataset under the FASTQ file tab and hit execute. This should produce another file called FASTQ Summary Statistics on data. Viewing this data in tabular format isn’t really all that informative. We can make a box plot of quality scores for each base position with this information. To do this click on Boxplot under Graph/Display Data. Be sure to select your Summary Statistics file and press execute. You should get something like this: The X-axis is the read position and the y-axis is the quality score. This is a summary of all quality scores on all your reads. Next we will want to split up our paired end reads. To do this select FASTQ Splitter under NGS: QC and Manipulation. For your FASTQ input file you will want to use the data that has been generated from the FASTQ Groomer application. Hit execute. If this is done successfully you should have two new files – your forward and reverse reads. We can now map these split paired end reads to our genome of interest. In this case yeast. Mapping will be done with a program called Bowtie (Langmead et al. 2009). It can be found under NGS: Mapping. Click on the Map with Bowtie for Illumina tab. We need to select the correct genome, to do this select Saccharomyces_cerevisiae_S288C_SGD2010 as your reference genome. Also select Paired-End under Is this library mate paired. Be sure the correct FASTQ files are selected under the Forward and Reverse FASTQ file tabs. WE will use the default parameters, so you can now click execute. The next step is to filter the SAM file that is created by Bowtie. TO do this select Filter SAM under NGS: SAM Tools. Be sure the Map with Bowtie data is selected and you now need to add flags for filtering. This is done by clicking Add New Flag, and selecting the appropriate flag. In total you will have three flags: Type: Read is Paired Set the state for this flag as YES. Type: Read is mated in a proper pair Set the state for this flag as YES. 3 Type: The read is unmapped Set the state for this flag as No. The last one is a counter intuitive way of saying the reads are mapped… After this we need to convert the SAM file to a BAM file. This is done under the NGS: SAM Tools section, click on SAM-to-BAM and be sure to have your filtered SAM file selected. After this is done we can generate a pileup file. This is done by selecting Generate pileup under NGS: SAM Tools. Once you have the SAM-to-BAM file selected you need to change the Call consensus according to MAQ model? setting to YES. Sometimes Galaxy does not save the pileup file in the correct format. TO confirm that it has click on the ‘pencil’ (edit attibutes) next to your completed pile up file. Click on the datatype tab and confirm that it is recognized as a pileup file. If it is not you can change it manually here. Be sure to save when you are done. Next you have to filter the pileup file. Basically we want to remove variable positions that have low sequence quality score and have a certain sequence coverage. Click on Filter Pileup under NGS: SAM Tools. Besure the correct data file is selected and under which contains: change the tab to Pileup with 10 Columns. The other option that needs to be changed is the Convert coordinates to intervals?:. Set this to Yes. This will allow you to visualize the SNPs later in the yeast genome. You are going to use the default settings, but you could explore other options. For instance you could increase the sequence quality required for SNP calling and/or you could increase the coverage of the SNP site. After you click execute you should get a Filter on pileup data file. For this data set you should have 37 regions with SNPs. IF you click on the ‘eye’ next to this output file you can see the report of where SNPs are located. You can visualize these SNPs. Do to this click on the ‘visualize’ button and select Trackster. You can save the name of the new visualization. This browser will allow you to visualize the position of the variable positions along each chromosome. Assessment Assessment questions for this module will be presented at the end of the class section. You will be asked to write/type out the commands necessary to accomplish various mapping processes. Discussion Topics for Class Which chromosomes have SNPs? Which chromosome has the most SNPs? Are you surprised at the number of SNPs found given the number of reads you started with? Literature Cited 4 This module has been developed by Andres AGuilar for implementation at the 2013 GCAT-SEEK workshop at Juniata College. It relies heavily on usage examples from the internet and published data. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25. Additional Reading http://samtools.sourceforge.net/ 5