Mapping_SNPS_Module_GCAT_2013_Aguilar_v2

advertisement
Mapping of Paired-End Reads to a Reference
Genome and SNP discovery
Background
Galaxy (main.g2.bx.psu.edu) is a powerful collection of analysis tools that allows users to perform
sophisticated genomic analysis. Before this module can be started students will need to have access to
the internet and a Galaxy account. These accounts are free. It should also be noted that the analyses in
this module can take a lot of time. Therefore this module may work best if the instructor has already
completed the activity and preformed and saved the analysis on their own Galaxy ‘history’. This way
students can be guided through the analysis, but not have to wait for each step to be completed (which
in many instances can take hours). This module may also be given out as a ‘take home assignment’ and
students can then have access to the completed project in their saved history. The instructor can then
go over each step during the lab period.
In this module students will learn how to map Illumina paired-end reads back to a reference genome,
with the goal of identifying variants or single nucleotide polymorphisms (SNPs). The identification of
genome wide variation has implications in many lines of study. One may wish to see where variation is
distributed throughout the genome for a particular individual or accession. For example, are there
mutations specific to a phenotype and do they occur in specific genes. This information may be useful in
identifying specific genomic regions (or even genes) that contribute to a particular phenotype or
disease.
To map sequence reads requires minimally two types of data: a reference genome and next generation
sequences. It is possible to work with non-model organisms; however this will require the additional
step of sequencing, assembling and annotating the genome of interest – not a trivial task. We will
therefore be using a genome form an organism that has been sequenced and well characterized – yeast
(Saccharomyces cerevisiae). For a eukaryote yeast has a relatively small genome and should be easier to
work with, but there are a number of reference genomes available via Galaxy (e.g. humans, Drosophila,
Arabidopsis…). You will also paired-end Illumina data to map back to the genome. You could upload
your own sequence data if you had it available, but luckily there are a number of published data sets
that are available for use. These are accessible through the Galaxy interface, more specifically through
the European Nucleotide Archive (www.ebi.ac.uk/enaA).
The basic workflow for this analysis is as follows:
1-Obtain data and get it to your Galaxy workspace
2-Filter the data using FASTQ Groomer
3-Create summary statistics for your FASTQ data
4-Split your paired-end reads
5-Map your reads to the reference genome
FASTQ data format stored your sequences and their associated quality scores in a single file. Please be
aware that the quality scores may be stored differently for Illumina/Solexa, 454 or SoLID sequencing
1
platforms. Galaxy can handle all of these formats, you just need to be sure what data you are working
with.
Module Goals
-Align NGS sequence data to a reference genome
-Obtain SNPs
By the end of this module you should be able to:
-Upload and/or obtain FASTQ data on Galaxy
-Filter and obtain summary statistics on FASTQ data
-Map sequence reads back to a reference genome
-Identify SNPs
Vision and Change Competencies
Ability to apply the process of science
Ability to use quantitative reasoning
Ability to use modeling and simulation
Ability to tap into the interdisciplinary nature of science
Sequencing Requirements
Sequencing requirements are project specific. This module utilized paired-end Illumina data.
Software and Hardware Requirements
This module utilizes the Galaxy server. A computer with internet access is required.
Procedures
Obtain the test dataset (https://main.g2.bx.psu.edu/u/andres-aguilar/h/yeastgcat). Alternatively you
could upload your own data. To do this click:
Get Data (Left Panel)
Upload Data
-Now you need to select the data file and upload it. An alternative to this is that you can access this data
set from my shared history (web link).
Once you have the data uploaded we now need to ‘clean it up’. The process is very similar to what is
done in the ‘Sequence Processing and Quality Control’ Module.
Now you need to run this dataset through FASTQ Groomer. This can be found on the left hand side
under
NGS: QC and Manipulation
Be sure you have selected the correct file under the File to Groom tab.
Also select the correct file type under the Input FASTQ Quality Scores Type. For this dataset use the
Illumina 1.3-1.7 selection.
2
When this is done you should have a new file in your history. Something like FASTQ Groomer in data XX.
We will want to them look at the summary statistics for this groomed dataset. To do this chose the
FASTQ Summary Statistics under NGS: QC and Manipulation. Select your groomed FASTQ dataset under
the FASTQ file tab and hit execute. This should produce another file called FASTQ Summary Statistics on
data.
Viewing this data in tabular format isn’t really all that informative. We can make a box plot of quality
scores for each base position with this information. To do this click on Boxplot under Graph/Display
Data. Be sure to select your Summary Statistics file and press execute. You should get something like
this:
The X-axis is the read position and the y-axis is the quality score. This is a summary of all quality scores
on all your reads.
Next we will want to split up our paired end reads. To do this select FASTQ Splitter under NGS: QC and
Manipulation. For your FASTQ input file you will want to use the data that has been generated from the
FASTQ Groomer application. Hit execute. If this is done successfully you should have two new files –
your forward and reverse reads.
We can now map these split paired end reads to our genome of interest. In this case yeast. Mapping
will be done with a program called Bowtie (Langmead et al. 2009). It can be found under NGS: Mapping.
Click on the Map with Bowtie for Illumina tab. We need to select the correct genome, to do this select
Saccharomyces_cerevisiae_S288C_SGD2010 as your reference genome. Also select Paired-End under Is
this library mate paired. Be sure the correct FASTQ files are selected under the Forward and Reverse
FASTQ file tabs. WE will use the default parameters, so you can now click execute.
The next step is to filter the SAM file that is created by Bowtie. TO do this select Filter SAM under NGS:
SAM Tools. Be sure the Map with Bowtie data is selected and you now need to add flags for filtering.
This is done by clicking Add New Flag, and selecting the appropriate flag. In total you will have three
flags:
Type: Read is Paired
Set the state for this flag as YES.
Type: Read is mated in a proper pair
Set the state for this flag as YES.
3
Type: The read is unmapped
Set the state for this flag as No.
The last one is a counter intuitive way of saying the reads are mapped…
After this we need to convert the SAM file to a BAM file. This is done under the NGS: SAM Tools section,
click on SAM-to-BAM and be sure to have your filtered SAM file selected.
After this is done we can generate a pileup file. This is done by selecting Generate pileup under NGS:
SAM Tools. Once you have the SAM-to-BAM file selected you need to change the Call consensus
according to MAQ model? setting to YES.
Sometimes Galaxy does not save the pileup file in the correct format. TO confirm that it has click on the
‘pencil’ (edit attibutes) next to your completed pile up file. Click on the datatype tab and confirm that it
is recognized as a pileup file. If it is not you can change it manually here. Be sure to save when you are
done.
Next you have to filter the pileup file. Basically we want to remove variable positions that have low
sequence quality score and have a certain sequence coverage. Click on Filter Pileup under NGS: SAM
Tools. Besure the correct data file is selected and under which contains: change the tab to Pileup with
10 Columns. The other option that needs to be changed is the Convert coordinates to intervals?:. Set
this to Yes. This will allow you to visualize the SNPs later in the yeast genome.
You are going to use the default settings, but you could explore other options. For instance you could
increase the sequence quality required for SNP calling and/or you could increase the coverage of the
SNP site.
After you click execute you should get a Filter on pileup data file. For this data set you should have 37
regions with SNPs. IF you click on the ‘eye’ next to this output file you can see the report of where SNPs
are located.
You can visualize these SNPs. Do to this click on the ‘visualize’ button and select Trackster. You can save
the name of the new visualization. This browser will allow you to visualize the position of the variable
positions along each chromosome.
Assessment
Assessment questions for this module will be presented at the end of the class section. You will be asked
to write/type out the commands necessary to accomplish various mapping processes.
Discussion Topics for Class
Which chromosomes have SNPs?
Which chromosome has the most SNPs?
Are you surprised at the number of SNPs found given the number of reads you started with?
Literature Cited
4
This module has been developed by Andres AGuilar for implementation at the 2013 GCAT-SEEK
workshop at Juniata College. It relies heavily on usage examples from the internet and published data.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biology 10:R25.
Additional Reading
http://samtools.sourceforge.net/
5
Download