Rhesus Alignment and Variant Calling Summary

advertisement
Rhesus Alignment and Variant Calling Summary
Jessica M. Maia
September 25, 2014.
This document describes the process of alignment and variant calling for a single genome sample.
Reference Sequence
We downloaded from the UCSC genome browser individual chromosomes belonging to the rhesus
macaque genome rheMac2 (v.1.0 Mmul_051212 rhesus genome assembly, Jan. 2006) .The reference
sequence we created is composed of chromosomes 1-20,X, and Ur.
Alignment
We aligned reads from a given sequencing lane to the reference sequence with BWA (version 0.5.10).
We trimmed low quality bases with parameter '-q 15'. To fix mate pair tags we used Picard (version
1.59). We then merged each sample’s lane level alignment files with Picard creating a sample level
coordinate sorted alignment file. This was followed by the removal of duplicate reads from this merged
alignment file with Picard.
Variant calling & annotation
To improve variant calling, we re-aligned reads around known rhesus snps and indels downloaded from
Ensembl (version 73) with GATK (version 1.6). Variants were identified with SAMtools (version 0.1.17)
with in the following way: ‘samtools mpileup -EDS -C50 -d 1000’, followed by ‘bcftools view -vcgN' .
Variants were filtered with vcftools (version 0.1.11) and tabix (tabix-0.2.6) with the command 'vcfannotate -f'. We annotated the filtered variants using SnpEff (version 3.3) with Ensembl (MMUL_1.72)
annotations.
References
1. BWA
Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler
Transform. Bioinformatics, 25:1754-60
2. SAMtools
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin
R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map
(SAM) format and SAMtools.
3. GATK
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D,
Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework
for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
4. Picard
http://picard.sourceforge.net
5. SnpEff
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:
SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Cingolani P, Platts A,
Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012
6. VCFtools
The Variant Call Format and VCFtools, Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A.
Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T.
Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group,
Bioinformatics, 2011.
7. Tabix
Tabix: fast retrieval of sequence features from generic TAB-delimited files. Li H. Bioinformatics.
2011 Mar 1;27(5):718-9.
8. Rhesus Base annotation file
http://www.rhesusbase.org/download/download.jsp
RB2 Gene Models in gpe format.
9. Ensembl Snps and Indels
http://e73.ensembl.org/macaca_mulatta
10. Reference Chromosomes
http://hgdownload.cse.ucsc.edu/goldenPath/rheMac2/bigZips/chromFa.tar.gz
11. Reference Sequence
http://hgdownload.cse.ucsc.edu/goldenPath/rheMac2/bigZips/chromFa.tar.gz
Download