MAMUPipe: A Computational Pipeline for Identifying

advertisement
MAMUPipe: A Computational Pipeline for
Identifying Novel SNP/SIDP in Rhesus Macaques
Based on 454 Pyrosequencing Technology
Brad Sickler 1, Ripan S. Malhi 2, Jessica Satkoski 3, Debbie George 3 ,Sree
Kanthaswamy 4, David Smith 3 , Dawei Lin 1
Keywords: Single Nucleotide Polymorphism, Rhesus Macaque, pyrosequencing
Introduction
Rhesus macaques (Macaca mulatta, MAMU) are a widely used and valuable model for biomedical
research and understanding the etiology of human diseases. With the completion of a draft sequence
of the rhesus macaque genome the ability to identify genetic polymorphisms throughout the entire
genome will vastly improve the use of this animal model for identifying heritable components of
complex disease [1]. In particular, the identification of SNPs (Single Nucleotide Polymorphisms) and
SIDPs (Single Insertion/Deletion Polymorphisms) will contribute to a high density linkage map that
can be used to perform QTL (Quantitative Trait Locus) and association studies to identify genes that
contribute to a phenotype of interest. The limited number of rhesus macaques used to start the
breeding colonies in the United States requires effective genetic management of the approximately
15,000 rhesus macaques in captive colonies to avoid inbreeding and associated deleterious phenotypic
effects. SNPs are useful genetic markers for genetic management in captive colonies because they are
widely distributed and abundant in the genome and are amenable to automated high throughput
analyses.
A large-scale parallel pyrosequencing technology from 454 Life ScienceTM is a new method that
can quickly obtain hundreds of thousands of short sequences (around 100bp) [2]. Those sequences are
distributed randomly across the entire genome due to the nature of shotgun sequencing. The wide
range of coverage makes it advantageous to discover novel unlinked SNP/SIDP through the analysis
of only a few individuals.
We have developed a high-throughput computational pipeline named MAMUPipe to identify
SNP/SIDP candidates based on 454 pyrosequencing data. The pipeline is capable of processing a large
number of DNA sequences with the assistance of a Linux computing cluster. Besides the
conventional Bioinformatics sequence analysis, we also incorporate special considerations on the
unique features of pyrosequencing and 454 technology to increase the reliability of SNP/SIDP
identification. Although this computational pipeline is developed for rhesus macaque genome, it
should be easily adaptable to analyze other genomes.
MAMUpipe: A Computational Pipeline for SNP/SIDP Discovery
MAMUpipe performs three types of functionality:
(1) Sequence Alignment and Information Parsing: All 454 sequences were aligned against the
reference genome. In this case, the MMUL 1.0 genome assembly from the Baylor College of Human
1
Genome Center Bioinformatics Core, UC Davis, CA, USA. E-mail: lhslin@ucdavis.edu
Dept. of Anthropology, UIUC, IL, USA. E-mail: malhi@uiuc.edu
3
Mol. Anthropology Lab., Dept. of Anthropology, UC Davis. E-mail: dgsmith@ucdavis.edu
4
Primate Center, UC Davis, CA, USA. E-mail: skanthaswamy@ucdavis.edu
2
Medicine. Sequence alignment was performed with NCBI’s blastn variant of the blastall program
without any low-complexity filters [4]. Chromosome number, 454 sequence id, alignment location,
mismatch information, and gap information are parsed with a Perl script using the BioPerl modules.
To standardize conventions, alignments and mismatch locations are relabeled relative to the 5’ strand
of the chromosome. Prior to the search, repetitive sequences, and low quality regions were masked out
of the database.
(2) Sequence Filtering and Selection: To avoid reporting non-informative sequence
polymorphisms, sequence alignments between 454 sequences and the reference genome that
corresponded to variable MHC regions, and regions exhibiting a statistically improbable number of
sequence overlaps (>4 overlaps) were removed from further analysis. Only alignments with >=98%
identity or only one mismatch were accepted for analysis. The 98% cutoff was selected to allow one
or two mismatches on a single sequence (average 454 sequence is about 103bp). After this step, the
contaminated sequences were eliminated. To control for pyrosequencing read errors in
homopolymeric regions due to non linear light response [4], any mismatch that is an
extension/truncation of an existing homopolymeric repeat was not considered in further analysis. All
mismatches occurring within 5 base pairs of a gap were screened out. SNP sites were considered
invalid if they exhibited either a Sanger quality score below 60 in the draft sequence or a 454 Life
ScienceTM score below 20. Any mismatch passing the aforementioned filters were considered as
potential SNP/SIDP candidates.
(3) Experimental and Computational Information Integration and Search: It is a non-trivial
task to manage and analyze of hundreds of thousands of sequences and their associate computational
results, especially in a complete genome context. We created a database that integrated 454 DNA
sequences, their corresponding quality scores, BLAST outputs, rhesus macaque genome sequences
and their corresponding Sanger quality scores. The database was built on top MySQL and has a Web
interface that allows dynamic queries through Perl CGI (Common Gateway Interface). The Website
for this database dubbed MAMUSNP, can be accessed at http://mamusnp.genomecenter.ucdavis.edu.
MAMUSNP supports three types of queries. 1) Use of one or more 454 sequence IDs to query a
summary report with 454 sequences, relative positions of each base pair, quality scores in a table
format and with mismatch sequences high-lighted by a different color. The details of BLAST outputs
including BLAST scores and the sequence alignments with reference are also shown in the report. The
query is useful to verify computational results and SNP/SIDP candidate selection. 2) Use of
chromosome number and the position on that chromosome to find if there is any 454 sequence
overlapped with that region. This type of query is useful for someone who wants to find a SNP/SIDP
that covers a site of interest. 3) Use of chromosome number and the position on that chromosome to
retrieve flanking sequences. This type of query is useful for primer design for experimental
verification of SNP/SIDPs.
We used the pipeline to process 339,967 sequence reads from a single sequencing run and identified
22,892 SNP candidates and 2923 SIDP candidates. Re-sequencing confirmation experiments are
underway.
References
[1] Milosavljevic, A. 2005. Pooled genomic indexing of rhesus macaque. Genome Research 15: 292-301.
[2] Margulies, M. et al 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature
437(7057):376-380.
[3] Altschul, S. et al 1997. Gapped BLAST and PSI-BLST: a new generation of protein database search programs.
Nucleic Acids Res. 25:3389-3402.
[4] Ronaghi, M. et al 2005. A sequencing method based on Real-time pyrophosphate. Science 281:363-365.
Download