MAMUPipe: A Computational Pipeline for Identifying Novel SNP/SIDP in Rhesus Macaques Based on 454 Pyrosequencing Technology Brad Sickler 1, Ripan S. Malhi 2, Jessica Satkoski 3, Debbie George 3 ,Sree Kanthaswamy 4, David Smith 3 , Dawei Lin 1 Keywords: Single Nucleotide Polymorphism, Rhesus Macaque, pyrosequencing Introduction Rhesus macaques (Macaca mulatta, MAMU) are a widely used and valuable model for biomedical research and understanding the etiology of human diseases. With the completion of a draft sequence of the rhesus macaque genome the ability to identify genetic polymorphisms throughout the entire genome will vastly improve the use of this animal model for identifying heritable components of complex disease [1]. In particular, the identification of SNPs (Single Nucleotide Polymorphisms) and SIDPs (Single Insertion/Deletion Polymorphisms) will contribute to a high density linkage map that can be used to perform QTL (Quantitative Trait Locus) and association studies to identify genes that contribute to a phenotype of interest. The limited number of rhesus macaques used to start the breeding colonies in the United States requires effective genetic management of the approximately 15,000 rhesus macaques in captive colonies to avoid inbreeding and associated deleterious phenotypic effects. SNPs are useful genetic markers for genetic management in captive colonies because they are widely distributed and abundant in the genome and are amenable to automated high throughput analyses. A large-scale parallel pyrosequencing technology from 454 Life ScienceTM is a new method that can quickly obtain hundreds of thousands of short sequences (around 100bp) [2]. Those sequences are distributed randomly across the entire genome due to the nature of shotgun sequencing. The wide range of coverage makes it advantageous to discover novel unlinked SNP/SIDP through the analysis of only a few individuals. We have developed a high-throughput computational pipeline named MAMUPipe to identify SNP/SIDP candidates based on 454 pyrosequencing data. The pipeline is capable of processing a large number of DNA sequences with the assistance of a Linux computing cluster. Besides the conventional Bioinformatics sequence analysis, we also incorporate special considerations on the unique features of pyrosequencing and 454 technology to increase the reliability of SNP/SIDP identification. Although this computational pipeline is developed for rhesus macaque genome, it should be easily adaptable to analyze other genomes. MAMUpipe: A Computational Pipeline for SNP/SIDP Discovery MAMUpipe performs three types of functionality: (1) Sequence Alignment and Information Parsing: All 454 sequences were aligned against the reference genome. In this case, the MMUL 1.0 genome assembly from the Baylor College of Human 1 Genome Center Bioinformatics Core, UC Davis, CA, USA. E-mail: lhslin@ucdavis.edu Dept. of Anthropology, UIUC, IL, USA. E-mail: malhi@uiuc.edu 3 Mol. Anthropology Lab., Dept. of Anthropology, UC Davis. E-mail: dgsmith@ucdavis.edu 4 Primate Center, UC Davis, CA, USA. E-mail: skanthaswamy@ucdavis.edu 2 Medicine. Sequence alignment was performed with NCBI’s blastn variant of the blastall program without any low-complexity filters [4]. Chromosome number, 454 sequence id, alignment location, mismatch information, and gap information are parsed with a Perl script using the BioPerl modules. To standardize conventions, alignments and mismatch locations are relabeled relative to the 5’ strand of the chromosome. Prior to the search, repetitive sequences, and low quality regions were masked out of the database. (2) Sequence Filtering and Selection: To avoid reporting non-informative sequence polymorphisms, sequence alignments between 454 sequences and the reference genome that corresponded to variable MHC regions, and regions exhibiting a statistically improbable number of sequence overlaps (>4 overlaps) were removed from further analysis. Only alignments with >=98% identity or only one mismatch were accepted for analysis. The 98% cutoff was selected to allow one or two mismatches on a single sequence (average 454 sequence is about 103bp). After this step, the contaminated sequences were eliminated. To control for pyrosequencing read errors in homopolymeric regions due to non linear light response [4], any mismatch that is an extension/truncation of an existing homopolymeric repeat was not considered in further analysis. All mismatches occurring within 5 base pairs of a gap were screened out. SNP sites were considered invalid if they exhibited either a Sanger quality score below 60 in the draft sequence or a 454 Life ScienceTM score below 20. Any mismatch passing the aforementioned filters were considered as potential SNP/SIDP candidates. (3) Experimental and Computational Information Integration and Search: It is a non-trivial task to manage and analyze of hundreds of thousands of sequences and their associate computational results, especially in a complete genome context. We created a database that integrated 454 DNA sequences, their corresponding quality scores, BLAST outputs, rhesus macaque genome sequences and their corresponding Sanger quality scores. The database was built on top MySQL and has a Web interface that allows dynamic queries through Perl CGI (Common Gateway Interface). The Website for this database dubbed MAMUSNP, can be accessed at http://mamusnp.genomecenter.ucdavis.edu. MAMUSNP supports three types of queries. 1) Use of one or more 454 sequence IDs to query a summary report with 454 sequences, relative positions of each base pair, quality scores in a table format and with mismatch sequences high-lighted by a different color. The details of BLAST outputs including BLAST scores and the sequence alignments with reference are also shown in the report. The query is useful to verify computational results and SNP/SIDP candidate selection. 2) Use of chromosome number and the position on that chromosome to find if there is any 454 sequence overlapped with that region. This type of query is useful for someone who wants to find a SNP/SIDP that covers a site of interest. 3) Use of chromosome number and the position on that chromosome to retrieve flanking sequences. This type of query is useful for primer design for experimental verification of SNP/SIDPs. We used the pipeline to process 339,967 sequence reads from a single sequencing run and identified 22,892 SNP candidates and 2923 SIDP candidates. Re-sequencing confirmation experiments are underway. References [1] Milosavljevic, A. 2005. Pooled genomic indexing of rhesus macaque. Genome Research 15: 292-301. [2] Margulies, M. et al 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376-380. [3] Altschul, S. et al 1997. Gapped BLAST and PSI-BLST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. [4] Ronaghi, M. et al 2005. A sequencing method based on Real-time pyrophosphate. Science 281:363-365.