SNP Discovery and Genotyping Methods and Applications Jun Wang, Dee Aud, Soren Germer,and Russell Higuchi 1. INTRODUCTION The identification of genes affecting complex traits (i.e., biological traits affected by several genetic and environmental factors) is a very difficult and challenging task (1–3). For many complex traits, the observable variation between individuals is quantitative; hence, loci affecting such traits are generally termed quantitative trait loci (QTLs). In contrast with monogenic traits, it is impossible to identify all the genomic regions responsible for complex trait variation without additional information on how these regions segregate (1,4). A key development in complex trait analysis was the establishment of large collections of molecular/genetic markers. With the discovery of a large amount of single nucleotide polymorphisms (SNPs) in human and model organisms, correlating SNP markers with phenotype in a segregating population has become a useful tool in QTL studies (5). In both linkage and association mapping, the development of high-throughput methods to discover and genotype polymorphism markers has enabled wholegenome scanning to detect individual loci possible (2). 2. SNP DISCOVERY: SNPs are single base differences observed when sequences from different genomes are compared. Among human genomes these changes occur at the frequency of about 1 in every 1000 bases (6). This high density of SNP markers facilitated the fine mapping of genes and prompted large-scale efforts to identify and map new SNPs (for review see refs. 7 and 8). From: Computational Genetics and Genomics: Edited by: G. Peltz © Humana Press Inc., Totowa, NJ The discovery of new polymorphisms has been most rapid in the human and mouse genomes. Although it is often still useful to perform de novo SNP discovery, the constantly growing number of validated human SNPs deposited in public databases makes a search of these databases an important first step in any study of human SNPs. Most human SNPs are deposited in a database (“dbSNP”) that is maintained and curated by the National Center for Biotechnology Information (NCBI). In its most recent build (build 110, January 2003), dbSNP contained more than three million human SNPs. This database can be queried in a variety of ways using the excellent NCBI search interfaces. Although most of the dbSNP polymorphisms derive from computational analysis of aligned sequence traces, more than half a million of the SNPs have additionally been validated experimentally. Many of the SNPs in dbSNP were identified by the SNP Consortium, which itself maintains an excellent website with a search interface. For alternative search tools (and to some extent informational content), the Human Genome Variation Database can be used as it contains most of the SNPs available through dbSNP. A number of more specific public databases contain information and SNPs related to specific projects, and though most of them also deposit their data in dbSNP, their own search interfaces can sometimes be useful tools. These include the Human Gene Mutation Database, which lists phenotypically related polymorphisms; the Japanese SNP database, which contains SNPs mapped in Japanese populations; the Cancer Genome Anatomy Project database; the National Institute of Environmental Health Sciences (GenesSNPs); and several others. In addition, some companies (e.g., Sequenom, San Diego, CA; ABI, Foster City, CA) maintain proprietary databases with comprehensive SNP information related to the SNP genotyping technologies they sell. A database of mouse SNPs, the Mouse SNP Database at http://mouseSNP.roche.com/, is described in Heading 4. Because of the variable quality of high-throughput sequence reads, unvalidated SNPs from purely in silico searches are frequently false. Although an increasing number of SNPs are being validated, the density of validated SNPs within a particular gene of interest is likely to be too low to enable thorough association studies to be performed. Also, SNPs in the particular disease group or ethnic group (or in our case, model organism) under investigation are likely to be underrepresented or missing. For these and other reasons, SNP discovery for particular research needs will remain an active area. Our own SNP discovery efforts, predominantly in inbred mouse strains, have been polymerase chain reaction (PCR)-based (as opposed to the recombinant deoxyribonucleic acid [DNA] methods used to generate most wholegenome 86 Wang et al. SNP Discovery and Genotyping: sequences). A nearly complete mouse genome sequence allowed the facile design of PCR primers resulting in SNP discovery that was evenly distributed along chromosomes. PCR amplicons can be assessed for SNPs in a number of different ways, including denaturing highperformance liquid chromatography (see ref. 9), single-stranded conformational polymorphism analysis (see ref. 10), and denaturation gradient gel electrophoresis (see ref. 11). All these methods detect heteroduplexes generated during the PCR reaction by the presence of an SNP in the heterozygous state. These methods maximize sequencing efficiency by targeting DNA sequences containing one or more polymorphism. However, DNAsequencing itself, because of the development of reusable and high-speed capillary gels and the automation of sample loading, has become rapid enough that it is now usually possible to proceed directly to DNA sequencing. Although it may seem obvious, it is worthwhile to note that when PCR primers are designed to amplify a known gene, the resulting polymorphisms identified will have all the information available for that gene, such as chromosomal position, gene name, gene function, and any annotation that is available for that gene. Also, the position of the polymorphism within the gene itself will be available as well as information, such as coding vs noncoding sequence, or promoter sequence vs 3untranslated region (UTR). In some cases, a readily identified functional mutation may be discovered, such as the introduction of a premature stop codon into the resulting messenger ribonucleic acid (mRNA). Historically, the sequencing of complementary DNA (cDNA) libraries had the advantage of focusing on expressed sequences. The discovered SNPs were located in the coding sequence or in the 3UTR. This area often contains important regulatory elements for each gene (12,13). As whole-genome sequences became available, PCR-directed sequencing became more useful than cDNA libraries, requiring less work and allowing for detection of polymorphisms in genes that are expressed at levels too low for representation in the libraries. Recently, we used the PCR-based approach to assemble a murine SNP database that includes annotation and mapping information for all the polymorphismsncontained in the database (14). Currently, the database contains more than 70,000 SNPs among 21 commonly used inbred mouse strains. The sequencing is performed using an ABI-3700 capillary sequencer with an autoloading attachment. Currently, two operators can sequence about 5000 PCR amplicons, each with 500 bp, in a week, and about half the sequenced amplicons contain SNPs. PCR primers are designed automatically by computer from batch-loaded, genome sequence files; the sequences are automatically entered into an electronic file for ordering oligonucleotides which are delivered as pairs in 96-well plates. Each PCR amplicon is sequenced with the same primers used for PCR amplification. A Qiagen Robot 3000 is used for primer dilution, PCR reaction setup, amplicon clean-up (using Qiaqick PCR Purification kits), and sequence template preparation. The amplicons are analyzed by gel electrophoresis to assess primer specificity. The major costs for this project have been the capital investment in the sequencer and robot and the ongoing expense of primers, thermostable polymerase, and plastic disposables. Although perhaps more expensive than most small labs could afford on their own, this approach is well within the means of most “core” sequencing facilities that are now present at many academic and industrial institutions. For human SNPs discovery, sequencing of 5000 amplicons per week from 50 individuals would identify about 50 new SNPs per week.