eprint_1_30107_170

advertisement
SNP Discovery and Genotyping
Methods and Applications
Jun Wang, Dee Aud, Soren
Germer,and Russell Higuchi
1. INTRODUCTION
The identification of genes affecting complex traits (i.e., biological traits
affected by several genetic and environmental factors) is a very difficult
and challenging task (1–3). For many complex traits, the observable
variation between individuals is quantitative; hence, loci affecting such
traits are generally termed quantitative trait loci (QTLs). In contrast
with monogenic traits, it is impossible to identify all the genomic regions
responsible for complex trait variation without additional information
on how these regions segregate (1,4). A key development in complex
trait analysis was the establishment of large collections of
molecular/genetic markers. With the discovery of a large amount of
single nucleotide polymorphisms (SNPs) in human and model
organisms, correlating SNP markers with phenotype in a segregating
population has become a useful tool in QTL studies (5). In both linkage
and association mapping, the development of high-throughput methods
to discover and genotype polymorphism markers has enabled wholegenome scanning to detect individual loci possible (2).
2. SNP DISCOVERY:
SNPs are single base differences observed when sequences from
different genomes are compared. Among human genomes these changes
occur at the frequency of about 1 in every 1000 bases (6). This high
density of SNP markers facilitated the fine mapping of genes and
prompted large-scale efforts to identify and map new SNPs (for review
see refs. 7 and 8).
From: Computational Genetics and Genomics:
Edited by: G. Peltz © Humana Press Inc., Totowa, NJ The discovery of
new polymorphisms has been most rapid in the human and mouse
genomes. Although it is often still useful to perform de novo SNP
discovery, the constantly growing number of validated human SNPs
deposited in public databases makes a search of these databases an
important first step in any study of human SNPs. Most human SNPs are
deposited in a database (“dbSNP”) that is maintained and curated by
the National Center for Biotechnology Information (NCBI). In its most
recent build (build 110, January 2003), dbSNP contained more than
three million human SNPs. This database can be queried in a variety of
ways using the excellent NCBI search interfaces. Although most of the
dbSNP polymorphisms derive from computational analysis of aligned
sequence traces, more than half a million of the SNPs have additionally
been validated experimentally. Many of the SNPs in dbSNP were
identified by the SNP Consortium, which itself maintains an excellent
website with a search interface. For alternative search tools (and to
some extent informational content), the Human Genome Variation
Database can be used as it contains most of the SNPs available through
dbSNP. A number of more specific public databases contain
information and SNPs related to specific projects, and though most of
them also deposit their data in dbSNP, their own search interfaces can
sometimes be useful tools. These include the Human Gene Mutation
Database, which lists phenotypically related polymorphisms; the
Japanese SNP database, which contains SNPs mapped in Japanese
populations; the Cancer Genome Anatomy Project database; the
National Institute of Environmental Health Sciences (GenesSNPs); and
several others. In addition, some companies (e.g., Sequenom, San Diego,
CA; ABI, Foster City, CA) maintain proprietary databases with
comprehensive SNP information related to the SNP genotyping
technologies they sell. A database of mouse SNPs, the Mouse SNP
Database at http://mouseSNP.roche.com/, is described in Heading 4.
Because of the variable quality of high-throughput sequence reads,
unvalidated SNPs from purely in silico searches are frequently false.
Although an increasing number of SNPs are being validated, the density
of validated SNPs within a particular gene of interest is likely to be too
low to enable thorough association studies to be performed. Also, SNPs
in the particular disease group or ethnic group (or in our case, model
organism) under investigation are likely to be underrepresented or
missing. For these and other reasons, SNP discovery for particular
research needs will remain an active area. Our own SNP discovery
efforts, predominantly in inbred mouse strains, have been polymerase
chain reaction (PCR)-based (as opposed to the recombinant
deoxyribonucleic acid [DNA] methods used to generate most wholegenome 86 Wang et al.
SNP Discovery and Genotyping:
sequences). A nearly complete mouse genome sequence allowed the
facile design of PCR primers resulting in SNP discovery that was evenly
distributed along chromosomes. PCR amplicons can be assessed for
SNPs in a number of different ways, including denaturing highperformance liquid chromatography (see ref. 9), single-stranded
conformational polymorphism analysis (see ref. 10), and denaturation
gradient gel electrophoresis (see ref. 11). All these methods detect
heteroduplexes generated during the PCR reaction by the presence of
an SNP in the heterozygous state. These methods maximize sequencing
efficiency by targeting DNA sequences containing one or more
polymorphism. However, DNAsequencing itself, because of the
development of reusable and high-speed capillary gels and the
automation of sample loading, has become rapid enough that it is now
usually possible to proceed directly to DNA sequencing. Although it
may seem obvious, it is worthwhile to note that when PCR primers are
designed to amplify a known gene, the resulting polymorphisms
identified will have all the information available for that gene, such as
chromosomal position, gene name, gene function, and any annotation
that is available for that gene. Also, the position of the polymorphism
within the gene itself will be available as well as information, such as
coding vs noncoding sequence, or promoter sequence vs 3untranslated
region (UTR). In some cases, a readily identified functional mutation
may be discovered, such as the introduction of a premature stop codon
into the resulting messenger ribonucleic acid (mRNA). Historically, the
sequencing of complementary DNA (cDNA) libraries had the advantage
of focusing on expressed sequences. The discovered SNPs were located
in the coding sequence or in the 3UTR. This area often contains
important regulatory elements for each gene (12,13). As whole-genome
sequences became available, PCR-directed sequencing became more
useful than cDNA libraries, requiring less work and allowing for
detection of polymorphisms in genes that are expressed at levels too low
for representation in the libraries.
Recently, we used the PCR-based approach to assemble a murine SNP
database that includes annotation and mapping information for all the
polymorphismsncontained in the database (14). Currently, the database
contains
more than 70,000 SNPs among 21 commonly used inbred mouse strains.
The sequencing is performed using an ABI-3700 capillary sequencer
with an autoloading attachment. Currently, two operators can sequence
about 5000 PCR amplicons, each with 500 bp, in a week, and about half
the sequenced amplicons contain SNPs. PCR primers are designed
automatically by computer from batch-loaded, genome sequence files;
the sequences are automatically entered into an electronic file for
ordering oligonucleotides which are delivered as pairs in 96-well plates.
Each PCR amplicon is sequenced with the same primers used for PCR
amplification. A Qiagen Robot 3000 is used for primer dilution, PCR
reaction setup, amplicon clean-up (using Qiaqick PCR Purification
kits), and sequence template preparation. The amplicons are analyzed
by gel electrophoresis to assess primer specificity. The major costs for
this project have been the capital investment in the sequencer and robot
and the ongoing expense of primers, thermostable polymerase, and
plastic disposables. Although perhaps more expensive than most small
labs could afford on their own, this approach is well within the means of
most “core” sequencing facilities that are now present at many
academic and industrial institutions. For human SNPs discovery,
sequencing of 5000 amplicons per week from 50 individuals would
identify about 50 new SNPs per week.
Download