An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry

advertisement
An Introduction to
Sequence Variation
Chris Lee
Dept. of Chemistry & Biochemistry
UCLA
Types of Polymorphism
• Single nucleotide polymorphisms (SNP)
constitute about 90% of polymorphisms.
• Insertions, deletions.
• Microsatellite repeats: a locus where
different numbers of copies of a short repeat
sequence are found in different people.
• Gross genetic losses or rearrangements.
Large-scale Polymorphism
Single Nucleotide Polymorphisms
• Each person different at 1 in 1000 letters.
• SNPs responsible for human individuality!
• Some SNPs cause human diseases (e.g.
cancer, cystic fibrosis, Alzheimer’s).
• Enormous efforts have been made to
identify specific mutations that cause
disease.
Single Nucleotide Polymorphism
Mutation can occur
as easily as the loss
of a single chemical
group from one
nucleotide base, e.g.
the amino group of
cytosine.
Creating a Mutation
Genomic Density of SNPs
• Comparing two random chromosome, one
SNP per 1000 bp.
• Comparing 40 people (2 chromosomes
each), expect 17 million SNPs in the
complete human genome (3 billion bp).
• In coding region (5% of genome) expect
500,000 cSNPs, perhaps 6 per gene.
SNPs: a detailed record of human
genetic history
• Each SNP is typically a single mutation
event, that occurred in a context of certain
pre-existing SNPs.
• As time passes this context is gradually lost
due to recombination.
A
B C
B C
D E F
time
D
SNP C initially created
linked to SNPs ABDEF...
“Island” of linkage shrinks...
A record of the origins,
migrations, and mixing of the
world’s peoples
• The size of the “island” of strong linkage
around a SNP indicates its age (small = old)
• The SNPs it’s linked to give a “genetic
fingerprint” of the original person it’s from.
• In principle each SNP can be used to track
all his descendants.
• Each person has 300,000 common SNPs-- a
very rich record of their genetic history.
SNPs in
lipoprotein lipase
(LPL) gene.
SNP genotypes in
71 individuals in
the LPL gene
heterozygote (X/y)
homozygote (y/y)
SNP Allele Frequency
SNP Haplotypes
reconstructed
from LPL
genotype data
SNP Linkage Disequilibrium
The Hunt for Disease Genes
• Currently: finding a disease gene can take
years, because there are very few markers,
forcing researchers to search dozens of
genes.
• SNPs are a powerful tool for discovering
genes that cause disease: with a SNP in
every gene, could directly map a disease to
a single gene.
Mapping Disease Genes
microsatellite
disease gene
chromosome
genes
SNPs
• Look for genetic linkage of disease to marker
• Microsatellite markers are too widely spaced
to get to the individual gene level.
• There are common SNPs in every gene.
Identification of SNPs
• In 1998, Wang et al. reported ~ 3000 SNPs.
• Currently about 200,000 SNPs have been
identified in total by experiment (in public
databases).
• A pharmaceutical industry SNP consortium
has been formed to fund identification of
300,000 SNPs to be shared publically.
SNPs for Pharmacogenomics
• Differences in efficacy and side effects from
person to person can be a big problem for
drug clinical trials / approval.
• If SNPs that correlate with these differences
can be identified, the clinical trial could be
limited to patients where efficacy is likely
to be best, with least side effects.
• These SNPs would then also have to be
tested on prospective patients for the drug.
Single-nucleotide polymorphism in the human mu opioid
receptor gene alters -endorphin binding and activity:
Possible implications for opiate addiction
Bond et al. PNAS 95:9608
The mu opioid receptor is the primary
site of action for the most commonly
used opioids, including morphine,
heroin, fentanyl, and methadone.
The A118G variant receptor binds endorphin, an endogenous opioid that
activates the mu opioid receptor,
approximately three times more tightly
than the most common allelic form of the
receptor. Furthermore, -endorphin is
approximately three times more potent at
the A118G variant receptor than at the
most common allelic form in agonistinduced activation of G protein-coupled
potassium channels.
Comprehensive EST Analysis of
Single Nucleotide Polymorphism
in the Human Genome
Chris Lee
Dept. of Chemistry & Biochemistry
UCLA
Targeting Functional Polymorphism
via Expressed Sequences
• Only 5% of the human genome corresponds to
coding “genes” coding functional protein.
• Look for functional SNPs by targeting these
gene sequence regions.
• Genes are “expressed” by transcription into
mRNA, which is spliced, poly-adenylated and
transcribed.
• Purify polyA-mRNA, make cDNA, sequence.
SNP Detection from ESTs
• 1.4 million Expressed Sequence Tag (EST)
sequences, 300-500 bp, from 950 people.
• How to put together all the ESTs from the
same gene, without mixing up related genes?
• How to distinguish sequencing errors (very
common) from genuine Single Nucleotide
Polymorphisms?
SNP Detection Approaches
• Experimentally: random sampling of DNA.
Very expensive, slow.
• Computationally: find SNPs from existing
experimental data. Sort out real SNPs from
experimental sequencing errors. Difficult
statistical and computational problems.
• This experimental data was sitting around
for years...
Distinguishing SNPs from
Sequencing Errors
A
T
The frequency and pattern in which a polymorphism is
observed, must rise above the rate of background, random error.
Single-pass read sequences contain many errors which
complicate the reliable detection of SNPs. There are miscalls
(N), and frequent letter duplications / losses in runs (repeats of
a single letter). These non-uniform error rates are critical in
assessing the statistical significance of candidate SNPs like A
(not in a run) vs. T (problematic because it involves a GG run).
How to address this?
• Adopt rigorous statistical approach based on
measured frequencies from very large data.
• Bayesian inference: carefully separate
observations from hidden states you want to
make inferences about.
• “Integrate out” all assumptions by considering
all possible values of the assumptions.
• Explicitly measure degree of uncertainty in
the predictions due to poor data, ambiguity.
Odds ratio: SNP model vs.
sequencing error model
p(obs | SNP)
SNPscore 
p(obs | err )
Consider both models: are the
observations more consistent with a SNP
or sequencing error?
Error Model: treat True gene
sequence as unknown
p(obs | err )   p(obs | T ) p(T )
T
•Treat all sequences T as equally likely (before you
consider the actual observations (chromatograms).
•Sum error model probability over all possible T.
SNP Model
1
p(obs | SNP)   p(obs | T , T ) p(T  | T ) p(T )  p(obs | T * , T )  p(T * )
 3
T T
•Rather than summing SNP model probability over
all possible T, T*, calculate the probability for a
specific SNP T* in a specific consensus T.
*
1
p
(
obs
|
T
, T )
 
SNPscore   
 3   p(obs | T )
T
Sequencing error model
Treat individual observed sequences i as independent;
treat alignment (what errors occurred) as uncertain.
p(obs | T )   p(obsi | T )   p(obsi , A | T )
i
i
A
Treat true gene sequence T as uncertain: sum over all possible T
p(obs | err )   p(obsi , A | T )
T
i
A
Hidden Markov Model
Discrimination of SNP vs. Error
The match states (M) of a profile is the equivalent of the true
population sequence, and deletion (D), insertion (I) and
emission probabilities are set to be the observed frequencies of
sequencing errors conditioned on local sequence context. The
sum probability for the SNP model, vs. the sum probability for
the error-only model, yields an odds-ratio for the SNP.
To assess putative SNP, consider
all alternative possibilities
• Sequencing error: calculate odds ratio SNP vs.
error. Use PHRED score, local context.
• Orientation errors: ESTs reported backwards?
• Chimeras, mixed clusters: ESTs may not be
properly clustered. Some ESTs chimeric?
• Alignments: all possible ways EST could have
been emitted from true sequence T.
• “true” sequence: all possible T for the gene.
SNP Model: “Local” allele
frequency qz in one person
p(obsi | T , T , z )   p(obsi , A | T , T , q z )
*
*
A
z = 0, 1, 2 …
p(obsiL
qz = z/N, where N = 2 typically
| T , T ) 
*

z  0 ,1, 2
p( z ) p(obsi | T , T , z )
N z
p( z | q)   q (1  q) N  z
z
*
iL
Assuming Hardy-Weinberg
Use Library information: which
sequences are from same person!
Combine observations from all libraries L, and treat population
allele frequency q as uncertain (so take integral over q= (0,1) ).
1
p (obs | T , T )  
*
0
*
p
(
obs
|
T
, T , q) p(q )dq

iL
L
 2 z
     q (1  q) 2 z  p(obsi , A | T * , T , z ) p(q)dq
0
L z  0,1, 2  z 
iL A
1
Posterior probability for
population allele frequency q
p(q | T * , T , obs) 
*
p
(
obs
|
T
, T , q)

iL
L
1

0
*
p
(
obs
|
T
, T , q)dq

iL
L
Gives posterior distribution for q, taking into account all error
rates in the observations, amount of sequence and library
availability, ambiguities in the sequence, etc.
6 SNP observations from one
library
0.000014
0.000012
0.00001
0.000008
0.000006
Series1
0.000004
0.000002
0
-0.000002
0
0.2
0.4
0.6
0.8
1
6 SNP observations scattered
over all libraries
8E-13
7E-13
6E-13
5E-13
4E-13
Series1
3E-13
2E-13
1E-13
0
0
0.2
0.4
0.6
0.8
1
Alignment Accuracy Challenges
• Automatic Multiple Sequence Alignment of
1000+ sequences is problematic.
• Alignment accuracy is much more of a
problem for SNP detection than for simply
getting the right consensus. Consensus
merely requires that the majority be aligned,
whereas even a single alignment error will
result in an incorrect SNP prediction.
Sequencing Error Analysis
• We have produced a dataset of 400,000,000 bp where we
have reliable consensus, and therefore can identify all the
sequencing errors. This could provide “corrected” EST
sequences, or alternatively consensus, assembled gene
sequences for a large fraction of human genes.
• This also provides detailed statistics on the frequency of
different types of sequencing errors, which show a startling
variation depending on local sequence context.
Background error rates of 0.3% substitution, 0.3%
insertion, 0.7% deletion, rise dramatically
Example SNP: GGA C/T CAA
Cluster AA702884
C vs. T polymorphism
Novel SNP, not
previously identified.
Automated SNP Detection
Input Unigene: 1,400,000 Human ESTs, 300-500 bp long
Word frequency based overlap
& orientation detection
Try all possible orientations;
Don’t trust Unigene!
Reorient ESTs: catch reversals,
place in 5’ -> 3’ orientation
Many errors in the reported data
e.g. reversals, in majority of
clusters!
EST Alignment: accuracy
predict gene consensus & SNPs
10-5000 ESTs per gene,
80,000 genes, 500-5000 bp long
Statistical Assessment
of candidate SNPs
>50,000 believable SNPs
hidden among >10,000,000
sequencing errors.
Sequence Alignment
Current Status: Results
• 400,000,000 bp aligned w/ reliable consensus.
• 83,000 consensus gene sequences produced.
• 20,000 show significant homology to known
proteins, almost all in expected + orientation.
• 75,000 SNPs above LOD score of 3.
• 30000 SNPs above LOD score of 6.
• current estimate: 60,000 high frequency SNPs.
Megakaryocyte Potentiating Factor (Unigene Cluster Hs.155981)
Hs#S785496
gagg..cccactcccttg.ctggccccagccctgctgan.at.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
Hs#S1065649
gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact
Hs#S706294
gagggccccactcccttg.ctagtgtcagccctgctggggat.ccccgcctggccaggagcagagcacgggtggtccccattccaccccaagagaact
Hs#S730843
Hs#S751356
gagggccc.actcccttg.ctggccccagcc.tgctgga.gt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
gagggccc.actcccttg.ctggccccagccctgctgna.nt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
Hs#S786081
gagggccccactcccttg.ctaggac.agcc.tgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
Hs#S417458
gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact
Hs#S751274
gagggccccactccctgggcttggcccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
Hs#S483955
gagggccccactcccttg.ctggccccagccctgctgga.atancccgcctggccaggagcag.gcacgggtnatccccgttccaccccaagagaact
Hs#S1434119
gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
Hs#S1065241
aagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaaaagaact
CONSENS0
gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact
1970
1980
1990
2000
2010
2020
2030
2040
2050
Chromatographic Evidence
G
G
T
G
Hs#S785496
zu42c08.r1
G
G
T
C
C
C
G
G
T
G
A
Hs#S1065649
oz03ho7.x1*
A
T
C
C
C
RFLP Detection of SNPs
1 2 3 4 5 6 7 8 9 10 11
86 nt
67 bases
67 nt
G /A
86 bases
32 bases
35 bases
GATC
G
MboI
[MboI]
35 nt
32 nt
TC
GG GG GG AA GG GA GG GA GG AA GA
genotype
Verified 56 of 79 SNPs tested
so far
%verified
RFLP Verification on 16-24 DNA Samples
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1-6
6-20
score
>20
Verification Test:
Whitehead cSNPs
• Whitehead Institute has systematically
searched for SNPs in 106 genes, using 20
Europeans, 10 Africans, 10 Asians.
• On 54 genes, our predicted cSNPs (score>3)
are verified by their results at a 70% rate.
The Whitehead set may be incomplete.
Gene
#seq
#lib
AHC
16
APOD
207
AR
18
AT3
55
BDNF
18
CETP
14
CGA
122
CNTF
4
COMT
183
CYP11A
64
CYP11B2
7
DRD1
4
F10
29
F13A1
141
F2
37
F5
49
F9
7
FGB
266
total
percentage verified
6
59
4
12
9
8
17
1
67
21
4
1
15
40
14
14
2
32
cSNPs
low frequency
predicted verified predicted verified
0
3
0
3
0
2
2
0
0
1
1
0
0
0
8
2
2
3
0
2
0
0
2
1
1
2
2
0
0
5
5
1
1
1
0
3
3
0
30
17
57%
9
high frequency
predicted verified
0
2
2
1
1
0
0
6
1
2
0
0
1
2
1
2
1
4
1
3
4
1
3
1
11%
21
16
76%
Validation Test: HLA-A
• HLA polymorphism has been studied very
extensively for the general population,
providing a “gold standard” for all true
positives.
• 140 distinct HLA-A allele sequences
available from Anthony Nolan Foundation
database.
Are any of our predicted HLA-A SNPs not
independently verified by this data?
T
C
T
T
gatggccgtc atggcgcccc gaaccctcgt cctgctactc tcgggggccc tggccctgac ccagacctgg
C T
C
T
T T
A
A
A
T
AC
C
A T A
gcgggctccc actccatgag gtatttcttc acatccgtgt cccggcccgg ccgcggggag ccccgcttca
A
T
A
AC
C
A
A A AA
G
A
T
A
T
A
A
tcgccgtggg ctacgtggac gacacgcagt tcgtgcggtt cgacagcgac gccgcgagcc agaggatgga
A
A T
A
T
G
G A
G
C
AA T T
CA A C
G
AA
A
gccgcgggcg ccgtggatag agcaggaggg gccggagtat tgggacgggg agacacggaa tgtgaaggcc
A
AA T T
CA C A
G A AA
T
A
A
GG
T C
C A
C
T
C
A
cactcacaga ctgaccgagt ggacctgggg accctgcgcg gctactacaa ccagagcgag gccggttctc
G
T C
C AG
CC
T
T C
A
A
A
G
C
G
T
C
T
C
G
acaccatcca gataatgtat ggctgcgacg tggggtcgga cgggcgcttc ctccgcgggt accacaggac
C
GG
T
C
C
T
C
G
TGGAG T
G
G
C
C
A
T
T
A
A
gcctacgacg gcaaggatta catcgccctg aacgaggacc tgcgctcttg accgcggcgg acatggcggc
T
T
A
A
C
A
A
A
C
G
C
C
A
A
T
G CA
CA
T
T
tcagatcacc aagcgcaagt gggaggcggc ccatgtggcg gagcagttga gagcctacct ggagggcacg
T
C
C
A
A
T
G A
CA
T
T
T
G
C
G
C
A
tgcgtggagt ggctccgcag atacctggag aacgggaagg agacgctgca gcgcacggac gcccccaaga
C
C
G
C
GA
A
C
C C A
A
T
A
G
cgcatatgac tcaccacgct gtctctgacc atgaggccac cctgaggtgc tgggccctga gcttctaccc
A
C
C C A
A
T
A
G
T
tgcggagatc acactgacct ggcagcggga tggggaggac cagacccagg acacggagct cgtggagacc
T
C
T
C
C
A
G
TAT
A
G
aggcctgcag gggatggaac cttccagaag tgggcggctg tggtggtgcc ttctggacag gagcagagat
G
TAT
A
G
A
A
T C
TA
acacctgcca tgtgcagcat gagggtctgc ccaagcccct caccctgaga tgggagccgt cttcccagcc
A
T C
G
TA
G
A
C
A
T G
caccatcccc atcgtgggca tcattgctgg cctggttctc tttggagctg tgatcactgg agctgtggtc
G
A
T
C
A
C
C
A
T G
A G
C
T
A
T C
gctgctgtga tgtggaggag gaagagctca gatagaaaag gagggagcta ctctcaggct gcaagcagtg
C
G
A
C
T
T
A
A
C
acagtgccca gggctctgat gtgtctctca cagcttgtaa agtgtga
A
C
HLA-A: 89% Verification Rate
• Of total 108 SNPs we predicted in the
coding region of HLA-A, 96 are
independently validated by the known
HLA-A allele sequences, and 12 are not.
• By comparison, the NCI CGAP project
(based on the same EST data) predicts just
10 SNPs in HLA-A (>90% false negatives!)
Mass Spectrometry Validation
• SNPs change the mass of a DNA fragment.
• Sequenom Inc. has tested more than 1000 of
our SNPs using mass spectrometry of
pooled DNA samples.
• 80% were detectably polymorphic in
samples of 90 people.
Bioinformatics Key to SNPs
Estimated Number of Human SNPs found
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
MIT-AFFY
NCI (NIH)
Wash. U.
UCLA
EST-based SNP detection similar
in reliability with experimental
methods
project
Picoult Newberg et al
Buetow et al.
UCLA high-LOD
UCLA LOD 3
Halushka et al.
Cargill et al.
# SNPs verification
# people
method
850
63%
18 sequencing & G
3000
82% 10 to 90
RFLP
30000 69%, 79% 8 to 24
RFLP
75000
57% 8 to 24
RFLP
874
79% resequenced VDA
560 55%, 60% resequenced VDA, DHPLC
Application to Disease Gene
Mapping
• How do SNPs compare with traditional
marker sets used for disease gene mapping
projects?
• Density: how dense is the marker set, when
mapped onto the human genome?
• Ideal: at least one marker per gene (strong
linkage disequilibrium within 3kb)
• Ideal: high heterozygosity for good
statistics
Hs.197713 (3/3)
Hs.205802 (1/1)
n
MICROSATELLITE
0.1
Hs.211929 (108/51) n n n n
Hs.193078 (3/2)
Hs.176560 (6/6)
EIF3S7 (261/106) n n n n n n
AFM164ze3
0.2
Hs.139929 (16/12) n
Hs.146766 (4/2)
Hs.212478 (1/1)
AFM273vd9
1
0.3
2
Chromosome 22
p13
p12
p11.2
p11.1
q11.1
q11.2
q12.1
q12.2
q12.3
q13.1
q13.2
q13.3
0.4
3
4
0.5
AFMa046za5
5
0.6
6
Hs.143856 (3/2)
NCF4 (2/1)
Hs.147244 (1/1)
CSF2RB (29/14) n
TST(118/56) n n n n n n n n n n n n
7
8
Hs.107692 (5/1)
Hs.196536 (3/2)
PVALB (66/17) n n n n n
Hs.187027 (2/2)
Hs.207456 (1/1)
p
0.7
Hs.94810 (27/13) n
1.4 MB from 22q13.1
0.8
9
10
0.9
IL2RB (1/1)
Hs.194750 (4/3)
Hs.196941 (20/15)
Hs.22011 (27/14) n n
Hs.118700 (3/2) Hs.177397 (2/2)
RAC2 (67/35) n
11
1.0
12
Hs.7189 (19/14) n
Hs.174434 (1/1)
Hs.220558 (3/2)
Hs.187981 (8/3)
13
1.1
14
AFM261ye5
MFNG (123/37) n n
Hs.187933 (2/1)
Hs.57973 (19/12) n
1.2
MSE55 (26/16) n
Contig NT_001454 (14.6 MB)
22q11.2 - q13.3
1.3
1.4
SNP
Hs.178824 (1/1)
Hs.190885 (9/6) n
Hs.6071 (48/32) n n
Hs.119913 (1/1)
Hs.97858 (3/2)
Hs.5790 (71/31) n n n n n n
Hs.25744 (20/13)
Mapping Test: positionally
cloned genes
• Positionally cloned genes represent a
(somewhat) random sampling of genes.
• They are examples of actual disease-gene
mapping targets, that typically took years of
linkage analysis and chromosome walking
to find.
• How good is the coverage and
heterozygosity of our SNP marker set for
these genes?
Gene
ALD
APC
CFTR
CHM
CLC1
DM
DMD
FMR1
GK
GLYRA2
HD
KRT9
MLH1
MNK
MSH2
NDP
NF1
NF2
OCRL
PAX3
PAX6
PKD1
RB1
RET
SOD1
SRY
TSC
VHL
WND
WT1
Disease
SNPS Heterozygosity
X-Linked Adrenoleukodystrophy
0
Adenomatous Polyposis Coli
7
0.45, 0.42, 0.40, 0.39, 0.26
Cystic Fibrosis
4
0.48, 0.38, 0.35, 0.35
Choroideremia
0
Thomsen Disease
0
Myotonic Dystrophy
0
Duchenne Muscular Dystrophy
3
0.47, 0.34, 0.24
Fragile X Syndrome
5
0.21, 0.21, 0.18, 0.18, 0.18
Glycerol Kinase Deficiency
9
0.50, 0.50, 0.49, 0.49, 0.49
Hyperekplexia
0
Huntington's Disease
4
0.50, 0.46, 0.35, 0.35
Epidermolytic Palmoplantar Keratoderma
0
Hereditary Non-polyposis Colon Cancer
2
0.24, 0.08
Menkes Syndrome
0
Hereditary Non-polyposis Colon Cancer
2
0.31, 0.28
Norrie Disease
1
0.49
Neurofibromatosis, Type 1
3
0.50, 0.50, 0.47
Neurofibromatosis, Type 2
3
0.49, 0.48, 0.26
Lowe Syndrome
7
0.50, 0.50, 0.50, 0.46, 0.33
Waardenburg Syndrome
5
0.38, 0.38, 0.38, 0.38, 0.38
Aniridia
0
Polycystic Kidney Disease
0
Retinoblastoma
3
0.36, 0.28, 0.10
Multiple Endocrine Neoplasia 2A
0
Amyotrophic Lateral Sclerosis
11 0.48, 0.35, 0.26, 0.21, 0.16
Gonadal Dysgenesis
0
Tuberous Sclerosis
0
Von Hippel-Lindau Disease
13 0.50, 0.50, 0.49, 0.48, 0.48
Wilson Disease
0
Wilms Tumor
3
0.44, 0.43, 0.41
SNP validation tests: b globin
• b globin polymorphism has been studied
intensively, identifying 100s of substitutions
• Verify predicted SNPs against known
mutations.
• We detect 21 SNPs in b globin, 17 within
exons.
SNPs highly biased towards third
codon position
SNP codon distribution
FEATURE
cod_pos_1
cod_pos_2
cod_pos_3
n_SNPs
2
4
11
SNPs Biased towards Silent or
Conservative Substitutions
SNP substitution type
FEATURE
n_SNPs
silent
6
conservative
9
non-conserved
2
codon
pos.
polym orphism
AA
AA
protein
LOD
f (%) pos.
disease
location
type
3 CAC
HIS
CAT
HIS
83.1
17
2 surface
silent
2 GAG
GLU
GTG
VAL
240.8
5
6 ab interface
non-conserved
3 GGC
GLY
GGA
GLY
48.9
9
16 surface
silent
2 AAG
LYS
AGG
ARG
52.9
11
17 surface
conservative
3 AGG
ARG
AGT
SER
13.2
8
30 ab interface
non-conserved
3 CTG
LEU
CTA
LEU
12.9
7
31 core
silent
3 GTG
VAL
GTC
VAL
12.1
8
33 ab interface
silent
2 GTC
VAL
GCC
VAL
7.4
8
34 ab interface
silent
3 CAC
HIS
CAA
GLN
14.3
2
77 surface
conservative
3 GAC
ASP
GAA
GLU
3
2
79 surface
conservative
1 AAG
LYS
GAG
GLU
2.3
2
82 surface
conservative
2 ACC
THR
AAC
ASN
23.2
3
84 surface
conservative
1 CTC
LEU
TTC
PHE
2.1
2
105 ab interface
conservative
3 GTG
VAL
GTT
VAL
4.8
2
113 surface
silent
3 CAC
HIS
CAA
GLN
7.6
5
117 surface
conservative
3 GAA
GLU
GAT
ASP
4.2
4
121 surface
conservative
3 GTG
VAL
GCC
ALA
23.7
2
134 core
conservative
association
sickle cell
hemolytic anemia
erythrocytosis
SNPs detect three disease alleles
• Mutations previously identified as causing
disease, catalogued by Online Mendelian
Inheritance in Man.
• The only two non-conservative amino acid
substitutions detected.
• All three at the a-b chain interface.
Verified SNP: Hb Tacoma
disrupts a-b interface
His 77  Gln Exposed, unlikely
to disrupt stability
What are the most polymorphic
genes in the Human Genome?
• Very large differences in polymorphism
levels in different genes.
• Maintaining high levels of diversity (large
numbers of alleles) may indicate a selective
pressure.
• What can we learn from patterns of
polymorphism?
• Why are some genes so polymorphic?
The Most Polymorphic Genes:
Five Classes
• Direct interactions with pathogens.
• Very highly expressed genes.
• Genes involved in tumorigenesis and
survival/growth of tumors.
• Viral- and transposon-derived sequences.
• Large families of highly similar genes?
Acknowledgements
• Christopher Lee: K. Irizarry, B. Modrek,
C. Grasso
• Wing Wong (Statistics): C. Li
• Stan Nelson (Human Genetics):
V. Kustanovich, N. Brown
Download