UcDavis-2006 - BC Bioinformatics

advertisement
SNPs and haplotypes – software tools for
genetic sequence variations
Gabor T. Marth, D.Sc.
Boston College
Department of Biology
marth@bc.edu
http://bioinformatics.bc.edu/marthlab
UC Davis, June 5. 2006
Genetic variations are important because…
… they underlie
phenotypic
differences
… cause heritable diseases
and determine responses
to drugs
… allow tracking ancestral
human history
We investigate several essential aspects of
genetic variations
• build SNP discovery tools
• extend these tools for other, genetic and epigenetic,
inherited and somatic, polymorphisms
• apply our tools for genome data mining
• model human polymorphism structure to bear on human
pre-history and to inform medical research
• build tools to aid the selection of markers for clinical
case-control association studies and association testing
Polymorphism discovery tools
Single-nucleotide variations
• Human Genome Project produced a reference genome
sequence that is 99.9% common to each human being
• sequence variations make our
genetic makeup unique
SNP
• Single-nucleotide polymorphisms
(SNPs) are most abundant, but other
types of variations exist and are important
How do we find variations?
• comparative analysis of multiple
sequences from the same region of the
genome (redundant sequence coverage)
• diverse sequence
resources can be used
EST
WGS
BAC
Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
Computational SNP mining – PolyBayes
Two innovative ideas:
1. Utilize the genome reference
sequence as a template to organize
other sequence fragments from
arbitrary sources
2. Use sequence quality information
(base quality values) to distinguish
true mismatches from sequencing
errors
sequencing error
true polymorphism
SNP discovery with PolyBayes
genome reference sequence
1. Fragment recruitment
(database search)
2. Anchored
alignment
4. SNP detection
3. Paralog
identification
Sequence clustering
• Clustering simplifies to search against sequence database to
recruit relevant sequences
• Clusters = groups of overlapping sequence fragments matching
the genome reference
genome reference
fragments
cluster 1
cluster 2
cluster 3
(Anchored) multiple alignment
• The genomic reference sequence serves as an anchor
• fragments pair-wise aligned to genomic sequence
• insertions are propagated – “sequence padding”
• Advantages
• efficient -- only involves pair-wise comparisons
• accurate -- correctly aligns alternatively spliced ESTs
Paralog filtering
• The “paralog problem”
• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping
• Challenge
• to differentiate between sequencing errors and paralogous
difference
Sequencing
errors
Paralogous
difference
Paralog filtering
• Pair-wise comparison between fragment and genomic sequence
• Bayesian discrimination algorithm between “ortholog” and
paralog models of the number of observed mismatches
Probability
Paralog discrimination
P(d|Model_NAT)
P(d|Model_PAR)
P(Model_NAT|d)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Discrepancies (d)
SNP detection
• Goal: to discern true variation from sequencing error
sequencing error
polymorphism
Bayesian-statistical SNP detection
A
A
A
A
A
polymorphic
combination
Bayesian
posterior
probability
P( SNP ) 
C
C
C
C
C
Base call +
Base quality

all var iable
G
G
G
G
G
T
T
T
T
T
monomorphic
combination
Expected polymorphism rate
P( S N | RN )
P( S1 | R1 )
 ... 
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
...

...

 PPr ior ( Si1 ,..., SiN )


P
(
S
)
P
(
S
)
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior
i1
Pr ior
iN
Base composition Depth of coverage
Priors
• Overall polymorphism rate in population -- e.g. 1 / 300 bp
• Distribution of SNPs according
to specific variation
Relative occurance
• Distribution of SNPs according
to minor allele frequency +
alignment depth
70
60
50
40
30
20
10
0
AC
AG
AT
Variation type
• Pre-existing specific
information about SNP
CG
SNP probability score
polymorphism
specific variation
Confirmation rate [%]
Validation by resequencing
100
80
60
40
20
0
51-60
61-70
71-80
SNP score [%]
81-90
91-100
The PolyBayes software
http://genome.wustl.edu/gsc/polybayes
Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z,
Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish
WR. A general approach to single-nucleotide
polymorphism discovery. Nat Genet. 1999
Dec;23(4):452-6.
SNP mining: genome BAC overlaps
overlap detection
inter- & intra-chromosomal duplications
known human repeats
fragmentary nature of draft data
SNP analysis
candidate SNP predictions
BAC overlap mining results
~ 30,000 clones
>CloneX
ACGTTGCAACGT
GTCAATGCTGCA
>CloneY
ACGTTGCAACGT
GTCAATGCTGCA
25,901 clones
(7,122 finished, 18,779 draft
with basequality values)
21,020 clone overlaps
(124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
507,152 high-quality
candidate SNPs
(validation rate 83-96%)
Marth et al., Nature Genetics 2001
SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC overlaps
Weber et al., AJHG 2002
2. The SNP Consortium (TSC): polymorphism discovery in
random, shotgun reads from whole-genome libraries
Sachidanandam et al., Nature 2001
SNP detection in Sanger sequence traces
Aaron Quinlan
SNP discovery in clonal vs. diploid sequence
• PolyBayes was originally written to find SNPs
in clonal sequences in large SNP discovery
projects
• medical re-sequencing projects require the detection of SNPs in heterozygous
diploid sequence traces
5’
3’
5’
3’
C
G
C
G
C
G
T
A
Heterozygotes in diploid Sanger traces
Ind. 1
Ind. 2
Ind. 3
Ind. 4
Heterozygote detection is challenging
• we use a machine learning method (Support Vector Machine, SVM) to recognize
characteristic features of homozygous vs. heterozygous positions
P(Het)
Analyzing individual traces
Heterozygotes
SVM
0
SVM Function
Homozygotes
-
0
SVM Score
+
P(CT|R) = .34
P(CT|R) = .01
P(AC|R) = .999
P(AT|R) = .001
Aggregating information from multiple traces
P(GT | Read) = .98
resultant genotype call
P(GT ) = .993
P(GT | Read) = .87
forward/reverse sequences from
same individual
Priors: discovery vs. genotyping
discovery: “uninformed prior”
don’t know if site is polymorphic
have to test each site
Prior(CT) = .001
genotyping: “informed prior”
1. site is known to be polymorphic
2. allele frequency estimate
Prior(CT) = 0.34
Performance
Fraction of
Data
Analyzed
False
Discovery
Rate
Fraction of
Heterozygotes
Found
Fraction of
Homozygotes
Found
PolyBayes+
85.1
0.0375
86.60%
97.8%
Polyphred 5
86.17
0.0389
83.16%
82.63%
Performance Measured on ~1000 Alignments covering 500Kb
Region of Chromosome 4
Base calling for 454 pyro-sequencer flowgrams
• readout in pyrosequencing is based on instantaneous detection of
base incorporation… multiple bases of the same type are
incorporated in the same cycle
55 24 15 10 7 5 4 2 1 0 0
TCAGGGGGGGGGGGACGACAAGGCGT…
• the identity of consecutive bases is very
reliable but the length of mono-nucleotide runs
(base number) is difficult to quantify (great
for re-sequencing; but problematic for de novo
sequencing)
The uncertainty in base number in 454 traces
A Bayesian base-calling strategy for 454 traces
data likelihoods (i.e. the probability
distribution of signal intensity S for a
given base number N: from pair-wise
alignments between training data and
genome reference sequence
0.07
P(0A) = ~0.74
P(1A) = ~0.24
P(2A) = 0.0625
P(3A) = 0.0156
…
0.06
Probability of Signal
0.05
0
1
2
3
4
5
0.04
0.03
Prior probabilities of possible base
numbers: from genome reference
sequence
As
As
As
As
As
As
0.02
0.01
0
0
1
2
3
4
5
6
7
8
Pyrosequencing Signal
P( N | S ) 
P( S | N )  PR( N )
n
 P(S | N )  PR( N )
i
i
i 0
The posterior probability of base number given the signal intensity
Base case-calling accuracy
Comparison of Corrections Made by each Method
Number of Corrections Made For Signal
120
100
80
Us Correct, 454 Wrong
60
454 Correct, Us Wrong
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Signal
Overall Accuracy on 10,000 Test Traces:
454 Method: 97.13%
PyroBayes: 99.15%
Somatic mutation detection
Michael Stromberg
Somatic mutations
the detection of somatic mutations, and their
distinction from inherited polymorphism, is
important to separate pre-disposing variants
from mutations that occur during disease
progression e.g. in cancer
© Brian Stavely, Memorial University of Newfoundland
1. detect the mutations
2. classify whether somatic or inherited
Detection using comparative data
• based on comparison of cancer and normal tissue from
the same individual
• often cancer tissue is highly heterogeneous and the
somatic mutant allele may represent at low allele
frequency
Detecting somatic mutations with subtraction
• if normal tissue samples are not available,
we detect SNPs in cancer tissue against
e.g. the human genome reference sequence
• search for evidence that these mutations are
genetic
• subtract apparent mutations that are present in sequence variation
databases
Somatic mtDNA mutations in murine brain cancers
• compared mitochondrial reads from cancer and normal
tissue
• found both heteroplasmic and homoplasmic
mutations
heteroplasmy homoplasmy
• some confirm known sites and some are novel
Future: tools to integrate genetic and epigenetic data from varied
sources to find “common themes” during cancer development
somatic
mutations
chromosome
rearrangements
methylation
profiles
chromatin
structure
copy number
changes
gene
expression
profiles
repeat expansions
Population genetic modeling – human prehistory
The current variation resource
• The current public resource (dbSNP)
contains over 10 million SNPs
1. How are these SNPs structured within
the genome?
2. What can we learn about the
processes that shape human variability?
3. What is the utility of these data for
medical applications?
Nucleotide diversity is heterogeneous
at the scale of the
chromosomes
0.4
0.3
0.2
0.1
4 kb
40.00
35.00
8 kb
30.00
25.00
12 kb
20.00
15.00
16 kb
10.00
5.00
0.00
in different regions
of given lengths
0
Compositional and functional features
G+C nucleotide content
7
8
6
5
30
33
36
39
42
45
G+C Content [%]
48
51
SNP Rate [per 10,000 bp]
SNP Rate [per 10,000 bp]
8
CpG di-nucleotide content
7
54
6
recombination
rate
0.3
1.2
2.1
3
3.9
CpG Content [%]
10-4
3’ UTR
5’ UTR
Exon, overall
Exon, coding
5.00 x
4.95 x 10-4
4.20 x 10-4
3.77 x 10-4
synonymous
non-synonymous
366 / 653
287 / 653
functional
constraints
4.8
SNP Rate [per 10,000 bp]
10
5
9
5.7
8
7
6
5
0
0.5
1
1.5
2
2.5
3
3.5
4
Recombination rate [per Mb]
Variance is so high that these quantities are poor predictors of nucleotide
diversity in local regions, hence random processes are likely to govern the
basic shape of the genome variation landscape described by neutral theory
Strategy: measure genome-wise distributions of
DNA polymorphism data…
0.3
0.2
1. marker density (MD): distribution of
number of SNPs in pairs of sequences
0.1
0
0
1
2
3
4
5
6
7
8
9
10
0.1
0.05
0
1
2
“rare”
3
4
5
6
7
8
9
10
“common”
2. allele frequency spectrum (AFS):
distribution of SNPs according to
allele frequency in a set of samples
… build models of these distributions under
competing scenarios of human demographic history…
stationary
past
collapse
expansion
bottleneck
history
present
MD
(simulation)
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
0
AFS
(direct form)
1
2
3
4
5
6
7
8
9
10
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
0
10
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
9
10
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
… and determine the best-fitting models.
European data
African data
genetic
bottleneck
modest but
uninterrupted
expansion
Marth et al.
PNAS 2003; Genetics 2004
Relevance to human demographic history
our results
Recent African Origin
Multiregional
Computer software to aid case-control association studies:
tagSNP selection and association testing (details)
5-site Computaionally Generated LD (r 2)
1
0.8
0.6
0.4
0.2
1-4 Mrk Sep.
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
0
0
0.2
0.4
0.6
0.8
1
LA LD (r2)
Dr. Eric Tsung
Clinical case-control association studies – concepts
• association studies are designed to find disease-causing genetic variants
• genotyping cases and controls at various polymorphisms
clinical cases
• searching “significant” marker allele
frequency differences between cases
and controls
AF(controls)
clinical controls
AF(cases)
Association study designs
• region(s) interrogated: single gene, list of candidate genes (“candidate gene study”),
or entire genome (“genome scan”)
• direct or indirect:
causative variant
• single-SNP marker or multiSNP haplotype marker
• single-stage or multi-stage
marker that is co-inherited
with causative variant
causative variant
Marker (tag) selection for association studies
for economy, one cannot genotype every SNP in thousands of clinical samples:
marker selection is the process where a subset of all available SNPs is chosen
1. hypothesis driven (i.e. based on gene function)
2. LD-driven – based entirely on the reduction of redundancy presented by the
linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are
correlated with
causative variant
The International HapMap project
The international HapMap project was designed to
provide a set of physical and informational reagents for
association studies by mapping out human LD structure
http://www.hapmap.org
LD varies across samples
there are large differences in LD
between different human populations…
European reference (CEU)
African reference (YRI)
… and even between samples from the
same population.
Other European samples
Sample-to-sample LD differences make tagSNP selection
problematic
groups of SNPs that are in LD in the
HapMap reference samples may not
be in a future set of clinical samples…
… and tags that were selected based
on LD in the HapMap may no longer
work (i.e. represent the SNPs they
were supposed to) in the clinical
samples…
… possibly resulting in missed disease
associations.
Natural marker allele frequency differences confound
association testing
• the HapMap reference samples are much smaller than clinical sample sizes
cases: 500-2,000 chromosomes
reference samples: ~ 120 chromosomes
controls: 500-2,000 chromosomes
• therefore difficult to assess statistical significance
of candidate associations
AF(controls)
• difficult to accurately assess both marker allele frequency (single-SNP or
haplotype frequency) in the clinical samples and naturally occurring variation of
marker allele frequency differences between cases and controls
AF(cases)
We are developing technology for assessing sample-tosample variance in silico
we estimate LD differences between
HapMap and future clinical samples…
cases
association
testing
reference
tag evaluation
tag selection
…by generating
“computational” samples
representing future
clinical samples…
controls
“cases”
… and use computational “proxy”
samples for tabulating LD and
allele frequency differences.
“controls”
Two methods of computational sample generation
Method 1. “Data-relevant Coalescent”. This
algorithm uses a population genetic model to
connect mutations in the HapMap reference
to mutations in future clinical samples. Full
model but computationally slow.
“HapMap”
HapMap
“cases”
“controls”
Method 2. The PAC method (product of
approximate conditionals, Li & Stephens).
This method constructs “new” samples as
mosaics of existing haplotypes, mimicking
the effects of recombination. An
approximation but fast.
Computational samples
HapMap (CEU)
Computational (PAC)
Extra genotypes (Estonia)
Computational (Coalescent)
MARKER EVALUATION with computational samples
test if markers selected from the HapMap continue to
“tag” other SNPs in their original LD group
MARKER SELECTION with computational samples
selecting tags in multiple consecutive
sets of computational samples and
choosing for the association study the
best-performing tags
ASSOCIATION TESTING with computational samples
“cases”
tabulating ΔAF in “cases” vs. “controls” in
multiple consecutive computational pairs of
samples provides the natural range of allele
frequency differences to decide if a candidate
association is statistically significant
“controls”
“cases”
AF(controls)
“controls”
“cases”
“controls”
AF(cases)
Do computational samples represent future clinical genotypes
realistically?
1
0.8
0.6
0.4
0.2
0
0
we quantify the quality of representation by
comparing the correlation of LD between
corresponding pairs of markers (i.e. ask if
two markers were in strong LD in one set of
samples, are they ALSO in strong LD in the
other set?
0.2
0.4
0.6
0.8
1
LD difference -- comparison to extra experimental genotypes
• we have analyzed two extra genotype sets collected at the HapMap SNPs in
three genome regions, from our clinical collaborators (Prof. Thomas Hudson,
McGill; Prof. Stanley Nelson, UCLA)
0.949 +/- 0.013
0.963 +/- 0.014
0.978 +/- 0.010
AF difference -- comparisons to extra experimental genotypes
0.06
AF Diff, Comp Samples
0.05
0.04
0.03
0.02
0.01
0
0
0.01
0.02
0.03
0.04
0.05
0.06
AF Diff, Estonian Data
• according to our limited initial test, computational samples can represent
future clinical samples well for estimating sample-to-sample variability
A new marker selection and association testing software tool
• data visualization
• gene annotations overlaid on physical map of
SNPs (i.e. the human genome sequence)
tags
gene
annotations
• representative computational
sample generation
LD views
• advanced tag selection functionality
• advanced association testing functionality
reference samples
representative
computational samples
5-site Computaionally Generated LD (r 2)
1
0.8
0.6
0.4
0.2
1-4 Mrk Sep.
association statistics
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
0
0
0.2
0.4
0.6
0.8
1
LA LD (r2)
• multi-level user customization including
user conveniences e.g. tag prioritization
based on SNP assay score
User community
• companies designing new generations of whole-genome or specialized SNP arrays
• researchers comparing alternative platforms (e.g. Affymetrix 500K and the
Illumina 300K ) most suitable for their study
• clinical researchers designing candidate gene studies
• researchers designing second-stage follow-up studies in specific genome regions
after an initial genome scan (our methods can take advantage of first-stage data
already available in the clinical samples)
• the association testing features should be useful for analysts regardless of study
design
Acknowledgements
Washington University
LaDeana Hillier
Bob Waterston
Mark Yandell
Ian Korf
Warren Gish
NCBI
Steve Sherry
Stephen Altschul
Eva Czabarka
Greg Schuler
Deanna Church
Boston College
Eric Tsung
Aaron Quinlan
Michael Stromberg
Tony Schreiner
Collaborators
Aravinda Chakravarti (Hopkins)
Andy Clark (Cornell)
Pui-Yan Kwok (UCSF)
Henry Harpending (Utah)
Jim Weber (Marshfield)
Wendell Weber (Michigan)
Stan Nelson (UCLA)
Thomas Hudson (McGill & Genome Canada)
http://bioinformatics.bc.edu/marthlab
We are looking for postdocs and graduate students!
Download