SNPs and haplotypes – software tools for genetic sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu http://bioinformatics.bc.edu/marthlab UC Davis, June 5. 2006 Genetic variations are important because… … they underlie phenotypic differences … cause heritable diseases and determine responses to drugs … allow tracking ancestral human history We investigate several essential aspects of genetic variations • build SNP discovery tools • extend these tools for other, genetic and epigenetic, inherited and somatic, polymorphisms • apply our tools for genome data mining • model human polymorphism structure to bear on human pre-history and to inform medical research • build tools to aid the selection of markers for clinical case-control association studies and association testing Polymorphism discovery tools Single-nucleotide variations • Human Genome Project produced a reference genome sequence that is 99.9% common to each human being • sequence variations make our genetic makeup unique SNP • Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important How do we find variations? • comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) • diverse sequence resources can be used EST WGS BAC Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection Computational SNP mining – PolyBayes Two innovative ideas: 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism SNP discovery with PolyBayes genome reference sequence 1. Fragment recruitment (database search) 2. Anchored alignment 4. SNP detection 3. Paralog identification Sequence clustering • Clustering simplifies to search against sequence database to recruit relevant sequences • Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3 (Anchored) multiple alignment • The genomic reference sequence serves as an anchor • fragments pair-wise aligned to genomic sequence • insertions are propagated – “sequence padding” • Advantages • efficient -- only involves pair-wise comparisons • accurate -- correctly aligns alternatively spliced ESTs Paralog filtering • The “paralog problem” • unrecognized paralogs give rise to spurious SNP predictions • SNPs in duplicated regions may be useless for genotyping • Challenge • to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference Paralog filtering • Pair-wise comparison between fragment and genomic sequence • Bayesian discrimination algorithm between “ortholog” and paralog models of the number of observed mismatches Probability Paralog discrimination P(d|Model_NAT) P(d|Model_PAR) P(Model_NAT|d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Discrepancies (d) SNP detection • Goal: to discern true variation from sequencing error sequencing error polymorphism Bayesian-statistical SNP detection A A A A A polymorphic combination Bayesian posterior probability P( SNP ) C C C C C Base call + Base quality all var iable G G G G G T T T T T monomorphic combination Expected polymorphism rate P( S N | RN ) P( S1 | R1 ) ... PPr ior ( S1 ,..., S N ) PPr ior ( S1 ) PPr ior ( S N ) P( SiN | R1 ) P( Si1 | R1 ) S ... ... PPr ior ( Si1 ,..., SiN ) P ( S ) P ( S ) S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior i1 Pr ior iN Base composition Depth of coverage Priors • Overall polymorphism rate in population -- e.g. 1 / 300 bp • Distribution of SNPs according to specific variation Relative occurance • Distribution of SNPs according to minor allele frequency + alignment depth 70 60 50 40 30 20 10 0 AC AG AT Variation type • Pre-existing specific information about SNP CG SNP probability score polymorphism specific variation Confirmation rate [%] Validation by resequencing 100 80 60 40 20 0 51-60 61-70 71-80 SNP score [%] 81-90 91-100 The PolyBayes software http://genome.wustl.edu/gsc/polybayes Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR. A general approach to single-nucleotide polymorphism discovery. Nat Genet. 1999 Dec;23(4):452-6. SNP mining: genome BAC overlaps overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data SNP analysis candidate SNP predictions BAC overlap mining results ~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 SNP detection in Sanger sequence traces Aaron Quinlan SNP discovery in clonal vs. diploid sequence • PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects • medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces 5’ 3’ 5’ 3’ C G C G C G T A Heterozygotes in diploid Sanger traces Ind. 1 Ind. 2 Ind. 3 Ind. 4 Heterozygote detection is challenging • we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions P(Het) Analyzing individual traces Heterozygotes SVM 0 SVM Function Homozygotes - 0 SVM Score + P(CT|R) = .34 P(CT|R) = .01 P(AC|R) = .999 P(AT|R) = .001 Aggregating information from multiple traces P(GT | Read) = .98 resultant genotype call P(GT ) = .993 P(GT | Read) = .87 forward/reverse sequences from same individual Priors: discovery vs. genotyping discovery: “uninformed prior” don’t know if site is polymorphic have to test each site Prior(CT) = .001 genotyping: “informed prior” 1. site is known to be polymorphic 2. allele frequency estimate Prior(CT) = 0.34 Performance Fraction of Data Analyzed False Discovery Rate Fraction of Heterozygotes Found Fraction of Homozygotes Found PolyBayes+ 85.1 0.0375 86.60% 97.8% Polyphred 5 86.17 0.0389 83.16% 82.63% Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4 Base calling for 454 pyro-sequencer flowgrams • readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle 55 24 15 10 7 5 4 2 1 0 0 TCAGGGGGGGGGGGACGACAAGGCGT… • the identity of consecutive bases is very reliable but the length of mono-nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing) The uncertainty in base number in 454 traces A Bayesian base-calling strategy for 454 traces data likelihoods (i.e. the probability distribution of signal intensity S for a given base number N: from pair-wise alignments between training data and genome reference sequence 0.07 P(0A) = ~0.74 P(1A) = ~0.24 P(2A) = 0.0625 P(3A) = 0.0156 … 0.06 Probability of Signal 0.05 0 1 2 3 4 5 0.04 0.03 Prior probabilities of possible base numbers: from genome reference sequence As As As As As As 0.02 0.01 0 0 1 2 3 4 5 6 7 8 Pyrosequencing Signal P( N | S ) P( S | N ) PR( N ) n P(S | N ) PR( N ) i i i 0 The posterior probability of base number given the signal intensity Base case-calling accuracy Comparison of Corrections Made by each Method Number of Corrections Made For Signal 120 100 80 Us Correct, 454 Wrong 60 454 Correct, Us Wrong 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Signal Overall Accuracy on 10,000 Test Traces: 454 Method: 97.13% PyroBayes: 99.15% Somatic mutation detection Michael Stromberg Somatic mutations the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer © Brian Stavely, Memorial University of Newfoundland 1. detect the mutations 2. classify whether somatic or inherited Detection using comparative data • based on comparison of cancer and normal tissue from the same individual • often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency Detecting somatic mutations with subtraction • if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence • search for evidence that these mutations are genetic • subtract apparent mutations that are present in sequence variation databases Somatic mtDNA mutations in murine brain cancers • compared mitochondrial reads from cancer and normal tissue • found both heteroplasmic and homoplasmic mutations heteroplasmy homoplasmy • some confirm known sites and some are novel Future: tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development somatic mutations chromosome rearrangements methylation profiles chromatin structure copy number changes gene expression profiles repeat expansions Population genetic modeling – human prehistory The current variation resource • The current public resource (dbSNP) contains over 10 million SNPs 1. How are these SNPs structured within the genome? 2. What can we learn about the processes that shape human variability? 3. What is the utility of these data for medical applications? Nucleotide diversity is heterogeneous at the scale of the chromosomes 0.4 0.3 0.2 0.1 4 kb 40.00 35.00 8 kb 30.00 25.00 12 kb 20.00 15.00 16 kb 10.00 5.00 0.00 in different regions of given lengths 0 Compositional and functional features G+C nucleotide content 7 8 6 5 30 33 36 39 42 45 G+C Content [%] 48 51 SNP Rate [per 10,000 bp] SNP Rate [per 10,000 bp] 8 CpG di-nucleotide content 7 54 6 recombination rate 0.3 1.2 2.1 3 3.9 CpG Content [%] 10-4 3’ UTR 5’ UTR Exon, overall Exon, coding 5.00 x 4.95 x 10-4 4.20 x 10-4 3.77 x 10-4 synonymous non-synonymous 366 / 653 287 / 653 functional constraints 4.8 SNP Rate [per 10,000 bp] 10 5 9 5.7 8 7 6 5 0 0.5 1 1.5 2 2.5 3 3.5 4 Recombination rate [per Mb] Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions, hence random processes are likely to govern the basic shape of the genome variation landscape described by neutral theory Strategy: measure genome-wise distributions of DNA polymorphism data… 0.3 0.2 1. marker density (MD): distribution of number of SNPs in pairs of sequences 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 1 2 “rare” 3 4 5 6 7 8 9 10 “common” 2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples … build models of these distributions under competing scenarios of human demographic history… stationary past collapse expansion bottleneck history present MD (simulation) 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 AFS (direct form) 1 2 3 4 5 6 7 8 9 10 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 0 10 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 9 10 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 … and determine the best-fitting models. European data African data genetic bottleneck modest but uninterrupted expansion Marth et al. PNAS 2003; Genetics 2004 Relevance to human demographic history our results Recent African Origin Multiregional Computer software to aid case-control association studies: tagSNP selection and association testing (details) 5-site Computaionally Generated LD (r 2) 1 0.8 0.6 0.4 0.2 1-4 Mrk Sep. 5-9 Mrk Sep. 10-17 Mrk Sep. 18-26 Mrk Sep. 0 0 0.2 0.4 0.6 0.8 1 LA LD (r2) Dr. Eric Tsung Clinical case-control association studies – concepts • association studies are designed to find disease-causing genetic variants • genotyping cases and controls at various polymorphisms clinical cases • searching “significant” marker allele frequency differences between cases and controls AF(controls) clinical controls AF(cases) Association study designs • region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”) • direct or indirect: causative variant • single-SNP marker or multiSNP haplotype marker • single-stage or multi-stage marker that is co-inherited with causative variant causative variant Marker (tag) selection for association studies for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen 1. hypothesis driven (i.e. based on gene function) 2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with causative variant The International HapMap project The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure http://www.hapmap.org LD varies across samples there are large differences in LD between different human populations… European reference (CEU) African reference (YRI) … and even between samples from the same population. Other European samples Sample-to-sample LD differences make tagSNP selection problematic groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples… … and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples… … possibly resulting in missed disease associations. Natural marker allele frequency differences confound association testing • the HapMap reference samples are much smaller than clinical sample sizes cases: 500-2,000 chromosomes reference samples: ~ 120 chromosomes controls: 500-2,000 chromosomes • therefore difficult to assess statistical significance of candidate associations AF(controls) • difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls AF(cases) We are developing technology for assessing sample-tosample variance in silico we estimate LD differences between HapMap and future clinical samples… cases association testing reference tag evaluation tag selection …by generating “computational” samples representing future clinical samples… controls “cases” … and use computational “proxy” samples for tabulating LD and allele frequency differences. “controls” Two methods of computational sample generation Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow. “HapMap” HapMap “cases” “controls” Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast. Computational samples HapMap (CEU) Computational (PAC) Extra genotypes (Estonia) Computational (Coalescent) MARKER EVALUATION with computational samples test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group MARKER SELECTION with computational samples selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags ASSOCIATION TESTING with computational samples “cases” tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant “controls” “cases” AF(controls) “controls” “cases” “controls” AF(cases) Do computational samples represent future clinical genotypes realistically? 1 0.8 0.6 0.4 0.2 0 0 we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set? 0.2 0.4 0.6 0.8 1 LD difference -- comparison to extra experimental genotypes • we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA) 0.949 +/- 0.013 0.963 +/- 0.014 0.978 +/- 0.010 AF difference -- comparisons to extra experimental genotypes 0.06 AF Diff, Comp Samples 0.05 0.04 0.03 0.02 0.01 0 0 0.01 0.02 0.03 0.04 0.05 0.06 AF Diff, Estonian Data • according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability A new marker selection and association testing software tool • data visualization • gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence) tags gene annotations • representative computational sample generation LD views • advanced tag selection functionality • advanced association testing functionality reference samples representative computational samples 5-site Computaionally Generated LD (r 2) 1 0.8 0.6 0.4 0.2 1-4 Mrk Sep. association statistics 5-9 Mrk Sep. 10-17 Mrk Sep. 18-26 Mrk Sep. 0 0 0.2 0.4 0.6 0.8 1 LA LD (r2) • multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score User community • companies designing new generations of whole-genome or specialized SNP arrays • researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study • clinical researchers designing candidate gene studies • researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples) • the association testing features should be useful for analysts regardless of study design Acknowledgements Washington University LaDeana Hillier Bob Waterston Mark Yandell Ian Korf Warren Gish NCBI Steve Sherry Stephen Altschul Eva Czabarka Greg Schuler Deanna Church Boston College Eric Tsung Aaron Quinlan Michael Stromberg Tony Schreiner Collaborators Aravinda Chakravarti (Hopkins) Andy Clark (Cornell) Pui-Yan Kwok (UCSF) Henry Harpending (Utah) Jim Weber (Marshfield) Wendell Weber (Michigan) Stan Nelson (UCLA) Thomas Hudson (McGill & Genome Canada) http://bioinformatics.bc.edu/marthlab We are looking for postdocs and graduate students!