Case-control association techniques in genetic studies March 10, 2011 Karen Curtin, Ph.D. Division of Genetic Epidemiology and HCI Pedigree & Population Resource (PPR) ‹#› Presentation outline Background (genetics concepts) Basic case-control association Complex case-control association Genome-wide association ‹#› The Human Genome: 6 billion DNA bases (Adenine, Cytosine, Guanine, or Thymine) ‹#› License: Creative Commons Attribution 2.0 Genotype and Haplotype C A … A G C C A A A C T G A A T T C … … A G C C A A A T T G G A T T C … At any locus (position on a chromosome): Read across both chromosomes Genotype CT T Read along a chromosome G Haplotypes: C-A and T-G If allele T can predict allele G, two alleles are in Linkage Disequilibrium (LD) ‹#› Single Nucleotide Polymorphsim Two alternate forms (alleles) that differ in sequence at one point in a DNA segment 90% of genomic variants are SNPs Source: David Hall, Creative Commons Attribution 2.5 license Genetic variants: Germline v Somatic • Germline variant/mutations – Inherited/In-born mutation – In all cells – In particular, in germline haploid cells • Heritable – Cell division - meiosis • Somatic variants/mutations – Acquired mutation – Only in an isolated number of cells (tumor site) • Generally not heritable – Cell division - mitosis ‹#› Hereditary mutation - meiosis Parent germ cells Daughter cells HAPLOID X New zygotes DIPLOID Presentation outline Background (genetics concepts) Basic case-control association Complex case-control association Genome-wide association ‹#› Genetic variants in association studies Association: two characteristics (disease & genetic variant) occur more often together than expected by chance • Direct Association / Causal Functional variant Disease – Functional variant is involved in disease – Functional variant is associated with the disease • Indirect Association Genetic variant Functional variant Disease – Genetic variant (SNP) is associated/correlated with underlying functional variant – Functional variant is involved in disease – Genetic variant (marker) is associated with disease (initial step.. Ultimate goal is to discover causal variant) ‹#› Genetic association study Designs • Observational – Exposure variables • Genetic variants • Environmental factors • Classical association study designs – Unit of interest is an individual – Cohort study (cross-sectional or longitudinal) – Case-control study • Family-based association study – Unit of interest is a family unit ‹#› Case-Control Study • Sample individuals based on to disease status and without knowledge of exposure status (e.g. genotype) – CASES (with disease) – CONTROLS (no disease) • Usually balanced design (#cases = #controls) • Retrospective • Neither prevalence nor incidence can be estimated ‹#› Types of Case-Control Study • Population-based – Risk estimates can be extrapolated to the source population – Could be nested in a cohort study • Selected sampling – Increases power to detect associations • Antoniou & Easton (2003) – Tests of independence are valid – True positive risks are exaggerated • Can not be extrapolated ‹#› Case-Control: Population-based • Source population – All individuals satisfying predefined criteria • Source cohort – A group that is ‘representative’ of the source population – CASES and CONTROLS occur in relation to population prevalence • CASES – Cases selected are ‘representative’ of cases in the source cohort – In particular, in terms of the exposure variables • CONTROLS – Controls selected are ‘representative’ of controls in the source cohort – In particular, in terms of the exposure variables • Odds Ratio (estimate of the relative risk) can be extrapolated back to the source population – Population Attributable Risk (PAR) ‹#› Case-Control: Selected Sampling • Source population – All individuals satisfying predefined criteria • Source cohort – A group that is ‘representative’ of the source population – CASES and CONTROLS occur in relation to population prevalence • CASES – Cases selected are in effect selectively sampled from cases in source cohort – Family history of disease, severe disease, early onset,… • CONTROLS – Cases selected are in effect selectively sampled from controls in source cohort – Screened negative, no family history,… • • Association analyses are still valid and power may be increased BUT… – Odds Ratio (estimate of the relative risk) can not be extrapolated back to the source population ‹#› Case-Control Study: Odds Ratio Disease Exposure Yes No Cases (Yes) a b Controls(No) c d Odds Ratio (OR) = a / b = a × d c/d b×c H0: OR = 1 same risk (no association) OR > 1 indicates increased risk OR < 1 indicates decreased risk (protective) ‹#› 95% confidence intervals for the Odds Ratio Lower and Upper bounds for the risk estimates. Two common methods: 1) eln(OR) – 1.96se(ln(OR)), eln(OR) + 1.96se(ln(OR)) where se(ln(OR)) = 1/a+1/b+1/c+1/d 2) OR1-1.96/, OR1+1.96/ ‹#› chi-square test Compares observed values (O) with those expected under independence between rows and columns Expected (E) = row total column total N chi-square statistic, with (rows-1) (columns-1) degrees of freedom 2 = (O – E)2 ~ 2(rows-1) (columns-1) E ‹#› Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage test for trend (1 df) 2×2 allele table (1 df) ‹#› Modeling genetic exposures • Exposure = genotype • Single variant with 2 alleles (SNP) • Three genotypes: CC, CT, TT • 23 contingency table – Chi-sq 2df – Chi-sq 1df (impose a linear dependency between columns) CC CT TT Controls a b c Cases d e f ‹#› Mode of Expression / Inheritance • • Let allele C be disease causing Examples of modes of expression are: 1. Dominant • Individuals heterozygous or homozygous for the C allele gives rise to the disease 2. Recessive • TT TC CC TT TC CC Only homozygous individuals for the C allele results in disease 3. Codominant TT TC CC • • All three genotypes can be distinguished phenotypically ‘Additive’ model – TC has r-fold risk, CC has 2r effect ‹#› chi-square test CC CT TT Totals Controls 120 40 20 200 Cases 120 Totals 120 60 120 240 50 50 100 40 30 30 200 400 60 Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2 120 50 30 120 50 30 Chi-statistic = 10.67 p-value=0.0048 (for a chi-square distribution with 2 df) ‹#› Genotypic relative risk • Assess risk (OR) for each genotype relative to the homozygous common genotype ORhet = CT vs. CC a×e b×d ORhzv = a × f TT vs. CC c×d Genotype (exposure) CC CT TT Controls a b c Cases d e f ‹#› chi-square test / genotypic relative risk CC CT TT Totals Controls 120 40 20 200 Cases 120 Totals 120 60 120 240 50 50 100 40 30 30 200 400 60 Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2 120 50 30 120 50 30 Chi-statistic = 10.67 p-value=0.0048 (for a chi-square distribution with 2 df) OR het CT vs. CC = 1.5 OR hzv TT vs. CC = 2.0 ‹#› Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage test for trend (1 df) 2×2 allele table (1 df) ‹#› Dominant model for exposure Exposure = CT&TT genotypes - 22 test with 1 df ORdom = a × (e+f) = 1.67 d × (b+c) CC Genotype CT TT Controls a=120 (b+c)=60 Cases d=120 (e+f)=100 ‹#› Recessive model for exposure Exposure = TT genotype (vs. CC&CT) - 22 test w/1 df ORrec = (a+b) × f (d+e) × c CC = 1.78 Genotype CT TT Controls (a+b)=160 c=20 Cases (d+e)=180 f=40 ‹#› Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage’s trend test (1 df) 2×2 allele table (1 df) ‹#› Armitage Trend Test (23 with 1df) Assess departures from a fitted trend A (r x ) i i i R (n x ) N i i i B R( N R ) [ N ( n x ) ( ( n x )) ] ( N ( N 1) 2 2 i i i 2 i i i 2 A trend B 2 CC (x1=0) CT (x2=1) TT (x3=2) Controls a (r1) b (r2) c (r3) Cases d e f n2 n3 n1 R N ‹#› Example – genotypic relative risk and trend test Shephard et al. Cancer Res 2009 Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage’s trend test (1 df) 2×2 allelic table (1 df) ‹#› Allelic Test • Exposure = Allele (T vs. C) • 2 x 2 table (1 df) for a single SNP • Count every allele (2 per person) – Doubles the sample size ORallele = (2a+b)×(2f+e) (2c+b)×(2d+e) C Allele T Controls 2a+b=280 2c+b=80 Cases 2d+e=300 2f+e=140 OR = 1.633 T vs. C allele ‹#› Example – allelic association 11 12 22 11 12 22 Xue et al. Arch Oral Bio 2009 ‹#› More flexible techniques • If other factors may have an effect on disease status (affected/unaffected, case/control) – We want to account for these as covariates – We want to adjust for matching variables (age, sex, etc.) • Logistic regression – Logistic transformation (logit) – ln(p/(1-p)) = + 1x1 + 2x2 + …. – Coefficients and ’s are estimated using maximum likelihood estimation (MLE) – Test H0: =0 against H1: = using a likelihood ratio test (LRT) ~~ • Must decide on how to model the genetic exposure ^ – genotype categories (i.e. CC, CT, TT), dominant, recessive, additive (allele dose).. ‹#› Example of logistic regression model with genetic exposure and covariates Slattery et al. IJC 2010 ‹#› Assumptions for Validity • Independence of all individuals – Independent and identically distributed (iid) • Reasonable sample sizes – Contingency tables • Expected values all > 1 and 80% > 5 – Logistic regression • Minimum of 15-20 individuals per group • If violated – Simulate the null distribution for testing • Permutation test – e.g. Fishers exact test is an exhaustive permutation test • Monte Carlo simulation ‹#› Presentation outline Background (genetics concepts) Basic case-control association Complex case-control association Genome-wide association ‹#› Performing haplotype analyses • Single locus – We observe genotypes, so testing is straight-forward counting into a contingency table CC CT TT Controls a b c Cases d e f ‹#› Performing haplotype analyses • Multi-locus – Haplotypes are not directly observed – But can be estimated (EM/Bayesian…) – For some individuals, their haplotype pair can be inferred unambiguously – For many individuals they can not • “Phase uncertainty” – All analyses of haplotypes must take into account the phase uncertainty in the data • Otherwise, increase in type 1 errors ‹#› Haplotypes / Genotypes C G … A G C T A A A C T G G A T T … … A G C C A A A C T G G A T T … Two-locus Haplotypes: The haplotype pair must be: C-G and C-G UNAMBIGUOUS C G ‹#› Estimating haplotypes Genotypes Locus 1 Locus 2 Haplotypes CC GG C-G & C-G CC GA C-G & C-A CC AA C-A & C-A CT GG C-G & T-G CT GA ?(C-G & T-A) or (C-A & T-G)? CT AA C-A & T-A TT GG T-G & T-G TT GA T-G & G-A TT AA T-A & T-A ‹#› Estimating haplotypes • Expectation-maximization (EM) algorithm – SNPHAP (Johnson et al 2001) – GCHap (Thomas 2003) • Bayesian MCMC approach – PHASE (Stephens et al 2001) • Both approaches assume independent individuals • Use to estimate – Population haplotype frequencies estimated from a set of individuals – Most likely haplotype pair for each individual ‹#› Traditional methods for phase uncertainty • Likelihood based approach – Each individual can have multiple different haplotype pairs that are consistent with the genotype data • Some pairs of haplotypes are more or less likely than others • Each pair is given a weight • All possible haplotype pairs are considered in the case-control analysis – weighted by their probabilities ‹#› Slattery et al. CEBP 2007 Simulation methods for phase uncertainty • Sample over the observed data • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis – Sample one pair of each individual • Randomly and in proportion to the weights, select a haplotype pair for each individual • Perform the analysis as if those were observed • Repeat 1,000 times… • Average • SIMHAP (McCaskie et al.) ‹#› Simulation methods for phase uncertainty • Monte Carlo testing – Simulate the null –matched to the real data • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis – Assign each individual their most likely haplotype pair • Cases and controls separately – Simulate null haplotype data • Null: Convert haplotypes to genotypes • Null: Estimate haplotypes • Null: Assign each individual their most likely haplotype pair – Real and null are matched – Test real data (with most likely haplotype pairs assigned) against the simulated null • hapMC (Thomas et al.) ‹#› Exponential explosion… high dimensional data • 1 SNP – 2 alleles – 3 genotypes 1 test 1+ tests • 2 SNP loci – 4 haplotypes • 3 SNP loci – 8 haplotypes • 10 SNP loci – 1024 haplotypes many tests.. ‹#› Multi-locus… but how many, and which loci to test? • For example…20 tSNPs – Only perform single SNP analyses? – Perform tests on all 20-locus haplotypes? • Group all ‘rare’ haplotypes together • Cluster to reduce dimension – Multi-locus tests with subsets of 20 SNPs? • Subsets of which SNPs? ‹#› Data mining approach to haplotype construction – hapConstructor • Automatically builds haplotypes (Abo et al.) (or composite genotypes) – Non-contiguous SNPs – In a case-control framework – All SNP haplotypes are phased during 1st stage and used in all subset analyses – Starts with each single SNP locus • Forward-backward process driven by significance thresholds • Significance and false discovery rates (p-values and q-values) reported for the building process • Computationally challenging, potentially time intensive ‹#› Multilocus model building example using hapConstructor 16 SNPs Curtin et al. BMC Med Genet 2010 Multilocus haplotype association using hapConstructor Curtin et al. BMC Med Genet 2010 ‹#› Meta-association in case-control studies • Association: two characteristics occur more often together than expected by chance – Disease – Genetic variants • Meta-Association: study of association across case-control data collected by multiple study sites (collaborative effort) – NARAC: North American Rheumatoid Arthritis Consortium – BCAC: Breast Cancer Association Consortium VS. “Meta-analysis of individual level data from participants in a systematically ascertained group of studies” (Petitti definition) ‹#› Meta-analysis of multi-study case-control data: general concepts • simple pooling – combine individual level data from multiple studies and compute association statistics • fixed effects models – inference is conditional on the studies actually done – in genetic association, assumes same genetic effect size across studies • random effects models – inference is based on assuming studies in the analysis are a ‘random sample’ of hypothetical population of studies ‹#› Fixed effects models • Methods and effect measures – Mantel-Haenszel: Odds ratio; also rate, risk ratio • well-known method for calculating summary estimate of effect across strata (i.e. multiple studies) – Peto: Ratio (can approximate odds ratio) • modification of M-H method – General variance-based: Ratio (all types) and rate differences ‹#› Mantel-Haenszel method (fixed effects) Not Exposed Exposed Total Diseased ai bi ai + bi Not Diseased ci di ci + di ai + ci bi + d i ni Total where i is the ith strata (study) ‹#› Mantel-Haenszel method (fixed effects) summary odds ratio ( weighti ORi ) ORmh weighti where: ORi ( ai d i ) ( bi ci ) weighti = 1/variancei variance component of effect size within studies only variance i ni ( bi ci ) ‹#› Mantel-Haenszel method (fixed effects) summary odds ratio • Strengths – Optimal statistical properties (uniformly most powerful test) – M-H estimate OR=1, M-H Chi-square=0 (mathematical connection of effect with summary statistic) – Widely available in statistical software • Limitations – Requires data to complete 2x2 table for all studies (potential exclusion bias) – ignores confounding not taken into account by study design (i.e. age, sex-matched controls) • could use logsitc regression estimate of OR to simultaneously model confounding variables and to adjust for study site ‹#› CMH chi-square general association test of independence (fixed-effect method) • Extension of Cochran-Mantel-Haenszel (CMH) test to sets of (X by Y) contingency tables (i.e. studies) • Formulas for the CMH statistics are more easily defined in terms of matrices (Landis and Koch 1978) • Assumes study strata are independent, and that the marginal totals of each stratum are fixed – H0 : there is no association between X (disease status) and Y (genotype) in any of the strata – corresponding model is the multiple hypergeometric ‹#› Heterogeneity • If Ho: homogeneity is rejected, studies are not measuring effect of the same size • Tests of Heterogeneity – Q test ~Chisq. with d.f.= #studies – 1 • Mantel-Haenszel method: Q [weighti (lnORmh ln ORi )2 ] – Logistic regression: add a term for interaction between study and genotypes in model (test using Wald or Likelihood Ratio) • When heterogeneity is not extreme, fixed- and random- effects models yield similar results ‹#› Random effects models • Methods and effect measures – DerSimionian-Laird (1986): Ratio (all types) and difference – Bagos and Nikolopoulos (2007): Odds ratio • study-specific coefficient in logistic regression model representing deviation of study i’s true genotypic effect to overall mean effect • incorporates between-study component of variance, CI’s at least as wide (wider) than fixed effects ‹#› Fixed- vs. Random- Assumptions • analysis under fixed model addresses the question: Was there a genotype-phenotype association in the consortium of casecontrol studies used in the meta analysis? • under the random model, question: Will there be a genotype-phenotype association “on average?” ‹#› Independent individuals • If study cases and controls are independent (unrelated) individuals, meta-association is straightforward... ‹#› Straightforward... • Adjust for ‘study site’ in a logistic regression • Use Cochran Mantel Haenszel (CMH) techniques, controlling for study – CMH test of association – CMH test of trend – meta odds ratio estimate ‹#› Cox et al, Nature Genetics (2007) • Test of Ho: no association included terms for genotype and BCAC study • Trend test included 1 parameter for allele dose and a term for BCAC study • Genotype-specific risks estimated as ORs using logistic regression with BCAC study as a covariate (fixed-effects) • Tested heterogeneity between studies by comparing logistic regression models with and without a genotype x study interaction term • Data also analyzed using a random-effects model, test for heterogeneity ‹#› Cox et al, Nature Genetics (2007) Table 1. Summary odds ratios and 95% confidence intervals for nine polymorphisms and breast cancer risk SNP No. of studies ADH1B 3' UTRA G rs1042026 9 CASP8 D302H rs1045485 14 CDKN1A S31R rs1801270 15 ICAM5 V301I rs1056538 15 IGFBP3 202C A rs2854744 10 No. of ctrls No. of cases 15,570 11,391 17,109 22,670 22,229 17,926 16,423 18,290 17,687 13,101 MAF 0.29 0.13 0.07 0.39 0.45 Betweenstudy heterogeneity 0.35 0.97 0.009 0.58 0.72 Test for assoc. 0.044 5.7 10-7 0.55 0.58 0.051 Trend test 0.54 1.1 10-7 0.28 0.78 0.046 Analysis model Heterozygote OR (95% c.i.) Rare homozygote OR (95% c.i.) Fixed effects 0.94 (0.89, 1.00) 1.04 (0.95, 1.14) Random effects 0.99 (0.90, 1.10) 1.04 (0.95, 1.14) Fixed effects 0.89 (0.85, 0.94) 0.74 (0.62, 0.87) Random effects 0.89 (0.85, 0.94) 0.73 (0.60, 0.90) Fixed effects 1.03 (0.97, 1.10) 1.07 (0.86, 1.33)c Random effects 1.04 (0.93, 1.09) 1.20 (0.82, 1.76)c Fixed effects 1.02 (0.98, 1.07) 1.00 (0.94, 1.06) Random effects 1.02 (0.97, 1.08) 0.99 (0.93, 1.06) Fixed effects 1.00 (0.94, 1.05) 0.93 (0.87, 0.99) Random effects 1.00 (0.94, 1.05) 0.92 (0.86, 0.99) ‹#› Meta Association – Related individuals • But what if some study individuals (cases or controls) are related in multi-study collaborations? ..sibships, trios, pedigreesor mixed, in families meta analysis of data from multiple sites is more difficult.. Genie to the rescue.. ‹#› Genie overview • Allen-Brady et al. (2006), Curtin et al. (2007) • Simulation-based technique – Monte Carlo approach – Null distribution is simulated for the statistic of interest matching the pedigree structure • Equivalent to an empirical version of the variance correction method with prior probabilities • Flexible in type of statistic that can be analyzed – Classical association statistics and effect measure (OR) – Meta association statistics (fixed-effects approach) • Dichotomous and quantitative traits http://www-genepi.med.utah.edu/Genie/index.html ‹#› Genie: Empirical null • Generate the empirical null • Using appropriate allele frequencies perform a gene-drop through the pedigree – Null genotypic configuration • Calculate the statistic of interest using the null data ignoring relatedness – Null statistic • Repeat thousands of times – Empirical estimate of the null distribution • Assess the significance of the observed statistic by assessing where it lies in the null distribution ‹#› Creating the Simulated Null Distribution Population allele frequencies Assign alleles randomly to pedigree founders Gene drop: simulated Mendelian inheritance Repeat Null Genotype Configuration Calculate NULL statistic Empirical Null Distribution ‹#› Meta Statistics PedGenie Can Perform Dichotomous Data Chi-square test (genotype- or allele-based) Meta Odds ratio test (genotype or allele data) CMH Chi-Square Meta Odds Ratio (95% CI determined from the empirical distribution) Chi-Square trend test (genotype- or allele-based) CMH Chi-SqTrend ‹#› Genie Meta-association • Fixed effects approach – assumes same genetic effect size across studies • Generalized CMH approach – chi-square general association test of independence extension to >2x2 tables across multiple studies • CMH chi-square test of trend – mean score statistic where ordered genotypes (i.e. genotypes aa, aA, and AA) lie on an ordinal scale • Meta ORs – M-H common odds ratio estimate for 2x2 tables (CT vs CC, TT vs CC) – 95% CI estimated empirically ‹#› Empirical 95% Confidence Interval Distribution of OR estimates from 1,000 configurations in PedGenie null Why Genie Meta-association? • Ability to combine family-based and independent case-control resources and use all available data – Genie software corrects for relationships in family-based resources; all family members with phenotype and genotype data can be included – increases the utility of pedigrees previously ascertained for linkage and can provide increased power to detect associations.. ..particularly in stratified and subset analyses that may lead to small sample sizes in individual studies – needs a logistic regression framework (underway) ‹#› XRCC2 SNP rs3218536 (R188H) rs3218501 rs3218499 rs3218472 rs3218454 rs3218418 rs3218402 rs3218400 rs3218395 rs3218385 rs3218374 rs3218373 Genotype Controls Cases GG 1,167 1,014 GA 204 185 AA 9 10 CC CG/GG 1,296 100 GG 823 GC 504 CC 65 CC vs. GC/GG Meta OR 1 1.0 1.3 95% CI Lower Upper p -Ref (0.8, 1.3) (0.4, 3.8) 0.87 1,117 84 1 1.0 -Ref (0.7, 1.4) 0.93 712 414 84 1 1.0 1.6 1.6 -Ref (0.8, 1.1) (1.1, 2.2) (1.1, 2.2) 0.23 0.009 CC CT/TT 1,355 6 1,190 2 1 0.4 -Ref (0.0, 2.7) 0.22 AA AT TT 1,140 206 20 999 167 23 1 0.9 1.3 -Ref (0.7, 1.2) (0.5, 3.4) 0.89 GG GA/AA 1,245 139 1,103 107 1 0.8 -Ref (0.6, 1.1) 0.23 AA AG/GG 1,302 70 1,130 75 1 1.2 -Ref (0.8, 1.7) 0.35 CC CA AA 1,068 275 11 974 210 16 1 0.8 1.6 -Ref (0.7, 1.0) (0.7, 3.7) 0.36 CC CT/TT 1,255 137 1,081 129 1 1.1 -Ref (0.8, 1.4) 0.51 TT TG/GG 1,226 144 1,052 138 1 1.1 -Ref (0.8, 1.4) 0.57 CC CG GG 411 677 299 370 559 269 1 0.9 1.0 -Ref (0.8, 1.1) (0.8, 1.3) 0.94 GG GT TT 1,119 232 15 983 193 16 1 0.9 1.2 -Ref (0.8, 1.2) (0.5, 2.9) 0.79 Association of XRCC2 tag-SNPs with CRC in 4-study meta analysis (Curtin et al. CEBP 2009) *Empirical Cochran-Mantel-Haenszel χ2 test for trend or recessive model based on 10,000 simulations. Association of XRCC2 rs3218499G>C with CRC in 4-study meta analysis Overall Tumor site Proximal colon Distal colon Rectal Female Rectal Meta OR 1.6 CC vs. GC/GG 95% CI Lower Upper (1.1, 2.2) p* 0.009 1.2 1.1 2.1 3.1 (0.8, (0.7, (1.3, (1.6, Men Women 1.3 2.0 (0.8, 2.1) (1.2, 3.4) 0.24 0.01 Age onset <60 Age onset ≥60 1.4 1.6 (0.6, 3.1) (1.1, 2.3) 0.38 0.03 1.7 1.8 (1.1, 2.6) (0.6, 5.7) 0.02 0.17 Family history No Yes 2.0) 1.8) 3.2) 6.1) 0.44 0.63 0.0006 0.0006 *Empirical Cochran-Mantel-Haenszel χ2 test for recessive model based on 10,000 simulations. ‹#› Genomewide (case-control) Association GWA: an approach to the study of common diseases • Complex architecture – Multiple genes likely involved – Multiple environmental factors – Individually low risks • Argument that the underlying variants may be common and of modest effect.. – Common variants (>0.05, >0.01) – Not under intense negative selection • Agnostic.. no hypothesis – Hypothesis generating vs. hypothesis driven (candidate gene or pathway) ‹#› GWA: What is required? • Large set of SNPs • Stringent significance thresholds – ~5 x 10-8 • Large case-control sample size – Example • • • • Allele frequency 0.15 OR=1.25 80% power 6,000 cases and 6,000 controls ‹#› Large set of SNPs • Linkage-disequilibrium (LD)-based – Genomewide tag-SNP set – Made possible by HAPMAP – 500,000-1,000,000 SNPs – High-density arrays with 2 million SNPs • Not optimal for rare variants… – tag-SNP methods ignore them ‹#› Stringent significance thresholds • Very few ‘hits’ per study – 1,3,4,5 significant hits per genome using GWA – If don’t correct and use nominal 0.05 • In 500,000 markers • Can expect 25,000 false positives – Need to use a correction for multiple testing • significance threshold of ~510-8 (Dudbridge & Koeleman ASHG 2004) • Good… but not great – But we’re expecting many more genes to be found… right? • Less stringency and instead use replication? ‹#› Multistage strategies in GWA Hirschhorn & Daly Nature Reviews 2005 ‹#› Interactions • An increase (or decrease) effect of one exposure given another. • Gene-environment interaction – Risk (genotype AA / no smoke) = 4 – Risk (genotype AA / smoke)= 6 • Gene-gene interaction – Epistasis – Risk (genotype AA / genotype bb) = 4 – Risk (genotype AA / genotype Bb,BB) = 6 ‹#› Statistical Interactions • Multiplicative model – Most commonly used – Natural to a risk framework • Logistic regression – Independent loci • multiply risk OR11=OR10×OR01 – Interaction • OR11≠OR10×OR01 ‹#› Multiplicative model • Multiplicative risk for alleles at each locus • First locus – aa 1.00 – aA 2.20 – AA 4.84 1 2 • Second locus – Bb 1.00 – bB 1.50 – BB 2.25 1 2 ‹#› Risk: Two Independent Loci (multiplicative) OR11= OR10 OR01 aa aA AA bb 1.00 2.20 4.84 bB 1.50 3.30 7.26 BB 2.25 4.95 10.89 ‹#› Statistical Interactions • Additive model – Less popular – Independent loci • Add risks • OR11= 1 + (OR10-1) + (OR01-1) – Interaction • OR11≠ 1+ (OR10-1) + (OR01-1) ‹#› Additive model • Additive risk for alleles for each single locus • First locus – aa – aA – AA 1.00 2.20 3.40 1 2-1 • Second locus – bb – bB – BB 1.00 1.50 2.00 1 2-1 ‹#› Risk: Two Independent Loci (additive) OR11=1+(OR10-1)+(OR01-1) aa aA AA bb 1.00 2.20 3.40 bB 1.50 2.70 3.90 BB 2.00 3.20 4.40 ‹#› No main effects??? • No main effects • Only interaction effects • Problem: – In a stepwise procedure, if aren’t able to identify the main effects, then how do you know to test the interaction?? • HOWEVER… Thus far, no biological model has been put forth that support the lack of main effects ‹#› Case-control design: ORs • Testing in the Odds Ratio framework • H0: OR11=OR10×OR01 • H0: IOR11=1.0 ‹#› Risk: Multiplicative aa aA AA bb 1.00 OR01 OR02 bB OR10 OR11 OR12 BB OR20 OR21 OR22 ‹#› Case-control design: ORs • IOR11= OR11 OR10 OR01 • Under the null, IOR11 = 1 • Can do several IORs – 11, 12, 21 and 22 • Can construct confidence intervals to test for a significant interaction ‹#› Case-control design: logistic regression logit P(Y=1/G1,G2) = + 1G1 + 2G2 + 3G1×G2 • Parameter is an estimator for ln(IOR) under a multiplicative model • G1 and G2 can be modeled several ways – – – – Dominant Recessive Additive 3 levels ‹#› Methods: MDR • Multifactor-Dimensionality Reduction (MDR) – Ritchie et al (2001) Am J Hum Genet – Combinatorial partitioning – Data mining – http://www.epistasis.org/software.html ‹#› MDR • Divide sample into 10 equal partitions – Model on 9/10 (1…9) – Test on 1/10 of data (10) – Repeat 10 times and average the misclassification • Pick n loci from the total N SNPs – Exhaustively assess all combinations – All cells cases>controls (high-risk) – All cells cases<controls (low-risk) – Group • Repeat for all possible n of N • May be too many… doesn’t scale well ‹#› Machine learning • Machine learning – Classification trees (e.g. CART) • Greedy algorithms • Not optimal • Cook et al (2004) Stat Med – Artificial Neural Networks (ANNs) • GPNN software • Motsinger et al (2006) BMC Bioinformatics – Support Vector Machine Approach • Combinatorial optimization techniques – Local search – Genetic algorithms • Weng et al (2007) Genet Epidemiol ‹#› Other approaches • Logistic regression framework – tagSNPs and powerful models for epistasis – Chapman and Clayton (2007) Genet Epidemiol • Case-control – Haplotype interactions – FAMHAP – Becker et al (2005) Genet Epidemiol ‹#› Thanks for your time.. ..questions ? ‹#› Additional slides (if needed) ‹#› Quantitative traits • Simple comparisons – 2 groups (e.g. alleles, dominant) • Normal test large sample sizes • T-test small sample sizes • Mann-Whitney non-parametric – >2 groups (e.g. genotypes) • ANOVA (F-test) • Kruskall-Wallis non-parametric • Including Covariates – Linear regression y = + x – Again, need to model genetic exposure ~~ ‹#› Family-Based Methods • Parent-Offspring Trios – Haplotype Relative Risk (HRR) – Transmission/Disequilibrium Test (TDT) – Quantitative TDT (QTDT) – Generalized Estimating Equations (GEE) • Nuclear Families – Sibling TDT (STDT) – FBAT – QTDT – GEEs ‹#› Family-Based Methods • General Pedigrees (small to moderate size) – PDT – FBAT – QTDT – Variance correction (posterior probability) – CCREL • Extended Pedigrees – Variance correction (prior probability) – Quasi-Likelihood Score (QLS) – PedGenie ‹#› Transmission/Disequilibrium Test (TDT) • Transmission method • Spielman et al (2003) • Trio method – Requires genotype data on all three individuals • The statistic considers only {parent, affectedoffspring} pairs from the trio for which the parent is heterozygous – Compare the number of times each of the different alleles is transmitted to the affected offspring – Is there evidence for preferential transmission of one allele over the other? ‹#› TDT: Validity • H0: (1-2) = 0 • A test for both association and linkage • Robust to stratification ‹#› TDT CT CC One heterozygous parent Transmits T to offspring CT ‹#› TDT CT CC One heterozygous parent Transmits C to offspring CC ‹#› TDT CT CT Two heterozygous parents One parent: Transmits C to offspring Other parent: Transmits T to offspring CT ‹#› TDT CC CC No heterozygous parents No data to record CT ‹#› TDT: Tabulation Allele NOT transmitted 2 tdt Allele transmitted C T C a b T c d b c 2 (b c ) ~ 2 1 ‹#›