Integrating Genetic and Biomarker Data with Social Science Research: Genetics Jason Fletcher Assistant Professor Health Policy and Administration Yale University RWJ Health and Society Scholar Columbia University Goals Introduce some terminology Requires multiple exposures Focus Limitations What findings from genetics should you believe? Opportunities How might social scientists use genetic data? Advances in both genetics and social science Data Opportunities Currently Available—DNA data Add Health Fragile Families 1957 HS grads and sibs, long follow up Framingham Heart Study National longitudinal sample, 5K, Mothers and children, lower income/immigrant samples Wisconsin Longitudinal Study National longitudinal sample, 15K, Age 12-30, siblings, school friends, focus on health Medical focus, multigenerational study Many international datasets Eventually available(?) Health and Retirement Study Panel study of income dynamics National longitudinal study, ages 50+, spouses, health and aging National longitudinal study, multigenerational families/all ages, income and labor market, health National Longitudinal Survey of Youth National longitudinal study, labor market focus, multigenerational, siblings Outline Background Behavioral genetics (non-molecular) Molecular genetics Integration with Social Science Gene X Environment interactions Instrumental variables Behavioral genetics Family based/twin studies No DNA data Decomposition of variance of outcomes into three components A=Heritability C=Shared/Common environment E=Unshared/Unique environment Heritability estimates (h2) Comparison of correlation of MZ twins with DZ twins The basic BG model Variation in phenotype (outcome/observable characteristic) is a function of variation in additive genetic (genotype) and environmental contributions (shared and unshared) Classic twin design y AC E y 1 A1 C 1 E 1 y 2 A2 C 2 E 2 A=genotype; C=common environment; E=unique environment Identical/Monozygotic (MZ) twins share 100% of genetic makeup Fraternal/Dizygotic (DZ) twins share ~50% of genetic makeup Equal environments assumption Classic twin design rmz A C rdz 1 AC 2 A 2 ( rmz rdz ) C rmz A 2 rdz rmz E 1 rmz Typical estimates Example from literature Credibility Test? Schonemann 1997 Example from literature: Boardman et al (2008) Regression based approach y1 j 0 1 y 2 j 2 g j 3 g j y 2 j j g is zygosity (0.5 for MZ, 0 for DZ), 3 is coefficient of interest Equal environments issue y1 j 0 1 y 2 j 2 g j 3 g j y 2 j X j Dressed the same, same room, same playmates Findings Positive affect is highly heritable (~.6) Controls decrease heritability estimate for women to 0.3 but men to 0.52 Socioeconomic Xs, emotional support, stress Some key assumptions/issues Random mating of parents generation Equal environments assumption Violations tend to inflate h2 External validity to non-twins Gene-environment correlation Assortative mating tends to deflate h2 Inflates h2 Gene-environment interaction Subsumed in h2 2nd Design: Adoptions Correlation between two adopted siblings Correlation between two non-adopted siblings C 1/2A+C Assumes equal environments What about gene-environment correlations? Selection of adoptees Recent Innovation: Adoption Studies Sacerdote (2007) Quasi-Random assignment of adoptees Gene-environment correlation h2 41% for college graduation 44% for education attainment 33% for income 5% for alcohol use 27% for tobacco use 17% for overweight status Innovations: Twin Studies Random mating Equal environments assumption Mating parameter in robustness checks Genotype siblings in order to estimate assortative mating parameter Use survey questions that measure shared environments Richer family level data Cousins, siblings, parents, etc. A Puzzle: Heritability vs. Measured Genetic Variation Large heritability estimates (~.3) Small measured variation using genetic On the other hand… Additional new directions Variation of h2 by study population Gender, Race, Country, Time Period Can this tell us anything about gene x environment interactions? Boardman and Blaylock in press Quiz from Collegeboard.com If a person has a disorder with h2=1, then the person will suffer from the disorder The heritability of having fingers on each hand is 1 or close to 1 False, it is close to zero because the source is often environmental Heritability and inherited are nearly the opposite in meaning False, phenylketonuria (PKU) = 1 but mental retardation can be prevented through diet True; equalizing school environments will increase heritability of achievement The heritability of behaviors of identical twins is 1 False, it is zero http://apcentral.collegeboard.com/apc/members/homepage/45829.ht ml Discussion/Questions What do we learn from h2 estimates? What are the policy implications of estimates? Heritability estimates set no upper limit on the potential effect of reducing or eliminating variation in environmental factors that currently vary in response to genotype, as many do. Nor do they set an upper limit on the effect of creating new environments. Heritability estimates do set an upper limit on the effect of reducing or eliminating environmental variations that are independent of genotype, but other statistics usually provide even better estimates of these effects. There is no evidence that genetically based inequalities are harder to eliminate than other inequalities. Until we know how genes affect specific forms of behavior, heritability estimates will tell us almost nothing of importance (Jenks). Molecular Genetics Describe a few concepts How do scientists/biologists/geneticists use genetic data? Sources: http://www.psych.umn.edu/courses/fall09 /mcguem/psy5137/lectures.htm Properties of Genetic Material Specify a code for protein synthesis (i.e., code for an the sequence of amino acids in a polypeptide chain.) Duplicate or replicate during both mitosis and meiosis Deoxyribonucleic Acid (DNA) Double stranded Strands are held together by (hydrogen) bonds that form between the nucleotide bases of the DNA molecule Adenine (A) <====> Thymine (T) Guanine (G) <====> Cytosine (C) DNA Length of Human Genome ~ 3,000,000,000 bases of DNA 1 kilo base (kb) = 1000 bases 1 mega base (Mb) = 1,000,000 bases 1 giga base (Gb) = 1,000,000,000 bases Average protein has ~ 400 amino acids, requiring 1200 DNA bases or 1200bp Translation The basic informational unit is 3 nucleotide bases (called a codon). Each codon specifies a single amino acid. There are 4*4*4=64 possible sequences but only 20 possible amino acids. Gene A sequence of DNA (a locus on a chromosome) that is involved in (“codes for”) the synthesis of a functional polypeptide (proteins consist of one or more polypeptides). “Modern Definition” (circa 2006): A locatable region of genomic sequence, corresponding to a unit of inheritan which is associated with regulatory regions, transcribed regions and/or other functional sequence regions Relative Genome Size Non-coding DNA ~98% of human DNA does not code directly for protein Pseudogenes (evolutionary relics) Repetitive DNA Interspersed Minisatellite repeats (10-30 bp) Microsatellite repeats (< 10 bp) Regulatory regions VNTR Gene Structure Typical gene is composed of multiple exons – Expressed sequences of DNA that are translated into protein introns - Intervening DNA sequences that are not translated Genetic Variation Genetic variation between individuals refers to differences in the DNA sequence 1. Originally arose through (gametic) mutation. 2. An estimated 99.8% - 99.9% of our DNA is common 3. But then .1% of 3,000,000,000 = 3 million differences The Genetic Basis for Human Variation Class of variation Designation/Description Example Frequency Single Nucleotide Polymorphism (SNP) Single base substitution involving A,T,C, or G A/T COMT 11,883,685 (6,262,709) Deletion/Insertion Polymorphisms (DIPs; In/Del) Designated using the full sequence of the insertion as one allele, and either a fully defined string for the variant allele or a “-” character to specify the deleted allele. T/-CCTA/G 431,319 Microsatellite or short tandem repeat (STR) Alleles are designated by providing the repeat motif and the copy number for each allele. DRD4 MAO-A Copy Number Variants Duplication, deletion of 1000 up to 1 Mb of DNA Gross Chromosomal Variation Contiguous gene deletion syndromes (> 1 Mb) Aneuploidy Derived from dbSNP release 128 5-HTTLPR ~10,000 TBD 5% of genome VCFS small Sources of Genetic Variation Types of Genetic Variation: Chromosomal Sickle-cell Mutation Types of Genetic Variation Chromosomal/Structural: Variations (or rearrangements) in the amount of genetic material inherited Polymorphisms: – Variations in the DNA sequence SNPs (~10,000,000) VNTR (STR, SSR) Types of Genetic Variation: Variable Number of Tandem Repeats (VNTR) Microsatellite: Small number of bases (<10) repeated a variable number of times (usually < 100)(>100,000) Huntington’s disease is an example of a microsatellite triplet repeat in a coding region How do researchers link genetic variation to outcomes? Candidate gene examinations Sometimes from animal models Specifically examine a small number of polymorphisms and an outcome Sometimes use family based designs Replication Gene association studies/Genome wide association studies (GWAS) Gene-finding exercise (atheoretical) Ex: Corder et al. Science 1993, p. 921-923 Alzheimer and APOE Ex 2: BRCA1 and Breast Cancer Mutations thought to account for 45% of GWAS Example GWAS Example: Beauchamp et al. 2007 9,000 observations from Framingham data Phenotype: educational attainment 500,000 SNPS 350,000 after cleaning Run 350,000 regressions Find several SNPS with very low p-values MAPKAP2—inflammatory responses/neuronal death IER3—regulation of cell death/impact on cognitive development GWAS vs. Candidate Gene More powerful for low penetrance variants Better resolution, reduce region of interest Do not need to specify particular variants But… Problems with studies Data mining False positives Gene-environmental interactions Population Stratification (500K+ tests) Use adjusted/tiny p-values Replication Linkage disequilibrium Hamer (2002)—markers for chopstick use Gene-gene interactions Integrating Genetics and Social Science Improve theory/empirics Gene X Environment Interactions New sources of variation Why would social scientists care? GxE Help understanding of individual level heterogeneity Knowledge of biological pathways may aid the development of interventions Empirics Reduce omitted variable bias in causal models Additional source of (“natural”) variation Instrumental variables analysis Shanahan and Hofer: GxE typology Social context (E) may serve as a trigger for those predisposed compensate for genetic predisposition act as a control to prevent behaviors for which there is a genetic predisposition Avoidance of low levels of functioning through the absence of a stressor or provision of enriched setting Focus on social norms and structural constraints that are placed on people to limit choice enhance adaptation through proximal processes Shanahan and Hofer: Promise of GxE Conclusive evidence from animal models Genetic main effects have been elusive and small But we think that “genes matter” Gene-environment correlations likely will not completely explain variation in outcomes Emerging human evidence Gene x Environment Caspi et al., Science 2002 Why do some maltreated children grow up to develop antisocial behaviors and others do not? MAOA gene (encodes neurotransmittermetabolizing enzyme) Animal model evidence Found to moderate the effect of childhood maltreatment Example: Guo et al. 2008 AJS Outcome: number of sexual partners Gene: DAT (dopamine transporter) Environment: school-level norms: % of kids who have sex early Cognitive ability Forthcoming, Journal of Politics Issues with GxE studies Non-replication Theory Measurement of Environment Endogenous vs. Exogenous Power Data Environmental variation Other issues: Shanahan and Hofer Static vs. dynamics Multifaceted nature of E Mediating mechanisms Simple statistical models New Example: Conley and Rauscher 2010 Focus on exogenous “E” in GxE Outcome: academic outcomes G: several markers in Add Health Data E: birth weight variation Key: within MZ twins, birth weight differences are “random” Examine effects of “types” of twin pairs based on genetic marker variation Extending GxE studies Most GxE studies use measures of environment that are non-randomly distributed (e.g. parental divorce) Potential for gene-environment correlation (rGE) Does G affect response to E or affect risk of exposure Possibility of GxG interactions Instead use variation in public policies (e.g. tobacco taxes) in order to: increase understanding of the biological factors affecting decisions and increase understanding of the heterogeneity of responses to environmental (e.g. policy) factors Example of idea: (Fletcher, unpublished data, 2010) High Tax Number of Cigarettes Low Tax 6.34 High Tax/Homozygous Number of Cigarettes 6.40 Homozygous 8.08 High Tax/Heterozygous 7.28 Heterozygous 6.98 Low Tax/Homozygous 9.38 Low Tax/Heterozygous 7.44 Sample: White males in the Add Health data Environmental risk: Low tobacco tax vs. high tobacco tax state Genetic risk variant: Homozygous (60%) or heterozygous (40%) for the CHRNA6 SNP Codes for the α6 subunit found in certain types of nicotinic acetylcholine receptors found primarily in the brain. Neural nicotinic acetylcholine receptors containing α6 subunits are expressed on dopamine-releasing neurons in the midbrain, and dopamine release following activation of these neurons is thought to be involved in the addictive properties of nicotine. 10.80 Discussion/Questions Emerging national representative with DNA data allows substantial variation in “E” (Add Health, Fragile Families…) Leverage empirical strategies from the social sciences for plausibly exogenous “E” Need to team up with biological sciences for measurement of “G” They need us, too Genetics and Social Science II: New Variation for Causal Inference Genetic Lotteries within Families Jason M. Fletcher Yale University Steven F. Lehrer Queen’s University Motivation Tremendous advances in research that links molecular genetic markers to health outcomes How might social scientists (economists) leverage new knowledge to advance our own research? Example: links between health and schooling Or: income, socioeconomic status, occupation, labor force participation, marital status… Idea: use sibling differences in genetic inheritance as an “experiment in nature” in order to trace through the effects of poor health on schooling Empirical Example Testscore Health X Issues: Health and the error term may be correlated Reverse causality Health S X • Would like experimental variation in health, uncorrelated with epsilon • Instrumental variable A Start: Mendelian Randomization Definition Random assortment of genes from parents to offspring that occurs during gamete formation and conception (Smith and Ebrahim IJE 2003) Used in a growing number of studies Strengths Not generally susceptible to reverse causation Scientific basic for link Instrumental variables Testscore Health X Health SNP X • Requirements for instruments • Related to the potentially problematic variable (theta is not zero) • Unrelated to the error in the main equation (epsilon) Example: Von Hinke et al. 2010 Examines causal effects of child fat mass on academic achievement ALSPAC data—Avon, England; 12,000 kids followed from birth Instrument: FTO gene 1-4 pound increase R-square is <1% FTO Findings: OLS, 1 SD increase in fat mass reduces achievement by 0.1 points Findings Building on Mendelian Randomization Limitations Confounding by polymorphisms in linkage disequilibrium Multiple phenotypic effects Dynastic effects—not only does the child inherit the gene, but also the consequences of the gene from the previous generation May invalidate the instrument, if correlated with epsilon Risk of inheriting the polymorphism may not be random New Approach: “Genetic Lottery” Eliminates dynastic effects Genetic Lottery Approach Testscore Health if if Health Gene if if X if X if if f f if Combines a sibling fixed effect and instrumental variables strategy All common family factors are controlled Source of variation in health is from “experiment in nature” Application: estimating causal effects of poor health on achievement National Longitudinal Study of Adolescent Health (Add Health) School-based study of the health-related behaviors of 12 to 18 year old adolescents who were in grades 7 to 12 in 1994/5 We use a specific subsample of the respondents for which DNA measures were collected, all of which have multiple family members in the survey. In total we have information on 1684 individuals. The data contains information a score on a common verbal test (Peabody picture vocabulary test). Health Summary Statistics Variable Full Sample Sibling Sample Twin Sample AD 0.052 (0.221) 0.049 (0.215) 0.056 (0.229) HD 0.049 (0.216) 0.052 (0.223) 0.043 (0.203) ADHD 0.078 (0.268) 0.077 (0.266) 0.078 (0.268) Depression 0.063 (0.243) 0.067 (0.251) 0.052 (0.223) Obese 0.13 (0.338) 0.14 (0.349) 0.12 (0.319) While the AD and HD subscale averages fell within standard ranges, roughly 8% of the sample has ADHD which exceeds the 6% national average. Genetic markers Our dataset contains information on six markers Dopamine Receptor D2 (DRD2) The dopamine D4 receptor (DRD4) gene. The dopamine transporter (DAT) and serotonin transporter (5HTT) code for proteins that lead to the reuptake of dopamine and serotonin respectively. Monoamine oxidase A (MAOA) gene responsible for the degradation of dopamine, serotonin, and norepinephrine. CYP2A6 gene metabolizes tobacco, drugs and other toxins, it is primarily operative in the liver Evidence that interactions of markers could have powerful effects. (Gene-gene interactions) Results Additional analyses Tests support that sibling fixed effects are necessary Tests support that instrumental variable is necessary Endogeneity Instrument invalidity (genes correlated with epsilon) Confounding Over-identification tests Sensitivity to small to moderate violations of instrument validity (Conley et al. 2007) Examination based on different samples Limitations Incomplete understanding of links found in genetic research Potential for multiple pathways May invalidate instruments Data Large samples of genotyped siblings Limited variation with families Weak instruments Genetic lottery idea Genetic lotteries is an interesting research design for causal inference The impact of poor mental health outcomes on academic achievement is large (one s.d.). Why? Accounting for family fixed effects is needed but does not fully solve problems due to endogeneity of health. The presence of high comorbidity of adverse health outcomes is striking, which presents a challenge to properly control them all. Discussion/Questions Technical take-home points: Strengths & weaknesses of alternative approaches 1. Candidate gene studies: focus on 1 candidate Advantages - Scientifically tractable: incremental & cross-validatable - Maximal statistical power (focused hypothesis) Disadvantages - Can only “discover” what we already know (i.e., biased) 2. Genome-wide association studies: focus on all candidates Advantages - Unbiased de novo discovery Disadvantages - Minimal statistical power, particularly for interactions 3. The bioinformatic “middle road”: focus on a small set of causally plausible candidates (unbiased search of regulatory and coding SNPs) Advantages - Scientifically tractable: “short leap of inference” & cross-validatable - Relatively high statistical power (focus on 1-10% of plausible SNPs) Disadvantages - Likely missing some true causal genetic influences - Bioinformatically intensive – thought (and programming) required Take-home points for this group: 1. Gene-Environment interactions are likely far more… - ubiquitous - large in effect size - clinically/socially meaningful …than current genetic analyses presume. There is plenty left for you to find. 2. If you have the study you have (i.e., can’t alter sampling design), your major opportunities for increasing power/discovery involve: - focusing on substantive effects that are true/big (e.g., GxE, not G, given antagonistic pleiotropy; E, ExE, GxG, etc.) - modeling biological mechanisms to focus power/impose constraints (e.g., candidate systems, functional themes, regulatory themes) - combinatorial data-mining (e.g., machine learning in discovery sample) - sequential testing designs (low stringency discovery, med stringency test, high stringency confirm) Your advantage is smart data analysis. Follow-up references Overview of genetics / biology Attia, J., et al. (2009) How to use an article about genetic association: A: Background concepts. JAMA, 301, 74-81 Genetic association studies Hirschhorn, J., & Daly, M. (2005) Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6, 95-108. Attia, J., et al. (2009) How to use an article about genetic association: B: Are the results of the study valid? JAMA, 301, 191-197. Cordell, H, & Clayton, D. (2005) Genetic epidemiology 3: Genetic association studies. Lancet, 366, 1121-1131 Basic statistical modeling for genetics Siegmund, D., & Yakir, B. (2007) The statistics of gene mapping. New York, Springer Sampling & statistical approaches for GxE discovery Thomas, D., (2010) Gene-environment-wide association studies: emerging approaches. Nature Reviews Genetics, 11, 259-272 Statistical strategies for combinatorial discovery Hastie, T., Tibshirani, R. & Friedman, J. (2001) The elements of statistical learning. New York, Springer. Acknowledgements BG Model slides http://www.psych.umn.edu/courses/fall09/mc guem/psy5137/lectures.htm