Exporting and importing Stata genotype data to and from PHASE and HaploView UK Stata Users Group Meeting 2009 September 10-11, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor of Biostatistics Department of Epidemiology and Biostatistics School of Rural Public Health Texas A&M Health Science Center jchuber@tamu.edu Motivation Many rapidly growing areas of research utilize multiple specialty “boutique” computer programs to conduct highly specialized analyses. The Stata user is faced with two choices: 1. Write new Stata commands that do the same analyses 2. Write Stata commands that efficiently export and import data for these “boutique” programs Stata for Genetic Data Analysis Outline 1. 2. 3. 4. 5. 6. Genetic Data Analysis using Stata Genetics Background The “file” commands in Stata The phasein and phaseout commands The haploviewout command Summary Stata for Genetic Data Analysis 2007 UK Stata Users Group meeting: http://www.stata.com/meeting/13uk/ A brief introduction to genetic epidemiology using Stata Neil Shephard, University of Sheffield An overview of using Stata to perform candidate gene association analysis will be presented. Areas covered will include data manipulation, Hardy–Weinberg equilibrium, calculating and plotting linkage disequilibrium, estimating haplotypes, and interfacing with external programs. User Written Genetics Commands Programs written by David Clayton • • • • • • • • • • • • • • • ginsheet- Read genotype data from text files. gloci - Make a list of loci. greshape - Reshape a file containing genotypes to a file of alleles. gtab - Tabulate allele frequencies within genotypes and generate indicators (performs Hardy-Weinberg Equilibrium testing). gtype - Create a single genotype variable from two allele variables. htype - Create a haplotype variable from allele variables. mltdt - Multiple locus TDT for haplotype tagging SNPs (htSNPs). origin - Analysis of parental origin effect in TDT trios. pseudocc - Create a pseudo-case-control study from case-parent trios. pscc - Experimental version of pseudocc in which there may be several groups of linked loci. pwld - Pairwise linkage disequilibrium measures. rclogit - Conditional logistic regression with robust standard errors. snp2hap - Infer haplotypes of 2-locus SNP markers. tdt - Classical TDT test. trios - Tabulate genotypes of parent-offspring trios. User Written Genetics Commands Programs written by Adrian Mander • • • • • • • • • • gipf - Graphical representation of log-linear models. hapipf - Haplotype frequency estimation using an EM algorithm and log-linear modelling. pedread - Read's pedigree data file (in pre-Makeped LINKAGE format), similar to ginsheet pedsumm - Summarises a pre-Makeped LINKAGE file that is currently in Stata's memory. pedraw - Draws one pedigree in the graphics window plotmatrix - Produces LD heatmaps displaying graphically the strength of LD between markers. profhap - Calculates profile likelihood confidence intervals for results from hapipf swblock - A step-wise hapipf routine to identify the parsimonious model to describe the Haplotype block pattern. qhapipf - Analysis of quantitative traits using regression and log-linear modelling when phase is unknown. hapblock - attempts to find the edge of areas containing high LD within a set of loci User Written Genetics Commands Programs written by Mario Cleves • • • gencc - Genetic case-control tests genhw - Hardy-Weinberg Equilibrium tests qtlsnp - A program for testng associations between SNPs an a quantitative trait. Programs written by Catherine Saunders • • • • • • co_power - Power calculations for Case-only study designs. gei_matching geipower - Power calculations for Gene-Environment interactions. ggipower - Power calculations for Gene-Gene interactions. tdt_geipower - Power calculations for Gene-Environment interactions via TDT analysis. tdt_ggipower - Power calculations for Gene-Gene interactions via TDT analysis. Programs written by Neil Shephard • genass- Performs a number of statistical tests on your genotypic data and collates the results into a Stata formatted data set for browsing. The Structure of DNA Watson et al. (2004) pg 23, Figure 2.5 The Structure of DNA Hartl & Jones (1998) pg 9, Figure 1.5 What is a SNP? • A SNP is a single nucleotide polymorphism (the individual nucleotides are called alleles) Person 1 – Chromosome 1 Person 1 – Chromosome 2 Person 2 – Chromosome 1 Person 2 – Chromosome 2 ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat ataagtccatactgatgcatagctagctgactgacgcgat ataagtcgatactgatgcatagctagctgactgaagcgat SNP1 SNP2 Allelic Association • Simple 2x2 table • One table per SNP • Compute a simple chi-squared statistic or odds ratio for each SNP Case Control SNP1 Allele g c 250 750 650 350 Genotypic Association • Compute chi-squared tests • Allows testing of various disease models (dominant, recessive, additivity) Case Control SNP1 Genotype gg gc cc 100 250 150 300 150 50 What is a Haplotype? • A haplotype is the combination of one or more alleles found on the same chromosome – Person 1 has a “gc” haplotype and a “ca” haplotype – Person 2 has a “cc” haplotype and a “ga” haplotype Person 1 – Chromosome 1 Person 1 – Chromosome 2 Person 2 – Chromosome 1 Person 2 – Chromosome 2 ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat ataagtccatactgatgcatagctagctgactgacgcgat ataagtcgatactgatgcatagctagctgactgaagcgat SNP1 SNP2 Haplotypic Association • Compute chi-squared tests • Two SNPs with genotypes a/g and c/t respectively Case Control SNP1:SNP2 Haplotype a:c a:t g:c g:t 100 250 75 75 300 100 50 50 Why are haplotypes important? 2009 Oxford and Cambridge Boat Race http://www.theboatrace.org/gallery/2009?page=7# Why are haplotypes important? SNP1 SNP2 SNP3 SNP4 SNP5 President VP State Defense Treasury Chromosome R Chromosome D Why are haplotypes important? SNP1 SNP2 SNP3 SNP4 SNP5 President VP State Defense Treasury Chromosome R Chromosome D Rearranging the members of each “chromosome” could have a profound effect! Why are haplotypes important? Hartl & Jones (1998) pg 18, Figure 1.13 Hartl & Jones (1998) pg 18, Figure 1.13 Why are haplotypes important? Watson et al. (2004) pg 29, Box 2-2 The PHASE Program • Unfortunately, haplotypes are not observed directly using modern, high-throughput lab techniques • We observe genotypes and must infer the haplotype structure using algorithms • PHASE is a very popular program for inferring haplotypes from many SNPs simultaneously (Stephens, Smith & Donnelly, 2001) The phaseout Command Raw Genotype Data in Stata The phaseout Command Input file format for PHASE The phaseout Command I need to get my data from here: to here: The “file” commands in Stata Using “file open”, “file write” and “file close” file file file file open Example1 using "ExampleFile.txt", write replace write Example1 "Hello World" _newline(1) write Example1 "Why so blue?" _newline(1) close Example1 The “file” commands in Stata Using “file open”, “file read” and “file close” . . . . file file file file open Example2 using "ExampleFile.txt", read read Example2 Line1 read Example2 Line2 close Example2 . disp "Line1: `Line1'" Line1: Hello World . disp "Line2: `Line2'" Line2: Why so blue? The phaseout Command Syntax for phaseout phaseout SNPlist , idvariable(string) filename(string) [missing(string) separator(string) positions(string)] Example local SNPList "rs1413711 rs3024987 rs3024989" local PositionsList "674 836 1955“ phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/") The phaseout Command Example local SNPList "rs1413711 rs3024987 rs3024989" local PositionsList "674 836 1955“ phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/") The phaseout Command Example local SNPList "rs1413711 rs3024987 rs3024989" local PositionsList "674 836 1955“ phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/") The phasein Command Output file format from PHASE The phasein Command Syntax for phasein phasein PhaseOutputFile [, markers(string) positions(string)] Example phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt") The phasein Command Example phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt") The phasein Command Example phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt") The HaploView Program • Once we have inferred our haplotypes, we can conduct further association analyses using the full complement of Stata commands. • We might also want to explore our data in the popular program HaploView (Barrett et al, 2005) The haploviewout Command Syntax for haploviewout haploviewout SNPlist , idvariable(string) filename(string) [positions(string)] [familyid(string)] [poslabel] Example local MarkerList "rs1413711 rs3024987 rs3024989“ haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel The haploviewout Command Example local MarkerList "rs1413711 rs3024987 rs3024989“ haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel The haploviewout Command Example local SNPList "rs1413711 rs3024987 rs3024989“ haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel The haploviewout Command The haploviewout Command The haploviewout Command Summary Compared to recreating “boutique” programs in Stata, it is relatively easy to create programs for exporting and importing data. Acknowledgements • Grant 1-R01DK073618-02 from the National Institute of Diabetes and Digestive and Kidney Diseases • Grant 2006-35205-16715 from the United States Department of Agriculture. • Drs. Loren Skow, Krista Fritz, Candice BrinkmeyerLangford of the Texas A&M College of Veterinary Medicine • Roger Newson of the Imperial College London References • Barrett, J., Fry, B., Maller, J., & Daly, M. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263-265. • Hartl, D.L., Jones, E.W. (1998) Genetics: Principles and Analysis, 4th Ed. Jones & Bartlett Publishers • Stephens, M., & Donnelly, P. (2003). A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data. American Journal of Human Genetics, 73, 1162–1169. • Stephens, M., Smith, N. J., & Donnelly, P. (2001). A New Statistical Method for Haplotype Reconstruction from Population Data. American Journal of Human Genetics, 68, 978–989. • Watson, J.D., Baker, T.A., Bell, S.P., Gann, A., Levine, M., Losick, R. (2004) Molecular Biology of the Gene, 5th Ed. Benjamin Cummings