The R genetics package: Tools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT Outline Project Goals Simplify Population Genetic Analysis Design Details Extend R ‘Factor’ objects Functions Included Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation, Sample-size tools Simple Examples Creating Genotype Objects Example Session Future Development: Page 2 Emulate BioConductor Project Large scale SNP analysis Formal Object Class Multi-team collaboration CT ASA Mini Conference: 2005-03-05 Problem At each genetic position within a gene, diploid cells have two alleles. This suggests storing each allele as separate variable. However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, This suggests the use of a single genotype variable. This duality is not directly handled by standard statistical packages. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis. Page 5 CT ASA Mini Conference: 2005-03-05 Initial Project Goals Simplify Statistical Analysis using Genetic Data by providing: A genotype object class that appropriately captures the single variable / separate allele duality Methods to import and manipulate genotype objects without string manipulation Simple tools including different ‘views’ of genotype variables in standard statistical models Dominant ( at least one copy of X) Recessive ( both alleles are X) Additive ( Number of copies of X) Heterozygote Effect (Differing Alleles) Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B) Functions for computing and visualizing common genetic summaries and statistical tests Allele Frequencies Hardy-Weinberg Equilibrium Linkage Disequilibrium Other statistical methods Page 6 CT ASA Mini Conference: 2005-03-05 Design Details Design: Genotypes are stored in ‘Factor’ objects, with factor levels formatted as ‘A/C’. A translation table is constructed to quickly extract individual allele information: Genotype Allele 1 Allele 2 A/A A A A/B A B B/B B B Consequences Can be stored in standard data frames Can be efficiently manipulated (space & time) Permits both biallelic (C/T) and multi-allelic genetic markers (SSLP’s) Page 7 CT ASA Mini Conference: 2005-03-05 Genotype Manipulation Importing & Creation genotype(), as.genotype(), makeGenotypes(), … haplotype(), as.haplotype(), makeHaplotypes(), … Manipulation [] (subsetting), []<- (subset assignment), == (equality) Information summary() (Allele and genotype counts and frequencies), allele.names(), allele() (Extract individual alleles), nallele() (Number of distinct allele values) Annotation locus(), gene(), marker(), … Transformation carrier(), homozygote(), heterozygote(), allele.count() Export write.marker.file(), write.pedigree.file(), write.pop.file() Page 8 CT ASA Mini Conference: 2005-03-05 Installation Windows GUI: Command Line: > install.packages(“genetics”, dependencies=TRUE) Page 9 CT ASA Mini Conference: 2005-03-05 Statistical Functions Hardy-Weinberg (Dis-)Equilibrium: D, D’, r, r2, X2 diseq(), diseq.ci() (Confidence Intervals!) HWE.test(), HWE.chisq(), HWE.exact() Linkage Disequlibrium: D, D’, r, r2 LD(), LDplot(), LDtable() Haplotype Imputation: hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle() Sample-size tools gregorius() (Probability of observing a marked of given frequency with specified sample size) power.casectrl() Utilities Bootstrap.ci Page 10 CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A single vector with a character separator: > g1 <- genotype( c('A/A','A/C','C/C','C/A', + NA,'A/A','A/C','A/C') ) > g3 <- genotype( c('A A','A C','C C','C A', + '','A A','A C','A C'), + sep=' ', remove.spaces=F) Page 11 CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A single vector with a positional separator > g2 <- genotype( c('AA','AC','CC','CA','', + 'AA','AC','AC'), sep=1 ) Two separate vectors > g4 <- genotype( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') + ) Page 12 CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A dataframe or matrix with two columns > gm <- cbind( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') ) > gm [,1] [,2] [1,] "A" "A" [2,] "A" "C" [4,] "C" "A" … > g5 <- genotype( gm ) > g5 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C Page 13 CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects Convert 1-column genotype variables read from a file: > gm1 <- makeGenotypes( + read.csv("gm1.csv")) > gm1 Age Sex G1 V2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M <NA> G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T > gm1$G1 [1] "A/A" "A/C" "C/C" "A/C" NA Alleles: A C Page 14 _ gm1.csv Age,Sex,G1,G2 31,M,A/A,G/T 27,F,A/C,G/G 35,M,C/C,G/T 19,M,A/C,G/T 55,M,,G/G 34,F,A/A,G/G 45,F,A/C,T/T 32,M,A/C,G/T "A/A" "A/C" "A/C" CT ASA Mini Conference: 2005-03-05 __ Simple Examples : Creating Genotype Objects Convert 2-column genotype variables read from a file > gm2 <- makeGenotypes( + read.csv("gm2.csv"), + convert=list(3:4,5:6)) > gm2 Age Sex G1.1/G1.2 V2.1/V2.2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M <NA> G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T Page 15 ______ gm2.csv _____ Age,Sex,G1.1,G1.2,G2.1,G2.2 31,M,A,A,G,T 27,F,A,C,G,G 35,M,C,C,T,G 19,M,C,A,G,T 55,M,,,G,G 34,F,A,A,G,G 45,F,A,C,T,T 32,M,A,C,T,G CT ASA Mini Conference: 2005-03-05 Simple Examples : Displaying Genotype Information “Raw” > g5 [1] "A/A" "A/C" "C/C" [4] "A/C" NA "A/A“ [5] "A/C" "A/C" Alleles: A C “Summary” > summary(g5) Allele Frequency: Count Proportion A 8 0.57 C 6 0.43 NA 2 NA Genotype Frequency: Count Proportion A/A 2 0.29 A/C 4 0.57 C/C 1 0.14 NA 1 NA Page 16 CT ASA Mini Conference: 2005-03-05 Simple Examples: Extracting allele information Genotypes (Independent factor levels): > g5 [1] "A/A" "A/C" "C/C" "A/C" [5] NA "A/A" "A/C" "A/C" Alleles: A C Allele Counts (Additive Effect): > allele.count(g5, "A") [1] 2 1 0 1 NA 2 1 attr(,"allele") [1] "A" 1 Allele Homozygote (Recessive Effect): > homozygote(g5,'A') [1] TRUE FALSE FALSE FALSE [5] NA TRUE FALSE FALSE Heterozygote (Heterozygote Advantage Effect): > heterozygote(g5,'A') [1] FALSE TRUE FALSE TRUE [5] NA FALSE TRUE TRUE Allele presence (Dominant Effect): > carrier(g5,'A') [1] TRUE TRUE FALSE [5] NA TRUE TRUE Page 17 TRUE TRUE CT ASA Mini Conference: 2005-03-05 Simple Examples: Extracting allele information First allele: > allele(g5, 1) [1] "A" "A" "C" "A" NA [7] "A" "A" attr(,"which") [1] 1 attr(,"allele.names") [1] "A" "C“ Page 18 "A" Both alleles: > allele(g5) [,1] [,2] [1,] "A" "A" [2,] "A" "C" [3,] "C" "C" [4,] "A" "C" [5,] NA NA [6,] "A" "A" [7,] "A" "C" [8,] "A" "C" attr(,"which") [1] 1 2 attr(,"allele.names") [1] "A" "C" CT ASA Mini Conference: 2005-03-05 Example Session Page 19 CT ASA Mini Conference: 2005-03-05 Future Development R GeneticsNG Mission: GeneticsNG is a collaborative project to develop a core set of data structures and analytic tools for the management, visualization, and analysis of genetic data. This core will provide sufficient ease of use, stability, features, documentation, and community support to inspire users and developers to utilize, contribute and extend the system. Goals: Scalable to Whole-Genome genetic analysis (>1e5 SNPs) Read/Write common genetics data storage formats Port existing open-source genetics codes • • Current R genetics packages (genetics, haplo.score, gap, …) Other open-source packages… Provide good documentation, including tutorials and training Engage the entire R genetics user/developer community Page 20 CT ASA Mini Conference: 2005-03-05 Future Development R GeneticsNG Current Team • • • • • • Pfizer: Gregory Warnes, Nitin Jain Channing Laboratory (Harvard): Ross Lazarus BMS: Scott D Chasalow, Giovanni Montana Insightful: Michael O'Connell Univ. Chicago: Junsheng Cheng Join us! Project Page: http://r-genetics.sf.net/ Page 21 CT ASA Mini Conference: 2005-03-05 References R Project: http://www.r-project.org R genetics package: http://cran.r-project.org/contrib/main/Descriptions/genetics.html R-News article: Warnes GR. ``The Genetics Package,'' R News, Volume 3, Issue 1, June 2003. R GeneticsNG project: http://r-genetics.sf.net/ Me: http://www.warnes.net Gregory.R.Warnes@Pfizer.com Page 22 CT ASA Mini Conference: 2005-03-05