The R genetics package - American Statistical Association

advertisement
The R genetics package:
Tools for statistical genetics
Gregory R. Warnes
Associate Director
NonClinical Statistics
Pfizer Global R&D
Groton CT
Outline

Project Goals
Simplify Population Genetic Analysis

Design Details
Extend R ‘Factor’ objects




Functions Included


Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export
Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation,
Sample-size tools
Simple Examples

Creating Genotype Objects
Example Session
Future Development:




Page 2
Emulate BioConductor Project
Large scale SNP analysis
Formal Object Class
Multi-team collaboration
CT ASA Mini Conference: 2005-03-05
Problem


At each genetic position within a gene, diploid cells have
two alleles.
This suggests storing each allele as separate variable.
However, most laboratory methods cannot distinguish
between A/B and B/A, yielding three observed
genotypes at each position: (A/A), (A/B or B/A), (B/B).
Consequently, the observed alleles are confounded,
 This suggests the use of a single genotype variable.

This duality is not directly handled by standard statistical
packages.
 As a consequence, the need to handle both views
creates complexity when manipulating or including
genotype data in statistical analysis.

Page 5
CT ASA Mini Conference: 2005-03-05
Initial Project Goals
Simplify Statistical Analysis using Genetic Data by providing:
 A genotype object class that appropriately captures the single
variable / separate allele duality
 Methods to import and manipulate genotype objects without string
manipulation
 Simple tools including different ‘views’ of genotype variables in
standard statistical models
 Dominant ( at least one copy of X)
 Recessive ( both alleles are X)
 Additive ( Number of copies of X)
 Heterozygote Effect (Differing Alleles)
 Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B)

Functions for computing and visualizing common genetic summaries
and statistical tests
 Allele Frequencies
 Hardy-Weinberg Equilibrium
 Linkage Disequilibrium

Other statistical methods
Page 6
CT ASA Mini Conference: 2005-03-05
Design Details

Design:
Genotypes are stored in ‘Factor’ objects, with factor levels formatted as
‘A/C’.
A translation table is constructed to quickly extract individual allele
information:

Genotype
Allele 1
Allele 2
A/A
A
A
A/B
A
B
B/B
B
B
Consequences
Can be stored in standard data frames
Can be efficiently manipulated (space & time)
Permits both biallelic (C/T) and multi-allelic genetic markers (SSLP’s)
Page 7
CT ASA Mini Conference: 2005-03-05
Genotype Manipulation

Importing & Creation
genotype(), as.genotype(), makeGenotypes(), …
haplotype(), as.haplotype(), makeHaplotypes(), …

Manipulation
[] (subsetting), []<- (subset assignment), == (equality)

Information
summary() (Allele and genotype counts and frequencies), allele.names(),
allele() (Extract individual alleles), nallele() (Number of distinct allele values)

Annotation
locus(), gene(), marker(), …

Transformation
carrier(), homozygote(), heterozygote(),
allele.count()

Export
write.marker.file(), write.pedigree.file(),
write.pop.file()
Page 8
CT ASA Mini Conference: 2005-03-05
Installation
Windows GUI:
Command Line:
> install.packages(“genetics”,
dependencies=TRUE)
Page 9
CT ASA Mini Conference: 2005-03-05
Statistical Functions

Hardy-Weinberg (Dis-)Equilibrium: D, D’, r, r2, X2
diseq(), diseq.ci() (Confidence Intervals!)
HWE.test(), HWE.chisq(), HWE.exact()

Linkage Disequlibrium: D, D’, r, r2
LD(), LDplot(), LDtable()

Haplotype Imputation:
hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle()

Sample-size tools
gregorius() (Probability of observing a marked of given frequency with specified sample size)
power.casectrl()

Utilities
Bootstrap.ci
Page 10
CT ASA Mini Conference: 2005-03-05
Simple Examples :
Creating Genotype Objects
A single vector with a character separator:
> g1 <- genotype( c('A/A','A/C','C/C','C/A',
+
NA,'A/A','A/C','A/C') )
> g3 <- genotype( c('A A','A C','C C','C A',
+
'','A A','A C','A C'),
+
sep=' ', remove.spaces=F)
Page 11
CT ASA Mini Conference: 2005-03-05
Simple Examples :
Creating Genotype Objects
A single vector with a positional separator
> g2 <- genotype( c('AA','AC','CC','CA','',
+
'AA','AC','AC'), sep=1 )
Two separate vectors
> g4 <- genotype(
+
c('A','A','C','C','','A','A','A'),
+
c('A','C','C','A','','A','C','C')
+
)
Page 12
CT ASA Mini Conference: 2005-03-05
Simple Examples :
Creating Genotype Objects
A dataframe or matrix with two columns
> gm <- cbind(
+
c('A','A','C','C','','A','A','A'),
+
c('A','C','C','A','','A','C','C') )
> gm
[,1] [,2]
[1,] "A" "A"
[2,] "A" "C"
[4,] "C" "A"
…
> g5 <- genotype( gm )
> g5
[1] "A/A" "A/C" "C/C" "A/C" NA
"A/A" "A/C" "A/C"
Alleles: A C
Page 13
CT ASA Mini Conference: 2005-03-05
Simple Examples :
Creating Genotype Objects
Convert 1-column genotype variables read from a file:
> gm1 <- makeGenotypes(
+
read.csv("gm1.csv"))
> gm1
Age Sex
G1 V2
1 31
M A/A G/T
2 27
F A/C G/G
3 35
M C/C G/T
4 19
M A/C G/T
5 55
M <NA> G/G
6 34
F A/A G/G
7 45
F A/C T/T
8 32
M A/C G/T
> gm1$G1
[1] "A/A" "A/C" "C/C" "A/C" NA
Alleles: A C
Page 14
_
gm1.csv
Age,Sex,G1,G2
31,M,A/A,G/T
27,F,A/C,G/G
35,M,C/C,G/T
19,M,A/C,G/T
55,M,,G/G
34,F,A/A,G/G
45,F,A/C,T/T
32,M,A/C,G/T
"A/A" "A/C" "A/C"
CT ASA Mini Conference: 2005-03-05
__
Simple Examples :
Creating Genotype Objects
Convert 2-column genotype variables read from a file
> gm2 <- makeGenotypes(
+
read.csv("gm2.csv"),
+
convert=list(3:4,5:6))
> gm2
Age Sex G1.1/G1.2 V2.1/V2.2
1 31
M
A/A
G/T
2 27
F
A/C
G/G
3 35
M
C/C
G/T
4 19
M
A/C
G/T
5 55
M
<NA>
G/G
6 34
F
A/A
G/G
7 45
F
A/C
T/T
8 32
M
A/C
G/T
Page 15
______
gm2.csv
_____
Age,Sex,G1.1,G1.2,G2.1,G2.2
31,M,A,A,G,T
27,F,A,C,G,G
35,M,C,C,T,G
19,M,C,A,G,T
55,M,,,G,G
34,F,A,A,G,G
45,F,A,C,T,T
32,M,A,C,T,G
CT ASA Mini Conference: 2005-03-05
Simple Examples :
Displaying Genotype Information
“Raw”
> g5
[1] "A/A" "A/C" "C/C"
[4] "A/C" NA
"A/A“
[5] "A/C" "A/C"
Alleles: A C
“Summary”
> summary(g5)
Allele Frequency:
Count Proportion
A
8
0.57
C
6
0.43
NA
2
NA
Genotype Frequency:
Count Proportion
A/A
2
0.29
A/C
4
0.57
C/C
1
0.14
NA
1
NA
Page 16
CT ASA Mini Conference: 2005-03-05
Simple Examples: Extracting allele information

Genotypes (Independent factor
levels):

> g5
[1] "A/A" "A/C" "C/C" "A/C"
[5] NA
"A/A" "A/C" "A/C"
Alleles: A C

Allele Counts (Additive Effect):
> allele.count(g5, "A")
[1] 2 1 0 1 NA 2 1
attr(,"allele")
[1] "A"

1
Allele Homozygote (Recessive
Effect):
> homozygote(g5,'A')
[1] TRUE FALSE FALSE FALSE
[5]
NA TRUE FALSE FALSE

Heterozygote (Heterozygote
Advantage Effect):
> heterozygote(g5,'A')
[1] FALSE TRUE FALSE TRUE
[5]
NA FALSE TRUE TRUE
Allele presence (Dominant Effect):
> carrier(g5,'A')
[1] TRUE TRUE FALSE
[5]
NA TRUE TRUE
Page 17
TRUE
TRUE
CT ASA Mini Conference: 2005-03-05
Simple Examples: Extracting allele information

First allele:

> allele(g5, 1)
[1] "A" "A" "C" "A" NA
[7] "A" "A"
attr(,"which")
[1] 1
attr(,"allele.names")
[1] "A" "C“
Page 18
"A"
Both alleles:
> allele(g5)
[,1] [,2]
[1,] "A" "A"
[2,] "A" "C"
[3,] "C" "C"
[4,] "A" "C"
[5,] NA
NA
[6,] "A" "A"
[7,] "A" "C"
[8,] "A" "C"
attr(,"which")
[1] 1 2
attr(,"allele.names")
[1] "A" "C"
CT ASA Mini Conference: 2005-03-05
Example Session
Page 19
CT ASA Mini Conference: 2005-03-05
Future Development
R GeneticsNG

Mission:
GeneticsNG is a collaborative project to develop a core set of data
structures and analytic tools for the management, visualization,
and analysis of genetic data. This core will provide sufficient
ease of use, stability, features, documentation, and community
support to inspire users and developers to utilize, contribute and
extend the system.

Goals:
Scalable to Whole-Genome genetic analysis (>1e5 SNPs)
Read/Write common genetics data storage formats
Port existing open-source genetics codes
•
•
Current R genetics packages (genetics, haplo.score, gap, …)
Other open-source packages…
Provide good documentation, including tutorials and training
Engage the entire R genetics user/developer community
Page 20
CT ASA Mini Conference: 2005-03-05
Future Development
R GeneticsNG

Current Team
•
•
•
•
•
•

Pfizer: Gregory Warnes, Nitin Jain
Channing Laboratory (Harvard): Ross Lazarus
BMS: Scott D Chasalow, Giovanni Montana
Insightful: Michael O'Connell
Univ. Chicago: Junsheng Cheng
Join us!
Project Page:
http://r-genetics.sf.net/
Page 21
CT ASA Mini Conference: 2005-03-05
References



R Project:
http://www.r-project.org
R genetics package:
http://cran.r-project.org/contrib/main/Descriptions/genetics.html
R-News article:
Warnes GR. ``The Genetics Package,'' R News, Volume 3,
Issue 1, June 2003.


R GeneticsNG project:
http://r-genetics.sf.net/
Me:
http://www.warnes.net
Gregory.R.Warnes@Pfizer.com
Page 22
CT ASA Mini Conference: 2005-03-05
Download