Slides

advertisement
A Shrinkage Regression Approach
to Tackle the HLA Region
Charlotte Vignal
Variable Selection Workshop
Vienna, July 26th 2008
Outline
Overview of the HLA system and the challenge
of analysing data from the HLA region
Multivariate association test using a
Bayesian-inspired shrinkage regression
approach
Application to the rheumatoid arthritis
case-control study
Conclusion
The Human Leukocyte Antigen System
• A genomic region found in almost all vertebrates, the major
histocompatibility complex (MHC) - gene composition and
arrangement vary between species (below)
• In humans, the MHC is the HLA system
• A set of genes encoding proteins essential to immune response
• Major role in histocompatibility and protection against pathogens
MOUSE
RAT
CHIMPANZEE
HUMAN
Kelley et al. Immunogenetics (2005)
The Challenge
Susceptibility to many complex disorders maps to
the HLA region
High degree of correlation within the region hampers the
identification of causal variants
Widely used approaches test the effect of one genetic
variable at a time
Require methods that allow the detection of (possibly
multiple) causal variants among highly correlated data
Multi-SNP Methods can be more
Powerful than Single-SNP Analyses
Multivariate logistic regression
– Problematic when nVars >> nObs
– Stepwise procedures can be unstable in presence of many highlycorrelated terms
Shrinkage method using Bayesian logistic regression
– A variable selection approach
– Based on the Least Absolute Shrinkage and Selection Operator
approach (LASSO) (Tibshirani 1996)
– Fast implementation using the Bayesian Binary Regression (BBR)
software for text-categorisation analysis
(Genkin et al. 2004, http:/www.stat.rutgers.edu/~madigan/BBR)
Bayesian Logistic Regression
for variable selection
Each coefficient βj has a Laplace prior distribution with
mode 0 and prior variance ν=2/λ2, where λ is the penalty
factor
– Mode 0 encodes a prior belief of no effect
– The prior variance determines the strength of this belief and hence
the sparseness of the fitted model
The maximum a posteriori (posterior mode) estimates ˆ
j
are often zero or else shrunk towards zero
Terms with non-zero ˆ are included in the final model, and
j
treated as significant
ˆ

The value of
j gives a (shrunk) measure of effect size
0.5
The Density of the Laplace Distribution
1
0.5
0.0
0.1
0.2
p(x)
0.3
0.4
0,
0,
-10
-5
0
5
10
x
! Effect size estimates are biased towards zero;
Over-shrinking true effects can lead to non-causal correlated
variables to be retained
Application
The Rheumatoid Arthritis Dataset
RA is an autoimmune disease and a complex disorder
– Estimated genetic contribution of ~30-50%
– The HLA region is strongly implicated in RA susceptibility
– Genetic associations reported with a biomarker called the shared
epitope (SE) defined by a class of alleles at HLA-DRB1
– The mechanism by which RA is determined is still unknown
Is the SE association the only HLA effect predisposing to RA?
The subjects: 842 RA cases and 957 controls
(but 774 cases and 945 controls with no missing data analysed)
The independent variables:
– 2,302 genetic markers, a continuous variable coded as 0, 1 and 2
based on the number of allele copies
– The shared epitope, a continuous variable coded as 0, 1, 2 based on
the number of shared epitope positive (SE+) alleles
The Effect of Shared Epitope on RA
Effect
SE carriage SE+ vs. SE SE+ copies
Wald P
OR [95% CI]
< 0.0001
5.1 [4.1; 6.3]
< 0.0001 3.7 [2.9; 4.6]
1 copy vs. 0 copy
3.2 [2.4; 4.3]
2 copies vs. 1 copy
11.8 [8.6; 16.1]
2 copies vs. 0 copy
The presence of SE is strongly associated with RA
 Increasing risk for RA associated with the number of SE+
allele copies
 The objective: to investigate the presence of additional causal
variants in the HLA region, possibly correlated with SE

Specification of the Penalty λ
• Cases and controls permuted 100 times for each
λ within each SE group (i.e. SE effect retained)
• SE (additive term) included in each model
• λ selected if false positive per model < 1
 λ = 62 was selected for further analyses
The Effect of Shrinking a True Effect
R2 between each genetic variables
and SE across the HLA region
 In blue are the genetic variables selected by BLR in addition to SE
 Three variables selected are correlated with SE
Shrinking a known effect may cause correlated SNPs to be selected
The Effect of Shrinkage on True
Effects
To investigate the effect of shrinkage, SE included twice (SE
& SEfake) in the model:
 When SE and SEfake are shrunk, both variables retained
– Shrinking a known effect may cause correlated SNPs to be
selected
 When SE is not shrunk, only SE is retained
– Correlated SNPs could be eliminated
The shrinkage factor was not applied to SE in subsequent
analyses (λ = 0)
BLR and Correlated Data
Can the BLR approach distinguish positive effects from
spurious associations in presence of correlation?
 4 variables correlated with SE were used to evaluate error rates
and power
 Records of each variables re-distributed in cases and controls to
achieve different size of OR while maintaining correlation with SE
 Error rate and power assessed by permuting cases and controls
— Error rate: frequency of the variables selected beyond SE & the
simulated correlated variables over 100 permutations
— Power: frequency of the simulated variables over 100 permutations
Power
•Selection of simulated
variables correlated with SE
variables moderately correlated with SE selected if OR> 2
variables highly correlated with SE selected if OR> 5
Error Rate
• Selection of simulated
variables correlated with SE
 Under the null, expect 1 false positive per analysis (λ = 62)
Analysis generates 1 to 2 false positives per analysis
ATT- BLR Results Comparison
R2 (SNP,SE)
SNPDE
PATT-adj
SE
4.2e-61
snp292
1.9e-6
0.03
• With λ=62, BLR identified 10 SNPs snp576
2.6e-6
0.02
snp271
3.2e-5
0.02
snp645
9.6e-5
8.5e-6
snp068
2.2e-5
0.04
snp384
2.4e-5
0.002
snp465
9.7e-6
0.03
snp156
2.3e-5
0.001
snp225
3.1e-6
0.05
• Data were analysed by Armitage
Trend Test (ATT) and BLR
• Single-point analysis using ATT
identified 109 associated SNPs at
α = 4.34e-04 = 1/2302
• Variables selected by BLR are not
correlated with SE
Additional Analysis
The NEG Distribution
• Data re-analysed using the normal-exponential-gamma
(NEG) prior with parameters set to expect 1 false positive per
model (Hoggart et al. PLoS (2008))
! NEG has heavier tails to
allow sparser solutions
Additional Analysis
The NEG Distribution
 NEG identified 4 variables; of which three (snp271, snp384,
snp545) were also retained by DE
 Variables identified with NEG prior are less correlated
among themselves and with SE than those selected using DE
 Three of the selected variables are in genes/region reported
to contribute to RA susceptibility: BAT1 and HLADQA1/DQB1
Conclusions
BLR appears to perform better than single-point
association analysis (ATT) when data are correlated
 Computationally efficient
 Identifies fewer positive results (10 vs.109)
 Correlation might be more effectively handled
Simulation analyses confirm reasonable power and error rate
Three variables identified by both DE and NEG priors lie in
genes previously implicated in RA
Results suggest the presence of independent RA-associated
effects in the HLA region
Acknowledgements
David Balding, Imperial College, UK
Clive Hoggart, Imperial College, UK
Aruna Bansal, GSK, UK
The Genetics Division at GSK
Download