A Shrinkage Regression Approach to Tackle the HLA Region Charlotte Vignal Variable Selection Workshop Vienna, July 26th 2008 Outline Overview of the HLA system and the challenge of analysing data from the HLA region Multivariate association test using a Bayesian-inspired shrinkage regression approach Application to the rheumatoid arthritis case-control study Conclusion The Human Leukocyte Antigen System • A genomic region found in almost all vertebrates, the major histocompatibility complex (MHC) - gene composition and arrangement vary between species (below) • In humans, the MHC is the HLA system • A set of genes encoding proteins essential to immune response • Major role in histocompatibility and protection against pathogens MOUSE RAT CHIMPANZEE HUMAN Kelley et al. Immunogenetics (2005) The Challenge Susceptibility to many complex disorders maps to the HLA region High degree of correlation within the region hampers the identification of causal variants Widely used approaches test the effect of one genetic variable at a time Require methods that allow the detection of (possibly multiple) causal variants among highly correlated data Multi-SNP Methods can be more Powerful than Single-SNP Analyses Multivariate logistic regression – Problematic when nVars >> nObs – Stepwise procedures can be unstable in presence of many highlycorrelated terms Shrinkage method using Bayesian logistic regression – A variable selection approach – Based on the Least Absolute Shrinkage and Selection Operator approach (LASSO) (Tibshirani 1996) – Fast implementation using the Bayesian Binary Regression (BBR) software for text-categorisation analysis (Genkin et al. 2004, http:/www.stat.rutgers.edu/~madigan/BBR) Bayesian Logistic Regression for variable selection Each coefficient βj has a Laplace prior distribution with mode 0 and prior variance ν=2/λ2, where λ is the penalty factor – Mode 0 encodes a prior belief of no effect – The prior variance determines the strength of this belief and hence the sparseness of the fitted model The maximum a posteriori (posterior mode) estimates ˆ j are often zero or else shrunk towards zero Terms with non-zero ˆ are included in the final model, and j treated as significant ˆ The value of j gives a (shrunk) measure of effect size 0.5 The Density of the Laplace Distribution 1 0.5 0.0 0.1 0.2 p(x) 0.3 0.4 0, 0, -10 -5 0 5 10 x ! Effect size estimates are biased towards zero; Over-shrinking true effects can lead to non-causal correlated variables to be retained Application The Rheumatoid Arthritis Dataset RA is an autoimmune disease and a complex disorder – Estimated genetic contribution of ~30-50% – The HLA region is strongly implicated in RA susceptibility – Genetic associations reported with a biomarker called the shared epitope (SE) defined by a class of alleles at HLA-DRB1 – The mechanism by which RA is determined is still unknown Is the SE association the only HLA effect predisposing to RA? The subjects: 842 RA cases and 957 controls (but 774 cases and 945 controls with no missing data analysed) The independent variables: – 2,302 genetic markers, a continuous variable coded as 0, 1 and 2 based on the number of allele copies – The shared epitope, a continuous variable coded as 0, 1, 2 based on the number of shared epitope positive (SE+) alleles The Effect of Shared Epitope on RA Effect SE carriage SE+ vs. SE SE+ copies Wald P OR [95% CI] < 0.0001 5.1 [4.1; 6.3] < 0.0001 3.7 [2.9; 4.6] 1 copy vs. 0 copy 3.2 [2.4; 4.3] 2 copies vs. 1 copy 11.8 [8.6; 16.1] 2 copies vs. 0 copy The presence of SE is strongly associated with RA Increasing risk for RA associated with the number of SE+ allele copies The objective: to investigate the presence of additional causal variants in the HLA region, possibly correlated with SE Specification of the Penalty λ • Cases and controls permuted 100 times for each λ within each SE group (i.e. SE effect retained) • SE (additive term) included in each model • λ selected if false positive per model < 1 λ = 62 was selected for further analyses The Effect of Shrinking a True Effect R2 between each genetic variables and SE across the HLA region In blue are the genetic variables selected by BLR in addition to SE Three variables selected are correlated with SE Shrinking a known effect may cause correlated SNPs to be selected The Effect of Shrinkage on True Effects To investigate the effect of shrinkage, SE included twice (SE & SEfake) in the model: When SE and SEfake are shrunk, both variables retained – Shrinking a known effect may cause correlated SNPs to be selected When SE is not shrunk, only SE is retained – Correlated SNPs could be eliminated The shrinkage factor was not applied to SE in subsequent analyses (λ = 0) BLR and Correlated Data Can the BLR approach distinguish positive effects from spurious associations in presence of correlation? 4 variables correlated with SE were used to evaluate error rates and power Records of each variables re-distributed in cases and controls to achieve different size of OR while maintaining correlation with SE Error rate and power assessed by permuting cases and controls — Error rate: frequency of the variables selected beyond SE & the simulated correlated variables over 100 permutations — Power: frequency of the simulated variables over 100 permutations Power •Selection of simulated variables correlated with SE variables moderately correlated with SE selected if OR> 2 variables highly correlated with SE selected if OR> 5 Error Rate • Selection of simulated variables correlated with SE Under the null, expect 1 false positive per analysis (λ = 62) Analysis generates 1 to 2 false positives per analysis ATT- BLR Results Comparison R2 (SNP,SE) SNPDE PATT-adj SE 4.2e-61 snp292 1.9e-6 0.03 • With λ=62, BLR identified 10 SNPs snp576 2.6e-6 0.02 snp271 3.2e-5 0.02 snp645 9.6e-5 8.5e-6 snp068 2.2e-5 0.04 snp384 2.4e-5 0.002 snp465 9.7e-6 0.03 snp156 2.3e-5 0.001 snp225 3.1e-6 0.05 • Data were analysed by Armitage Trend Test (ATT) and BLR • Single-point analysis using ATT identified 109 associated SNPs at α = 4.34e-04 = 1/2302 • Variables selected by BLR are not correlated with SE Additional Analysis The NEG Distribution • Data re-analysed using the normal-exponential-gamma (NEG) prior with parameters set to expect 1 false positive per model (Hoggart et al. PLoS (2008)) ! NEG has heavier tails to allow sparser solutions Additional Analysis The NEG Distribution NEG identified 4 variables; of which three (snp271, snp384, snp545) were also retained by DE Variables identified with NEG prior are less correlated among themselves and with SE than those selected using DE Three of the selected variables are in genes/region reported to contribute to RA susceptibility: BAT1 and HLADQA1/DQB1 Conclusions BLR appears to perform better than single-point association analysis (ATT) when data are correlated Computationally efficient Identifies fewer positive results (10 vs.109) Correlation might be more effectively handled Simulation analyses confirm reasonable power and error rate Three variables identified by both DE and NEG priors lie in genes previously implicated in RA Results suggest the presence of independent RA-associated effects in the HLA region Acknowledgements David Balding, Imperial College, UK Clive Hoggart, Imperial College, UK Aruna Bansal, GSK, UK The Genetics Division at GSK