Protocol for meta-analysis of Genome-Wide Associations

advertisement
timofeevam
Page 1
2/12/2016
Protocol for meta-analysis of Genome-Wide
Associations Studies of lung cancer:
1. Aims
The purpose of this analysis is as follows:
To conduct genome-wide analysis of lung cancer within specific subgroups; in
particular smoking status, histology, sex, age of onset, stage and family history.
This analysis will hopefully lead to new susceptibility loci that are specific for
particular subgroups and will help generate additional hypotheses that can be
followed up within ILCCO.
2. Description of participating studies
The description of participating studies is presented in the Table 1. Data will be
restricted to Caucasians.
3. Analysis plan
Studies will participate with stratified results generated according to a prespecified analytical plan.
I.Analysis will be performed by each participating center separately unless they
choose to send their individual level data to one of the study partners. All
analysis will be performed locally, with an exception for IARC related studies
(CE, CARET cohort, Norway cohort; INSERM (FR); Estonia) and NCI
studies (EAGLE, PLCO, ATBC, CPS-II) which will be performed at the
coordinating center at IARC/NCI.
II. Results from each individual study will be deposited at the website generated
by MD Anderson (http://epi.mdanderson.org/u19/). All groups will have access
to uploaded results. Overall meta-analysis will be coordinated by NCI and
IARC although other groups can take an active role if they wish.
III. Cases from Liverpool study will be pooled with R.Houlston study (please, see
table 1).
IV. Imputation based on public reference platforms (HapMap2, HapMap3,
1000Genome) will take place later (Appendix IV).
116092604
timofeevam
Page 2
2/12/2016
Table 1 Summary of participating studies
IARC
CE, CARET,
Norwayt; INSERM
(FR); Estonia
Included
studies
PI
SLRI
MDACC
deCode
HGF
T. Rafnar
HMGU
Wichmann,
Rish
Toronto study
P Brennan
Total
GELCC
R Hung
cases
controls
cases
control
s
2894
4233
333
506
C. Amos
ICR* +
Liverpool
study**
cases
cont
rols
case
s
contr
ols
cases
controls
194
219
1181
1184
885
11301
Actual total
NCI
D.Christiani
EAGLE, PLCO,
ATBC, CPS-II
R.Houlston,
John K Field
cases
cont
rols
cases
contr
ols
506
480
2550
1438
MT Landi
D.Christiani
cases
control
s
cases
controls
cases
controls
5739
5848
1000
1000
15072
28224
Sex
male
2176
2843
160
190
82
129
677
664
440
5759
314
243
1625
?
4132
4916
9493
15641
female
718
1390
173
316
112
90
504
520
445
5542
192
237
905
?
993
932
3945
10145
<=50
381
1146
42
236
186
123
62
2871
506
480
182
?
82
110
1375
4936
>50
2513
3087
291
270
995
1061
849
8430
-
-
2368
?
5043
5738
11889
18911
Age
0
Smoking status
never
0
178
1401
95
220
28
14
former
521
993
95
140
80
106
615
677
current
Family history
of lung cancer
2073
1699
91
89
86
99
566
507
34
720
(ever)
415
7038
(ever)
37
220
187
362
1402
914
3837
87
136
?
1946
2041
3747
4859
382
124
?
3410
2384
6840
5311
yes
148
113
18
31
194
10
250
1007
201
1382
-
-
321
685
3789
1796
6331
no
1716
3153
214
383
0
209
928
175
684
9919
-
-
1692
443
3883
3781
14569
SQC
1031
50
84
316
-
189
101
1077
1451
4253
Adeno
595
90
52
615
-
374
194
589
1849
4253
Large cell
46
??
??
??
??
??
??
??
??
Small cell
451
22
34
-
-
110
114
480
706
1833
Histology
Stage (NSCLC)
Stage (SCLC)
116092604
I
2
55
II
2
9
III
12
89
IV
15
92
timofeevam
Page 3
2/12/2016
ICR* +
Liverpool
study**
NCI
only for
Liverpool study
YES (at NCI)
Yes
only for
Liverpool study
Partial
? Illumina
500K
Beadchip
Illumina 550K
BeadChip,
HumanHap
300K Beadchip
(for Liverpool
study)
IARC
SLRI
GELCC
MDACC
deCode
Yes (at IARC)
Yes (at IARC)
no
Yes (at IARC)
no
Genotype data
Yes
Yes
Yes
no
Covariates
Core variables
available in
ILCCO
repository
Yes
Yes
yes
yes
Yes
age, sex, BMI (for
CE,CO,CA,UK,dnar
ep), smoking status,
PY,alcohol habits,
family history of
cancer (CO and
dnarep, only)
Yes
extensive
n/a
limited
n/a
availability of
individual level
data in IARC
and NCI
variables
available
Platform used
for genotyping
Availability of
imputed results
and platform
used as a
scaffold for
imputation
HumanHap 300K
Beadchip
Yes
No
Actual total
D.Christiani
extended
questionnaire
HumanHap
300K Beadchip
Imputed using
HapMap
II/1000genome
project (2010-03
pilot 1 release)
* - Numbers are from the paper Broderick et al 2009 (for Phase I).
** - Liverpool study was moved from IARC study
116092604
No
HGF
?
HumanHap
300K
Beadchip
HumanHap 300K
Beadchip and
Humancnv370duo Beadchip
Imputed using
HapMap II
Illumina 550K,
317+270K,
610Quad, 1M
na
Imputed using
HapmapIII/1000
genome
timofeevam
Page 4
2/12/2016
4. Generation of results for each study
o Quality control: standardized quality control procedure before the
analysis will be performed at each study center separately (Appendix I)
o Analysis will be performed at each study center using common scripts
written in Mach with in R/SAS/PLINK or with the software of individual
center’s choosing based on the model specifications provided in this
proposal. The R scripts will be provided by NCI later.
o Strategy for analysis
 The main effect of the each SNP under a log-additive model will be
tested using unconditional logistic regression adjusted for gender
(not in analysis stratified by gender), age, country/study center and
principal components for population stratification within studies. A
separate analysis adjusting for smoking will be performed (not in
analysis stratified by smoking status).
 Dominant, recessive and co-dominant models will be also tested.
 Models for the subgroup analysis:
1. Logit(Case/control status) = β1SNP + β2gender + β3-8age
group + β9PC1+ β10PC2 + β11PC3 + β12study_center (if
applicable);
i. age - five-year age intervals defined as age at
diagnosis for cases or interview for controls (e.g.
<50;50-54; 55-59;60-64;65-69;70-74;>75 years old)
ii. At least 1 principal component will be included.
Additional principal components may be included if
judged necessary. The statistical significance of
principal components should be evaluated applying
commonly used test e.g. Tracy-Widon statistic
implemented in EIGENSTRAT
iii. Study center – is a variable for participating centers
in case of multiple genotyping centers participating
within each study.
2. Logit (Case/control status) = β1SNP + β2gender + β3-8age
group + β9PC1+ β10PC2 + β11PC3 + β11smoking status +
β13study_center (if applicable);
iv. Smoking status : current, former and never smokers
as it is defined below

116092604
Definition of subgroups:
1. Gender:
timofeevam
Page 5
2.
3.
4.
5.
6.

2/12/2016
 Men
 women
Smoking status:
 never smokers
 former smokers (time since quitting >=2 years)
 current smokers (time since quitting < 2 years)

 ever smokers (former and current combined)
Age subgroups
 <= 50 (early onset lung cancer)
 > 50 years (later onset of lung cancer)
Histology:
 Adenocarcinoma
 large cell lung cancer (if reasonable number of cases
are available)
 squamous cell lung cancer
 Small cell lung cancer
Stage:
 I / II stage for NSCLC
 III/ IV stage for NSCLC
 Limited stage for SCLC
 Extensive stage for SCLC
Family history of lung cancer:
 First degree relatives with lung cancer: yes
 first degree relatives with lung cancer: no
Output files in text document (txt) or csv formats should include
following variables (example of the SAS macro to create requested
output files are presented in the 0; an example of the output file is
given in the Appendix III):
 rs number for the SNP
 reference allele and risk allele
 Frequency of minor allele (MAF) in cases
 Frequency of minor allele (MAF) in controls
 Number of cases
 Number of controls
 test (additive model, co-dominant model, dominant or
recessive models)
 point estimate (OR)
 Standard error
116092604
timofeevam
Page 6




2/12/2016
Lower 95%CL (not needed if standard error is given)
Upper 95%CL (not needed if standard error is given)
P-value
Results for HWE exact test in controls (p-value should be
provided for each SNPs either in separate file or in the
same file).
5. Uploading of individual study results
The results of each individual study should be uploaded at the U19 TRICL
website hosted by M.D. Anderson (http://epi.mdanderson.org/u19/). The
instruction will be provided at a later date to the group.
6. Overall meta-analysis



Heterogeneity among the participating studies will be assessed by loglikelihood ratio test and Q – statistic.
Based on the heterogeneity between participating studies random or
fixed-effects model will be selected to combine the effect estimates
from all studies and to estimate combined ORs and their significance
levels.
Influence analysis, where each study is excluded one at a time to
examine the effect on the pooled estimate, will be utilized to detect
outliers (this will be affected by the 2-3 largest studies).
7. Time Line
Table 2 Timeline for the project
1
2
months
3 4
5
6
combine data/writing scripts/impute missing variables
main and subgroup analysis
meta-analysis at IARC/NCI
writing and submitting of manuscript
8. Authorship Policy
U19 Area 1 authorship: Authorship is determined based upon number of samples
that are being contributed to the meta-analysis and secondarily by the effort
provided for analysis.
116092604
timofeevam
Page 7
2/12/2016
Appendix I. Standardized quality control procedure (each center will have already done
this)
Following tests and cut-offs are suggested for quality control procedure.
 Genotype call rate > 95%
 Missing rate per person > 90%
 Test for deviation from HWE will be run in controls. SNPs will not be
excluded based on HWE but p-values for the exact test of HWE will be
provided for each SNP in the study. If the top SNPs have HW
disequilibrium, the SNP will be tested by PCR.
 Sex chromosome heterozygosity rate (heterozygosity > 0.10 for men and
<0.20 for women)
 Duplicates (identical IDs, highly correlated samples identified by
calculation of genome-wide IBD given IBS information e.g. in PLINK
PI_HAT > 0.20)
 Population outliers (ancestry probability rate being Caucasians < 80%)
 Whole genome heterozygosity rate (< mean heterozygosity± 6Std >)
116092604
timofeevam
Page 8
2/12/2016
Appendix II Example of a macro to create requested output format
%readplink(path= ,file=,model1='HOM',model2='HET'); */for codominant model/*
%readplink(path= ,file= ,model1='ADD',model2='ADD'); */for additive model/*
%readplink(path= ,file=,model1='DOM',model2='DOM'); */for dominant model/*
%readplink(path= ,file= ,model1='REC',model2='REC'); */for recessive model/*
*********
options nomprint;
%global N;
/*path - location of input plink files; file - name of the outputfile;
analysis - type of analysis e.g for all subjects, form men or for women only and etc*/
/*to run stratified analysis a cluster fule should be created.using --wrute-cluster
based on file with covariates(see page 56 of PLINK manual*/
%macro readplink(path=,file=, model1=,model2=);
libname i "\&path.";
libname meta "\&path.\database";
data freqco; attrib rs format=$11.;
infile "\&path.freq\frco.frq" missover truncover;
input CHR rs $ A1_co $ A2_co $ MAFco $ NCHROBS_co
N_controls = NCHROBS_co/2;
proc sort; by rs;
/*STAT - coefficient t-statistic*/
data freqca; attrib rs format=$11.;
infile "\&path.freq\frca.frq" missover truncover;
input CHR rs $ A1_ca $ A2_ca $ MAFca $ NCHROBS_ca
N_cases = NCHROBS_ca/2;
proc sort; by rs;
;if rs ne 'SNP';
;if rs ne 'SNP';
data hwe; attrib rs test_hwe GENO format=$14.;
infile "\&path.hwe\hwe.hwe" missover truncover;
input CHR rs $ test_hwe $ A1_hwe $ A2_hwe $ GENO $ ObsHet ExpHet Phwe;
if test_hwe eq 'UNAFF';
proc sort; by rs;
data outall; attrib rs format=$11.;
infile "\&path.LR\&file..assoc.logistic" missover truncover;
input chr rs $ BP Allele $ MODEL $ N OR SE L95 U95 STAT P ;
marker=rs;
IF MODEL eq &model1. or MODEL eq &model2.;
if p ne .;
proc contents out=u;
data rien; set u; where NAME='chr';
call symput('N',nobs);
proc sort data=outall; by rs;
proc sort data=meta.genes317k; by rs;
data all; merge outall (in=ina) meta.genes317k freqco freqca hwe ; by rs;
if ina ;
posk=int(BP/1000000);
LOGP=-log10(p);
sign='
';
if logp>4.5 then sign=marker;
ul=u95;
logor=log(or);
proc sort; by chr position;
data t; set all;
retain posk2 0;
if chr=1 then posk2=posk;
if chr=2 then posk2=posk+247;
116092604
timofeevam
Page 9
2/12/2016
if chr=3 then posk2=posk+489;
if chr=4 then posk2=posk+688;
if chr=5 then posk2=posk+879;
if chr=6 then posk2=posk+1059;
if chr=7 then posk2=posk+1229;
if chr=8 then posk2=posk+1387;
if chr=9 then posk2=posk+1533;
if chr=10 then posk2=posk+1673;
if chr=11 then posk2=posk+1808;
if chr=12 then posk2=posk+1942;
if chr=13 then posk2=posk+2074;
if chr=14 then posk2=posk+2188;
if chr=15 then posk2=posk+2294;
if chr=16 then posk2=posk+2394;
if chr=17 then posk2=posk+2482;
if chr=18 then posk2=posk+2560;
if chr=19 then posk2=posk+2636;
if chr=20 then posk2=posk+2699;
if chr=21 then posk2=posk+2761;
if chr=22 then posk2=posk+2807;
if chr=23 then posk2=posk+2856;
imp=0;
if substr(rs,1,1)='i' then imp=1;
data t0; set t;
file "D:\Mntimofeeva\&path.graph&file..csv";
put marker ',' chr ',' BP ',' or ',' l95 ',' u95 ',' logp ',' posk2 ',' sign ',' _n_ ',' N ',' p;
proc sort data=all; by p;
data &file; set all;
sign='
';
theoricp=-Log10((_n_/&N));
if substr(rs,1,1)='i' then imp=1; else imp=2;
if allele=A1_ca then ourmafca=mafca;
if allele=A2_ca then ourmafca=1-mafca;
if allele=A1_co then ourmafco=mafco;
if allele=A2_co then ourmafco=1-mafco;
if allele = A1_hwe then risk_allele = A2_hwe;
if allele = A2_hwe then risk_allele = A1_hwe;
data &file._results; set &file;
proc sort data = &file._results; by chr BP;
data &file._results; set &file._results;
file "\&path.result_&file..csv";
if _n_ = 1 then put 'rs_number , reference_allele , risk_allele , MAF_cases , MAF_controls , N_cases , N_controls , model
, OR , StError , Low95%CI , Upper95%CI , p-value , p_value_HWE ' ;
put rs ',' Allele ',' risk_allele ',' ourmafca F5.3 ',' ourmafco F5.3 ',' N_cases ',' N_controls ',' MODEL ',' or F5.3 ',' SE F5.3 ','
l95 F5.3 ',' u95 F5.3 ',' p ',' Phwe ;
data i.&file; set &file;
proc sort; by p;
data g; set i.&file;
posk2=int(position/1000);
if _n_<1001 then do;
file "\&path.top&file..csv";
if _n_=1 then do;
put "\&path.top&file..csv";
put 'marker , segment, position , N , allele, model, or , l95 , u95 , p , rank , MAF_cases ,MAF_controls, N_cases ,
N_controls, GeneSymbol,distance, Location,coding_status,AminoAcidChange ' ;
end;
put rs ',' chr ',' posk2 ',' N ',' allele ',' MODEL ',' or F5.2 ',' l95 F5.2 ',' u95 F5.2 ',' p ',' _n_ ',' ourmafca F5.3 ',' ourmafco
F5.3 ',' N_cases ',' N_controls ',' GeneSymbol ',' locationrelativetogene ','
Location ',' coding_status ',' AminoAcidChange ;
end;
file "\&path.qqplot&file..csv";
if _n_=1 then put 'marker , chrom , position , or , l95 , u95 , p , logp ,theoricp ,imp ' ;
put rs ',' chr ',' position F20. ',' or F5.2',' l95 F5.2',' u95 ',' p ',' logp ',' theoricp ',' imp;
run;
%mend;
116092604
timofeevam
Page 10
2/12/2016
Appendix III Example of the output file
rs_number
reference_allele
risk_allele
rs3934834
T
C
rs3737728
A
G
rs6687776
T
rs9651273
MAF_controls
N_cases
N_controls
0.201
0.13
144
0.3
0.331
148
C
0.168
0.172
A
G
0.277
rs4970405
G
A
rs12726255
G
A
rs2298217
T
C
rs4970362
A
G
116092604
MAF_cases
model
OR
111
REC
2.59
113
REC
0.44
148
113
REC
0.278
148
113
0.074
0.07
147
0.142
0.15
147
0.136
0.15
0.425
0.379
StError
Low95%CI
Upper95%CI
p-value
p_value_HWE
0.95
0.4
0.49
0.17
16.76
0.3184
1
1.14
0.09014
0.48
0.89
1
0.08
2.74
0.4078
REC
0.88
1
0.48
0.34
2.26
0.7902
0.5667
113
REC
113
REC
1.04
1.36
0.07
15.01
0.976
0.4069
0.31
1.01
0.04
2.24
0.245
0.4495
147
113
134
108
REC
1.28
1.11
0.15
11.17
0.8248
1
REC
1.45
0.39
0.68
3.09
0.3366
1
timofeevam
Page 11
2/12/2016
Appendix IV.Preliminary guidelines for future imputation and meta-analysis of
imputed data
Aim of imputation:
1) To increase number of SNPs by imputing not genotyped variants using a
common reference panel HapMap 2.0 release 22 including all CEPH referent
samples (CEU+YRI+JPT+CHB; overall 420 haplotypes: 120 CEU, 120 YRI,
90 CHB and 90 JPT; and ~ 3 million SNPs).
NOTE FOR DISCUSSION: for the HapMAP 2 project two already phased realeases are available
on the HapMap website (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/). Should we use
release 21 or release 22? Release 21 based on build 35, release 22 based on build 36. I believe
that if we are interested in combining all population, we should use phasing based on consensus
SNPs . On the HapMap website phased consensus haplotypes are available only for release 21.
On the website for Impute v1 , consensus haplotypes are available for all 4 population for release
22 also.
2) To normalize the number of studied SNPs to reference panel, which allows
easy meta-analysis of common SNPs .
Reference Panel
HapMap 2 r21 or r22 (NCBI B36 assembly) reference panel of already phased
haplotypes available at was suggested to use for imputation.
Study specific quality control for variant individuals
1. Standardized among individual studies quality control procedure (genotype
missing rates, HWE and sex check, heterozygosity rate and population
outliers; please, see the appendix 1 of the protocol for meta-analysis of
GWAS of lung cancer)
2. Exclude SNPs in which there is a large difference in proportion (>5%) of
missing cases versus controls
Imputation procedure to be performed by each center
1. Standardization of the study genotyping data to the physical position and
strand orientation to the reference phased haplotypes (build 36 for HapMap
2).
2. Imputation will be performed by each participating center separately. Within
the participating center imputation ideally will be performed by study (or
country of origin) and genotyping platform.
116092604
timofeevam
Page 12
2/12/2016
3. Program for imputation: MACH; potentially use IMPUTE 2 to provide
alternative methods to investigate regions of interest.
4. Following steps describes the imputation in MACH:
 Input file formats: merlin format data and pedigree files
 Two step imputation to speed the process
a. Calculating error map and crossover map using a random subset of 200
individuals , applying 100 iteration
e.g. mach1 -d mach.dat -p subset.ped -s chr1.snp -h chr1.hap --compact -greedy --autoFlip -r 100 -o chr1 > chr1mach.infer.log
b. Imputation of all SNPs using parameter estimates calculated at the first step
(a) mach1 -d mach.dat -p mach.ped -s chr1.snps -h chr1.hap --greedy -autoFlip --errorMap chr1.erate --crossoverMap chr1.rec --mle --mldetails -dosage -o chr1.imp2.log
After imputation analysis
1. Analysis should be done taking imputation uncertainty into account ( posterior
probability for every genotype)
2. Regression program: ProbABEL (MACH output files can be directly used in
ProbABEL)
3. The standardized models for the association analysis should be used (please,
see the main text of the “Protocol for meta-analysis of GWAS of lung
cancer”)
4. Output files of ProbABEL should be sent to the coordinating center/centers
where following meta-analysis will be performed (example of output file is
given in the Appendix II)
5. Direct genotyping of hits to validate
116092604
Download