timofeevam Page 1 2/12/2016 Protocol for meta-analysis of Genome-Wide Associations Studies of lung cancer: 1. Aims The purpose of this analysis is as follows: To conduct genome-wide analysis of lung cancer within specific subgroups; in particular smoking status, histology, sex, age of onset, stage and family history. This analysis will hopefully lead to new susceptibility loci that are specific for particular subgroups and will help generate additional hypotheses that can be followed up within ILCCO. 2. Description of participating studies The description of participating studies is presented in the Table 1. Data will be restricted to Caucasians. 3. Analysis plan Studies will participate with stratified results generated according to a prespecified analytical plan. I.Analysis will be performed by each participating center separately unless they choose to send their individual level data to one of the study partners. All analysis will be performed locally, with an exception for IARC related studies (CE, CARET cohort, Norway cohort; INSERM (FR); Estonia) and NCI studies (EAGLE, PLCO, ATBC, CPS-II) which will be performed at the coordinating center at IARC/NCI. II. Results from each individual study will be deposited at the website generated by MD Anderson (http://epi.mdanderson.org/u19/). All groups will have access to uploaded results. Overall meta-analysis will be coordinated by NCI and IARC although other groups can take an active role if they wish. III. Cases from Liverpool study will be pooled with R.Houlston study (please, see table 1). IV. Imputation based on public reference platforms (HapMap2, HapMap3, 1000Genome) will take place later (Appendix IV). 116092604 timofeevam Page 2 2/12/2016 Table 1 Summary of participating studies IARC CE, CARET, Norwayt; INSERM (FR); Estonia Included studies PI SLRI MDACC deCode HGF T. Rafnar HMGU Wichmann, Rish Toronto study P Brennan Total GELCC R Hung cases controls cases control s 2894 4233 333 506 C. Amos ICR* + Liverpool study** cases cont rols case s contr ols cases controls 194 219 1181 1184 885 11301 Actual total NCI D.Christiani EAGLE, PLCO, ATBC, CPS-II R.Houlston, John K Field cases cont rols cases contr ols 506 480 2550 1438 MT Landi D.Christiani cases control s cases controls cases controls 5739 5848 1000 1000 15072 28224 Sex male 2176 2843 160 190 82 129 677 664 440 5759 314 243 1625 ? 4132 4916 9493 15641 female 718 1390 173 316 112 90 504 520 445 5542 192 237 905 ? 993 932 3945 10145 <=50 381 1146 42 236 186 123 62 2871 506 480 182 ? 82 110 1375 4936 >50 2513 3087 291 270 995 1061 849 8430 - - 2368 ? 5043 5738 11889 18911 Age 0 Smoking status never 0 178 1401 95 220 28 14 former 521 993 95 140 80 106 615 677 current Family history of lung cancer 2073 1699 91 89 86 99 566 507 34 720 (ever) 415 7038 (ever) 37 220 187 362 1402 914 3837 87 136 ? 1946 2041 3747 4859 382 124 ? 3410 2384 6840 5311 yes 148 113 18 31 194 10 250 1007 201 1382 - - 321 685 3789 1796 6331 no 1716 3153 214 383 0 209 928 175 684 9919 - - 1692 443 3883 3781 14569 SQC 1031 50 84 316 - 189 101 1077 1451 4253 Adeno 595 90 52 615 - 374 194 589 1849 4253 Large cell 46 ?? ?? ?? ?? ?? ?? ?? ?? Small cell 451 22 34 - - 110 114 480 706 1833 Histology Stage (NSCLC) Stage (SCLC) 116092604 I 2 55 II 2 9 III 12 89 IV 15 92 timofeevam Page 3 2/12/2016 ICR* + Liverpool study** NCI only for Liverpool study YES (at NCI) Yes only for Liverpool study Partial ? Illumina 500K Beadchip Illumina 550K BeadChip, HumanHap 300K Beadchip (for Liverpool study) IARC SLRI GELCC MDACC deCode Yes (at IARC) Yes (at IARC) no Yes (at IARC) no Genotype data Yes Yes Yes no Covariates Core variables available in ILCCO repository Yes Yes yes yes Yes age, sex, BMI (for CE,CO,CA,UK,dnar ep), smoking status, PY,alcohol habits, family history of cancer (CO and dnarep, only) Yes extensive n/a limited n/a availability of individual level data in IARC and NCI variables available Platform used for genotyping Availability of imputed results and platform used as a scaffold for imputation HumanHap 300K Beadchip Yes No Actual total D.Christiani extended questionnaire HumanHap 300K Beadchip Imputed using HapMap II/1000genome project (2010-03 pilot 1 release) * - Numbers are from the paper Broderick et al 2009 (for Phase I). ** - Liverpool study was moved from IARC study 116092604 No HGF ? HumanHap 300K Beadchip HumanHap 300K Beadchip and Humancnv370duo Beadchip Imputed using HapMap II Illumina 550K, 317+270K, 610Quad, 1M na Imputed using HapmapIII/1000 genome timofeevam Page 4 2/12/2016 4. Generation of results for each study o Quality control: standardized quality control procedure before the analysis will be performed at each study center separately (Appendix I) o Analysis will be performed at each study center using common scripts written in Mach with in R/SAS/PLINK or with the software of individual center’s choosing based on the model specifications provided in this proposal. The R scripts will be provided by NCI later. o Strategy for analysis The main effect of the each SNP under a log-additive model will be tested using unconditional logistic regression adjusted for gender (not in analysis stratified by gender), age, country/study center and principal components for population stratification within studies. A separate analysis adjusting for smoking will be performed (not in analysis stratified by smoking status). Dominant, recessive and co-dominant models will be also tested. Models for the subgroup analysis: 1. Logit(Case/control status) = β1SNP + β2gender + β3-8age group + β9PC1+ β10PC2 + β11PC3 + β12study_center (if applicable); i. age - five-year age intervals defined as age at diagnosis for cases or interview for controls (e.g. <50;50-54; 55-59;60-64;65-69;70-74;>75 years old) ii. At least 1 principal component will be included. Additional principal components may be included if judged necessary. The statistical significance of principal components should be evaluated applying commonly used test e.g. Tracy-Widon statistic implemented in EIGENSTRAT iii. Study center – is a variable for participating centers in case of multiple genotyping centers participating within each study. 2. Logit (Case/control status) = β1SNP + β2gender + β3-8age group + β9PC1+ β10PC2 + β11PC3 + β11smoking status + β13study_center (if applicable); iv. Smoking status : current, former and never smokers as it is defined below 116092604 Definition of subgroups: 1. Gender: timofeevam Page 5 2. 3. 4. 5. 6. 2/12/2016 Men women Smoking status: never smokers former smokers (time since quitting >=2 years) current smokers (time since quitting < 2 years) ever smokers (former and current combined) Age subgroups <= 50 (early onset lung cancer) > 50 years (later onset of lung cancer) Histology: Adenocarcinoma large cell lung cancer (if reasonable number of cases are available) squamous cell lung cancer Small cell lung cancer Stage: I / II stage for NSCLC III/ IV stage for NSCLC Limited stage for SCLC Extensive stage for SCLC Family history of lung cancer: First degree relatives with lung cancer: yes first degree relatives with lung cancer: no Output files in text document (txt) or csv formats should include following variables (example of the SAS macro to create requested output files are presented in the 0; an example of the output file is given in the Appendix III): rs number for the SNP reference allele and risk allele Frequency of minor allele (MAF) in cases Frequency of minor allele (MAF) in controls Number of cases Number of controls test (additive model, co-dominant model, dominant or recessive models) point estimate (OR) Standard error 116092604 timofeevam Page 6 2/12/2016 Lower 95%CL (not needed if standard error is given) Upper 95%CL (not needed if standard error is given) P-value Results for HWE exact test in controls (p-value should be provided for each SNPs either in separate file or in the same file). 5. Uploading of individual study results The results of each individual study should be uploaded at the U19 TRICL website hosted by M.D. Anderson (http://epi.mdanderson.org/u19/). The instruction will be provided at a later date to the group. 6. Overall meta-analysis Heterogeneity among the participating studies will be assessed by loglikelihood ratio test and Q – statistic. Based on the heterogeneity between participating studies random or fixed-effects model will be selected to combine the effect estimates from all studies and to estimate combined ORs and their significance levels. Influence analysis, where each study is excluded one at a time to examine the effect on the pooled estimate, will be utilized to detect outliers (this will be affected by the 2-3 largest studies). 7. Time Line Table 2 Timeline for the project 1 2 months 3 4 5 6 combine data/writing scripts/impute missing variables main and subgroup analysis meta-analysis at IARC/NCI writing and submitting of manuscript 8. Authorship Policy U19 Area 1 authorship: Authorship is determined based upon number of samples that are being contributed to the meta-analysis and secondarily by the effort provided for analysis. 116092604 timofeevam Page 7 2/12/2016 Appendix I. Standardized quality control procedure (each center will have already done this) Following tests and cut-offs are suggested for quality control procedure. Genotype call rate > 95% Missing rate per person > 90% Test for deviation from HWE will be run in controls. SNPs will not be excluded based on HWE but p-values for the exact test of HWE will be provided for each SNP in the study. If the top SNPs have HW disequilibrium, the SNP will be tested by PCR. Sex chromosome heterozygosity rate (heterozygosity > 0.10 for men and <0.20 for women) Duplicates (identical IDs, highly correlated samples identified by calculation of genome-wide IBD given IBS information e.g. in PLINK PI_HAT > 0.20) Population outliers (ancestry probability rate being Caucasians < 80%) Whole genome heterozygosity rate (< mean heterozygosity± 6Std >) 116092604 timofeevam Page 8 2/12/2016 Appendix II Example of a macro to create requested output format %readplink(path= ,file=,model1='HOM',model2='HET'); */for codominant model/* %readplink(path= ,file= ,model1='ADD',model2='ADD'); */for additive model/* %readplink(path= ,file=,model1='DOM',model2='DOM'); */for dominant model/* %readplink(path= ,file= ,model1='REC',model2='REC'); */for recessive model/* ********* options nomprint; %global N; /*path - location of input plink files; file - name of the outputfile; analysis - type of analysis e.g for all subjects, form men or for women only and etc*/ /*to run stratified analysis a cluster fule should be created.using --wrute-cluster based on file with covariates(see page 56 of PLINK manual*/ %macro readplink(path=,file=, model1=,model2=); libname i "\&path."; libname meta "\&path.\database"; data freqco; attrib rs format=$11.; infile "\&path.freq\frco.frq" missover truncover; input CHR rs $ A1_co $ A2_co $ MAFco $ NCHROBS_co N_controls = NCHROBS_co/2; proc sort; by rs; /*STAT - coefficient t-statistic*/ data freqca; attrib rs format=$11.; infile "\&path.freq\frca.frq" missover truncover; input CHR rs $ A1_ca $ A2_ca $ MAFca $ NCHROBS_ca N_cases = NCHROBS_ca/2; proc sort; by rs; ;if rs ne 'SNP'; ;if rs ne 'SNP'; data hwe; attrib rs test_hwe GENO format=$14.; infile "\&path.hwe\hwe.hwe" missover truncover; input CHR rs $ test_hwe $ A1_hwe $ A2_hwe $ GENO $ ObsHet ExpHet Phwe; if test_hwe eq 'UNAFF'; proc sort; by rs; data outall; attrib rs format=$11.; infile "\&path.LR\&file..assoc.logistic" missover truncover; input chr rs $ BP Allele $ MODEL $ N OR SE L95 U95 STAT P ; marker=rs; IF MODEL eq &model1. or MODEL eq &model2.; if p ne .; proc contents out=u; data rien; set u; where NAME='chr'; call symput('N',nobs); proc sort data=outall; by rs; proc sort data=meta.genes317k; by rs; data all; merge outall (in=ina) meta.genes317k freqco freqca hwe ; by rs; if ina ; posk=int(BP/1000000); LOGP=-log10(p); sign=' '; if logp>4.5 then sign=marker; ul=u95; logor=log(or); proc sort; by chr position; data t; set all; retain posk2 0; if chr=1 then posk2=posk; if chr=2 then posk2=posk+247; 116092604 timofeevam Page 9 2/12/2016 if chr=3 then posk2=posk+489; if chr=4 then posk2=posk+688; if chr=5 then posk2=posk+879; if chr=6 then posk2=posk+1059; if chr=7 then posk2=posk+1229; if chr=8 then posk2=posk+1387; if chr=9 then posk2=posk+1533; if chr=10 then posk2=posk+1673; if chr=11 then posk2=posk+1808; if chr=12 then posk2=posk+1942; if chr=13 then posk2=posk+2074; if chr=14 then posk2=posk+2188; if chr=15 then posk2=posk+2294; if chr=16 then posk2=posk+2394; if chr=17 then posk2=posk+2482; if chr=18 then posk2=posk+2560; if chr=19 then posk2=posk+2636; if chr=20 then posk2=posk+2699; if chr=21 then posk2=posk+2761; if chr=22 then posk2=posk+2807; if chr=23 then posk2=posk+2856; imp=0; if substr(rs,1,1)='i' then imp=1; data t0; set t; file "D:\Mntimofeeva\&path.graph&file..csv"; put marker ',' chr ',' BP ',' or ',' l95 ',' u95 ',' logp ',' posk2 ',' sign ',' _n_ ',' N ',' p; proc sort data=all; by p; data &file; set all; sign=' '; theoricp=-Log10((_n_/&N)); if substr(rs,1,1)='i' then imp=1; else imp=2; if allele=A1_ca then ourmafca=mafca; if allele=A2_ca then ourmafca=1-mafca; if allele=A1_co then ourmafco=mafco; if allele=A2_co then ourmafco=1-mafco; if allele = A1_hwe then risk_allele = A2_hwe; if allele = A2_hwe then risk_allele = A1_hwe; data &file._results; set &file; proc sort data = &file._results; by chr BP; data &file._results; set &file._results; file "\&path.result_&file..csv"; if _n_ = 1 then put 'rs_number , reference_allele , risk_allele , MAF_cases , MAF_controls , N_cases , N_controls , model , OR , StError , Low95%CI , Upper95%CI , p-value , p_value_HWE ' ; put rs ',' Allele ',' risk_allele ',' ourmafca F5.3 ',' ourmafco F5.3 ',' N_cases ',' N_controls ',' MODEL ',' or F5.3 ',' SE F5.3 ',' l95 F5.3 ',' u95 F5.3 ',' p ',' Phwe ; data i.&file; set &file; proc sort; by p; data g; set i.&file; posk2=int(position/1000); if _n_<1001 then do; file "\&path.top&file..csv"; if _n_=1 then do; put "\&path.top&file..csv"; put 'marker , segment, position , N , allele, model, or , l95 , u95 , p , rank , MAF_cases ,MAF_controls, N_cases , N_controls, GeneSymbol,distance, Location,coding_status,AminoAcidChange ' ; end; put rs ',' chr ',' posk2 ',' N ',' allele ',' MODEL ',' or F5.2 ',' l95 F5.2 ',' u95 F5.2 ',' p ',' _n_ ',' ourmafca F5.3 ',' ourmafco F5.3 ',' N_cases ',' N_controls ',' GeneSymbol ',' locationrelativetogene ',' Location ',' coding_status ',' AminoAcidChange ; end; file "\&path.qqplot&file..csv"; if _n_=1 then put 'marker , chrom , position , or , l95 , u95 , p , logp ,theoricp ,imp ' ; put rs ',' chr ',' position F20. ',' or F5.2',' l95 F5.2',' u95 ',' p ',' logp ',' theoricp ',' imp; run; %mend; 116092604 timofeevam Page 10 2/12/2016 Appendix III Example of the output file rs_number reference_allele risk_allele rs3934834 T C rs3737728 A G rs6687776 T rs9651273 MAF_controls N_cases N_controls 0.201 0.13 144 0.3 0.331 148 C 0.168 0.172 A G 0.277 rs4970405 G A rs12726255 G A rs2298217 T C rs4970362 A G 116092604 MAF_cases model OR 111 REC 2.59 113 REC 0.44 148 113 REC 0.278 148 113 0.074 0.07 147 0.142 0.15 147 0.136 0.15 0.425 0.379 StError Low95%CI Upper95%CI p-value p_value_HWE 0.95 0.4 0.49 0.17 16.76 0.3184 1 1.14 0.09014 0.48 0.89 1 0.08 2.74 0.4078 REC 0.88 1 0.48 0.34 2.26 0.7902 0.5667 113 REC 113 REC 1.04 1.36 0.07 15.01 0.976 0.4069 0.31 1.01 0.04 2.24 0.245 0.4495 147 113 134 108 REC 1.28 1.11 0.15 11.17 0.8248 1 REC 1.45 0.39 0.68 3.09 0.3366 1 timofeevam Page 11 2/12/2016 Appendix IV.Preliminary guidelines for future imputation and meta-analysis of imputed data Aim of imputation: 1) To increase number of SNPs by imputing not genotyped variants using a common reference panel HapMap 2.0 release 22 including all CEPH referent samples (CEU+YRI+JPT+CHB; overall 420 haplotypes: 120 CEU, 120 YRI, 90 CHB and 90 JPT; and ~ 3 million SNPs). NOTE FOR DISCUSSION: for the HapMAP 2 project two already phased realeases are available on the HapMap website (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/). Should we use release 21 or release 22? Release 21 based on build 35, release 22 based on build 36. I believe that if we are interested in combining all population, we should use phasing based on consensus SNPs . On the HapMap website phased consensus haplotypes are available only for release 21. On the website for Impute v1 , consensus haplotypes are available for all 4 population for release 22 also. 2) To normalize the number of studied SNPs to reference panel, which allows easy meta-analysis of common SNPs . Reference Panel HapMap 2 r21 or r22 (NCBI B36 assembly) reference panel of already phased haplotypes available at was suggested to use for imputation. Study specific quality control for variant individuals 1. Standardized among individual studies quality control procedure (genotype missing rates, HWE and sex check, heterozygosity rate and population outliers; please, see the appendix 1 of the protocol for meta-analysis of GWAS of lung cancer) 2. Exclude SNPs in which there is a large difference in proportion (>5%) of missing cases versus controls Imputation procedure to be performed by each center 1. Standardization of the study genotyping data to the physical position and strand orientation to the reference phased haplotypes (build 36 for HapMap 2). 2. Imputation will be performed by each participating center separately. Within the participating center imputation ideally will be performed by study (or country of origin) and genotyping platform. 116092604 timofeevam Page 12 2/12/2016 3. Program for imputation: MACH; potentially use IMPUTE 2 to provide alternative methods to investigate regions of interest. 4. Following steps describes the imputation in MACH: Input file formats: merlin format data and pedigree files Two step imputation to speed the process a. Calculating error map and crossover map using a random subset of 200 individuals , applying 100 iteration e.g. mach1 -d mach.dat -p subset.ped -s chr1.snp -h chr1.hap --compact -greedy --autoFlip -r 100 -o chr1 > chr1mach.infer.log b. Imputation of all SNPs using parameter estimates calculated at the first step (a) mach1 -d mach.dat -p mach.ped -s chr1.snps -h chr1.hap --greedy -autoFlip --errorMap chr1.erate --crossoverMap chr1.rec --mle --mldetails -dosage -o chr1.imp2.log After imputation analysis 1. Analysis should be done taking imputation uncertainty into account ( posterior probability for every genotype) 2. Regression program: ProbABEL (MACH output files can be directly used in ProbABEL) 3. The standardized models for the association analysis should be used (please, see the main text of the “Protocol for meta-analysis of GWAS of lung cancer”) 4. Output files of ProbABEL should be sent to the coordinating center/centers where following meta-analysis will be performed (example of output file is given in the Appendix II) 5. Direct genotyping of hits to validate 116092604