1 Supplementary Information 2 The germline sequence variant rs2736100_C in TERT 3 associates with myeloproliferative neoplasms 4 Oddsson A1*, Kristinsson SY2,3 *, Helgason H1, Gudbjartsson DF1, Masson G1, Sigurdsson A1, Jonasdottir 5 A1, Jonasdottir A1, Steingrimsdottir H3, Vidarsson B3, Reykdal S3, Eyjolfsson GI5, Olafsson I6, 6 Onundarson PT2,3, Runarsson G3, Sigurdardottir O4, Kong A1, Rafnar T1, Sulem P1, Thorsteinsdottir U1,2 7 & Stefansson K1,2 8 deCODE Genetics/Amgen Inc., 101 Reykjavik, Iceland 9 1 10 2 11 3 12 Iceland 13 4 14 5 15 Faculty of Medicine, University of Iceland, 101 Reykjavik, Iceland Department of Hematology, Landspitali, The National University Hospital of Iceland, 101 Reykjavik, Department of Clinical Biochemistry, Akureyri Hospital, 600 Akureyri, Iceland The Laboratory in Mjodd, RAM, 109 Reykjavik, Iceland Department of Clinical Biochemistry, Landspitali, The National University Hospital of Iceland, 101 6 16 Reykjavik, Iceland 17 Content 18 Table S1-S8 19 Supplemental methods 20 References 1 21 22 Table S1: Number of directly and familially imputed MPN cases and controls in the study. Phenotype N Chip imputed Familially imputed Myeloproliferative neoplasm (ph-negative) 237 112 125 - Polycythemia Vera 98 40 58 - Essential thrombocythemia 40 27 13 - Primary myelofibrosis 26 9 15 34 128 16 128 18 000 Controls 23 24 25 Of the 237 MPN cases 74 were without sub-phenotype (PV, ET or PMF) classification and one MPN case had two sub-phenotypes assigned. 26 2 27 28 29 Table S2: Association with MPN in Iceland of the JAK2 variant rs1034072_A reported in this study and the previously reported variant rs10974944_G. MPN GWAS All Chip-typed SNP ID Position (hg18) Allele AF P OR P OR P* OR* r2 ‡ rs1034072 chr9:5078903 A/T 28.2 3.19 x 10-7 1.85 4.19 x 10-5 1.64 0.29 1.78 0.91 rs10974944 chr9:5060831 G/C 28.7 1.90 x 10-6 1.78 7.54 x 10-5 1.09 0.86 1.75 30 Allele: minor allele/major allele, AF: Allele frequency (shown for minor allele), OR: Odds ratio (shown for minor allele). 31 32 *Adjusted values between rs1034072 and rs10974944. Only chip-typed individuals were used in conditional analysis (Cases N ‡ = 112, ccontrols N = 16,128), r2 between rs1034072 and rs10974944 3 33 34 Table S3: Association with MPN of previously reported variants and those that associate with MPN at 35 P >10-5 at the TERT locus, with and without conditioning on rs2736100. MPN GWAS All SNP ID Chip-typed Position (hg18) Allele AF Reported phenotypes P OR P-unadj OR-unadj P-adj* OR-adj* r2 ‡ rs2736100 chr5:1339516 C/A 49.3 LA,IPF,LC,GL,TC,TL,BCC 6.39 x 10-10 2.09 4.38 x 10-7 2.02 NA NA 1.000 NA chr5:1345642 TT/- 47.8 - 1.29 x 10-8 1.96 2.98 x 10-5 1.78 0.27 1.22 0.450 1.92 1.59 x 10-4 1.68 0.45 1.14 0.411 10-5 1.75 0.35 1.19 0.448 rs2853677 chr5:1340194 G/A 41.6 LA 2.24 x 10-8 10-8 1.93 4.92 x rs2735940 chr5:1349486 A/G 48.1 - 3.28 x rs7705526 chr5:1338974 A/C 34.6 - 3.35 x 10-8 1.92 2.75 x 10-6 1.90 0.14 1.33 0.524 rs2853672 chr5:1345983 C/A 47.8 - 5.11 x 10-8 1.91 7.01 x 10-5 1.73 0.41 1.16 0.450 rs2736099 chr5:1340340 A/G 36.6 - 9.22 x 10-7 1.78 1.31 x 10-4 1.69 0.36 1.17 0.393 rs78559769 chr5:1429174 T/C 2.5 - 3.82 x 10-6 3.32 1.19 x 10-4 3.13 0.003 2.42 0.022 rs2736108 chr5:1350488 T/C 30.0 - 5.56 x 10-6 1.74 4.43 x 10-4 1.65 0.17 1.24 0.207 - 7.99 x 10-6 1.72 8.08 x 10-4 1.61 0.23 1.21 0.206 10-6 1.72 6.90 x 10-3 1.49 0.46 1.13 0.205 NA chr5:1350077 A/ACC 28.9 rs2736107 chr5:1350854 T/C 28.5 - 9.46 x NA chr5:1349255 AG/A 29.2 - 9.74 x 10-6 1.72 9.21 x 10-2 1.47 0.55 1.10 0.204 rs2736098 chr5:1347086 T/C 27.4 BCC 2.10 x 10-5 1.70 3.89 x 10-3 1.52 0.37 1.15 0.173 rs4635969 chr5:1361552 A/G 20.2 TC 3.01 x 10-3 0.62 3.50 x 10-2 0.68 0.16 0.77 0.018 rs2853676 chr5:1341547 T/C 26.5 GL 6.91 x 10-3 1.41 5.50 x 10-3 1.50 0.57 1.10 0.220 rs4975709 chr5:1930280 C/A 23.7 CVD 6.01 x 10-2 1.28 0.12 1.27 0.06 1.33 0.002 10-2 0.80 0.54 0.92 0.74 0.96 0.002 rs401681 chr5:1375087 T/C 45.4 PSA,ME,UBC,PC,LC,BCC rs4975616 chr5:1368660 G/A 42.5 LC 6.91 x 10-2 0.80 0.69 0.95 0.95 0.99 0.003 rs31489 chr5:1395714 A/C 42.4 LA 8.91 x 10-2 0.82 0.48 0.91 0.87 0.98 0.010 rs31490 chr5:1397458 A/G 43.8 CLL 0.11 0.83 0.60 0.93 0.88 0.98 0.004 rs402710 chr5:1373722 T/C 36.0 LC 0.27 0.87 0.84 0.97 0.65 1.07 0.015 rs12653946 chr5:1948829 T/C 40.4 PC 0.29 1.13 0.09 1.26 0.08 1.27 0.001 rs10069690 chr5:1332790 T/C 25.6 UBC, CLL 0.89 1.02 0.26 1.19 0.37 0.87 0.169 rs2242652 chr5:1333028 A/G 22.3 PC 0.96 1.01 0.48 1.12 0.27 0.83 0.138 36 37 38 39 40 41 42 6.31 x OR: Odds ratio (shown for minor allele), AF: allele frequency (shown for minor allele), Allele: minor allele/major allele, Reported: Known associations of diseases and traits with the index SNPs, BCC: Basal cell carcinoma, CVD:Cardiovascular disease risk factors, CLL: Chronic lymphocytic leukemia, GL:Glioma, IPF:Idiopathic pulmonary fibrosis, LA:Lung adenocarcinoma, LC:Lung cancer, ME:Melanoma, MPN:Myeloproliferative neoplasms, UBC: Urinary bladder cancer, PC:Pancreatic cancer, PSA:Prostate specific antigen levels, TC:Testicular germ cell cancer, TL:Telomerase length *Adjusted for rs2736100. Only chip-typed individuals were used in conditional analysis (Cases N = 112, Controls N = 16,128), ‡ r2 correlation between rs2736100 and the listed variants 4 43 Table S4: Association with MPN in Iceland of variants reported to affect telomere length MPN in Iceland Terlomere length SNP ID* Position (hg18) Gene Allele AF (%) P OR AF (%)* P* Effect (SD) * rs10936599 chr3:170974795 TERC C 79.2 0.92 1.02 74.8 2.54 x 10-31 0.097 rs2736100 chr5:1339516 TERT C 49.3 6.39 x 10-10 2.09 48.6 4.38 x 10-19 0.078 rs7675998 chr4:164227270 NAF1 G 80.4 0.94 1.02 78.3 4.35 x 10-16 0.074 rs9420907 chr10:105666455 OBFC1 C 12.3 0.83 1.04 13.5 6.90 x 10-11 0.069 rs11125529 chr2:54329370 ACYP2 A 16.3 2.85 x 10-3 1.53 14.2 7.50 x 10-10 0.056 rs8105767 chr19:22007281 ZNF208 G 32.0 0.425 1.10 29.1 1.11 x 10-9 0.048 rs755017 chr20:61892066 RTEL1 G 14.2 0.82 1.04 13.1 6.71 x 10-9 0.062 44 Allele: Effect allele, OR: Odds ratio (shown for effect allele), AF: allele frequency (shown for effect allele). 45 *Data drawn from Codd et al. 2013 that reported association of the listed SNPs with telomere length. 5 46 47 48 Table S5: Association of rs2736100_C in TERT and rs1034072_A in JAK2 with various blood cell counts in Iceland TERT rs2736100_C Trait JAK2 rs1034072_A N P Effect (SD) P Effect (SD) 76 739 9.07 x 10-6 0.019 0.91 0.001 126 853 4.10 x 10-4 0.016 7.74 x10-4 0.017 - Basophils 99 809 8.19 x 10-2 0.006 0.02 0.008 - Eosinophils 99 862 2.50 x 10-2 -0.010 3.01 x 10-4 0.018 - Granulocytes 99 473 3.55 x 10-4 0.016 4.82 x 10-4 0.017 - Monocytes 100 271 2.55 x 10-2 0.010 0.38 0.004 - Lymphocytes 100 270 0.78 0.001 0.95 0.000 Platelet 103 441 5.26 x 10-8 0.025 2.50 x 10-4 -0.019 Red blood cells White blood cells 49 50 6 51 52 53 Table S6: Association of rs2736100_C Hematological disorders in Iceland. Myeloid in with Phenotype N P OR Chronic myeloid leukemia 85 0.93 0.98 Acute myeloid leukemia 291 0.30 1.14 1 122 0.68 0.98 Multiple myeloma 414 0.12 0.85 Waldenstroms 86 0.06 1.46 Non Hodgkins lymphoma 800 0.96 1.00 Hodgkins lymphoma 256 0.44 0.91 Follicular non Hodgkins lymphoma 149 0.67 1.06 Chronic lymphocytic leukemia 309 0.56 0.94 Monoclonal gammopathy of unknown significance Lymphoid TERT 54 55 7 56 57 Table S7: Mutation status of JAK2V617F in MPN patients Phenotype N JAK2V617F-positive 43 N JAK2V617F-negative 19 % JAK2V617F-positive 69.35 - Polycythemia vera (N =31) 25 6 80.64 - Essential thrombocythemia (N =11) 6 5 54.55 - Primary myelofibrosis (N =5 ) 3 2 60.00 Myeloproliferative Neoplasms (ph-neg.) (N = 62) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 MPN cases with available blood samples drawn until two years prior to and after MPN diagnosis were included in the analysis (N = 62). Table S8: The effect of the germline risk alleles rs1034072_A in JAK2 and the rs2736100_C in TERT on Somatic JAK2V617F allele burden. 8 JAK2V617F allele burden effect 97 98 99 100 101 Phenotype P† per germline risk allele † TERT rs2736100_C MPN (n= 60) 0.77 1.45 % (Control AF=49.32) PV (n= 30) 0.98 0.17 % JAK2 rs1034072_A MPN (n= 60) 0.02 10.28 % (Control AF=28.21) PV (n= 29) 0.03 15.07 % MPN cases with available blood samples drawn until two years prior to and after MPN diagnosis were included in the analysis (N = 62). The mean JAK2V617F somatic allele burden among the 62 MPN cases used is 22%. † Linear regression analysis was performed after adjusting for time from blood draw date to MPN diagnosis to estimate the JAK2V617F somatic allele burden effect per germline risk allele. 102 103 104 105 106 107 108 9 109 110 111 Supplemental methods 112 Study population 113 The study group consists of 237 patients diagnosed with MPN from the year 1956 until the 114 end of 2012 according to the nationwide Icelandic Cancer Registry14. These include the sub- 115 diagnosis PV (N = 98), ET (N = 40) and PMF (N = 26). In total, 74 had MPN unclassifiable and 116 one individual had two sub-diagnoses. Median age at diagnosis was 70 years (range 20-96, 117 48% males) and the median time since diagnosis was 15 years (range 1-57). 48% of the 118 patients were males. The controls consist of 34 128 Icelanders recruited through different 119 research projects at deCODE genetics. 120 The Data Protection Authority of Iceland and the National Bioethics Committee of Iceland 121 approved this study. All participants signed written informed consent prior to participation 122 in the study. All personal identifiers associated with blood samples, medical information, 123 and genealogies were encrypted by the Data Protection Authority, using a third-party 124 encryption system. 125 Illumina SNP Chip Genotyping 126 Genotyping was performed with methods previously described 11. Icelandic chip-typed 127 samples were assayed with the Illumina HumanHap300, HumanCNV370, HumanHap610, 128 HumanHap1M, HumanHap660, Omni-1, Omni 2.5 or Omni Express bead chips at deCODE 129 genetics. SNPs were excluded if they had (i) yield less than 95%, (ii) minor allele frequency 10 130 less than 1% in the population or (iii) significant deviation from Hardy-Weinberg 131 equilibrium in the controls (P <0.001), (iv) if they produced an excessive inheritance error 132 rate (over 0.001), (v) if there was substantial difference in allele frequency between chip 133 types (from just a single chip if that resolved all differences, but from all chips otherwise). 134 All samples with a call rate below 97% were excluded from the analysis. For the HumanHap 135 series of chips, 304,937 SNPs were used for long range phasing, whereas for the Omni series 136 of chips 564,196 SNPs were included. The final set of SNPs used for long-range phasing was 137 composed of 707,525 SNPs. 138 Single track SNP assay genotyping 139 Single SNP genotyping applying the Centaurus (Nanogen) single track genotyping assay1 140 was done to verify the accuracy of the imputation of the TERT rs2736100_C variant in the 141 Icelandic samples. 142 Whole Genome Sequencing 143 Paired-end libraries for sequencing were prepared according to the manufacturer’s 144 instructions (Illumina, TruSeqTM). Whole genome sequencing was performed for 2,230 145 Icelanders, selected for various conditions. All of the individuals were sequenced at a depth 146 of at least 10X (average sequencing depth = 22X). 147 Template DNA fragments were hybridized to the surface of flow cells (GA PE cluster kit 148 (v2) or HiSeq PE cluster kits (v2.5 or v3)) and amplified to form clusters using the Illumina 149 cBot. In brief, DNA (2.512 pM) was denatured, followed by hybridization to grafted adaptors 150 on the flow cell. Isothermal bridge amplification using Phusion polymerase was then 151 followed by linearization of the bridged DNA, denaturation, blocking of 3’ ends and 11 152 hybridization of the sequencing primer. Sequencing-by-synthesis (SBS) was performed on 153 Illumina GAIIx and/or HiSeq 2000 instruments. Paired-end libraries were sequenced at 2 x 154 101 (HiSeq) or 2 x 120 (GAIIx) cycles of incorporation and imaging using the appropriate 155 TruSeqTM SBS kits. Each library or sample was initially run on a single GAIIx lane for QC 156 validation followed by further sequencing on either GAIIx (≥ 4 lanes) or HiSeq (≥ 1 lane) 157 with targeted raw cluster densities of 500800 k/mm2, depending on the version of the data 158 imaging and analysis packages (SCS2.6-2-9/RTA1.6-1.9, HCS1.3.8-1.4.8/RTA1.10.36- 159 1.12.4.2). Real-time analysis involved conversion of image data to base-calling in real-time. 160 Sample preparation 161 Paired-end libraries for sequencing were prepared according to the manufacturer’s 162 instructions (Illumina, TruSeqTM). In short, approximately 1 mg of genomic DNA, isolated 163 from frozen blood samples, was fragmented to a mean target size of 300 bp using a Covaris 164 E210 instrument. The resulting fragmented DNA was end repaired using T4 and Klenow 165 polymerases and T4 polynucleotide kinase with 10 mM dNTP followed by addition of an ”A” 166 base at the ends using Klenow exo fragment (3’ to 5’-exo minus) and dATP (1 mM). 167 Sequencing adaptors containing ”T” overhangs were ligated to the DNA products followed 168 by agarose (2%) gel electrophoresis. Fragments of about 400-500 bp were isolated from the 169 gels (QIAGEN Gel Extraction Kit), and the adaptor-modified DNA fragments were PCR 170 enriched for ten cycles using Phusion DNA polymerase (Finnzymes Oy) and a PCR primer 171 cocktail (Illumina). Enriched libraries were further purified using AMPure XP beads 172 (Beckman-Coulter). The quality and concentration of the libraries were assessed with the 173 Agilent 2100 Bioanalyzer using the DNA 1000 LabChip (Agilent). Barcoded libraries were 174 stored at -20oC. All steps in the workflow were monitored using an in-house laboratory 175 information management system with barcode tracking of all samples and reagents. 12 176 Alignment and SNP calling 177 Reads were aligned to NCBI Build 36 of the human reference sequence using Burrows- 178 Wheeler Aligner (BWA) 0.5.92. Alignments were merged into a single BAM file and marked 179 for duplicates using Picard 1.55 (http://picard.sourceforge.net/ 180 http://picard.sourceforge.net/). Only non-duplicate reads were used for the downstream 181 analysis. 182 Variants were called using Genome Analysis Toolkit, (GenomeAnalysisTK) 1.2-29- 183 g0acaf2d3 by applying base quality score recalibration, INDEL realignment and performing 184 SNP and INDEL discovery and genotyping using standard hard filtering4. Variants were 185 annotated using SNP effect predictor (snpEff) and Genome AnalysisToolkit 1.4-9-g1f1233b 186 with only the highest-impact effect 3,5. 187 Genotype imputation 188 Long range phasing of all chip-genotyped individuals was performed with methods 189 described previously6,7. SNPs and INDELs identified through sequencing were imputed into 190 all chip typed Icelanders who had been phased with long range phasing using the same 191 model as used by IMPUTE 8. In brief, phasing is achieved using an iterative algorithm which 192 phases a single proband at a time given the available phasing information about everyone 193 else who shares a long haplotype identically by state with the proband. Given the large 194 fraction of the Icelandic population that has been chip-typed, accurate long range phasing is 195 available genome-wide for all chip-typed Icelanders. SNPs and INDELs identified through 196 sequencing were imputed into all chip typed Icelanders who had been phased with long 197 range phasing using the same model as used by IMPUTE8 (for details see Supplementary 198 Methods). The genotype data from sequencing can be ambiguous due to low sequencing 13 199 coverage. In order to phase the sequencing genotypes, an iterative algorithm was applied 200 for each SNP with alleles 0 and 1. We let H be the long range phased haplotypes of the 201 sequenced individuals and applied the following algorithm: 202 1. For each haplotype h in H, use the Hidden Markov Model of IMPUTE to calculate for 203 every other k in H, the likelihood, denoted γh,k, of h having the same ancestral 204 source as k at the SNP. 205 2. For every h in H, initialize the parameter , which specifies how likely the one allele 206 of the SNP is to occur on the background of h from the genotype likelihoods 207 obtained from sequencing. The genotype likelihood Lg is the probability of the 208 observed sequencing data at the SNP for a given individual assuming g is the true 209 genotype at the SNP. If L0, L1 and L2 are the likelihoods of the genotypes 0, 1 and 2 in 210 the individual that carries h, then set 211 3. For every pair of haplotypes h and k in H that are carried by the same individual, use 212 the other haplotypes in H to predict the genotype of the SNP on the backgrounds of 213 h and k: 214 4. and 14 215 5. Combining these predictions with the genotype likelihoods from sequencing gives 216 un-normalized updated phased genotype probabilities that were not normalized 217 yielded 218 6. Now use these values to update θh and θk to 219 7. and 220 8. Repeat step 3 when the maximum difference between iterations is greater than a 221 convergence threshold ϵ. We used ϵ = 10-7. 222 Given the long range phased haplotypes and θ, the allele of the SNP on a new haplotype h 223 not in H, is imputed as. 15 224 Genotype imputation information. 225 The information measure value of genotype imputation was estimated by the ratio of the 226 variance of imputed expected allele counts and the variance of the actual allele counts. 227 Were 228 variance of the imputed expected counts and V ar(θ) was estimated by p(1 - p) were p is the 229 allele frequency. is the allele count. V ar(E(θ|chipdata)) was estimated from the observed 230 In the present MPN GWAS, only variants with an information measure value >0.9 were 231 used. The imputed genotype information measure value for rs2736100 is 0.98. To validate 232 the imputation we directly genotyped rs2736100 in 7 281 Icelanders, by single track 233 Centaurus genotyping assay1. The correlation (r2) between directly genotyped and imputed 234 allele counts was 0.94. 235 Familial imputation (in-silico genotyping) 236 In addition to imputing sequence variants from the whole genome sequencing effort into 237 chip genotyped individuals, we also performed a second imputation step where genotypes 238 were imputed into relatives of chip genotyped individuals, creating in-silico genotypes. The 239 inputs into the second imputation step are the fully phased (in particular every allele has 240 been assigned a parent of origin) imputed and chip type genotypes of the available chip 241 typed individuals. The algorithm used to perform the second imputation step consists of: 16 242 1. For each ungenotyped individual (the proband), find all chip genotyped individuals 243 within two meiosis of the individual. The six possible types of two meiosis relatives 244 of the proband are (ignoring more complicated relationships due to pedigree loops): 245 Parents, full and half siblings, grandparents, children and grandchildren. If all 246 pedigree paths from the proband to a genotyped relative go through other 247 genotyped relatives, then that relative is excluded. For example, if a parent of the 248 proband is genotyped, then the probands grandparents through that parent are 249 excluded. If the number of meiosis in the pedigree around the proband exceeds a 250 threshold (we used 12), then relatives are removed from the pedigree until the 251 number of meiosis falls below 12, in order to reduce computational complexity. 252 2. At every point in the genome, calculate the probability for each genotyped relative 253 sharing with the proband based on the autosomal SNPs used for phasing. A 254 multipoint algorithm based on the hidden Markov model Lander-Green multipoint 255 linkage algorithm using fast Fourier transforms is used to calculate these sharing 256 probabilities9,10. First single point sharing probabilities are calculated by dividing 257 the genome into 0.5cM bins and using the haplotypes over these bins as alleles. If 258 there are informative haplotypes in the pedigree around the proband, denote by the 259 inheritance vector (sharing pattern) 9. Haplotypes that are the same, except at most 260 at a single SNP, are treated as identical. Given the haplotype frequencies in each bin 261 the single point distribution, can be calculated as in classical multipoint linkage 262 analysis9. When the haplotypes in the pedigree are incompatible over a bin, then a 263 uniform probability distribution was used for that bin, . The most common causes 264 for such incompatibilities are recombinations in member belonging to the pedigree, 265 phasing errors and genotyping errors. Note that since the input genotypes are fully 266 phased, the single point information is substantially more informative than for 17 267 unphased genotyped, in particular one haplotype of the parent of a genotyped child 268 is always known. The single point distributions are then convolved using the 269 multipoint algorithm to obtain multipoint sharing probabilities at the center of each 270 bin just as in the original Lander Green algorithm 9. Genetic distances were obtained 271 from the most recent version of the deCODE genetic map11. 272 3. Based on the sharing probabilities at the center of each bin, all the SNPs from the 273 whole genome sequencing are imputed into the proband. To impute the genotype of 274 the paternal allele of a SNP located at x, flanked by bins with centers at 275 276 and . Starting with the left bin, going through all possible inheritance vectors v, let be the set of haplotypes of genotyped individuals that share identically by descent 277 within the pedigree with the probands paternal haplotype given the inheritance 278 vector v and P(v) be the probability of at the left bin this is the output from step 2 279 above and let 280 be the expected allele count of the SNP for haplotype i. Then is the expected allele count of the paternal haplotype of the proband 281 given and an overall estimate of the allele count given the sharing distribution at the 282 left bin is obtained from 283 with the proband’s paternal haplotype given v and thus there is no information 284 about the allele count. We therefore store the probability that some genotyped 285 relative shared the probands paternal haplotype, 286 expected allele count, conditional on the probands paternal haplotype being shared 287 by at least one genotyped relative: 288 In the same way calculate 289 an estimates of the SNP from the two flanking bins: . If Iv is empty then no relative shares and and an . . Linear interpolation is then used to get 18 290 If θ is an estimate of the population frequency of the SNP then Oc + (1 - O)θ is an 291 estimate of the allele count for the probands paternal haplotype. Similarly, an 292 expected allele count can be obtained for the proband’s maternal haplotype. 293 Association testing 294 Logistic regression was used to test for association between SNPs and disease, treating 295 disease status as the response and expected genotype counts from imputation or allele 296 counts from direct genotyping as covariates in Iceland, as described previously12. Testing 297 was performed using the likelihood ratio statistic. 298 Multivariate logistic regression analysis was performed conditioning for a given marker 299 by adjusting for the estimated allele count on the basis of imputation of this marker in 300 Iceland. The genomic control correction factor was the same as used for the unadjusted 301 association analysis. A forward selection multiple logistic regression model was used to 302 further define the extent of the genetic association. Briefly, all imputed SNPs located within 303 the interval of 500 kb were tested for possible incorporation into a multiple-regression 304 model. In a stepwise fashion, a SNP was added to the model if it had the smallest P-value 305 among all SNPs not yet included in the model and if it had a P-value below the locus-wide 306 significance threshold. 19 307 To account for the relatedness and stratification within our case and control sample 308 sets, we applied the method of genomic control based on chip markers. For the MPN versus 309 control comparison, the correction factors based on the genomic control was 1.11 310 Real-time quantitative PCR assay for JAK2V617F 311 The somatic JAK2V617F mutation was screened for using a Real-time quantitative PCR assay 312 performed as described previously13. Briefly, PCR amplification and detection were 313 performed on an ABI Prism 7900HT Sequence Detection System (Applied Biosystems) with 314 an initial step of 10 minutes at 95°C, followed by 40 cycles of 15 seconds at 95°C and 1 315 minute at 60°C. DNA from a healthy JAK2V617F non-carrier and from a homozygous 316 for JAK2V617F carrier, as determined by sanger sequencing, were mixed in various 317 proportions to generate a standard curve for JAK2V617F/JAK2Total against ΔCt (CtJAK2V617F – 318 CtJAK2WT) to estimate JAK2V617F somatic allele burden. All samples were measured in 319 duplicate, and the mean ΔCt was used to calculate JAK2V617F /JAK2WT. 320 Derived from this measurement we classify individuals with positive JAK2V617F somatic 321 mutation status if the allele burden is 5% or higher. In addition, we correlate the number of 322 copies (0, 1 or 2) of each of the two germline MPN risk alleles (rs2736100_C and 323 rs1034072_A) with JAK2V617F somatic allele burden (ranging from 0% to 85%) after 324 adjusting for time between MPN diagnosis and blood sampling. 325 326 327 328 20 329 330 References 331 1. Kutyavin IV, Milesi D, Belousov Y, Podyminogin M, Vorobiev A et al. A novel endonuclease 332 333 334 335 336 337 IV post-PCR genotyping system. Nucleic Acids Res 34, e128 (2006).) 2. Li, H. & Durbin, R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England) 25, 1754—1760 (2009). 3. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20, 1297—1303 (2010). 4. DePristo, M. A. et al. A framework for variation discovery and genotyping using next- 338 generation DNA sequencing data. Nature genetics 43, 491—498 (2011). 339 5. Cingolani, P. et al. A program for annotating and predicting the effects of single 340 nucleotide polymorphisms, SnpEff: SNPs in the genome of drosophila melanogaster 341 strain w1118; iso-2; iso-3. Fly 6, 80—92 (2012). 342 343 344 345 6. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature genetics 40, 1068—1075 (2008). 7. Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868—874 (2009). 346 8. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for 347 genome-wide association studies by imputation of genotypes. Nature genetics 39, 348 906—913 (2007). 21 349 9. Lander, E. S. & Green, P. Construction of multilocus genetic linkage maps in humans. 350 Proceedings of the National Academy of Sciences of the United States of America 84, 351 2363—2367 (1987). 352 10. Kruglyak, L. & Lander, E. S. Faster multipoint linkage analysis using fourier transforms. 353 Journal of computational biology: a journal of computational molecular cell biology 5, 354 1—7 (1998). 355 356 357 358 359 11. Kong, A. et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099—1103 (2010). PMID: 20981099. 12. Helgason, H. et al. A rare nonsynonymous sequence variant in c3 is associated with high risk of age-related macular degeneration. Nature genetics (2013). 13. Levine RL, Belisle C, Wadleigh M, Zahrieh D, Lee S, Chagnon P. X-inactivation-based 360 clonality analysis and quantitative JAK2V617F assessment reveal a strong association 361 between clonality and JAK2V617F in PV but not ET/MMM, and identifies a subset of 362 JAK2V617F-negative ET and MMM patients with clonal hematopoiesis. Blood. 2006 363 May 15;107(10):4139-41. Epub 2006 Jan 24. 22