Supplementary Material Positional cloning and next-generation sequencing identified a TGM6 mutation in a large Chinese pedigree with acute myeloid leukaemia (AML) Li-li Pan1, Yuan-mao Huang1, Min Wang1, Xiao-e Zhuang1, Dong-feng Luo1, Shi-cheng Guo2, Zhi-shun Zhang3, Qing Huang1, Sheng-long Lin1, Shao-yuan Wang1 Li-li Pan and Yuan-mao Huang contributed equally to this work and should be considered co-first authors. Correspondence: Prof. Shaoyuan Wang, Fujian Institute of Hematology, Fujian Provincial Key Laboratory on Hematology, Department of Hematology, Fujian Medical University Union Hospital, 29 Xinquan Road, Fuzhou 350001, PR China. E-mail: mdsy.wang@gmail.com Materials and methods Patients and materials The criteria of “ members at potential preleukaemic phases” included: 1) A history of easy bruising and/or fatigue; 2) Persistent unexplained cytopenia or neutrophil leucocytosis; 3) Hypercellular bone-marrow with a mild increase in myeloblasts or persistent mild dysplastic features; 4) Cytogenetic abnormalities. Be consistent with the above items, but can’t be diagnosed of AML, CML, MDS or MPN. They were re-examed at least twice and followed for several years. Genotyping, linkage and haplotype analysis Genome-wide linkage scans were performed on 13 members of the family (III-5, III-8, III-13, III-14, III-15, IV-7, IV-10, IV-13, IV-18, IV-19,V-3 and the spouses of III-8 and III-15) using the Affymetrix GeneChip Mapping Array 500K set. Non-Mendelian error checking of genotypes and the generation of linkage format files from raw Affymetrix array files were performed using ProgenyLab (Progeny, South Bend, IN). MERLIN1 was used to further remove additional unlikely genotypes that were consistent with potential genotyping errors. An in-house program was then used to implement tag SNP selection based on the following filters: 1) NoCall, Non-Mendelian and Mendelian error SNPs were removed; 2) alleles presenting an AA or BB in all samples were also excluded; 3) minor allele frequencies below 0.01 were excluded; and 4) the distance between two successive tags was required to be at least 0.5 cM to avoid linkage disequilibrium (LD). An additional step was performed after the initial tag selection to further remove high-LD SNPs. We employed MERLIN (v1.1.2) to perform multipoint linkage analysis using the non-parametric model and dominant model, with a disease allele frequency of 0.0001 and penetrance of 0.99. Targeted capture and 454 sequencing We applied array-based sequence capture followed by 454 sequencing to simultaneously analyse all genes in the region due to the large number of genes in the linkage regions and the lack of obvious candidate genes. NimbleGen 385K microarrays were produced to capture the critical region at 20p13 (7.8-13 cM) in two affected individuals patients (III-15 and IV-19), including all exons, flanking intronic sequences, untranslated regions (UTRs), microRNAs, and the highly conserved regions of all candidate genes in this region. An additional 30 bp were added at each end of all exons for the detection of splice-site mutations. Targets smaller than 250 bp were enlarged by extending both ends of the region. Library construction was completed according to the manufacturer’s instructions (GS FLX Titanium general library preparation kit) (Roche 454 company, CT, USA).2 Pre- and post-captured libraries were subjected to quantitative PCR to estimate the magnitude of enrichment. Post-captured libraries were subjected to em-PCR (GS FLX Titanium LV or SV emPCR Kits) and sequenced on the Genome Sequencer FLX platform (GS FLX Titanium Sequencing Kit XLR70). Whole exome sequencing (WES) We also performed WES on the two patients plus a normal family member to fully explore the candidate exonic mutations. Briefly, 15mcg of genomic DNA from each sample was enriched for the target region of the consensus coding sequence (CCDS) exons with NimbleGen 2.1M human exome array and subsequently sequenced on the Illumina Hiseq2000 platform following the manufacturer’s instructions (Illumina, CA).3 The raw data was mapped to the human genome reference sequence (hg19) with Burrows-Wheeler Aligner (BWA).4 Single nucleotide variants (SNV) and short Insertion/Deletion (InDels) were detected with SOAPsnp5 and SAMtools,6 respectively. After that, the low-quality variations were filtered out using the following criteria: (i) quality score ≧20 (Q20); (ii) average copy number at the allele site ≦2; (iii) distance of two adjacent SNPs ≧5bp; and (iv) sequencing depth ≧4 and ≦1000. Then we used ANNOVAR7 to annotate the confident variant results. Variants within the linkage region at 20p13 were selected for downstream analysis. SNPs in the dbSNP135 and 1000 genome project databases (2011) were removed. And the remaining variants that were shared by the two patient samples but absent in the normal control were selected and classified by genomic context: exonic, intronic, intergenic, splicing, ncRNA, 5’UTR or 3’UTR. Molecular modelling The 3D molecular models of TGM6 were built using homology modelling. Templates for modelling included human transglutaminase 3 (1L9M) and transglutaminase 2 (2Q3Z), which were downloaded from the RCSB database (http://www.rcsb.org/). Templates and target sequences were aligned using Promals3D 8 under the default settings. Molecular models were generated using the Modeller program (version 9.11; released September 6, 2012) 9and viewed in PyMOL (http://pymol.sourceforge.net). Supplementary Table 1. Clinical Presentation of Patients in the Chinese AML Pedigree. Supplementary Table 2. Sequencing data within the linkage region on 20p13. A=Ⅲ 15; B=Ⅳ19; C=Ⅲ15's wife; A+B=the common variants that were shared by patients Ⅲ15 and Ⅳ19; A+B-C=variants present in A+B but absent in C. Supplementary Table 3. 36 HCDiffs shared by the two patients in family. Supplementary Table 4. 4 variants shared by the two patients in family. Supplementary Table 5. Information about the markers and genes located at significant LOD score peak of chr20. LOD_mpt represents the score of multipoint linkage analysis under the dominant model. Associated_Genes represent the genes located at the position of the markers. Supplementary Figure 1. LOD plots for chromosome 18. X-axis represents genetic distance (cM), and y-axis represents the corresponding LOD score. LOD score peak located within a region ranging from 91.46 to 97.06 cM (66127086-69342671) with an average HLOD score of 1.56 (average p = 0.0074). The LOD score was significantly higher in dominant model than non-parametric model, which supported a dominant transmission of the disease in the family we studied. Supplementary Figure 2. The 3D structure and β-barrel 1 domain of TGM6. A. The 3D structure of the compact, inactive form of TGM6 is shown using a ribbon model in which the TG active site is buried. The NH2 and COOH termini are labelled. The four domains (β-sandwich, residues 3-136; catalytic core, res. 137-462; β-barrel 1, res. 494-605; β-barrel 2, res. 606-706) 10,11 are depicted in different colours. The flexible loop that connects the catalytic core with the β-barrel 1 domain is shown in blue. The Leu517 residue is depicted in space-filling style, and the GDP-binding pocket is shown by an arrow. B. The 3D structure of the extended, active form of TGM6 is shown using a ribbon model in which the TG active site is exposed. C. The β-barrel 1 domain of TGM6. Residue L517 is shown in a sphere model. The GDP-binding pocket that is inferred from the GDP-bound TG2 structure is shown in a semitransparent surface model in which the key residues are depicted as stick models. D. The β-barrel 1 domain of TGM6 in which the L517 residue is substituted by W517. REFERENCES 1 Abecasis, G. R., Cherny, S. S., Cookson, W. O. & Cardon, L. R. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30, 97-101, doi:10.1038/ng786 (2002). 2 Rehman, A. U., Morell, R. J., Belyantseva, I. A. et al. Targeted capture and next-generation sequencing identifies C9orf75, encoding taperin, as the mutated gene in nonsyndromic deafness DFNB79. Am J Hum Genet 86, 378-388, doi:10.1016/j.ajhg.2010.01.030 (2010). 3 Wang, J. L., Yang, X., Xia, K. et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain 133, 3510-3518, doi:10.1093/brain/awq323 (2010). 4 Sunyaev, S., Ramensky, V., Koch, I., Lathe, W., 3rd, Kondrashov, A. S. & Bork, P. Prediction of deleterious human alleles. Hum Mol Genet 10, 591-597 (2001). 5 Li, R., Li, Y., Fang, X., Yang, H., Wang, J. & Kristiansen, K. SNP detection for massively parallel whole-genome resequencing. Genome Res 19, 1124-1132, doi:10.1101/gr.088013.108 (2009). 6 Li, H., Handsaker, B., Wysoker, A. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079, doi:10.1093/bioinformatics/btp352 (2009). 7 Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164, doi:10.1093/nar/gkq603 (2010). 8 Pei, J., Kim, B. H. & Grishin, N. V. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 36, 2295-2300, doi:10.1093/nar/gkn072 (2008). 9 Eswar, N., Webb, B., Marti-Renom, M. A. et al. Comparative protein structure modeling using MODELLER. Current protocols in protein science / editorial board, John E. Coligan ... [et al.] Chapter 2, Unit 2 9, doi:10.1002/0471140864.ps0209s50 (2007). 10 Iismaa, S. E., Mearns, B. M., Lorand, L. & Graham, R. M. Transglutaminases and disease: lessons from genetically engineered mouse models and inherited disorders. Physiol Rev 89, 991-1023, doi:10.1152/physrev.00044.2008 (2009). 11 Thomas, H., Beck, K., Adamczyk, M. et al. Transglutaminase 6: a protein associated with central nervous system development and motor function. Amino Acids, doi:10.1007/s00726-011-1091-z (2011).