Supplementary Materials for Haplotype-based approach for noninvasive prenatal tests of Duchenne muscular dystrophy (DMD) using cell-free fetal DNA in maternal plasma Yan Xu, MS1,2﹟; Xuchao Li, MS4﹟; Huijuan Ge, ME4; Bing Xiao, MD1,2; Yanyan Zhang, MS4; Xiao-Min Ying, BS1,2; Xiaoyu Pan, BE4; Lei Wang, MD1,3, Weiwei Xie, BM4; Lin Ni, BS1,2; Shengpei Chen, BE4; Wen-TingJiang, MS1,2; Ping Liu, MM4; Hui Ye, BS1,2; Ying Cao, BS1,2; Jing-Min Zhang, MD1,2; Yu Liu, BS1,2; Zu-Jing Yang, MD2,3; Ying-Wei Chen, MD1,2; Fang Chen, MS4*; Hui Jiang, MS4*; and Xing Ji, MS1,2* 1. Methods Identification of the underlying mutations in the proband and the mother by multiplex ligation-dependent probe amplification (MLPA) Large deletions/duplications were detected by MLPA using SALSA MLPA kits P034 and P035 DMD (MRC-Holland). The analysis was performed according to the manufacturer’s recommendations. The FAM-labeled PCR products were separated by capillary electrophoresis on an ABI Prism 3730 Genetic Analyzer (Applied Biosystems) using ROX 500 as the size standard. The data were analyzed using the Microsoft Excel software. Sanger sequencing and qPCR were used to confirm the abnormal reading from a single probe to exclude the possibility of a SNP under a probe or primer binding site. qPCR For each family, the gDNA of the parents, proband and fetus was used for prenatal analysis. The DNA copy numbers for specific exons were determined using the DNA-binding dye SYBR Green I. The reference gene ALB, which was simultaneously quantified in separate tubes, was used to correct possible variation as related to the DNA input amounts. The normal control male and female samples were the mixture of DNA that was obtained from 10 normal males and females, respectively. The amplification mixtures (20 μL) contained 10 μL of SYBR Green I master mix (Takara), 0.2 μM each primer, 10 nM ROX fluorescein, 3.8 μL of DNase/RNase-free water and 10 ng of template DNA. The no-template control (NTC) included DNase/RNase-free water instead of DNA. The cycling conditions were as follows: 30 s at 95°C, 40 cycles at 95°C for 5 s and 60°C for 34 s. Each sample was run in triplicate on an ABI 7500 machine (Applied Biosystems) with a SD < 0.15. The results were analyzed using the ABI 7500 software. The primers used for qPCR are listed in Table S1. Sanger sequencing The specific primers for amplifying exon 67 of the DMD gene and SRY gene were designed using Primer 3.0 (Table S1). DNA was amplified in a 20-μL reaction volume, including 10 μL of PCR mix (2× HS Taq PCR Mix, TransGen Biotech, TAKARA), 0.2 μM each primer, 100 ng of genomic DNA and 7.2 μL of DNase/RNase-free water. The PCR cycling conditions were as follows: initial denaturation at 95°C for 5 min, followed by 35 cycles at 94°C for 30 s, 55°C for 30 s, and 72°C for 30 s, and a final extension at 72°C for 10 min. Sequence analysis was performed using an ABI 3730XL DNA Analyzer (Applied Biosystems, USA). The 1 primers used for PCR are listed in Table S1. Linkage analysis The cytogenetic locations of these markers as well as the length of the amplified products were obtained from the Human Genome Database and the Marshfield Medical Center database. According to the identified mutation in the DMD gene, four closely linked microsatellites among DXS1235, DXS1236, DXS1237, DXS1238, DXS1241, DXS1242, DXS1214, DXS992, and STR07A were selected to determine the haplotype of the fetus and to exclude maternal contamination. The sense primers were labeled with FAM fluorophores, and the PCR products were separated by capillary electrophoresis. The data were collected and analyzed using a 3730XL genetic analyzer (Applied Biosystems). Short reads alignment and parental SNP calling The short reads that were generated using Illumina HiSeq 2000 sequencing were mapped to the human reference genome (NCBI 37) using SOAP2. Then, we performed SNP calling using SOAPsnp with the default parameters. Filters (Q>20 & depth≥8) were set to guarantee the accuracy of the parental and probands’ genotypes. Haplotyping in parent-offspring trios We constructed the haplotype based on the trios’ strategy. For chromosome X, the parental and proband’s haplotype were inferred by the genotype information of trios as imposed by Mendel’s laws. For example, the genotype of the father is ‘A’, while that of the mother is “AT” and of the proband is “AT”. In this case, the “A” must be inherited from father, and the “T” should be inherited from mother. Here, we defined the parental allele that was passed to offspring as haplotype 0 and the other as haplotype 1. Thus, we could phase the “A” to haplotype 0 and “T” to haplotype 1 in the mother, whereas the father was much easier because of the haploid status for chromosome X. Calculation of the overall sequencing error rate Sequencing errors can originate during both PCR-based library construction and next-generation sequencing. In the loci that were homozygous with the same alleles in both parents, the fetal genotype must have been homozygous irrespective of a de novo mutation. As de novo mutation is extremely rare, occurring 18-74 per offspring1, we assume that all the discordant bases arise from sequencing error. Thus, we calculated the sequencing error rate using the SNPs that were homozygous with the same genotype in both parental genomes on chromosome 22, but with different bases in the plasma. This rate is an important parameter in the following mathematic model. Haplotyping of the fetus based on the HMM 1. Basic denotation The number of loci on certain chromosomes was denoted as N c , while the total number of loci was denoted as N * . The haplotypes of the father and mother were FH { fh0 } recorded as and , respectively, MH = {mh0 , mh1} where mhk = {mi,k } , fhk { fi ,k 0 } , k Î{0,1} , i = 1,2, 3..., N c , and "fhi,k , mi,k Î{ A,C,G,T } . { } The unknown fetal haplotype was denoted as H = {h0 ,h1 } , where h0 = mi,xi , 2 { } and h1 = fi,xi . Therefore, qi = { xi , yi } consisted of the hidden state that we needed to decipher, and the potential hidden state consisted of the set Q . In maternal plasma sequencing, we denoted the sequence base as S = {Si } , where Si = {ni,A ,ni,C ,ni,G ,ni,T } indicates the sequencing depth of each base. For other parameters in the maternal plasma, the average cff-DNA concentration and the average sequence error were denoted as e and e . 2. Initial state distribution p = {p j } , j ÎQ . Due to the lack of prior probability, we defined j Pr q1 j 1 , representing the same initial probability of each hidden 2 state. 3. Transition probabilities matrix A = { a jk } ( j, k ÎQ ), where xi xi 1 , yi yi 1 1 pr q jk Pr qi k | qi 1 j xi xi 1 , yi yi 1 pr pr = re N * , and re was the average frequency of the recombination between gemmates, where we used re = 30 for the whole genome. { } 4. Observation symbol probabilities matrix B = bi, j ( si ) ( j ÎQ ), where ( bi, j ( si ) = Pr si qi = j, { m0 , m1 } = ) (ni,A + ni,C + ni,G + ni,T )! n n n n × ( Pi,A ) i ,A × ( Pi,C ) i ,C × ( Pi,G ) i ,G × ( Pi,T ) i ,T ni,A !ni,C !ni,G !ni,T ! ( Pi,base = Pr base qi = j, { m0 , m1 } ) 1 1 1 (1- e ) D ( base, mk ) + e × D ( base, mxi ) + e × D ( base, fyi ) 2 2 kÎ{0,1} 2 and the indicator function ìï 1- e x=y D ( x, y ) = í x¹y ïî e 3 = å 5. Viterbi algorithm [3] (1) Initialization d 1 ( q1 ) = p j ×b1,q1 ( s1 ) (2) Iteration d i-1 ( qi-1 ) × aqi-1qi bi,q ( si ) , d i ( qi ) = max q ÎQ i ( ) i-1 Y i ( qi ) = arg max d i-1 ( qi-1 ) × aqi-1qi qi-1ÎQ (3) Termination and backtracking The final optimized hidden state qN* c= argmax d Nc ( qNc ) The optimized path q = Yi ( qi ) * i-1 qNc ÎQ i = 2, 3,..., N c 3 2. Figure Figure S1. Pedigrees and the inherited mutations that were identified in the DMD gene for the eight analyzed families. The male probands are indicated by arrows. The inherited mutations and the week of gestation (wk) of the mother (pregnant) for each family are shown in the figure. The mothers in these families were mutation carriers with genotypes of one mutant allele and one wild-type allele. 4 3. Tables Table S1 Primer sequences used for PCR/qPCR in this study Location Forward primer (5’-3’) DMD E2 TCATAATGGAAAGTTACTTTGGTTG DMD E17 ACAATTTTATTTGGCTTCAATATGG DMD E22 Reverse primer (5’-3’) Product Length Tests 219 bp qPCR GACATTACAGGTACCCGAGGATT 448 bp qPCR GGCAAAGTGTGAAACAATTAAGTG TGGGCAAACTACCATACTTGTCAGAAT 317 bp qPCR DMD E23 TCATCTACTTTGTTTACATGTTTGAA ACAGTGTATCGTTAGGGAAAAA 397 bp qPCR DMD E 45 TGTCTTTCTGTCTTGTATCCTTTGG CTGCTAAAATGTTTTCATTCCTATTAGA 399 bp qPCR DMD E 47 GATAGACTAATCAATAGAAGCAAAG GGGAGGAGGCTGGTATGTG 342 bp qPCR DMD E 56 TCCAAATTCACATTCATCGC CCAGTTACTTGTGCTAAGACAATGAG 329 bp qPCR ALB E12 AGCTATCCGTGGTCCTGAAC TTCTCAGAAAGTGTGCATATATCTG 202 bp qPCR DMD E 67 TGGCTACTCTTGAGAATTGCTACTG CTGCCTACTGAAGAGCTAATATGAGA 369 bp PCR SRY CTAAGTATCAGTGTGAAACGGG CCTTCCGACGAGGTCGATAC 279 bp PCR CACAGGTACATAGTCCATTTTGAAA 5 Continued Location Forward primer (5’-3’) Reverse primer (5’-3’) DXS1235 AAGGTTCCTCCAGTAACAGATTTGG TATGCTACATAGTATGTCCTCAGAC DXS1236 CGTTTACCAGCTCAAAATCTCAAC CATATGATACGATTCGTGTTTTGC DXS1237 GAGGCTATAATTCTTTAACTTTGGC CTCTTTCCCTCTTTATTCATGTTAC DXS1238 TCCAACATTGGAAATCACATTTCAA TCATCACAAATAGATGTTTCACAG DXS1241 TGTCTGTCTTCAGTTATATG ATAACTTACCCAAGTCATGT DXS1242 TCTTGATATATAGGGATTATTTGTGTTTGTTATAC ATTATGAAACTATAAGGAATAACTCATTTAGC DXS1214 TAGAACCCAAATGACAACCA TAGAACCCAAATGACAACCA DXS992 AAGAATGGGACTCCATTTCA AAGAATGGGACTCCATTTCA STR07A TTCTGGTTTTCTGGTCTG TTCTGGTTTTCTGGTCTG 6 Product Length Tests Linkage analysis Table S2. Data production of deep sequencing for target enrichment region Mother Family Father Proband Plasma Coverage Depth Coverage Depth Coverage Depth Coverage Depth F01 95.64% 37.08 95.55% 30.50 95.79% 60.10 95.92% 29.06 F02 95.57% 31.09 95.50% 27.49 95.53% 39.67 95.95% 30.10 F03 95.08% 50.47 95.89% 42.47 90.79% 12.25 95.88% 21.84 F04 95.53% 29.42 94.56% 11.93 95.19% 18.55 95.90% 28.48 F05 95.66% 35.19 96.08% 121.74 95.73% 47.97 95.92% 35.62 F06 95.61% 31.77 95.81% 54.90 95.48% 28.56 95.91% 26.38 F07 95.69% 45.11 95.57% 36.16 95.49% 27.02 95.94% 31.43 F08 95.19% 22.97 93.67% 9.13 95.02% 22.12 95.86% 21.50 7 Table S3. The inferred SNP genotypes compared with direct fetal gDNA sequencing data on maternal chromosome X and the DMD gene region Heterozygous SNP sites Total SNP sites Family chrX DMD chrX DMD F01 5663(92.35%) 265(71.70%) 5,580,677(99.99%) 156,597(99.95%) F02 5859(78.53%) 276(78.62%) 5,587,961(99.98%) 141,948(99.96%) F03 2005(84.34%) 62(100.00%) 2,413,321(99.99%) 46,402(100.00%) F04 5164(78.78%) 285(100.00%) 5,087,381(99.98%) 147,877(100.00%) F05 5700(89.46%) 233(100.00%) 5,638,886(99.99%) 154,455(100.00%) F06 5241(87.67%) 249(87.95%) 5,251,571(99.99%) 150,469(99.98%) F07 6165(87.67%) 250(77.20%) 5,620,763(99.99%) 160,426(99.96%) F08 4685(82.18%) 208(79.81%) 4,612,512(99.98%) 112,682(99.96%) 8 Reference 1. Francioli1 LC, Menelaou1 A, Pulit SL, et al. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 2014;46(8): 818-825. 9