Supplemental Text Quality control of WGS 1. Quality control of raw

advertisement
Supplemental Text
Quality control of WGS
1. Quality control of raw reads
Raw reads contaminated by adapter sequences, or raw reads with >50% bases whose base quality was <5 and with the proportion of N
bases >10%, were filtered. Usually, the ratio of adapter contaminated reads was <2% of the total number, the proportion of low quality reads was
<8%, and the proportion of N bases was <10%. If not, we considered discarding all the reads from these lanes.
2. Quality control of WGS
For 30X re-sequencing, whole genome sequences must meet the following criteria: the mapping rate should be >95%, the mismatch rate should
be <1%, GC content should be within the normal range 35-45%, and coverage for ≥4X depth should be >95%.
Basic statistics of whole genome sequencing for each individual
Sample
Mapping rate (%)
Mismatch-rate (%)
GC-content (%)
Coverage≥4X (%)
Mean-depth
Unaffected member of HMO Family 1
97.14
0.53
41.09
99.13
31.01
Proband of HMO Family 1
96.92
0.58
41.30
99.19
33.52
Proband of HMO Family 3
97.87
0.40
39.75
99.60
28.39
Patient with Dent disease
97.31
0.46
40.23
99.47
30.75
Supplemental Fig. S1 Radiographs showing exostosis lesions in affected HMO individuals.
a Right forearm of Family 1 member III-1, showing exostosis in the ulna and bowed forearm
conferring restricted rotation (arrow). b Right leg of Family 1 member III-12, showing the
tibial exostosis resulting in the destruction of the fibula (arrow). c Pelvic radiograph of Family
1 member I-2, displaying osteoarthritis and necrosis of the femoral head on the right side of
the hip (arrow). d Radiograph of Family 2 member II-1, showing the exostosis at the
epiphysis of the phalanx in the fourth digit of the right hand (arrow). e Radiograph of Family
2 member II-1, showing multiple exostoses around the left knee joint (arrows). f Radiograph
of Family 3 member II-1, revealing the exostoses in the scapula (arrow).
Supplemental Fig. S2 Pipeline methods employed to accurately characterize the CNVs using
whole genome sequencing data from Families 1 and 2 (red letters denote deleted sequences,
orange letters denote micro-mutations, blue letters denote insertions and black letters denote
unchanged sequence or reference sequences).
Step 1:Prediction of the CNVs using DELLY, Breakdancer and CNVnator software on WGS
data, which indicated multiple distinct breakpoints.
Step 2: Determination of the breakpoints by alignment of the truncated reads around the
breakpoints with reference sequences. The breakpoints were determined using the truncated
reads from read-pairs (one read mapped, the other truncated) because these truncated reads
had concordant ends near the predicted breakpoint (6 at the 5’ end and 9 at the 3’end). Thus,
the tentative breakpoints of this CNV exemplar were determined to be chr11:43,936,139 and
chr11:44,438,037.
Step 3: Tracking sequences in breakpoints regions and fine-tuning of the breakpoints.
(1) Tracking sequences in breakpoints regions. We extracted the 1000-bp flanking sequence at
each of the breakpoint ends from the human reference sequence and concatenated the flanking
sequences into 2000-bp junction sequences as the reference to be aligned with the patients’
sequences obtained from the WGS data. We extracted all the reads with abnormal insert sizes,
unexpected strand orientations and truncations, and ends that mapped to two different
chromosomes as well as the read-pairs (one read mapped, the other truncated), and aligned
them to the junction sequences using BWA. Then we obtain information on insertions,
microhomologies and micro-mutations.
(2)Fine-tuning.
1) Construct sequences before fine-tuning. We searched the inserted sequences
(GAGAAAAGCATTTGCAAAAA) by BLAT, and found the only position in 157bp
downstream of the 5’ breakpoint. Analysis of the patient reads at the 3' end of the deletion
revealed a TGA microhomology (in purple box) that could either be assigned to the deleted
sequence or to the breakpoint-flanking sequence due to the presence of an insertion at the
breakpoint junction.
2) Fine-tuning. By combining the physical positions and flanking sequence at the breakpoint
junction, the 'GTATGA' could be located at the 3’ flanking regions of the 20bp insertion
because of perfect mapping to the reference sequence.
3) Construct sequences after fine-tuning. Consequently, the 26bp insertion perfectly matched
in Chr11:43,936,296-43,936,321 and the precise position of the 3' breakpoint of the CNV was
refined to position chr11:44,438,043.
4) Patient sequences.
Supplemental Fig. S3 Pipeline methods employed to accurately characterize the CNVs using
whole genome sequencing data from Family 3 (red letters denote deleted sequences, orange
letters denote micro-mutations, blue letters denote insertions and black letters denote
unchanged sequences or reference sequences).
Step 1:Prediction of the CNVs using DELLY, Breakdancer and CNVnator software on WGS
data, which indicated multiple distinct breakpoints.
Step 2: Determination of the breakpoints by alignment of the truncated reads around the
breakpoints with reference sequences. The breakpoints were determined using the truncated
reads from read-pairs (one read mapped, the other truncated) because these truncated reads
had concordant ends near the predicted breakpoint (12 at the 5’ end, 10 at the 3’end). Thus,
the breakpoints of this CNV exemplar were determined to be chr11:44,128,440 and
chr11:44,198,500.
Step 3: Tracking sequences in breakpoints regions. We extracted the 1000-bp flanking
sequence at each of the breakpoint ends from the human reference sequence and concatenated
the flanking sequences into 2000-bp junction sequences as the reference to be aligned with the
patients’ sequences obtained from the WGS data. We extracted all the reads with abnormal
insert sizes, unexpected strand orientations and truncations, and ends that mapped to two
different chromosomes as well as the read-pairs (one read mapped, the other truncated), and
aligned them to the junction sequences using BWA. Consequently, we found a 5bp insertion
(TCTTG) within the breakpoint junctions and a CC insertion in the flanking regions of the
breakpoint.
Supplemental Fig. S4 Pipeline methods employed to accurately characterize the CNVs using
whole genome sequencing data from the Dent disease patient (red letters denote deleted
sequences, orange letters denote micro-mutations, blue letters denote insertions and black
letters denote unchanged sequences or reference sequences).
Step 1:Prediction of the CNVs using DELLY, Breakdancer and CNVnator software on WGS
data, which indicated multiple distinct breakpoints.
Step 2: Determination of the breakpoints by alignment of the truncated reads around the
breakpoints with reference sequences. The breakpoints were determined using the truncated
reads from read-pairs (one read mapped, the other truncated) because these truncated reads
had concordant ends near the predicted breakpoint (5 at the 5’ end, 5 at the 3’end). Thus, the
breakpoints of this CNV exemplar were determined to be chrX:49,780,222 and chrX:
49,840,741.
Step 3: Tracking sequences in breakpoints regions. We extracted the 1000-bp flanking
sequence at each of the breakpoint ends from the human reference sequence and concatenated
the flanking sequences into 2000-bp junction sequences as the reference to be aligned with the
patients’ sequences obtained from the WGS data. We extracted all the reads with abnormal
insert sizes, unexpected strand orientations and truncations, and ends that mapped to two
different chromosomes as well as the read-pairs (one read mapped, the other truncated), and
aligned them to the junction sequences using BWA. Consequently, we found a 22bp insertion
(TACATATAGTGACAGGGAATGG) at the breakpoint junctions.
Supplemental Fig. S5 FISH analysis of the cultured blood cells from the proband of HMO
Family 1 (III-1). The EXT2 gene signal is shown in red whilst the control signal from the
centromeric sequences of chromosome 11 is shown in green. Note the absence of the EXT2
gene signal in one of the chromosome 11 homologues in both metaphase (a) and interphase (b)
cells.
Supplemental Fig. S6 Identification of CNVs by MLPA (Multiplex Ligation-dependent
Probe Amplification) and chromosome microarray analyses. a MLPA electropherogram of the
HMO Family 1 proband showing the amplification ratio of all EXT2 probes relative to the
reference probes (as well as the EXT1 probes). Identical MLPA electropherogram was
observed in the Family 2 proband. b MLPA electropherogram of the HMO Family 3 proband
showing a heterozygous deletion of exons 2-8 of EXT2 (defined by probes EXT2-04 to EXT210). The horizontal red line indicates the threshold ratio indicative of a heterozygous deletion.
c Chromosome microarray analysis of the boy with Dent disease revealing a deletion
involving part of the CLCN5 gene. By the weighted log2 ratio method (upper panel), the copy
number of the X chromosome for a normal male corresponds to baseline -0.5 on the
scatterplot. By the copy number state method (bottom panel), the copy number of the X
chromosome for a normal male is 1.0. The probes revealing a zero copy number indicate a
~50 kb deletion at Xp11.23-p11.22 with a minimum range of 49,790,892-49,840,451 (hg19).
Supplemental Table S1 Clinical characteristics of the patients
Sex
Agea
(years)
No. of
exostoses
Family 1-I-2
Female
70
2
Femur
Family 1-II-3
Family 1-II-5
Family 1-II-8
Family 1-III-1
Male
Male
Female
Male
48
45
40
26
6
5
4
8
Humerus, tibia, ulna and radius
Femur, fibula and radius
Femur and humerus
Femur, tibia, fibula, humerus, ulna and radius
Male
Female
Male
Male
Male
Male
Male
13
7
14
12
45
13
10
7
5
13
15
6
17
6
Femur, tibia, fibula, ulna and radius
Femur and rib
Femur, tibia, fibula, humerus, ulna and radius
Femur, tibia, fibula, humerus, ulna and radius
Femur and tibia
Femur, tibia, fibula and phalanx
Femur, ulna, radius and scapula
Sex
Age*
(years)
Male
12
HMO Patient
Family 1-III-5
Family 1-III-9
Family 1-III-11
F Family 1-III-12
Family 2- I -1
Family 2-II-1
Family 3-II-1
Dent Disease
Patient
II-1
a
Age at diagnosis.
Location of exostoses
Renal damage
Positive urinary protein, low-molecular-weight proteinuria,
hypercalciuria, microscopic hematuria and intermittent
hematuria
Other clinical phenotypes
Hip osteoarthritis, necrosis of femoral
head, and scoliosis
No
Dislocation of radioulnar joint
No
Forearm deformity and wrist joint
dysfunction
No
No
No
No
No
No
No
Histopathological changes
Mild mesangial proliferative glomerulonephritis, focal
glomerulosclerosis and crescent formation in glomeruli
Surgical
therapy
Pain
No
Yes
No
No
No
No
No
No
No
No
No
No
Yes
Yes
Yes
No
Yes
No
No
No
No
No
No
No
Other phenotypes
Mild growth retardation
Supplemental Table S2 PCR primers for the generation of EXT2 gene probes used for FISH
analysis
Forward (5’-3’)
Reverse (5’-3’)
Fish-1
CGTGGTGTCTCGTTTGGGTTTAAG
GATCTGGTTCCCACCGAATGTAAC
Fish-2
GGCAATGCTCAAGGTATAGA
AGAAATCCAAGGTAGTAACGGT
Fish-3
TTAGGCACTGCGAATACTTAGATA
GCCCACCACACTAAACCTC
Fish-4
CTTTTCTTGAGACCACTTGAACCA
CTAGGGCTTGAACATTCCACG
Fish-5
TTTCCCTTGTAGTCCACGGCAATAC
ACTCCCTCAAACCCCCTCAATGT
Fish-6
Fish-7
GGGGAAAGCCTATTGTATCAGT
CTCCTGGGGCAGCATTTAAGTA
CTTTTTCCTAATCAGCCCACTAC
GCCCATTGGATTTTGCTTATCAC
Fish-8
AGTGATAGATGGTATTGGACCTAC
GGCCTAACTCTTCTGATAACTCT
Fish-9
TCTCTTTGTCCCATGTTCTATT
GCCCCATTGTAATTCTACG
Fish-10
GCAATAGACAAATACTGAAACCTAC
GATTCAAGAGATCCGAGCTAC
Fish-11
CCTCTGGGCTGAAATGTTACTACTG
AATACTCTCATCTGGCTGATCCCTT
Fish-12
AGAGGCTGGGTTCAGACTAAATC
CAGCATTAATGGGGAAATAGGA
Fish-13
ATTTGTTGAACTCTGGTCCATT
TTTAGGAATTTCTGGGCTACAG
Fish-14
GGCAACATGGACCACATTACTGAT
CTGGCTGACCAAGGAGAGTGTCTA
Fish-15
CGCCATAGTCCTCACCTACGACC
TGAACAAACACCCCACAGAAGATTAAAC
Fish-16
GAATCTCCCCTGACACAGTTCTACCT
GCAATGAAGAGAGAAATCACTCGC
Fish-17
CCTTAAAGGCACACCATAGCAAGT
GGCCCCCTCATCACTAATTAAATC
Fish-18
Fish-19
CCCTTTGAGTTCATCTTGGAC
TTGCTAGGGAGATCGCTAGTTAAGGT
TAAACCAGCCAACAGACAGTAGTA
TCTCTTCCAAAGGAGCTACGACAGT
Fish-20
AAGCAGCATCTCCTGTTCACGTT
GACCCTCTGTTTTTCTCTGACAATACC
Supplemental Table S3 Primers for long-range PCR and for use, following Sanger sequencing, to confirm
the whole-genome sequencing findings with respect to the three pathogenic CNVs*
Family
Families
1/2
Forward (5’-3’)
L1
CGAGGCTTGCTCTCCAACTTCTTAAC
F1
AAGAAGTCTGGCAGGATG
Reverse (5’-3’)
D1
CCTGGGCTCTTCAACTAGGACAGTAAAC
R1
GCTGGGATGAGTAGGTC
Family 3
L2
F2
TCTTAAAATGTGGTCTACATGGGAACT
TCACCGCAACCTCCAC
D2
R2
AGTCCAGGGAAGTATCTAATCCTCATC
TCCCCTAATAAAGAAC
Dent
disease
patient
L3
F3
GGTGGGCTTGTCTGTGTATTAGAAT
TGCCCTTTATCTTCCA
L3
R3
GTTTCTGTTATTTTGACATGGAATGC
CTGCCTCTGACACTTCT
nd D:* L and D indicate primers for long-range PCR whilst F and R indicate primers for Sanger sequencing.
Supplemental Table S4 LOD scores for chromosome 11 markers in HMO Family 1
LOD Score at θ=
Microsatellite
markers
Zmax
0
0.01
0.05
0.1
0.2
0.3
0.4
D11S4102
0
0
0
0
0
0
0
0
D11S905
4.21
4.15
3.88
3.53
2.76
1.91
0.96
4.21
D11S4191
3.01
2.96
2.77
2.51
1.95
1.32
0.65
3.01
D11S987
-∞
0.97
1.49
1.55
1.34
0.95
0.47
1.55
D11S4162
-∞
2.15
2.58
2.53
2.06
1.38
0.61
2.58
D11S1314
-∞
2.45
2.88
2.83
2.36
1.68
0.87
2.88
Supplemental Table S5 The three templated inserts derived from distant regions of the human genome
Sample
CNV
Breakpoint insertion (5′ to 3′)
Origin of inserted sequences
Genic region
Dent disease patient
ChrX deletion
TACATATAGTGACAGGGAATGG
ChrX:49701701-49701722(+)
CLCN5 intron
114816
Chr19 deletion
ATTTGGCAGAGGGGGATTTGGCAGGGTCAT
AGGACAACAGCGGAGGGAAGGTCAG
Chr17:15999543-15999592(-)
NCOR1 intron
120099/120098
Ch6 deletion
GTCACCCAGTCTGGAGTGCTGT
Chr1:10452912-10452933(-),
Chr1:187864763-187864784(+),
Chr1:201144808-201144829(-),
Chr4:53575026-53575047(+),
Chr9:26186546-26186567(-)
Intergenic region
Intergenic region
Intergenic region
Intergenic region
Intergenic region
Download