Supplementary data
WGS500 AUTHORSHIP
Steering Committee
Peter Donnelly (Chair) 1 , John Bell 2 , David Bentley 3 , Gil McVean 1 , Peter Ratcliffe 1 , Jenny
Taylor
1,4
, Andrew Wilkie
4, 5
Operations Committee
Peter Donnelly
1
(Chair) John Broxholme
1
, David Buck
1
, Jean-Baptiste Cazier
1
, Richard Cornall
1
,
Lorna Gregory
1
, Julian Knight
1
, Gerton Lunter
1
, Gilean McVean
1
, Jenny Taylor
1,4
, Ian
Tomlinson
1, 4
, Andrew Wilkie
4, 5
Sequencing & Experimental Follow up
David Buck
1
(Lead) Christopher Allan
1
, Moustafa Attar
1
, Angie Green
1
, Lorna Gregory
1
, Sean
Humphray 3, Zoya Kingsbury 3 , Sarah Lamble 1 , Lorne Lonie 1 , Alistair Pagnamenta 1 , Paolo Piazza 1 ,
Guadelupe Polanco
1
, Amy Trebes
1
Data Analysis
Gil McVean
1
(Lead), Peter Donnelly
1
, Jean-Baptiste Cazier
1
, John Broxholme
1
, Richard Copley
1
,
Simon Fiddy
1
, Russell Grocock
3
, Edouard Hatton
1
, Chris Holmes
1
, Linda Hughes
1
, Peter
Humburg
1
, Alexander Kanapin
1
, Stefano Lise
1
, Gerton Lunter
1
, Hilary Martin
1
, Lisa Murray
3
,
Davis McCarthy 1 , Andy Rimmer 1 , Natasha Sahgal 1 , Ben Wright 1 , Chris Yau 6
1
The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK.
2
Office of the Regius Professor of Medicine, Richard Doll Building, Roosevelt Drive, Oxford,
OX3 7LF, UK
3 Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Essex, CB10 1XL, UK
4
NIHR Oxford Biomedical Research Centre, Oxford, UK.
5
Weatherall Inst of Molecular Medicine, University of Oxford; John Radcliffe Hospital
Headington, Oxford OX3 9DS, UK
6
Imperial College London, South Kensington Campus, London, SW7 2AZ. UK
1
Supplementary Methods
Whole genome sequencing and analysis in detail
2ug of DNA were fragmented using a Covaris S2 system with the following settings: Duty
Cycle= 10% Intensity= 5% Cycles/bust= 200. Distribution of fragments after shearing was determined using Tapestation 1DK system (Agilent/Lab901). Libraries were constructed using the NEBNext DNA Sample Prep Master Mix Set 1 Kit (NEB) with minor modifications.
Ligation of adapters was performed using 6µl of Illumina Adapters (Multiplexing Sample
Preparation Oliogonucleotide Kit). Ligated libraries were size selected using 2% E-Gel® EX
(Invitrogen) and the distribution of fragments in the purified fraction wasdetermined using
Tapestation 1DK system (Agilent/Lab901). Each library was PCR enriched with 25 µM each of the following custom primers:
Multiplex PCR primer 1.0
5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC
CGATCT-3'
Index primer
5'-CAAGCAGAAGACGGCATACGAGAT[INDEX]CAGTGACTGGAGTTCAGACGT
GTGCTCTTCCGATCT-3'
Indexes were 8bp long and part of an indexing system developed in-house. Four independent
PCR reactions per sample were prepared using 25% volume of the pre-PCR library each. After 8 cycles of PCR (cycling conditions as per Illumina recommendations) the four reactions were pooled and purified with AmpureXp beads. The final size distribution was determined using a
Tapestation 1DK system (Agilent/Lab901). The concentration of each library was determined by realtime using Agilent qPCR Library Quantification Kit and a MX3005P instrument (Agilent).
Sequencing was performed on a HiSeq2000 as 100 paired end. The library was run on the
HiSeq2000, using version 3 sequencing chemistry. A PhiX control was spiked into the library.
We ran 2 lanes of the original library. Then, to "top up" to the required coverage, we ran the library in a multiplex of 4.
WGS reads were mapped to the human reference genome (GRCh37d5/hg19) using STAMPY
1 and duplicate reads removed using Picard. After duplicate reads removal, the mean coverage across the genome was 25.6x with 90.4% of bases covered at 15x or more. The mean coverage over the 17.1 Mb ROH identified by SNP analysis was 25.9x with 93.4% of bases covered at 15x or more. Coverage was calculated with custom scripts and the BEDTOOLS package
2
.
Identification of variant sites and alleles was performed with in-house software Platypus
3
, which can detect SNPs and short (<50bp) indels.
Exome sequencing and analysis in detail for Saudi Arabian kinship
5µg of DNA was fragmented using a Covaris S2 system with the following settings: Duty
Cycle= 20% Intensity= 4% Cycles/burst= 200 time 118s. Distribution of fragments after shearing was determined using 2100 Bioanalyser (Agilent). Libraries were constructed using the TruSeq
Exome enrichment capture technology (Illumina) as per the manufacturer’s protocol in a
2
technical service provided by Eurofins MWG Operon (www.eurofinsdna.com). Sequencing was performed on the Illumina HiSeq2000 platform as 2x100bp paired end reads. Raw sequencing reads were aligned to the consensus genome (hg19), sorted and converted to a BAM file using
Mosaik (version 1.1.21; http://bioinformatics.bc.edu/marthlab/Mosaik ). The BAM file was indexed and variants called using samtools (version 0.1.16; Li H.*, Handsaker B.*, Wysoker A.,
Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools.
Bioinformatics, 25, 2078-9.). The alignments were optimised for indel calling and indels called using dindel (version 1.0.12; CA Albers, G Lunter, Daniel G MacArthur, Gilean McVean,
Willem H Ouwehand, Richard Durbin. Dindel: Accurate indel calls from short-read data.
Genome Research 2010). The resulting list of variants were visualised and assessed using the
UCSC Genome Browser (Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,
Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.)
Putative disease causing variants were verified by PCR amplification and Sanger sequencing.
Genomic DNA was extracted from blood samples by automated DNA extraction on the M48
BioRobot using the MagAttract DNA blood Mini M48 kit (Qiagen 951336) as part of the routine service performed by the Northern Region Genetics service molecular laboratory, and amplified by the Moltaq PCR kit (Molzym P-010-1000). Primer sequences are available on request. All sequencing was performed using bi-directional fluorescent sequencing on an ABI 3730 XL 96 capillary sequencer, with BigDye Version 3.1 chemistry.
1.
Lunter G, Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of
Illumina sequence reads. Genome research 2011; 21:936-939.
2.
Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26: 6: 841 –842.
3.
Rimer A, Mathieson I, McVean G, Lunter G. Platypus program: Integrated Variant Caller. The
Wellcome Trust Centre for Human Genetics, University of Oxford. 2012. http://www.well.ox.ac.uk/platypus/
3
Table S1 Analysis of whole exome sequence data for Cases 1&2 (ALG14 mutation)
Sibling 1 Sibling 2 Shared between two siblings
1511
1178
1548
1192
Shared between two siblings but absent from other in-house exomes
All variants
Synonymous substitutions
Non-synonymous substitutions and mutations in
3’UTR, 5’UTR and splicing mutations
Homozygous mutations
Genes with two or more mutations
333
70
36
356
87
38
52
2
0
1-Alg14
4
Table S2 Analysis of whole exome sequence data for Case 5 (ALG2 mutation, c.283_296delGGGGACTGGCTGCinsAGTCCCCGGC
) candidate interval
Total SNVs
Exomic SNVs
Genomic indels
Exomic indels
Total exonic variants
Homozygous
Not in dbSNP
Homozygous and not in dbSNP
Segregating with disease phenotype
18
18
0
(9p31.1) whole exome
72 59068
18
12
0
13676
6844
203
13879
939
354
0
0
12
0
SNV; single nucleotide variations
5
Table S3. Analysis of whole genome sequencing data for Case 7 (ALG2 mutation, c.203C>G ) – Homozygous variants
Stopgain
Non-synonymous
Splicing
2
14
3
Insertion (non-frameshift) 7
Insertion (frameshift) 5
Deletion (non-frameshift) 6
Deletion (frameshift)
Other exonic (ncRNA, synonymous)
3'UTR
5'UTR
1
44
113
31
0
3
0
0
0
0
0
0
0
0
TOTAL 226 3
Removal of variants where sequences were misaligned due to segment duplication, non-
Fhomozygous variants, and variants located on X chromosome
6
Figure S1
II III i ii iii i ii iii
Segregation of the ALG2 c.203T>G variant within the pedigree. Restriction digest analysis of
DNA from family members. For numbering of family members see Figure 5A. ALG2 Exon 1 was amplified from genomic DNA and digested with Age I, and products separated on an agarose gel. c.203T>G results in loss of the Age I site. IIii = index case.
7
Figure S2
Reads from whole genome sequencing showing the homozygous variant ALG2 c.203C>G
8