Supplementary Information (doc 120K)

advertisement
Supplemental data
TESTING PHASE AND TECHNOLOGY SELECTION
The first step of NGS diagnostic implementation was to test the available platforms in our
laboratory, together with various enrichment strategies. Basically, a panel of 20 control DNA
harbouring routine BRCA mutations as well as difficult cases was tested with different
enrichment, sequencing and bioinformatics procedures in order to define the pros and cons of
each technology in a diagnostic perspective.
We first tested the accuracy of the SOLiDv3 system by fragment sequencing of the 20
barcoded patients using in-house multiplex PCRs and the Multiplicom BRCA MASTR assay
v 0.1 enrichment kit, both targeting approx. 40 kb of BRCA1 and BRCA2 coding sequence.
Four barcodes were found to perform poorly, with a low number of attributed reads.
Following BFAST mapping and variant calling with SAMTools , most false positives were
filtered out using the minimum coverage, the minimum allele ratio and distribution of the
heterozygosis frequency as main filtering parameters. All specific point mutations were
successively localized and identified along with expected polymorphisms. Large
rearrangements were found by comparison of normalized exonic coverage between patients.
Following these promising results in terms of sequencing accuracy, we focused on target
enrichment, a critical step which must ensure that all the ROI is properly covered.
Consequently, different enrichment strategies were tested in combination with SOLiDv4
and/or PGM sequencing. A Sure Select liquid capture panel of 43 genes involved in
oncogenetics and targeting promoters, exons and flanking regions was designed with the help
of Agilent. The same 43 genes-panel was also designed using the Selector technology from
HaloGenomics along with a restricted format targeting BRCA1 and BRCA2 coding sequences.
We include in our evaluation the optimized version of the BRCA Multiplicom assay (MASTR
assay v2.0). At last, 5 control DNA from our panel were send out to Rain Dance for
enrichment by the RainStorm microdroplet-based technology on their 142-genes oncology
panel. In all cases, emphasis was put on design quality i.e. our main issue was to ensure
complete capture of the ROI. For capacity reasons, gene panels were sequenced with the
SOLiD v4 paired –end chemistry while BRCA-restricted enrichments were sequenced using
the Ion Torrent-PGM and the 316 ion chip with 200bp read length. Agilent’s liquid capture
provided a depth-of-coverage ranging from 1X to1472X (mean 629X) and 1X to1260X (mean
472X) for BRCA1 and BRCA2, respectively. Eighteen % of the reads were off target and
enrichment failed for GC-rich regions such as the 5’ part of RB1 gene and some first exons.
More subtle, strand bias occurred in well covered exons which is in turn a real issue despite
good depth-of-coverage because some true variants are filtered out in amplicons extremities
due to a unbalanced forward/reverse ratio. Insufficient overlaps in bait design combined with
the SOLiD small read length are probably at cause, which is why an optimized bait design
should solve the problem. On the other hand, the promising Selector approach gave poor
results with only 15% of reads on target and uneven depth-of-coverage. Similar results were
obtained with the dedicated Selector BRCA kit, prompting us to recommend technological
improvement before any clinical use. We also observed lack of coverage with the Rainstorm
technology, at least for a few BRCA exons, precluding it in the present state for diagnostic
purposes. Lastly, the Multiplicom enrichment provided complete and even depth of coverage
for BRCA1 and BRCA2 (min=1 and max=8025, average=3501), thus appearing as a method of
choice, at this point in time, for small size target enrichment in a diagnostic setting.
To summarize this pilot phase, two enrichment procedures appear diagnostic-compatible,
Multiplicom and Agilent Sure Select, the latter providing bait optimization and taking into
account that GC-rich regions cannot be analysed. Another drawback of liquid capture is that
automation is mandatory for high throughput diagnosis on a high number of cases with rapid
turnaround time. Such automation is still tedious and expensive. Regarding sequencers, the
SOLiD system, despite its sequencing accuracy, appeared inadequate for clinical diagnostics
for run time reasons (10 days for v4 paired-end sequencing). On the other hand, PGM has
shorter, diagnostic-compatible run time (2 hours). Overall, and considering the constraints in
terms of enrichment quality, automation and turnaround time, we chose a combination of
BRCA Multiplicom enrichment and PGM sequencing on a 318 chip for BRCA clinical
diagnosis
TESTING PHASE, BIOINFORMATICS PARAMETERS
Read mapping
SOLiDv3 fragment reads were mapped with BFAST 0.6.2a onto the BRCA1/2 amplicons of
the Multiplicom BRCA enrichment kit, whereas all SOLiDv4 paired-end reads were aligned
with BFAST+BWA 0.6.5a onto the human reference genome hg19 after conversion of the
csfasta/qual files into the fastq format with the solid2fastq. F3 and F5 reads were mapped in
color-space with the alignment tools BFAST and BWA (ref), respectively, with default
parameters. Then, after the local alignment and the pairing procedure of reads (with –v 215 –s
54 –S 4.0 options for the insert size parameters), only pairs showing the best unique alignment
of reads were kept for further analysis.
SNP/Indel detection and filtering
SNVs from SOLiDv3 data were identified with the Samtools 0.1.13 pileup program and the
varFilter.pl perl script with default parameters. For all SOLiDv4 experiments, SNVs and
Indels were detected with the UnifiedGenotyper of the GATK v1.0.5 suite after preprocessing
of pairs: PCR duplicates have been marked, reads were local-realigned around known indels
and base quality score recalibrated from dbSNP132. The Exome variants filters (Q<30,
QD<5,HRun>5, SB>-0.1) were then used to filter out both SNV and Indels false-positives.
Exonic rearrangement detection
For the SOLiDv3 dataset, exonic rearrangements were detected by comparing normalized
mean coverage ratios between samples. For all other sequencing datasets, copy-number
variations were detected using a depth-of-coverage method based on the read count. For each
captured or amplified region, read counts were first computed with the multiBamCov program
from Bedtools then normalized and compared to the mean of all samples from the same
experiment as a control using the Bioconductor R package DESeq.
Supplemental table 1. Mutations and rare variants found in the validation set. Nucleotide position was numbered on the basis of the coding sequences
NM_007294.2 and NM_000059.3 for BRCA1 and BRCA2, respectively. Nucleotide numbering reflects cDNA numbering with +1 corresponding to the A of
the ATG translation initiation codon in the reference sequence. Number of occurrences is indicated in brackets. (*): large rearrangements found using the
electrophoresis step (see text).
Variant type
Large
rearrangements
Insertions/Deletions
Nucleotide
substitutions
Gene
BRCA1
BRCA1
BRCA2
BRCA2
BRCA1
BRCA1
BRCA1
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA1
BRCA1
BRCA1
BRCA1
BRCA1
BRCA1
Description
c.671-?_c.4185+? del
c.871-?_c.547+? dup
c.426-?_c.(1910_6841) dup
c.(?_-227)_(*902_?) dup
c.19_47del
c.68_69del
c.3416_3427delinsC
c.68-7dup
c.5722_5723del
c.5946del
c.6514_6515del
c.6591_6592del
c.10095delinsGAATTATATCT
c.301+7G>A
c.314A>G
c.994C>T
c2458A>G
c.3748G>A
c.4393A>C
Identified
with
reference
method
Identified with
NextGene v2.3 (*)
(Coverage/Mutation
ratio)
Identified with
academic pipeline
(Coverage/Mutation
ratio)
p.(Ser3366AsnfsTer4)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes (*)
Yes (*)
Yes (*)
Yes (*)
Yes (37X/70%)
Yes (35X/62%)
Yes (82X/45%)
Yes (84X/20%)
Yes (330X/56%)
Yes (92X/46%)
Yes (416X/44%)
Yes (413X/31%)
Yes (177X/40%)
Yes
Yes
Yes
Yes
Yes (52X/40%)
No
Yes (60X/35%)
No
Yes (333X/55%)
Yes (94X/45%)
Yes (447X/41%)
Yes (435X/24%)
Yes (173X/36%)
p.?
p.(Tyr105Cys)
p.(Arg332Trp)
p.(Lys820Glu)
p.(Glu1250Lys)
p.(Ile1465Leu)
Yes
Yes
No
Yes
Yes
Yes
Yes (111X/56%)
Yes (289X/44%)
Yes (294X/47%)
Yes (153X/56%)
Yes (383X/47%)
Yes (124X/50%)
Yes (115X/59%)
Yes (257X/45%)
Yes (296X/49%)
Yes (137X/60%)
Yes (346X/45%)
Yes (116X/52%)
Expected
Consequences
p.?
p.?
p.?
p.?
p.(Arg7CysfsTer24)
p.(Glu23ValfsTer17)
p.(Ser1139ThrfsTer6)
p.?
p.(Leu1908ArgfsTer2)
p.(Ser1982ArgfsTer22)
p.(Ser2172ThrfsTer3)
p.(Glu2198AsnfsTer4)
BRCA1
BRCA1
BRCA1
BRCA2
BRCA2
c.4535G>T
c.4812A>G
c.4956G>A
c.-11C>T
c.68-17A>G
p.(Ser1512Ile)
p.(=)
p.(Met1652Ile)
p.?
p.?
Yes
Yes
Yes
Yes
Yes
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
c.68-7T>A (2)
c.125A>G
c.1151C>T
c.1788T>C
c.1938C>T
c.1964C>G
c.3079A>G
c.3252T>C
c.3264T>C
c.3516G>A
c.4068G>A
c.4090A>C
c.4584C>T
c.4585G>A
c.4956G>A
c.5199C>T (2)
p.?
p.(Tyr42Cys)
p.(Ser384Phe)
p.(=)
p.(=)
p.(Pro655Arg)
p.(Ser1027Gly)
p.(=)
p.(=)
p.(=)
p.(=)
p.(Ile1364Leu)
p.(=)
p.(Gly1529Arg)
p.(Met1652Ile)
p.(=)
No, No
Yes
yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
BRCA2
c.5418A>G
c.6037A>T
c.6100C>T
c.6785T>C
c.7017G>C
c.7319A>G
c.8850G>T
p.(=)
p.(Lys2013Ter)
p.(Arg2034Cys)
p.(Met2262Thr)
p.(Lys2339Asn)
p.(His2440Arg)
p.(Lys2950Asn)
No, Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes (153X/46%)
Yes (142X/47%)
Yes (543X/46%)
Yes (484X/51%)
Yes (76X/50%)
Yes (72X/50%)
Yes (895X/48%)
Yes (986X/50%)
Yes (112X/40%)
Yes (37X/41%)
Yes, Yes
Yes, Yes
(Minimum: 312X/41%) (Minimum: 172X/45%)
Yes (458X/54%)
Yes (413X/55%)
Yes (810X/77%)
No
Yes (241X/42%)
Yes (219X/43%)
Yes (260X/55%)
Yes (234X/56%)
Yes (1082X/48%)
Yes (1024X/51%)
Yes (715X/39%)
Yes (790X/40%)
Yes (1006X/51%)
Yes (998X/49%)
Yes (4065X/39%)
Yes (3556X/46%)
Yes (271X/47%)
Yes (233X/45%)
Yes (448X/47%)
Yes (278X/48%)
Yes (71X/42%)
Yes (78X/44%)
Yes (999X/50%)
Yes (1221X/46%)
Yes (78X/59%)
Yes (77X/60%)
Yes (44X/50%)
Yes (42X/52%)
Yes, Yes
Yes, Yes
(Minimum: 166X/30%) (Minimum: 145X/32%)
Yes (3948X/50%)
Yes (4299X/50%)
Yes (78X/53%)
Yes (67X/52%)
Yes (909X/49%)
Yes (1114X/47%)
Yes (404X/49%)
Yes (372X/50%)
Yes (106X/44%)
Yes (108X/43%)
Yes (2452X/44%)
Yes (2507X/47%)
Yes (180X/39%)
Yes (155X/46%)
BRCA2
BRCA2
BRCA2
BRCA2
c.8851G>A
c.9649-19G>A
c.9730G>A
c.10110G>A
p.(Ala2951Thr)
p.?
p.(Val3244Ile)
p.(=)
Yes
Yes
Yes
Yes
Yes (246X/44%)
Yes (966X/50%)
Yes (7277X/47%)
Yes (511X/50%)
Yes (232X/45%)
Yes (856X/49%)
Yes (8242X/50%)
Yes (464X/52%)
Supplemental table 2 : Bioinformatic tools and parameters used in the academic pipeline.
Analysis step
Program
Command line
MAPPING
Mapping
Sort SAM file
Get BAM
Focus on targets
Coverage statistics
TMAP
PicardTools
Samtools
BEDTools
GATK
tmap mapall -g 0 -n 8 -f <reference_file> -r <sff_file> -s <output_sam_file> -v -Y stage1 map1 map2 map3
SortSam.jar I=<input_sam_file> O=<output_sam_file> SO=coordinate
samtools view -q 10 -bSt <reference_index_file> -o <output_bam_file> <input_sam_file>
intersectBed -abam <input_bam_file> -b <exons_bed_file>
GenomeAnalysisTK.jar -R <reference_file> -I <exons_bam_file> -T DepthOfCoverage -L <exons_bed_file> -ct 50 -ct 100 -ct 200 -ct 300 --omitDepthOutputAtEachBase
SNV/INDEL DETECTION
Variant calling
TVC
Variant Filtering
PostProcessVCF.pl
Variant Annotation
Annovar
Annovar
variantCaller.py -l -k -L <log_file> -o <flow_order_seq> -p <GermLine_json_file> -r <TVC_bin_directory> -b <exons_bed_file> <output_dir> <reference_file><exons_bam_file>
Modified parameters in the <GermLine_json_file>:
max_alternate_alleles : 3
min-bayesian-score : 0.1
min-var-freq : 0
gatk-min-score : 900
SNVs : postProcessVCF.pl –vcf SNP_variants.vcf --mincov 30 --varfreq 0.3 --strdfreq 0.2 --out <snps_output>
Indels : PostProcessVCF.pl –vcf Indel_variants.vcf --mincov 30 --varfreq 0.2 --strdfreq 0.2 --out <indels_output>
convert2annovar.pl -format vcf4 –includeInfo <snps/indels_vcf_file>
annotate_variation.pl --geneanno --buildver hg19 <snps/indels_var_file> <Annovar_bin_dir>/humandb
Samtools
multiBamCov -bams <exons_bam_files> -bed <amplicons_bed_file>
DESeq
Rscript --vanilla rgt_script.R <FDR> <number_of_multiplexes> <read_count_matrix>
REARRANGEMENT DETECTION
Read count for each amplicon
and for all samples
Differential read count analysis
Supplemental table 3 : Minimum input requirements for analysis
Nextgene pipeline
Academic
pipeline
Highthroughput
reads
34 619
Reads at
QV20 and
more:
92.91 %
34 619
92.91 %
Minimum
Minimum number
number of
of matched bases
aligned reads
30 972
5 524 125
33 420
5 566 449
Minimum
coverage
observed
30X
35X
Download