Supplementary Material

advertisement
Supplementary Material
Detailed Methods
The participating centers provided our center with comprehensive lists of
GenBank accession numbers for finished clones submitted over the respective periods of
the quality analysis rounds. To maximize the amount of sequence data evaluated, we
removed from these lists all clones whose sequences were less than 90 kb in length.
From the remaining clones, we randomly selected BACs for quality analysis, with the
number of clones selected from each center fixed for the total period of the analysis
round. To reflect accurately the monthly variability in finished sequence production,
random selection of clones was proportional to the total amount of finished sequence
produced by each center per month. Once the analysis lists were generated, we
downloaded the sequence data for each finished clone to be evaluated directly from the
GenBank database. Participating centers provided glycerol stocks of the BAC templates
corresponding to the submitted sequences, along with all data pertinent to the assembly of
these clones. Files that encoded the process by which a sequence trace was generated
(DNA template, primer, sequencing chemistry and sequencing platform) accompanied all
trace data, which we used to create a uniform naming convention for traces in the quality
analysis. We constructed initial assemblies from these data for each clone.
We performed a significant amount of additional DNA sequencing on the BAC
clones that were evaluated. We generated sheared plasmid libraries with a defined insert
size from the BAC templates and sequenced these libraries to a coverage level of fourfold (4x) sequence coverage in high quality (Phred-20 bases) sequence. This additional
sequence data was added to the original trace data provided by the sequencing center in a
combined assembly. We then performed sequence “refinishing” on these clone to meet
highly stringent criteria, including:
Contiguity: The combined assembly must form a singly contiguous linear consensus
sequence (contig).
Completeness: All paired subclone end-sequences must be present in this contig in
correct relative orientation. The sequence distance (measured in bp) between these
paired ends must fall within the range of the sized plasmid library. There should be no
breaks in plasmid template coverage as inferred from paired end links.
Redundancy: Every consensus base must be confirmed by sequence from at least two
independent templates.
Accuracy: The aggregate error probability rate as calculated by the Phrap assembly
algorithm must be as close to zero as reasonably achievable.
Direct sequence primer walks on the BAC template were used to resolve any
discrepancies from the above criteria. In cases where it was impossible to design a
unique oligonucleotide primer for the BAC template, plasmid primer-walks on multiple
subclone templates were employed. The primer walks included reads performed with
alternative chemistries such as dGTP and/or Invitrogen enhancers. All finishing quality
analysis was performed using the Phred/Phrap/Consed pipeline.
Once we determined that the combined assembly met the above criteria, we
exported a consensus sequence and compared it to the original GenBank submission.
Discrepancies between the consensus sequences were examined and verified.
Discrepancies were counted as error events, which we define as a break in alignment
between GenBank sequence and our ‘final’ consensus sequence. Multiple contiguous
incorrect bases, or multiple base pair insertions or deletions were counted as singular
error events, and the number of incorrect bases tallied. This provides two different
calculations of error rate; an event rate that indicates the number of error events found,
and a base pair rate that indicates the gross number of sequence bases affected by these
error events. All error events were recorded and classified by type into substitutions,
insertions, or deletions. Significant errors and gross misassemblies were also noted. To
eliminate the possibility of counting an error from a clone growth issue, we only noted
errors that had evidence for the correct representation in the original data set.
We examined all error events in the original assemblies and noted the probable
causes of the errors. Comprehensive reports of our results were sent to the NHGRI and
single-center reports were sent to the respective sequencing centers. All data supporting
our assessment was made available to the participating centers.
Identifying and Fixing Errors Without Additional Sequencing
We carried out an experiment with the round 1 center's data to address the
question of whether we would be able to locate the identified error events in the original
data set. We examined the original data set assemblies at each of the identified error
points and determined whether we could navigate to the error, (i.e., Would we have
identified the error in that data set if we did not know the error was present?) We also
determined whether, having been able to find the error, it would have been possible to
determine the correct answer from the original data or whether more data would have
been needed to resolve the correct basecall. The results are shown in Figure S2.
100.0%
90.0%
80.0%
88.7%
86.6%
75.5%
70.0%
60.0%
55.2%
52.4%
50.0%
40.0%
33.3%
30.0%
20.0%
10.0%
0.0%
Center A
Center B
Visible
Center C
Correctable
Figure S2. Percent of error events visible and correctable in the original data set.
The results for centers A and C are as we would expect. Most of center A’s error events
were miscalls, which by definition are both visible in the data set and fixable with the
original data. We were able to navigate to about the same percentage (87%) for center C;
however, only 55% of the error events were fixable due to the varied nature of center C’s
error types. Our result for center B was surprising. We could only identify 52% of the
error events and only fix 33% with the available data set. Upon closer inspection, we
found that a large majority of the error events from center B were compressions in the
original sequencing trace. These were hidden in the traces in the original set and only
became visible when we added our 4x sequence coverage to the clones.
Finishing Strategies of Surveyed Centers
These descriptions of the finishing strategies of the finishing centers are based on the
projects that we examined and do not necessarily apply to all projects finished by the
centers.
Round 1
2/15/01 through 8/15/01
Baylor Human Genome Sequencing Center (39 clones)
Shotgun:
Transitioned from all M13 to mixed plasmid and M13
Platform:
ABI3700
Chemistry:
Dye Terminator with some Dye Primer
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct sequencing of PCR products
Difficult Regions:
Shattered PCR and small insert libraries
Common Errors:
Concentrated in long homopolymers and in the areas
resolved with direct sequence from PCR products and
shattered PCR products.
Washington University Genome Sequencing Center (38 clones)
Shotgun:
M13 draft, topped off with plasmids
Platform:
ABI377 and ABI3700
Chemistry:
Mixed Dye Primer and Dye Terminators
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct sequencing of PCR products
Difficult Regions:
Alternative chemistry reads
Common Errors:
Compressions in Dye Primer reads, M13 cloning issues
Whitehead Institute Center for Genome Research
(40 clones)
Shotgun:
M13 draft, topped off with heavy plasmid coverage
Platform:
ABI3700
Chemistry:
Dye Terminator
Captured Gap Closing:
Transposon bombing
Uncaptured Gap Closing:
Direct sequencing of duplicate PCR products
Difficult Regions:
None in the sampled set
Common Errors:
Miscalls and homopolymers (consensus generation errors),
deleting M13 variants
Round 2
02/15/01 through 06/30/02
Genome Therapeutics Corporation
(20 clones)
Shotgun:
All plasmids
Platform:
MegaBACE (MB) with some ABI3700
Chemistry:
Dye Terminator
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
None in the sampled set
Difficult Regions:
None in the sampled set
Common Errors:
Miscalls in noisy traces, GC compressions in MB traces,
erroneous joins in repetitive regions
French National Sequencing Center Genoscope
(20 clones)
Shotgun:
2 plasmid libraries (3kb, 10kb)
Platform:
Licor with some ABI3700
Chemistry:
Licor and Dye Terminators
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct BAC walks
Difficult Regions:
None in the sampled set
Common Errors:
Licor noisy bp and homopolymer length
RIKEN Genomic Sciences Center
(20 clones)
Shotgun:
Plasmid majority, some M13 draft
Platform:
MegaBACE with small amount of ABI3700
Chemistry:
Dye Terminators
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct sequencing of PCR products
Difficult Regions:
Multiple deletions of subclones
Common Errors:
Miscalls in low sequence coverage areas, MB A/T dropouts
and homopolymer length, collapsed and unresolved repeats
University of Washington Genome Center
(20 clones)
Shotgun:
Plasmid majority, some M13 draft
Platform:
MegaBACE and ABI3700
Chemistry:
Dye Terminator with some Dye Primer draft
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct BAC walks and limited PCR sequencing
Difficult Regions:
None in the sampled set
Common Errors:
Miscalls and multi-bp indels in simple sequence runs,
deletions in single subclones areas
Computational Exchange
2/15/01 through 8/15/01
Wellcome Trust Sanger Institute
(39 clones)
Shotgun:
Plasmid majority
Platform:
Mixed AB377 and AB3700 with some MegaBACE
Chemistry:
Mixed Dye Primer and Dye Terminator
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct sequencing of PCR products
Difficult Regions:
dGTP reads, shattered PCR products, and shattered clones
Common Errors:
Miscalls from combined polymorphic data sets,
homopolymer length
Joint Genome Institute/Stanford Human Genome Center
(38 clones)
Shotgun:
All plasmids
Platform:
MegaBACE shotgun, ABI377 finishing
Chemistry:
Dye Terminator
Captured Gap Closing:
Subclone primer walks
Uncaptured Gap Closing:
Direct BAC walks
Difficult Regions:
dGTP chemistry
Common Errors:
MB A/T dropouts, MB G spikes, homopolymer length
Table S3. Average statistics about projects surveyed. The numbers given are the
average number of reads for the various finishing techniques employed by the centers
surveyed.
Baylor
Washington
University
Whitehead
Genoscope
GTC
RIKEN
University of
Washington
Sanger
SHGC/JGI
Size (kb) Shotgun M13 Plasmid Primer PCR PCR Shattered Multiple Transpo
Cov. Reads Reads Walks
Shatter Clone Deletion sons
172,692
9.3x
3185 1953
62
15
42
179,595
8.6x
2467
2041
83
14
-
-
-
-
168,574
169,927
176,691
176,666
12.6x
8.1x
10.1x
10.7x
2253
0
0
554
3246
2314
4132
5803
247
62
20.3
3
10.6
3.5
-
-
9
71
-
171,048
8.7x
637
3064
46.6
0.2
-
-
-
-
136,446
147,610
10.2
7.5x
267
0
5003
4090
94
74.6
4
-
6
-
19
-
-
-
Shotgun Coverage:
Coverage of the project in non-vector phred20 bases in the universal
primer subclone reads divided by the total size of the clone
M13 Reads:
Universal primed subclone reads from M13 templates, sequenced in one
direction
Plasmid Reads:
Universal primed subclone reads from plasmid templates, usually
sequenced in both directions creating paired ends
Primer Walks:
Custom primed finishing reads primarily from subclone templates
PCR:
Sequence generated from a PCR product, generally from the BAC clone
PCR Shatter:
Reads from a PCR product that has been shattered into small inserts and
subcloned into a M13 or plasmid vector
Shattered Clone:
Reads from a subclone that has been shattered and recloned into a M13
or plasmid vector
Transposons:
Reads from a transposon hopped subclone, sequenced in both directions
from the inserted junction site
Multiple Deletion:
Universal reads from a consecutively deleted and recircularized subclone
Download