Supplementary Material

Supplementary Material Detailed Methods The participating centers provided our center with comprehensive lists of GenBank accession numbers for finished clones submitted over the respective periods of the quality analysis rounds. To maximize the amount of sequence data evaluated, we removed from these lists all clones whose sequences were less than 90 kb in length. From the remaining clones, we randomly selected BACs for quality analysis, with the number of clones selected from each center fixed for the total period of the analysis round. To reflect accurately the monthly variability in finished sequence production, random selection of clones was proportional to the total amount of finished sequence produced by each center per month. Once the analysis lists were generated, we downloaded the sequence data for each finished clone to be evaluated directly from the GenBank database. Participating centers provided glycerol stocks of the BAC templates corresponding to the submitted sequences, along with all data pertinent to the assembly of these clones. Files that encoded the process by which a sequence trace was generated (DNA template, primer, sequencing chemistry and sequencing platform) accompanied all trace data, which we used to create a uniform naming convention for traces in the quality analysis. We constructed initial assemblies from these data for each clone. We performed a significant amount of additional DNA sequencing on the BAC clones that were evaluated. We generated sheared plasmid libraries with a defined insert size from the BAC templates and sequenced these libraries to a coverage level of fourfold (4x) sequence coverage in high quality (Phred-20 bases) sequence. This additional sequence data was added to the original trace data provided by the sequencing center in a combined assembly. We then performed sequence “refinishing” on these clone to meet highly stringent criteria, including: Contiguity: The combined assembly must form a singly contiguous linear consensus sequence (contig). Completeness: All paired subclone end-sequences must be present in this contig in correct relative orientation. The sequence distance (measured in bp) between these paired ends must fall within the range of the sized plasmid library. There should be no breaks in plasmid template coverage as inferred from paired end links. Redundancy: Every consensus base must be confirmed by sequence from at least two independent templates. Accuracy: The aggregate error probability rate as calculated by the Phrap assembly algorithm must be as close to zero as reasonably achievable. Direct sequence primer walks on the BAC template were used to resolve any discrepancies from the above criteria. In cases where it was impossible to design a unique oligonucleotide primer for the BAC template, plasmid primer-walks on multiple subclone templates were employed. The primer walks included reads performed with alternative chemistries such as dGTP and/or Invitrogen enhancers. All finishing quality analysis was performed using the Phred/Phrap/Consed pipeline. Once we determined that the combined assembly met the above criteria, we exported a consensus sequence and compared it to the original GenBank submission. Discrepancies between the consensus sequences were examined and verified. Discrepancies were counted as error events, which we define as a break in alignment between GenBank sequence and our ‘final’ consensus sequence. Multiple contiguous incorrect bases, or multiple base pair insertions or deletions were counted as singular error events, and the number of incorrect bases tallied. This provides two different calculations of error rate; an event rate that indicates the number of error events found, and a base pair rate that indicates the gross number of sequence bases affected by these error events. All error events were recorded and classified by type into substitutions, insertions, or deletions. Significant errors and gross misassemblies were also noted. To eliminate the possibility of counting an error from a clone growth issue, we only noted errors that had evidence for the correct representation in the original data set. We examined all error events in the original assemblies and noted the probable causes of the errors. Comprehensive reports of our results were sent to the NHGRI and single-center reports were sent to the respective sequencing centers. All data supporting our assessment was made available to the participating centers. Identifying and Fixing Errors Without Additional Sequencing We carried out an experiment with the round 1 center's data to address the question of whether we would be able to locate the identified error events in the original data set. We examined the original data set assemblies at each of the identified error points and determined whether we could navigate to the error, (i.e., Would we have identified the error in that data set if we did not know the error was present?) We also determined whether, having been able to find the error, it would have been possible to determine the correct answer from the original data or whether more data would have been needed to resolve the correct basecall. The results are shown in Figure S2. 100.0% 90.0% 80.0% 88.7% 86.6% 75.5% 70.0% 60.0% 55.2% 52.4% 50.0% 40.0% 33.3% 30.0% 20.0% 10.0% 0.0% Center A Center B Visible Center C Correctable Figure S2. Percent of error events visible and correctable in the original data set. The results for centers A and C are as we would expect. Most of center A’s error events were miscalls, which by definition are both visible in the data set and fixable with the original data. We were able to navigate to about the same percentage (87%) for center C; however, only 55% of the error events were fixable due to the varied nature of center C’s error types. Our result for center B was surprising. We could only identify 52% of the error events and only fix 33% with the available data set. Upon closer inspection, we found that a large majority of the error events from center B were compressions in the original sequencing trace. These were hidden in the traces in the original set and only became visible when we added our 4x sequence coverage to the clones. Finishing Strategies of Surveyed Centers These descriptions of the finishing strategies of the finishing centers are based on the projects that we examined and do not necessarily apply to all projects finished by the centers. Round 1 2/15/01 through 8/15/01 Baylor Human Genome Sequencing Center (39 clones) Shotgun: Transitioned from all M13 to mixed plasmid and M13 Platform: ABI3700 Chemistry: Dye Terminator with some Dye Primer Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct sequencing of PCR products Difficult Regions: Shattered PCR and small insert libraries Common Errors: Concentrated in long homopolymers and in the areas resolved with direct sequence from PCR products and shattered PCR products. Washington University Genome Sequencing Center (38 clones) Shotgun: M13 draft, topped off with plasmids Platform: ABI377 and ABI3700 Chemistry: Mixed Dye Primer and Dye Terminators Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct sequencing of PCR products Difficult Regions: Alternative chemistry reads Common Errors: Compressions in Dye Primer reads, M13 cloning issues Whitehead Institute Center for Genome Research (40 clones) Shotgun: M13 draft, topped off with heavy plasmid coverage Platform: ABI3700 Chemistry: Dye Terminator Captured Gap Closing: Transposon bombing Uncaptured Gap Closing: Direct sequencing of duplicate PCR products Difficult Regions: None in the sampled set Common Errors: Miscalls and homopolymers (consensus generation errors), deleting M13 variants Round 2 02/15/01 through 06/30/02 Genome Therapeutics Corporation (20 clones) Shotgun: All plasmids Platform: MegaBACE (MB) with some ABI3700 Chemistry: Dye Terminator Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: None in the sampled set Difficult Regions: None in the sampled set Common Errors: Miscalls in noisy traces, GC compressions in MB traces, erroneous joins in repetitive regions French National Sequencing Center Genoscope (20 clones) Shotgun: 2 plasmid libraries (3kb, 10kb) Platform: Licor with some ABI3700 Chemistry: Licor and Dye Terminators Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct BAC walks Difficult Regions: None in the sampled set Common Errors: Licor noisy bp and homopolymer length RIKEN Genomic Sciences Center (20 clones) Shotgun: Plasmid majority, some M13 draft Platform: MegaBACE with small amount of ABI3700 Chemistry: Dye Terminators Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct sequencing of PCR products Difficult Regions: Multiple deletions of subclones Common Errors: Miscalls in low sequence coverage areas, MB A/T dropouts and homopolymer length, collapsed and unresolved repeats University of Washington Genome Center (20 clones) Shotgun: Plasmid majority, some M13 draft Platform: MegaBACE and ABI3700 Chemistry: Dye Terminator with some Dye Primer draft Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct BAC walks and limited PCR sequencing Difficult Regions: None in the sampled set Common Errors: Miscalls and multi-bp indels in simple sequence runs, deletions in single subclones areas Computational Exchange 2/15/01 through 8/15/01 Wellcome Trust Sanger Institute (39 clones) Shotgun: Plasmid majority Platform: Mixed AB377 and AB3700 with some MegaBACE Chemistry: Mixed Dye Primer and Dye Terminator Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct sequencing of PCR products Difficult Regions: dGTP reads, shattered PCR products, and shattered clones Common Errors: Miscalls from combined polymorphic data sets, homopolymer length Joint Genome Institute/Stanford Human Genome Center (38 clones) Shotgun: All plasmids Platform: MegaBACE shotgun, ABI377 finishing Chemistry: Dye Terminator Captured Gap Closing: Subclone primer walks Uncaptured Gap Closing: Direct BAC walks Difficult Regions: dGTP chemistry Common Errors: MB A/T dropouts, MB G spikes, homopolymer length Table S3. Average statistics about projects surveyed. The numbers given are the average number of reads for the various finishing techniques employed by the centers surveyed. Baylor Washington University Whitehead Genoscope GTC RIKEN University of Washington Sanger SHGC/JGI Size (kb) Shotgun M13 Plasmid Primer PCR PCR Shattered Multiple Transpo Cov. Reads Reads Walks Shatter Clone Deletion sons 172,692 9.3x 3185 1953 62 15 42 179,595 8.6x 2467 2041 83 14 - - - - 168,574 169,927 176,691 176,666 12.6x 8.1x 10.1x 10.7x 2253 0 0 554 3246 2314 4132 5803 247 62 20.3 3 10.6 3.5 - - 9 71 - 171,048 8.7x 637 3064 46.6 0.2 - - - - 136,446 147,610 10.2 7.5x 267 0 5003 4090 94 74.6 4 - 6 - 19 - - - Shotgun Coverage: Coverage of the project in non-vector phred20 bases in the universal primer subclone reads divided by the total size of the clone M13 Reads: Universal primed subclone reads from M13 templates, sequenced in one direction Plasmid Reads: Universal primed subclone reads from plasmid templates, usually sequenced in both directions creating paired ends Primer Walks: Custom primed finishing reads primarily from subclone templates PCR: Sequence generated from a PCR product, generally from the BAC clone PCR Shatter: Reads from a PCR product that has been shattered into small inserts and subcloned into a M13 or plasmid vector Shattered Clone: Reads from a subclone that has been shattered and recloned into a M13 or plasmid vector Transposons: Reads from a transposon hopped subclone, sequenced in both directions from the inserted junction site Multiple Deletion: Universal reads from a consecutively deleted and recircularized subclone

Supplementary Material

Related documents

Products

Support

Supplementary Material

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib