An Overview of DNA Sequencing

advertisement
An Overview of
DNA Sequencing
Prokaryotic DNA
Plasmid
http://en.wikipedia.org/wiki/Image:Prokaryote_cell_diagram.svg
Eukaryotic DNA
http://en.wikipedia.org/wiki/Image:Plant_cell_structure_svg.svg
DNA Structure
The two strands of a
DNA molecule are held
together by weak bonds
(hydrogen bonds)
between the
nitrogenous bases,
which are paired in the
interior of the double
helix.
The two strands of
DNA are antiparallel;
they run in opposite
directions. The carbon
atoms of the
deoxyribose sugars are
numbered for
orientation.
http://en.wikipedia.org/wiki/Image:DNA_chemical_structure.png
Sequencing DNA
The goal of sequencing DNA
is to tell the order of the
bases, or nucleotides, that
form the inside of the
double-helix molecule.
We can do this in one of two
ways:
• Directed Sequencing
• Shotgun Sequencing
Directed Sequencing=Primer Walking
• Start with genome, gene, clone, PCR product
• Design a primer and sequence a certain segment
of the genome, usually the beginning.
• From that sequence, design the next primer and
sequence the next segment of the genome.
• Continue designing primers and sequencing until
the genome is completed.
.
Shotgun Sequencing
• Start with a whole genome or a large piece of the
DNA (a BAC).
• Shear the DNA into many different, random
segments.
• Sequence each of the random segments.
• Then, put the pieces back together again in their
original order.
Theory Behind Shotgun Sequencing
3000
2500
2000
Gaps
Haemophilus influenzae 1.83 Mb base
Coverage
unsequenced (%)
1X
37%
2X
13%
5X
0.67%
6X
0.25
7X
0.09%
1500
1000
500
0
0
20000
40000
60000
80000
Sequences
For 1.83 Mb genome, 6X coverage is 10.98 Mb of sequence, or 22,000
sequencing reactions, 11000 clones (1.5-2.0 kb insert), 500 bp average read.
BAC based Projects
BACs are Bacterial Artificial
Chromosomes. They are
large transport systems that
can hold pieces of DNA that
are 50 to 300 kilobases. To
make copies of the DNA we
insert the BAC into an E.
coli cell.
transform in E. coli
and process each
BAC as individual
projects
chromosome
break into
large pieces
clone into BAC
vector
Shotgun vs. Directed: Which is the better,
more efficient method?
•
With directed, sequences are done in order and there’s no
puzzle to put back together. Minimal computing power is
required.
•
Primers must be continually designed and purchased.
•
If an area is difficult to sequence, you could get stuck.
•
Shotgun sequencing takes less time; all the sequences can be
done at about the same time.
•
Minimal cost
•
But, the pieces have to be assembled, a time-consuming
process requiring extensive computational power.
Whole Genome Shotgun Sequencing
1.Library construction
2. Random Sequencing Phase
a. sequence DNA
(15,000 sequences/ Mb)
a. isolate DNA
3. Closure Phase
a. assemble sequences
b. close gaps
b. fragment DNA
c. edit
c. clone DNA
GGG ACTGTTC ...
d. annotation
237
4. COMPLETE
GENOME SEQUENCE
239
238
Library Construction
•
Construct shotgun libraries of the genome or target DNA; sizes
of inserts are varied; could be small (2-3 kb), medium (8-12
kb), or fosmid (30-40 kb)
• Start with multiple copies of purified DNA
• Shear the DNA using mechanical force, breaking it into
smaller pieces
• DNA fragments are cloned into a plasmid vector to
replicate the DNA; these fragments are called “inserts”
Library Construction
•
•
Insert fragment DNA into a
vector such as pBR322
Transform into E. coli cells and, using an antibiotic, select
for cells that have a plasmid. The plasmids carry antibiotic
resistant genes. When plated in the presence of an
antibiotic, the cells without a plasmid die.
Template Preparation
• Isolation of the plasmid DNA
• Transformed E.coli is plated onto an agar plate. Every E.
coli colony will contain plasmids with the same insert.
• Colonies are “picked” & transferred to liquid media where
they multiply; use 384 well high throughput plates
• Plasmids are isolated and suspended in a buffer.
Template Production Laboratory
Current Capacity:
22,000,000 plasmids/year
Sanger Sequencing
•
Utilize dideoxy sequencing method of chain termination (Sanger)
• Each plasmid is reacted with a forward and reverse primer (2
reactions for each piece of DNA).
• Done in high throughput manner in 384 well plates
Sequencing reactions
-Initial dideoxy sequencing involved use of
radioactive dATP and 4 separate reactions (ddATP,
ddTTP, ddCTP, ddGTP) & separation on 4 separate
lanes on an acrylamide gel with detection through
autoradiogram
-New techologies use 4 fluorescently labeled bases
and separation on capillaries and detection through
a CCD camera
DNA sequencing
R primer
binds,
synthesis
Plasmid Structure
Antibiotic
resistance
F primer binds, synthesis
Sequencing Machines
• The DNA fragments are loaded into capillaries in
the sequencing machines.
• Polymer in the capillaries provides a matrix for
separating the DNA fragments based on size.
• Separation of the fragments through a matrix is
called electrophoresis.
Sequence Production Laboratory
Current Capacity:
40,000,000 sequences/year
Data Collection
• A laser excites the fluorescent dyes.
• A camera detects the fluorescence.
• Data collection software collects the data.
Capillary array view
Sequencing Machine Output
AACTCATCGAATCCGTACGGG
AACTCATCGAATCCGTACGG
AACTCATCGAATCCGTACG
AACTCATCGAATCCGTAC
AACTCATCGAATCCGTA
AACTCATCGAATCCGT
AACTCATCGAATCCG
AACTCATCGAATCC
AACTCATCGAATC
AACTCATCGAAT
AACTCATCGAA
AACTCATCGA
AACTCATCG
AACTCATC
AACTCAT
AACTCA
AACTC
AACT
AAC
AA
A
Fluorescent
Sequencing Gel
Four colors, one lane
per sample
Fluorescent Sequencing Gel
Each fragment differs by one
nucleotide
This is a diagram of just one
lane. Reading from the bottom,
where the fragment is only one
base long, the fluorescent dye
is an A. This is the first base in
the sequence.
Data Analysis
• An chromatogram is produced and the bases are called
• Software assign a quality value to each base
• Phred & TraceTuner
• Read DNA sequencer traces
• Call bases
• Assign base quality values
• Write basecalls and quality values to output files.
Assemble Fragments
AGCTAGGCTC
GCTAGCTAGCT
AGCTCGCTA
CTAGCTAGCTAGGCTC
SEQUENCER OUTPUT
AGCTAGC
AGCTCGCTAGCTA
GCTAGCTAGC
TAGCTAGC
ASSEMBLE
FRAGMENTS
AGCTCGCTAGCTAGCTAGCTAGCTAGGCTC
GCTAGCTAGC
TAGCTAGC
AGCTCGCTAGCTA
TAGCTAGCTA
AGCTAGC
AGCTCGCTA
CTAGCTAGCTAGGCTC
AGCTAGGCTC
GCTAGCTAGCT
CTCGCTAGCTAG
CLOSURE & ANNOTATION
TAGCTAGCTA
CTCGCTAGCTAG
•
• FRAGMENTS
FROM
SEQUENCING
PROCESS
CONSENSUS
SEQUENCE
Closure
• Assemble the sequence files, relate them to each other,
and close gaps
• Involves many computational programs to identify
overlapping sequences, linkages between sequences
• Back to the lab for hard to close gaps
Complicating Factors
• A procedure that works well in one species may
not produce the same results in even a closely
related species.
• Each genome is uniquely different in its size and
how it sequences and assembles
• New technologies must be developed to tackle
the unique characteristics and properties of
difficult genomes
Whole Genome Shotgun Sequencing:
Modifying for Eukaryotes
ƒNot restricted to bacterial organisms
ƒSequence eukaryotes:
ƒwhole genome draft sequence; same approach as
with bacteria
ƒchromosome by chromosome; sequence genome
using large insert bacterial artificial chromosome
(BAC) clones anchored to the chromosomes
ƒcombination of whole genome and chromosome
by chromosome
Whole Genome Draft Sequencing
1.Library construction
2. Random Sequencing Phase
a. sequence DNA
(15,000 sequences/ Mb)
a. isolate DNA
3. Closure Phase
a. assemble sequences
b. close gaps
b. fragment DNA
c. edit
c. clone DNA
GGG ACTGTTC ...
d. annotation
237
4. COMPLETE
GENOME SEQUENCE
239
238
Whole Genome Draft Sequencing
1.Library construction
2. Random Sequencing Phase
a. sequence DNA
(15,000 sequences/ Mb)
a. assemble
sequences
Advantages:
Saves time and money
b. close gaps
(~50 %)
a. isolate DNA
b. fragment DNA
c. clone DNA
3. Closure Phase
GGG ACTGTTC ...
Disadvantages: Incomplete
c. edit
sequence, contains
errors
d. annotation
237
4. COMPLETE
GENOME SEQUENCE
239
238
454 Genome Sequencing System
•
•
•
•
•
•
•
•
•
Library prep, amplification and sequencing: 2-4 days
Single sample preparation from bacterial to human genomic DNA
Single amplification per genome with no cloning or cloning artifacts
Picoliter volume molecular biology
100 Mb per run (4-5 hr); less than $ 20,000 per run
Read lengths 200-230 bases
Massively parallel imaging, fluidics and data analysis
Requires high genome coverage for good assembly
Error rate of 1-2%
454-Pyrosequencing
Construct
Single stranded
adaptor
liagated DNA
Perform emulsion
PCR
ƒ Sequencing by Synthesis:
ƒSimultaneous sequencing of the entire genome in
hundreds of thousands of picoliter-size wells
ƒPyrophosphate signal generation
Depositing DNA Beads into the
PicoTiter™Plate
Expressed Sequence Tags (ESTs): Sampling the
Transcriptome and Genic Regions
ƒ What is an EST?
ƒ single pass sequence from cDNA
ƒ specific tissue, stage, environment, etc.
cDNA
library
in E.coli
pick
individual
clones
template
prep
T7
T3
Insert in
pBluescript
ƒMultiple tissues, states..
ƒwith enough sequences, can ask quantitative
questions
Uses of EST sequencing:
-Gene discovery
-Digital northerns/insights into transcriptome
-Genome analyses, especially annotation of genomic DNA
Issues with EST sequencing:
-Inherent low quality due to single pass nature
-Not 100 % full length cDNA clones
-Redundant sequencing of abundant transcripts
Address through
clustering/
assembly to build
consensus sequences
= Gene Index,
Unigene Set,
Transcript Assembly
EST Clustering
All ESTs and mRNAs from an organism
Cluster and Assemble
Set of clustered,
assembled sequences=
contigs, Transcript
Assembly, Tentative
Consensus, Unigene
Longer, more accurate
sequence of the transcript
Sequences which do not
cluster or
assemble=Singletons,
singlets
Single pass transcript
Web Links for Animation on Genome Sequencing
http://www.jgi.doe.gov/education/how/how30minflash.html
http://www.illumina.com/media.ilmn?Title=Sequencing-BySynthesis%20Demo&Cap=&PageName=solexa%20technolo
gy&PageURL=203&Media=1
From Fragments to Finished Genome
Overview of Sequencing, Assembly, and
Closure Processes
Genome Sequencing Process
Library Construction
Clone Picking
rDNA
Molecules
Genome Closure
Order Contigs
Close Gaps
Template Preparation
Sequencing Reactions
Electrophoresis and
Base Calling
Genome Assembly
Identify Repeats
Finish the Genome
Annotation
Sequence Requirements
1. Free vector should be at low or undetectable level.
2. No chimeric clones. Chimeras occur two or more random
fragments from separate parts of the genome recombine
and end up next to each other.
3. The majority of the inserts should be of relatively uniform
size.
4. Libraries need to be random and cover the whole
genome.
Basecalling & Quality Assignments
Phred & TraceTuner
•Read DNA sequencer
traces
•Call bases
•Assign base quality values
Warner Brothers, Inc.
•Write basecalls and
quality values to output
files.
What are phred quality values?
The quality value q assigned to a base call is defined as:
q = - 10 x log10(p)
where p is the estimated error probability
for that base-call.
OR
A base-call having a probability of 1/1000 of
being incorrect is assigned a quality value of
30.
Probability Quality Value
1/100
20
1/10
10
Assembling the fragments
Merging two sequences
overlap (19 bases)
overhang (6 bases)
…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC
CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…
overhang
% identity = 18/19 % = 94.7%
overlap - region of similarity between regions
overhang - un-aligned ends of the sequences
The assembler screens merges based on:
• length of overlap
• % identity in overlap region
• maximum overhang size.
TIGR Assembler
Greedy
•
Build a rough map of fragment
overlaps
•
Pick the largest scoring overlap
•
Merge the two fragments
•
Repeat until no more merges can be
done
Forward-reverse constraints
•
•
The sequenced ends are facing towards each other
The distance between the two fragments is known
(within certain experimental error)
clone length
F
R
sequenced ends
Scaffolding
• Given a set of non-overlapping contigs
order and orient them along a chromosome
I
II
III
IV
II
III
IV
I
Clone-mates
Insert
R
F
I
II
R
Vector
F
I
II
R
F
II
I
F
R
Linking information
• Overlaps
• Mate-pair links
• Similarity links
reference genome
• Physical markers
• Gene synteny
physical map
Grouping the contigs
Assembly gaps
physical gap
group A
group B
sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap
Unifying view of assembly
Assembly
Scaffolding
Why Completeness is Important
•
•
•
•
•
•
•
Improves characterization of genome features
• Gene order, replication origins
Better comparative genomics
• Genome duplications, inversions
Determination of presence and absence of particular genes and
features is less subjective
Missing sequence might be important (e.g., centromere)
Allows researchers to focus on biology not sequencing
Facilitates large scale correlation studies
Controls for contamination
What Is Closure?
•
Obtaining sequence that was not obtained during random sequencing
which resulted in:
• Sequencing Gaps
• Physical Ends
•
Confirming the integrity of assemblies
• Repeats and misassemblies
• Verification of Clone Coverage
•
Confirming the base sequence of the consensus
• Editing
• Verification of Sequence Coverage
Sequence Validation: Sequence coverage
1X
2X
3X
Sequence coverage rule:
Every base in an assembly must be covered by at least two
sequences of high quality.
Why?
Validating sequence coverage provides a high degree of
confidence in the consensus base calls.
Sequence editor
•
In this example there is an
obvious discrepancy
between the base calls of
several of the underlying
clones in this region.
Causes for gaps
•
•
Non-random shotgun library
• Toxicity of genes or promoters in E. coli
• Genomic DNA difficult to clone (capsular polysaccharides)
• Unstable regions (low complexity)
Sequencing problems
• Hard stops
• Secondary structures
• Very high or low GC content
• Small unit tandem repeats
• Loss of signal
• Homopolymeric tracts
• Very high or low GC content
Closure Challenges: Sequencing
Through Secondary Structures
Hairpin structure
Homopolymeric tracts
Solutions
• Apply different sequencing chemistries
• Big-Dye terminator (default)
• Dye-primer
• dGTP mix (GC rich regions)
• Denature structures - Additives
• Betaine
• DMSO
• Break structure
• Restriction digest
• Transposon insertion
• Micro-libraries
Repetitive Areas
•
•
•
Repetitive areas are regions of high similarity within the
genome/BAC.
Sequences in these areas may be misassembled by the
Assembler.
Verification of the sequence of repetitive areas:
•
A. Identify potential repetitive areas, using repeatFinder and other tools.
•
B. Classify repeats based on length, copy number, % similarity,
structure and complexity.
•
C. If repeats are misassembled, transpose spanning clones or obtain
PCR products and sequence to verify assembly.
Mis-assembled repeat
Clones link different repeat flanks
Resolved Repeat
• Unique flank order is correct
• Use linking information across the repeat (large
insert clones or PCR)
• Consensus sequence is correct
• Use linked clones that have one mate in the
repeat and the other anchored in unique
sequence
• Transposon mediated libraries
Fasta Format
> GDRFE25TF
CATTGAACACTAGGAGCCATAGAC………(up to 60 bases per line)
GTTCAACCGTTTAAGGCAAAACTTA………
AATTTTGGGCAGACTCTAGATCATG………
GGTAATACATACTCTGGGATTACGA………
Multifasta Format
> GDRFE25TF
CATTGAACACTAGGAGCCATAGACT………(up to 60 bases per line)
GTTCAACCGTTTAAGGCAAAACTTA………
AATTTTGGGCAGACTCTAGATCATG………
GGTAATACATACTCTGGGATTACGA………
> GDRFE45TF
ACTGGTTCACATGGAGGGATAGTAC………(up to 60 bases per line)
GACACTCCGTAGCTGGCAATCCTTA………
GGCTCTCAATCGAGACTCTAGTTAC………
TCCAATATGGGCTCATGGAACAAGA………
> GDRFE67TF
CATTGAACACTAGGAGCCATAGATC………(up to 60 bases per line)
AATGTGGCGTAGCTGCCACTTGGTA………
TACCGTCAATCGTATTGTCTAGTTAC………
GGGAGATAATATGGGCTCATATGGT………
>
Download