lecture_10

advertisement
Sequencing and Sequence Assembly
--overview of the genome sequenceing
process
Presented by NIE , Lan
CSE497
Feb.24, 2004
1
Introduction




2
Q: What is Sequence
A: To sequence a DNA molecule is to obtain
the string of bases that it contains. Also know
as read
Q: How to sequence
A: Recall the Sanger Sequencing technology
mentioned in Chapter 1
Introduction
Sanger Sequencing

Cut DNA at each
base:A,C,G,T
Fragment’s migrate
distance is inversely
proportional to their
size
 Run gel and read off
sequence

3
TCGCGATAGCTGTGCTA
Introduction
Limitation
The size of DNA fragments that can be read in
this way is about 700 bps
 Problem
Most genomes are enormous (e.g 108 base
pair in case of human).So it is impossible to be
sequenced directly! This is called Large-Scale
Sequencing

4
Introduction

Solution

Break the DNA into small
fragments randomly
Sequence the readable
fragment directly
Assemble the fragment
together to reconstruct the
original DNA
Scaffolder gaps



5
Solving a one-dimensional jigsaw puzzle with millions of
pieces(without the box) !
1.
2.
3.
4.
5.
6
Break
Sequence
Assemble
Scaffolder
Conclusion
Break
DNA can be cutten into pieces through
mechanical means
7
Issues in Break
 How?
• Coverage
The whole fragments provide an 8X oversampling of
the genome
• Random
Libraries with pieces sizes of 2,4,6,10, 12 and 40 k bp were
produced
• Clone
Obtaining several copies of the original genome and fragments
8
1.
2.
3.
4.
5.
9
Break
Sequence
Assemble
Scaffolder
Conclusion
Sequence
clone
Directed
sequencing
(GEL)
Q: can we read
the fragment
from both end?
10
GTCCAGCCT
1.
2.
3.
4.
5.
11
Break
Sequence
Assemble
Scaffolder
Conclusion
3. Assemble

A Simple Example --ACCGT
ACCGT
CGTGC
TTAC
----CGTGC
TTAC
TTACCGTGC
Overlap: The suffix of a fragment is same as the prefix
of another.
Assemble: align multiple fragments into single
continuous sequence based on fragment overlap
12
3. Assemble
fragments
fragments
assemble
contig1
contig2
gap
13
target
original
A simple model

The simplest, naive approximation of DNA
assemble corresponds to Shortest Superstring
Problem(SCS): Given a set of string s1, ... , sn,
find the shortest string s such that each si appears
as a substring of s.
--ACCGT
----CGTGC
TTAC
TTACCGTGC
14
(1) Overlap step
Create an overlap graph in which every node is a
fragment and edges indicate an overlap
(2) Layout step
Determine which overlaps will be used in
the final assembly, find an optimal spanning
forest on the overlap graph
15
Overlap step
Finding overlap
 Compare each fragment with other fragments to find
whether there’s overlap on its end part and another’s
beginning part.
We call ‘a overlap b’ when a’s suffix equal to b’s prefix
16
Overlap step
Overlap graph
Directed, weighted graph G(V,E,w)
V: set of fragments
E : set of directed edge indicates the overlap between
two fragments. An edge <a,b,w> means an overlap
between a and b with weight w. this equal to
suffix(a,w)=prefix(b,w)
17
Example
W=AGTATTGGCAATC
Z=AATCGATG
U=ATGCAAACCT
X=CCTTTTGG
Y=TTGGCAATCA
S=AATCAGG
s
5
y
9
4
w
x
3
4
z
18
3
u
Layout step




19
Looking for shortest common superstring is
the same as looking for path of maxium
weight
Using greedy algorithm to select a edge with
the best weight at every step.
The selected edge is checked by Rule. If this
check is accepted, the edge is accepted,
otherwise omit this edge
Rule: for either node on this edge, indegree
and outdegree <=1; Acyclic

20
At last the fragments merged together , from
the point of graph, it is a forest of hamitonian
paths(a path through the graph that contains
each node at most once)., each path
correspond to a contig
Example
W=AGTATTGGCAATC
Z=AATCGATG
U=ATGCAAACCT
X=CCTTTTGG
Y=TTGGCAATCA
S=AATCAGG
W->Y->S
AGTATTGGCAATC
TTGGCAATCA
AATCAGG
s
5
y
9
4
w
x
3
4
z
21
AGTATTGGCAATCAGG
3
u
Z->U->X
AATCGATG
ATGCAAACCT
CCTTTTGG
AATCGATGCAAACCT TTTGG

Geedy Algorithm is neither optimal nor
complete, and will introduce gap
GCC
2
ATGC
Can’t
22
2
3
TGCAT
correctly model the assembly
problem due to complication in the real
problem instance
Complication with
Assemble




23
Sequencing errors. Most sequencers have around
1% error in the best case.
Unknown orientation. Could have sequenced either
strand.
Bias in the reads. Not all regions of the sequence
will be covered equally.
Repeats. There is much repetitive sequence,
especially in human and higher plants
Sequenceing Errors
Fragments contains3 kinds of errors: insert,
deletion, substitution
Possibility :Substitutions ( 0.5-2% ), insert and
deletion occur roughly 10 times less frequently
24
http://compbio.uchsc.edu/Hunter_lab/Hunter/bioi7711/lecture6.ppt
Problems with the
simple model - Errors
Y:CGTGC
A
Z:TTAC
5
25
3
--ACCGT
----CGTGC
TTAC
z
-TACCGT
TTACCGTGC
G
y
u
y
2
u
U:TACCGT
x
3
x
x:ACCGT
z
Problems with the
simple model - Errors
Solution
Allow for bounded number of mismatches between
overlapping fragments ----- Approximate overlaps
Criterion: minimum overlap length(40 bps), error rate(less than
6% mismatches )
How?
Using semi-global alignment to find the best match
between the suffix of one sequence and the prefix of
another.
26
semi-global alignment
Score system: 1 for matches, -1 for mismatches, -2 for gaps
Initializing the first row and first column of zero, ignore gap in both
extremities
Algorithm is same as global comparision
Search last column for higest score and obtain alignment by
tracing back to start point ( overlap of x over y). overlap of y over
x corresponds to the max in the last row
y
x
0000000000000……
0
0
0
27
X:
A
C
C
G T
0
0
0
0
0
0
-1
1
1
-1
-2
Y:
C 0
28
G 0
-1
-1
0
2
0
A 0
1
-1
-2
1
1
T 0
-1
0
-2
-1
2
G 0
-1
-2
-1
-1
0
C 0
-1
Overlap:x->y
ACCG-T—
--CGATGC
Overlap: y->x
0
-1
-2
-2
CGATGC------ACCGT
Problems with the
simple model - Errors
3
x
x:ACCGT
Y:CGTGC
A
Z:TTAC
5
3
TTAC
z
2
y
0
0
u
29
-TACCGT
TTACCGTGC
G
x
--ACCGT
----CGTGC
2
u
U:TACCGT
y
-2
z
Criterion
1.Score>-3
2. Mismatch<2
--ACCG-T
----CGATGC
TT-C
-TAGCGT
TTACCGTGC
Problems with the
simple model Unkown orientation
Unknowns Orientation:
y Y’
Fragments can be read from both of
the DNA strands.
x
Solution
X’
Try all possible combination
z
30
Z’
CACGT
CACGT
CACGT
ACGT
ACGT
-ACGT
ACTACG
CGTAGT
--CGTAGT
GTACT
AGTAC
-----AGTAC
CACGTAGTACTGA
Problems with the
simple model - Repeat
Repeats can be characterized by length, copy number
& fidelity between copies
– Human T-cell receptor: 5x of a 4kb gene w/ ~3%
variation
– ALUs. ~300bp w/5-15% variation, clustering to be
50-60% of many human sequence regions
– microsatellites, 3-6bp with thousands of repeats in
centromeric and telemeric regions, 1-2% variation.
31
gepard.bioinformatik.uni-saarland.de/html/BioinformatikIIIWS0304-Dateien/ V3-Assembly.ppt
Problems with the
simple model Repeat2
Rearrangment
Original One
A
X1
B
X2
C
A
X3
D
X2
3X
Fragment
D
Assembler
A
X2
Consensus
A
X2
32
C
C
X3
X3
B
X1
D
B
X1
D
Problems with the
simple model Repeat3
Original one
A
X1
Assembly
A
X1
X2
B
X2
Overcollapsing
C
Target one
A
X
C
Contig 1
33
C
!
Shortest string is
not always the
best!
B
Contig2
B
gap
Problems with the
simple model -Lack of
coverage
Lack of coverage
Not all regions of the sequence will be covered equally
Target DNA
Uncovered area
Solution
Do more sampling to increase the coverage level
34
Using scaffolder technology
1.
2.
3.
4.
5.
35
Break
Sequence
Assemble
Scaffolder
Conclusion
4. Scaffolder
A
A
36
X
C’
B
C
X
B’

Scaffold
Given a set of non-overlapping contigs, order and orient
them to reconstruct the original DNA

How?
Is there any relationsip can be built between different
contigs?
4. Scaffolder
-Mate Pairs

Mate pairs:

The sequenced ends are facing towards each other
The distance between the two fragments is known( insert size – fragment
size)
The mate pairs is extremly valuable during the scaffold step.


Mate Pair
37
4. Scaffolder
-Method
• A scaffold retrieve the original mate pairs
spanning in different contigs
• Using the link information of the pairs( Distance,
Orientation) to orients contigs and estimates
the gap size, this is calles “walk”
38
4 Scaffolder
-Example
Contig 1
Contig 2
gap
39
4 Scaffolder
Graph Representation
 Nodes: contigs
 Directed edges: constraints on relative
placement of contigs – relative order and
relative orientation

http://jbpc.mbl.edu/jbpc/GenomesMedia/10_14POP.PPT
40
1.
2.
3.
4.
5.
41
Break
Sequence
Assemble
Scaffolder
Conclusion
5. Conclusion
The whole genome sequencing process
Break-> Sequence -> Assemble-> Scaffolder

A Simple Model
Using overlap graph to construct the shortest
common string
However, it can’t corrctly model the assembly problem

42
ConclusionRepeat
•
Repeat detection
–
pre-assembly: find fragments that belong to repeats


–
–
•
during assembly: detect "tangles" indicative of repeats
(Pevzner, Tang, Waterman 2001)
post-assembly: find repetitive regions and potential misassemblies. (Reputer, RepeatMasker)
Repeat resolution
–
–
43
statistically (most existing assemblers)
repeat database (RepeatMasker)
find DNA fragments belonging to the repeat
determine correct tiling across the repeat
Download