SPECTRUM-BASED DE NOVO REPEAT
DETECTION IN GENOMIC SEQUENCES
Do Huy Hoang
OUTLINE
Introduction
What is a repeat?
Why studying repeats?
Related work
SAGRI
Algorithm
Analysis
Evaluation
INTRODUCTION
WHAT IS A REPEAT? (DEFINITION)
[General]: Nucleotide sequences occurring multiply
within a genome
[CompBio]: Given a genome sequence S, find a string P
which occurs at least twice in S (allowing some errors).
WHAT IS A REPEAT? (FUNCTION)
Motifs
Very short repeats (10-20bp)
Transcription factor binding sites
Long and Short interspersed elements (SINE, LINE)
Jumping genes
Genes and Pseudogenes
Tandem repeats
Simple short sequence repeats An, CGGn
WHY STUDYING REPEATS? (1)
Eukaryotic genomes contain a lot of repeats
Repeats are believed to play an important role in evolution and
disease.
E.g. Human genome contains 50% repeats.
E.g. Alu elements are particularly prone to recombination. Insertion of
Alu repeats inactivate genes in patient with hemophilia and
neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)
Repeats are important to chromatin structure.
Most TEs in mammals seem to be silenced by methylation. Alu
sequences are major target for histone H3-Lys9 methylation in humans
(Kondo and Issa, 2003).
It is known that heterochromatin have a lot of SINE and LINE repeats.
WHY STUDYING REPEATS? (2)
Repeats complicated sequence assembly and genome
comparison
Many people remove repeats before they analyze the genome.
Repeats set hurdles on microarray probe signal analysis
The probe signal may be inaccurate if the probe sequence
overlap with repeat regions.
Repeats may contribute to human diversity more than
genes.
Repeats can be used as DNA fingerprint
STEPS IN REPEAT FINDING
Repeat library (RepeatMasker)
De-novo repeat discovery (two steps):
Identification of repeats
Classification of repeats
SAGRI ALGORITHM
ALGORITHM OUTLINE
Input: a text G
FindHit phase: finds all candidate of second occurrence
of repeat regions
ACGACGCGATTAACCCTCGACGTGATCCTC
Validation phase: uses hits from phase 1 to find all pairs
of repeats
ACGACGCGATTAACCCTCGACGTGATCCTC
SPECTRUM-BASED REPEAT FINDER
What is a spectrum?
Given a string G, its spectrum is the set of all k-mers.
E.g. k=3, G= ACGACGCTCACCCT
The spectrum is
ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
CTC is a k-mer occurring at position 7.
ACG is a k-mer occurring at positions 1, 4.
OBSERVATION 1: HOW TO FIND CANDIDATE REGIONS
CONTAINING REPEATS?
Two regions of repeats should share some k-mers.
E.g. the following repeats share CGA.
ACGACGCGATTAACCCTCGACGTGATCCTC
FEASIBLE EXTENSION (BUD)
i
S = ACGACGTGATTAACCCTCGACGTGATCCTC
Given the spectrum S for G[1..i-1]:
i
CGA
AX
C
GX
T
Feasible extensions!
Note: T is called a fooling probe!
OBSERVATION 2
A path of feasible extensions may be a repeat.
Example:
S = ACGACGCTATCGATGCCCTC
11
Spectrum S for G[1..10] is
ACG, CGA, CGC, CTA, GAC, GCT, TAT
Starting from position 11, there exists a path of feasible extensions:
CGA-C-G-C
This path corresponds to a length-6 substring in position 2.
Also, this path has one mismatch compare with the length-6 substring for
position 11 (CGATGC).
PHASE 1: FINDHIT()
Algorithm:
Input: a text G
Initialize the empty spectrum S
For i = 1 to n
/* we maintain the variant that S is a spectrum for G[1..i-1] */
Let x be the k-mer at position i
If x exists in S, run DetectRepSeq(S,i);
Insert x into S
Note: DetectRepSeq(S,i) looks for repeat occurring at
position i.
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
CGA C
CGA
22
23
G C
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
DetectRepSeg(S(18), 18)
24
25
G A
26
T
27
28
C T
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
CGA C G C
CGA-T1
GAT
GTG
TAA
TCG
TGA
TTA
DetectRepSeg(S(18), 18)
24
25
G A
26
T
27
28
C T
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
OTHER DETAILS
Extend backward
Stop backtracking after h steps
VALIDATION PHASE
Decompose hits into set of k-mer and index all the
locations of these k-mers.
Scan for each pair of locations of a k-mer w in the hits,
do BLAST extension
Use some auxiliary data structure to avoid double checking
Report the pairs whose length exceed our threshold
ANALYSIS
ANALYSIS
How to find most repeats?
Avoid false negative
How to get better speed?
Avoid false positive
HOW DO WE CHOOSE K? (1)
If k is too big,
If k is too small,
k-mer is too specific and we may miss some repeat
k-mer cannot help us to differentiate repeat from non-repeat
For repeat of length 50 and similarity>0.9,
we found that k log4n+2 is good enough.
HOW DO WE CHOOSE K? (2)
A random k-mer match with one of n chosen k-mer
Pr(a k-mer re-occurs by random in a sequence of length
n)
(analog to throwing n balls into 4k bins)
1-(1 – 4-k)m 1 – exp(-m/4k).
We requires 1-exp(-n/4k)1,
hence, k log4n + log41.
If we set 1=1/16, k log4n + 2
0
m
THE OCCURRENCE OF FALSE NEGATIVE (MISSED
REPEAT) (1)
A pair of repeats of length L, with m mismatches
Probability of a preserved k-mer in repeat is
L
1 M /
m
M is the number of nonnegative integer solutions
x1 x2 xm1 L m
0 x1 , x2 ,, xm1 k 1
to Subject to
X
x1
X
x2
Xm+1
L
THE OCCURRENCE OF FALSE NEGATIVE (MISSED
REPEAT) (2)
It is easy to see that M is the coefficient of xL−m in
(1 x x x
2
Hence
k 1 m 1
)
(1 x k ) m1
(1 x) m1
m 1 L jk
M
(1)
j m
0 j ( m 1) ( L m ) / k
j
CRITERION FOR PATH TERMINATION (1)
Instead of fixing the number of mismatches, we may
want to fixed the percentage of mismatches, says, 10%.
Then, the pruning strategy is length dependent.
If the length of strings in is r, we allow (r) mismatches.
CRITERION FOR PATH TERMINATION (2)
Let q be the mismatch probability and r be the length of the
string.
Prob that a string has s mismatches =
r 2 j
q (1 q) r 2 j
Pq ( s) q
j
j s 2
r 2
2
For a threshold (says, 0.01), we set
(r) = max {2 s r-2 | Pq(s) > } + 2
CONTROL OF FALSE POSITIVES (1)
Two typical cases
The
probability of (case 1)/ (case 2)
is 2*4-
P(case1 or case2) is small
For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 *
10-8
EVALUATION
Compare with other programs
PROGRAMS
EulerAlign by Zhang and Waterman
PALS by Edgar and Myers
REPuter by Kurtz et al.
SARGRI
MEASUREMENT
Count Ratio (CR): the ratio of number of pairs of repeat
share more than 50% with a reference pair to the
number of reference pairs.
Shared Repeat Region (SRR): the ratio of the found
region to the reference region.
SIMULATED DATA
Conclusion from simulated data
The result is consistent with the analysis
GENOME DATA
M.gen (0.6 Mbp)
C.tra (1 Mbp)
Found in high-temperature oil fields
E.coli (4 Mbp)
Live inside the cells of humans
A.ful (2.1 Mbp)
Organism with the smallest genome
Lives in the primate genital and respiratory tracts
An import bacteria live inside lower intestines of mammals
Human chr22 p20M to p21M (1Mbp)
Use CR and SRR ratio to measure
Cross validation
G/H=1,
G/H<1,
G/H<1,
G/H=1,
H/G<1
H/G=1
H/G<1
H/G=1
G “outperforms” H
H “outperforms” G
G, H are complementary
G, H are similar
=
QUESTIONS AND ANSWERS
H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang.
Spectrum-based de novo repeat detection in genomic
sequences. Journal of Computational Biology, 15(5):469487, June 2008