Repeat finding method

advertisement
SPECTRUM-BASED DE NOVO REPEAT
DETECTION IN GENOMIC SEQUENCES
Do Huy Hoang
OUTLINE

Introduction
What is a repeat?
 Why studying repeats?

Related work
 SAGRI

Algorithm
 Analysis


Evaluation
INTRODUCTION
WHAT IS A REPEAT? (DEFINITION)

[General]: Nucleotide sequences occurring multiply
within a genome

[CompBio]: Given a genome sequence S, find a string P
which occurs at least twice in S (allowing some errors).
WHAT IS A REPEAT? (FUNCTION)

Motifs



Very short repeats (10-20bp)
Transcription factor binding sites
Long and Short interspersed elements (SINE, LINE)

Jumping genes

Genes and Pseudogenes

Tandem repeats

Simple short sequence repeats An, CGGn
WHY STUDYING REPEATS? (1)

Eukaryotic genomes contain a lot of repeats


Repeats are believed to play an important role in evolution and
disease.


E.g. Human genome contains 50% repeats.
E.g. Alu elements are particularly prone to recombination. Insertion of
Alu repeats inactivate genes in patient with hemophilia and
neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)
Repeats are important to chromatin structure.
Most TEs in mammals seem to be silenced by methylation. Alu
sequences are major target for histone H3-Lys9 methylation in humans
(Kondo and Issa, 2003).
 It is known that heterochromatin have a lot of SINE and LINE repeats.

WHY STUDYING REPEATS? (2)

Repeats complicated sequence assembly and genome
comparison


Many people remove repeats before they analyze the genome.
Repeats set hurdles on microarray probe signal analysis

The probe signal may be inaccurate if the probe sequence
overlap with repeat regions.

Repeats may contribute to human diversity more than
genes.

Repeats can be used as DNA fingerprint
STEPS IN REPEAT FINDING
Repeat library (RepeatMasker)
 De-novo repeat discovery (two steps):

Identification of repeats
 Classification of repeats

SAGRI ALGORITHM
ALGORITHM OUTLINE

Input: a text G

FindHit phase: finds all candidate of second occurrence
of repeat regions


ACGACGCGATTAACCCTCGACGTGATCCTC
Validation phase: uses hits from phase 1 to find all pairs
of repeats

ACGACGCGATTAACCCTCGACGTGATCCTC
SPECTRUM-BASED REPEAT FINDER

What is a spectrum?

Given a string G, its spectrum is the set of all k-mers.
E.g. k=3, G= ACGACGCTCACCCT
The spectrum is
 ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA


CTC is a k-mer occurring at position 7.
ACG is a k-mer occurring at positions 1, 4.
OBSERVATION 1: HOW TO FIND CANDIDATE REGIONS
CONTAINING REPEATS?

Two regions of repeats should share some k-mers.

E.g. the following repeats share CGA.
ACGACGCGATTAACCCTCGACGTGATCCTC
FEASIBLE EXTENSION (BUD)
i
S = ACGACGTGATTAACCCTCGACGTGATCCTC

Given the spectrum S for G[1..i-1]:
i
CGA
AX
C
GX
T
Feasible extensions!
Note: T is called a fooling probe!
OBSERVATION 2

A path of feasible extensions may be a repeat.
Example:
S = ACGACGCTATCGATGCCCTC

11
Spectrum S for G[1..10] is
ACG, CGA, CGC, CTA, GAC, GCT, TAT
Starting from position 11, there exists a path of feasible extensions:
CGA-C-G-C
This path corresponds to a length-6 substring in position 2.
Also, this path has one mismatch compare with the length-6 substring for
position 11 (CGATGC).
PHASE 1: FINDHIT()
Algorithm:
Input: a text G
 Initialize the empty spectrum S
 For i = 1 to n
/* we maintain the variant that S is a spectrum for G[1..i-1] */
 Let x be the k-mer at position i
 If x exists in S, run DetectRepSeq(S,i);
 Insert x into S

Note: DetectRepSeq(S,i) looks for repeat occurring at
position i.
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
CGA C
CGA
22
23
G C
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
DetectRepSeg(S(18), 18)
24
25
G A
26
T
27
28
C T
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
CGA C G C
CGA-T1
GAT
GTG
TAA
TCG
TGA
TTA
DetectRepSeg(S(18), 18)
24
25
G A
26
T
27
28
C T
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
1 2 …
… 18 19 20 21 22 23 24 25 26 27 28
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20
Ref
Curr
21
22
23
24
25
26
27
28
CGA C G C G A T C T
CGA-T1-T2-A3*
A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3*
G3 *
DetectRepSeg(S(18), 18)
OTHER DETAILS
Extend backward
 Stop backtracking after h steps

VALIDATION PHASE
Decompose hits into set of k-mer and index all the
locations of these k-mers.
 Scan for each pair of locations of a k-mer w in the hits,
do BLAST extension



Use some auxiliary data structure to avoid double checking
Report the pairs whose length exceed our threshold
ANALYSIS
ANALYSIS

How to find most repeats?


Avoid false negative
How to get better speed?

Avoid false positive
HOW DO WE CHOOSE K? (1)

If k is too big,


If k is too small,


k-mer is too specific and we may miss some repeat
k-mer cannot help us to differentiate repeat from non-repeat
For repeat of length 50 and similarity>0.9,

we found that k  log4n+2 is good enough.
HOW DO WE CHOOSE K? (2)

A random k-mer match with one of n chosen k-mer

Pr(a k-mer re-occurs by random in a sequence of length
n)



(analog to throwing n balls into 4k bins)
 1-(1 – 4-k)m  1 – exp(-m/4k).
We requires 1-exp(-n/4k)1,


hence, k  log4n + log41.
If we set 1=1/16, k  log4n + 2
0
m
THE OCCURRENCE OF FALSE NEGATIVE (MISSED
REPEAT) (1)

A pair of repeats of length L, with m mismatches

Probability of a preserved k-mer in repeat is
L
1  M /  
m

M is the number of nonnegative integer solutions
x1  x2    xm1  L  m

0  x1 , x2 ,, xm1  k  1
to Subject to
X
x1
X
x2
Xm+1
L
THE OCCURRENCE OF FALSE NEGATIVE (MISSED
REPEAT) (2)

It is easy to see that M is the coefficient of xL−m in
(1  x  x    x
2

Hence
k 1 m 1
)
(1  x k ) m1

(1  x) m1
 m  1 L  jk 


M
(1) 

j  m 
0 j  ( m 1)  ( L  m ) / k 
j
CRITERION FOR PATH TERMINATION (1)

Instead of fixing the number of mismatches, we may
want to fixed the percentage of mismatches, says, 10%.

Then, the pruning strategy is length dependent.

If the length of strings in  is r, we allow (r) mismatches.
CRITERION FOR PATH TERMINATION (2)

Let q be the mismatch probability and r be the length of the
string.

Prob that a string has s mismatches =
 r  2 j
q (1  q) r 2 j
Pq ( s)  q  
j 
j s 2 
r 2
2

For a threshold  (says, 0.01), we set

(r) = max {2  s  r-2 | Pq(s) > } + 2
CONTROL OF FALSE POSITIVES (1)

Two typical cases
 The
probability of (case 1)/ (case 2)
is  2*4-
 P(case1 or case2) is small

For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 *
10-8
EVALUATION
Compare with other programs
PROGRAMS
EulerAlign by Zhang and Waterman
 PALS by Edgar and Myers
 REPuter by Kurtz et al.
 SARGRI

MEASUREMENT

Count Ratio (CR): the ratio of number of pairs of repeat
share more than 50% with a reference pair to the
number of reference pairs.

Shared Repeat Region (SRR): the ratio of the found
region to the reference region.
SIMULATED DATA
Conclusion from simulated data
The result is consistent with the analysis
GENOME DATA

M.gen (0.6 Mbp)



C.tra (1 Mbp)


Found in high-temperature oil fields
E.coli (4 Mbp)


Live inside the cells of humans
A.ful (2.1 Mbp)


Organism with the smallest genome
Lives in the primate genital and respiratory tracts
An import bacteria live inside lower intestines of mammals
Human chr22 p20M to p21M (1Mbp)

Use CR and SRR ratio to measure

Cross validation
G/H=1,
 G/H<1,
 G/H<1,
 G/H=1,

H/G<1
H/G=1
H/G<1
H/G=1
 G “outperforms” H
 H “outperforms” G
 G, H are complementary
 G, H are similar
=   
QUESTIONS AND ANSWERS

H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang.
Spectrum-based de novo repeat detection in genomic
sequences. Journal of Computational Biology, 15(5):469487, June 2008
Download