Repeat finding method

SPECTRUM-BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang OUTLINE  Introduction What is a repeat?  Why studying repeats?  Related work  SAGRI  Algorithm  Analysis   Evaluation INTRODUCTION WHAT IS A REPEAT? (DEFINITION)  [General]: Nucleotide sequences occurring multiply within a genome  [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors). WHAT IS A REPEAT? (FUNCTION)  Motifs    Very short repeats (10-20bp) Transcription factor binding sites Long and Short interspersed elements (SINE, LINE)  Jumping genes  Genes and Pseudogenes  Tandem repeats  Simple short sequence repeats An, CGGn WHY STUDYING REPEATS? (1)  Eukaryotic genomes contain a lot of repeats   Repeats are believed to play an important role in evolution and disease.   E.g. Human genome contains 50% repeats. E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999) Repeats are important to chromatin structure. Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).  It is known that heterochromatin have a lot of SINE and LINE repeats.  WHY STUDYING REPEATS? (2)  Repeats complicated sequence assembly and genome comparison   Many people remove repeats before they analyze the genome. Repeats set hurdles on microarray probe signal analysis  The probe signal may be inaccurate if the probe sequence overlap with repeat regions.  Repeats may contribute to human diversity more than genes.  Repeats can be used as DNA fingerprint STEPS IN REPEAT FINDING Repeat library (RepeatMasker)  De-novo repeat discovery (two steps):  Identification of repeats  Classification of repeats  SAGRI ALGORITHM ALGORITHM OUTLINE  Input: a text G  FindHit phase: finds all candidate of second occurrence of repeat regions   ACGACGCGATTAACCCTCGACGTGATCCTC Validation phase: uses hits from phase 1 to find all pairs of repeats  ACGACGCGATTAACCCTCGACGTGATCCTC SPECTRUM-BASED REPEAT FINDER  What is a spectrum?  Given a string G, its spectrum is the set of all k-mers. E.g. k=3, G= ACGACGCTCACCCT The spectrum is  ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA   CTC is a k-mer occurring at position 7. ACG is a k-mer occurring at positions 1, 4. OBSERVATION 1: HOW TO FIND CANDIDATE REGIONS CONTAINING REPEATS?  Two regions of repeats should share some k-mers.  E.g. the following repeats share CGA. ACGACGCGATTAACCCTCGACGTGATCCTC FEASIBLE EXTENSION (BUD) i S = ACGACGTGATTAACCCTCGACGTGATCCTC  Given the spectrum S for G[1..i-1]: i CGA AX C GX T Feasible extensions! Note: T is called a fooling probe! OBSERVATION 2  A path of feasible extensions may be a repeat. Example: S = ACGACGCTATCGATGCCCTC  11 Spectrum S for G[1..10] is ACG, CGA, CGC, CTA, GAC, GCT, TAT Starting from position 11, there exists a path of feasible extensions: CGA-C-G-C This path corresponds to a length-6 substring in position 2. Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC). PHASE 1: FINDHIT() Algorithm: Input: a text G  Initialize the empty spectrum S  For i = 1 to n /* we maintain the variant that S is a spectrum for G[1..i-1] */  Let x be the k-mer at position i  If x exists in S, run DetectRepSeq(S,i);  Insert x into S  Note: DetectRepSeq(S,i) looks for repeat occurring at position i. AAC AAG ACC ACG AGT ATT CCC CCT 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 CGA C CGA 22 23 G C CGA CTC GAA GAT GTG TAA TCG TGA TTA DetectRepSeg(S(18), 18) 24 25 G A 26 T 27 28 C T AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 CGA C G C CGA-T1 GAT GTG TAA TCG TGA TTA DetectRepSeg(S(18), 18) 24 25 G A 26 T 27 28 C T AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 ACGAAGTGATTAACCCTCGACGCGATCC 18 19 20 Ref Curr 21 22 23 24 25 26 27 28 CGA C G C G A T C T CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3 * DetectRepSeg(S(18), 18) OTHER DETAILS Extend backward  Stop backtracking after h steps  VALIDATION PHASE Decompose hits into set of k-mer and index all the locations of these k-mers.  Scan for each pair of locations of a k-mer w in the hits, do BLAST extension    Use some auxiliary data structure to avoid double checking Report the pairs whose length exceed our threshold ANALYSIS ANALYSIS  How to find most repeats?   Avoid false negative How to get better speed?  Avoid false positive HOW DO WE CHOOSE K? (1)  If k is too big,   If k is too small,   k-mer is too specific and we may miss some repeat k-mer cannot help us to differentiate repeat from non-repeat For repeat of length 50 and similarity>0.9,  we found that k  log4n+2 is good enough. HOW DO WE CHOOSE K? (2)  A random k-mer match with one of n chosen k-mer  Pr(a k-mer re-occurs by random in a sequence of length n)    (analog to throwing n balls into 4k bins)  1-(1 – 4-k)m  1 – exp(-m/4k). We requires 1-exp(-n/4k)1,   hence, k  log4n + log41. If we set 1=1/16, k  log4n + 2 0 m THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (1)  A pair of repeats of length L, with m mismatches  Probability of a preserved k-mer in repeat is L 1  M /   m  M is the number of nonnegative integer solutions x1  x2    xm1  L  m  0  x1 , x2 ,, xm1  k  1 to Subject to X x1 X x2 Xm+1 L THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (2)  It is easy to see that M is the coefficient of xL−m in (1  x  x    x 2  Hence k 1 m 1 ) (1  x k ) m1  (1  x) m1  m  1 L  jk    M (1)   j  m  0 j  ( m 1)  ( L  m ) / k  j CRITERION FOR PATH TERMINATION (1)  Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.  Then, the pruning strategy is length dependent.  If the length of strings in  is r, we allow (r) mismatches. CRITERION FOR PATH TERMINATION (2)  Let q be the mismatch probability and r be the length of the string.  Prob that a string has s mismatches =  r  2 j q (1  q) r 2 j Pq ( s)  q   j  j s 2  r 2 2  For a threshold  (says, 0.01), we set  (r) = max {2  s  r-2 | Pq(s) > } + 2 CONTROL OF FALSE POSITIVES (1)  Two typical cases  The probability of (case 1)/ (case 2) is  2*4-  P(case1 or case2) is small  For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8 EVALUATION Compare with other programs PROGRAMS EulerAlign by Zhang and Waterman  PALS by Edgar and Myers  REPuter by Kurtz et al.  SARGRI  MEASUREMENT  Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.  Shared Repeat Region (SRR): the ratio of the found region to the reference region. SIMULATED DATA Conclusion from simulated data The result is consistent with the analysis GENOME DATA  M.gen (0.6 Mbp)    C.tra (1 Mbp)   Found in high-temperature oil fields E.coli (4 Mbp)   Live inside the cells of humans A.ful (2.1 Mbp)   Organism with the smallest genome Lives in the primate genital and respiratory tracts An import bacteria live inside lower intestines of mammals Human chr22 p20M to p21M (1Mbp)  Use CR and SRR ratio to measure  Cross validation G/H=1,  G/H<1,  G/H<1,  G/H=1,  H/G<1 H/G=1 H/G<1 H/G=1  G “outperforms” H  H “outperforms” G  G, H are complementary  G, H are similar =    QUESTIONS AND ANSWERS  H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469487, June 2008

Repeat finding method

Related documents

Products

Support

Repeat finding method

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib