1 Work @ Fudan University Chen, Yaoliang 2 • TTS System • A Chinese Text-To-Speech system • SafeDB • Bug backlog • SMemoHelper • A small tool that helps learn English words. • Fraud Detecting • Time series tech 3 • CGAP-align: A high performance DNA short read alignment tool ▫ Coauthor with BCM. Bioinformatics in progress ▫ NDBC Demo • On Encoding Shortest Paths in Large Graphs ▫ Coauthor with Jian Pei. VLDB in progress ▫ Coauthor with Haixun Wang. Sigmod in progress ▫ NDBC • Other Projects 4 • Baylor College of Medicine • 序列比对及意义 ▫ Reference & Reads ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT CACGAT • Given a number z reference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z. 5 • A human genome sequence ▫ ▫ ▫ ▫ ▫ 2000 € 1,000,000,000 2008 € 50 - 100,000 2010 € 5 - 10,000 ...2015 € 1,000 ...2020 € 10 in ~10 years in ~4 months in ~2 weeks in ~1 day in ~1 hour to minutes DNA sequences in GenBank 6 • Burrows-Wheeler Alignment Tool ▫ 一个流行的在大型参照序列上对基因片段进行 比对工具 • Optimization of BWA ▫ Code level ▫ Algorithm level • BWA Performance: T = N × Taln ▫ N: enumerate all mismatches and gaps of the read ▫ Taln: time to locate the modified reads in the reference during the alignment stage 7 • Optimizing Taln: efficiency for matching ▫ Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating 8 • Suffix Tree • Suffix Array Based on BWT (FM-index) • Comparison Root Leaf (b=2) A R(AA) C _ R(AA) Ref=ATCTTCAAGA Read=TAA A C G T G T A A ... R(TC) _ R(TC) FM-index ... C R(TT) T _ R(TT) 9 From Yuval Rikover L F mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows # i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i 10 Reminder: Recovering T from L 1. 2. 3. 4. Find F by sorting L First char of T? m Find m in L L[i] precedes F[i] in T. Therefore we get mi How do we choose the correct i in L? 5. ▫ ▫ The i’s are in the same order in L and F As are the rest of the char’s 6. i is followed by s: 7. And so on…. mis F L # i i i i m p p s s s s i p s s m # p i s s i i 11 • Backward-search algorithm • Uses only L (output of BWT) • Relies on 2 structures: ▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars) ▫ Occ(c,q): number of occurrences of char c in prefix L[1,q] Example •C[ ] for T = mississippi# 1 5 6 8 i m p s •occ(s, 5) = 2 •occ(s,12) = 4 Occ Rank 1 2 3 4 5 6 7 8 9 10 11 12 12 SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES) P[ j ] C P = si First step unknown rows prefixed by char “i” fr lr occ=2 [lr-fr+1] fr lr #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L i p s s m # p i s s i i # i m p S 1 2 7 8 10 Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] Œ Find the first c in L[fr, lr] Find the last c in L[fr, lr] Occ() oracle is enough 13 • Backward search • Store “First” and “Last” (k and l) values 14 • P = CAA ▫ i = 12 3 ▫ c= ‘A’ ‘C’ ‘A’ ▫ First = C[‘T’] First(AA) + Occ(‘C’,First(AA)) +1 ▫ Last = C[‘T’] Last(AA) + Occ(‘C’,Last(AA)) Root A A FM-index 15 • Optimizing Taln: efficiency for matching ▫ Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating 16 • e(W) ▫ minimal number of the edit operations that is needed to make W exactly align onto the reference X. • D-array ▫ D[i] : Lower bound of e(W[0…i]) 0 i … 4 3 17 • Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)> • D array in BWA ▫ split W into several small strings like W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > . 18 • Example Reference X = “AACGTATCGACG” ▫W ▫D A A C T G G A 0 0 0 1 1 1 • A better segmentation: Consider e(·)= 2 ▫W ▫D A A C C T G G A 0 0 0 1 2 2 ▫ calculating e(·) costs exponential time ▫ Need to pre-compution 19 Train Reads Frequent Patterns • Fasta file F containing training reads • Mining Frequent Patterns • (FPs) Should be similar to the • reads Generate prefix trie T for in practice • Art State theof FPs withMethods e(w)=2. • Data Concious • Our solution: A simple DFS • on Refine T to a DFA GT FM-index ▫ Count=Last-First+1 Trie DFA 20 • Why Trie DFA? ▫ When online doing alignment, we need to find all the FPs contained in a read ▫ This operation should be no more expensive than O(|W|) 21 Offline Index: Construction R • String Set(FP set) ▫ ▫ ▫ ▫ ▫ ▫ AA C G T AC AG A 1 C L3C LAC 6 T L4G T A C G L2AA G L7 AG • The prefix trie done. We start to construct DFA. L5T 22 • DFS order – minimize the average hop between each jump. (7% up) R A C 35 1 A C G 2 G 36 46 T 7 4 T 57 23 Online Query • String Set(FP set) ▫ ▫ ▫ ▫ ▫ ▫ AA AC AG C G T RR 1 LAA • W=“CACAT” LLAC AC LG LLCC A C G T LAG T G C A LLTT 24 • Optimizing Taln: efficiency for matching ▫ Suffix Tarray (20% up) • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating (0-200% up) 25 • Background • Consider a graph G = (V,E), where V is a set of vertices and E =VxV is a set of edges. • FH-Partition 26 7 4 7->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10 27 • Numbering Function 28 29 Compute FHPartitions Get Numbering Function(s) Encoding FHPartitions • Compute a naïve • Reduce to TSP numbering function • Further Compression ••Region treeFH-partitions Store the • Answering query • Multi numbering efficiently functions 30 31 Thank you!