Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank Topics that will be discussed Biological background Present the biological problem Show an algorithm that treats this problem statistical mechanics methods Try our algorithm on two well known problems What we did so far Human Genome Project(2001) This article published(2000) :Sequence is not everything - Lets do some theory Control over gene expression - when, how much Control element = Regulator = Sequence motif Genes are working together = Co-regulated genes The goals of this work Identify the Control element Where are they located ? Identify co-regulated genes Multiple control elements Example : where are the control elements located? Concepts: directionality , upstream ,in the junk TACGAXTTCGA Example: co-regulated genes naïve approach :TACGAXTTTAAYATGGCA experimentally :TACGAXTTCGAYATGGCA To activate set of genes: multiple sequences needed New terminology DNA = string of letters Control element = word Multiple control element = sentences Genes and junk = background noise Example : S: …GAGCXTGGYGCTT…… words = {GA,TG} sentence = GA.TG background = genes and junk. MobyDick algorithm decipher a ‘‘text’’ consisting of a long string of letters written in an unknown language. Find the words in the text Find the right spacing example : D={A,T,AT} S=ATT P1=A.T.T P2=AT.T How would you do it ? 1.Look for repeated substring in the string : Tal went to Weizmann this morning. When he arrived he didn’t go to his office, he went to drink a cup of coffee …. {went, to, he} D (dictionary) 2.Space the text – ooopps Spacing is not that simple. e.g.– D={A,T,AT} S=ATT P1=A.T.T p1 P2=AT.T p2 MobyDick Blueprints 1 letter word S=TAGATAT Find pw D={T,A,G} pw ={pA,pT..} 2 letter word D={A,TA,…} S=TAGATAT Find pw pw ={pA,pTA.} No more optional words stop! Find spacing S=TA.G.A.TA.T statistical mechanics in order to ? 1.How does MobyDick decide {pw}? 2.When does MobyDick add a new word? 3.Space (parse) the text. The likelihood function z pw k k: Nw ( k ) w a possible spacing Nw: number of times the word w appears Example : D=(T,AT,A) S=TATA k1=T.A.T.A Z p A2 pT2 p A pT p AT k2=T.AT.A Likelihood function - intuition Z(D,{pw})- partition function: <E>,<N>,<T>,…. Z(D,{pw})- the probability to obtain a sequence S. Example : D =(T,AT,A) {pT,pA,pAT} Question : what is the probability to S=TATA? 1st possibility : T.A.T.A pA*pA*pT*pT 2nd possibility: T.AT.A pT*pAT*pA p p A p p A pT p AT Z 2 2 T Finding {pw} Given : D,S Maximize Z({pw},D) with respect to {pw} This {pw} gives the highest probability to get the given S Lets find the {pw} ! Definition : N w - average number of the word w over the different spacings . Can prove: N w pw ( ) ln Z pw Nw maximize Z- solve: pw w' N w' solving : is done by iteration: pw’ <Nw’> pw Enough is enough !!! When is pw good enough ? when the new {pw} don’t give higher Z We say : this method converges ! Other methods don’t converge. Why finding {pw} using this way ? Monte-Carlo methods don’t converge. Slow method can transform to fast method Order of complexity O(LDl) L-the length of the string D-the size of the dictionary l-the length of the longest word in D Add new words ? Look at dictionary D={T,A,C,G} S=TATTGA Compose new word ww’ D={T,A,C,G} S=TATTGA ww’=TA Check occurrence D={T,A,C,G} S=TATTGA ww’=TA [ N _ ww '] [ N w pw' ]? Yes- add to dictionary D={T,A,C,G,TA} S=TATTGA A problem and a bad solution The algorithm finds only the words which are composed from words already in the dictionary. Example : S=AATATAAA 1st step : S=AATATAAA D= {A} 2nd step : S=AATATAAA AT is not a composition of words Solution: Look for repeated long strings by consideration the problem Spacing Define : w number of times the word w occurs in a given spacing. Quality factor : Nw Qw w The required condition : Qw 1 checking the algorithm Applying on the English novel Moby Dick Applying on Control elements on the yeast genome Not always possible - Voynich manuscript (1450) Preparing the book MobyDick Call me Ishmael. Some years ago- never mind how long precisely- having little or no money in my purse, and nothing particular tothought I would sail … CallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavingli ttleornomoneyinmypurseandnothingparticulartothoughtIwouldsail.. CallabajameIshmaelbjklmbbSomeyearsagonevermindhowlon Eciselyhavinglittlermsdrornomoneyinmypurseandnothingparticu artothoughtIwouldsail … Results- MobyDick 10 first chapters D={a,b,c….} Text : 4,214 unique words 2,630 occurred only once Background – increases L by the factor of 3. 2,450 words found , 700 in English, 40 composite words. Results- yeast D={T,A,C,G} Text : 443 experimentally determined sites Background – genes and junk 500 words found 114 match the experimentally predictions Not that good – it is a beginning! The end