! IMPORTANT:! ! ! Applications!of!metaheuristics!to!the!Sequential!Ordering!Problem!will!be! covered!first.!! ! ! Applications!of!metaheuristics!to!DNA!codes!design!will!be!treated!only!if! time!permits.! ! ! ! Roberto! ! DNA Codes Design Roberto Montemanni Dalle Molle Institute for Artificial Intelligence University of Applied Science of Southern Switzerland Email: roberto@idsia.ch Tel: +41 58 666 666 7 1 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 2 Contributions to slides • Dan C. Tulpan NRC Institute for Information Technology, Canada (Introduction, Applications, Stochastic Local Searches) • Marco Chiarandini University of Southern Denmark, Denmark (Introduction to Stochastic Local Search) • Thomas Stuetzle Darmstadt University of Technology, Germany (Introduction to Variable Neighbourhood Search) 3 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 4 DNA – The Blueprint of Life bacteria human DNA chimp worm fish cow dinosaur bird Background: DNA 9 pictures taken from ClipArt 5 What is DNA? • All organisms on this planet are made of the same type of genetic blueprint. 6 Real Applications • DNA computing => using DNA for massively parallel computations. • DNA Chemical libraries => for the development and test of new drugs • DNA Microarrays => for profiling genes and tracing genes within long DNA strands • DNA Nanotechnologies => for the development of new materials/devices http://en.wikipedia.org/wiki/DNA_computing 7 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 8 What is DNA? • • genetic material four letter alphabet (nucleotides, bases): – – – – • • A (adenine), C (cytosine), G (guanine), T (thymine) complementary base pairs CG, AT hybridization via base pairing DNA, Wikimedia Commons 5 3 5 3 A T A T A T T T C G G G G C G C T A T A 3 5 3 5 Perfect hybridization Imperfect hybridization Background: DNA 9 Desired properties • Desired properties coming from real applications • Notice that properties are not the same for all applications Modeling Design Goals Uniform Stability Non-interaction 5 3 5 3 A T A C A T A A C G C C G C G C T A T C 3 5 3 5 10 DNA Codes Design Problem description Input data: • The alphabet {A, C, G, T} • A fixed length n for the codewords • A required distance d among codewords (used by constraints in Z) • A set Z of constraints (explained in the next slides) Optimization objective: • Find the largest possible set of codewords (= code) of length n on alphabet {A, C, G, T}, feasible with respect to constraints Z (based on d) Why to maximize the size of the code? To have more flexibility in the applications seen before! 11 DNA Codes Design Problem description Code (solution) ACCTGATT TCACCATG ATTCCCAG CTACTACG ACCTTTTT GGCTTTTA TATATATA TTGGCCAA CATTCACC CTATTCAC GATTCAAT GCGCGCGC GCTTATTC CCGTTACA Example Codeword AATTCCGG Word Length n = 8 The solution respects a given a constraints set Z (we do not know Z at this stage!) 12 DNA Codes Design Problem description Requirements of a DNA Code • Success in specific hybridization between a DNA codeword and its complement. • No hybridization between DNA codewords from the same DNA code or between a DNA codeword and others complement. How do these requirements translate into our constraints set Z? 13 DNA Codes Design Problem description Constraints considered (set Z): • Requirement: the distance between two codewords must be large (no hybridization). • Answer: HD (Hamming Distance) - Given two codewords w1 and w2 - H(w1, w2) = number of positions i in which the ith letter of w1 differs from the ith letter of w2 - example: w1 = GCTA, w2 = ATTA, H(w1, w2) = 2 - Constraint: H(w1, w2) ≥ d 14 DNA Codes Design Problem description Constraints considered (set Z): • Requirement: the number of G or C of each codeword must be the same (uniform stability) [=> self-hybridization is likely] • Answer: GC (GC-content constraint) - A fixed number of the letters of each word has to be either G or C: floor(n/2) in our case - example: ATA is not feasible, AGA is feasible 15 DNA Codes Design Problem description • Requirement: the distance between a codeword and the complement of another codeword must be large. Watson-Crick complement of a DNA codeword wcc(w) = Watson-Crick complement of a DNA codeword w, obtained by reversing w and then by replacing each A in w by T (and vice-versa) and each C in G (and vice-versa) - example: wcc(ATGC) = GCAT 16 DNA Codes Design Problem description Constraints considered (set Z): • Requirement: the distance between a codeword and the complement of another codeword must be large. • Answer: RC (Reverse Complement Hamming distance) - Given two codewords w1 and w2 - example: GCTA, ATGC H(GCTA, wcc(ATGC)) = H(GCTA,GCAT) = 2 - Constraint: H(w1, wcc(w2)) ≥ d 17 Example of a problem and its solution • Input data: n = 4, d = 3. • Constraints considered: HD, GC, RC • Solution: the largest possible code with the characteristics above contains 6 codewords. Optimal code with respect to the constraints considered (not unique!): CTTC GGTT GTCA AGGA ACTG TTGG 18 Problem description Important observation • Other kinds of constraints are possible. • They depend on the real-world application considered • In this mini-course we limit ourselves to the constraints on the previous slides 19 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 20 Approaches from the literature TEMPLATE-MAP DESIGN • Find the largest possible set of 8-mers with – 50% GC content in each word – at least four mismatches between each word and the complement of each distinct word (reverse-complement constraint) – at least four mismatches between each pair of words (direct Hamming constraint) – based on template-map design Frutos A.G., Liu, Q., Thiel A.J., Sanner A.M.W., Condon A.E., Smith L.M., Corn R.M. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Res. 25, 4748-4757 (1997) Arita, M., Kobayashi, S. DNA sequence design using templates. New Generation Computing, 20, 263-277 (2002). Kobayashi, S., Konto, T., Arita, M. On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214 (2003). Koul, N. Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI (2010). 21 TEMPLATE-MAP DESIGN Approaches from the literature • The selection of maps and templates is based on reasoning and theoretical results • Difficult to apply results to different problems: not a general approach 22 Approaches from the literature MATHEMATICAL CONSTRUCTIONS • • Approaches adapted from classic Coding Theory Theoretical results, based on the characteristics of the desired code, are used to produce mathematical constructions leading to (very regular) codes • Example: Theorem If C0 is a code that is fixed by reverse permutation R, then the subcode C1 of C0 consisting of the codewords that are unchanged by R is obtained as the intersection of C0 and the code R(C0). • Not a general method. Results typically hold for the problem under investigation only • The codes obtained are very regular. For many applications this is not desirable King, O. D. Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33 (2003). Gaborit P., King O. D. Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113 (2005). Neelakandan, I. New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI (2010). 23 Approaches from the literature HEURISTIC ALGORITHMS • Many of the classic heuristic algorithms have been adapted, implemented and tested • We will see some of them in details…! 24 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 25 Construction Heuristics Construction Heuristic (CH) All possible codewords with the required GC-content are examined in a given order. Codewords are incrementally accepted if feasible with respet to the already accepted ones. Smith, D.H., Hughes L.A., Perkins S. A new table of constant weight binary codes of length grater than 28. Electron. J. of Combinatorics, 13(1), #A2 (2006). Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008). Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009) 26 Construction Heuristics Example: n = 4, d = 3. Constraints: HD, GC, RC Lexicographic order: AACC AACG AAGC AAGG ACAC ACAG ACCA ACCT ACGA ACGT ACTC ACTG AGAC AGAG AGCA AGCT AGGA AGGT AGTC AGTG ATCC ATCG ATGC ATGG CAAC CAAG CACA CACT CAGA CAGT CATC CATG CCAA CCAT CCTA CCTT CGAA CGAT CGTA CGTT CTAC CTAG CTCA CTCT CTGA CTGT CTTC CTTG GAAC GAAG GACA GACT GAGA GAGT GATC GATG GCAA GCAT GCTA GCTT GGAA GGAT GGTA GGTT GTAC GTAG GTCA GTCT GTGA GTGT GTTC GTTG TACC TACG TAGC TAGG TCAC TCAG TCCA TCCT TCGA TCGT TCTC TCTG TGAC TGAG TGCA TGCT TGGA TGGT TGTC TGTG TTCC TTCG TTGC TTGG Solution: AACC ACAG AGGA CCTA GTCA 27 Construction Heuristics • The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random) => different algorithms in fact… • Computational experiments suggest that random orders guarantee better results on DNA code design problems • Slow for large problems (all possible codewords have to be examined!) Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008). 28 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 29 Seed Building local search Seed Building (SB) Iterative approach A set of seed codewords is considered The set of seed codewords is dynamically adapted through iterations During each iteration: • All possible codewords with the required GC-content are examined in a given order. • Codewords are incrementally accepted if feasible with those already accepted in the current iteration and with the seed codewords. Statistics are used to expand or contract the set of seed codewords every ItrSeed iterations, based on the quality of the solutions built. Brouwer A.E., Shearer J.B., Sloane N.J.A., Smith W.D. A new table of constant weight codes. IEEE Trans. Inf. Theory 36, 1334-1380 (1990). Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008). 30 Seed Building local search Seed codewords management 31 Seed Building local search Example: n = 4, d = 3. Constraints: HD, GC, RC Seed codewords: AACC ACAG Random order: CTTC CTTG CTCA CTCT CTGA CTGT CTAC CTAG CATC CATG CACA CACT CAGA CAGT CAAC CAAG CCTA CCTT CCAA CCAT CGTA CGTT CGAA CGAT GTTC GTTG GTCA GTCT GTGA GTGT GTAC GTAG GATC GATG GACA GACT GAGA GAGT GAAC GAAG GCTA GCTT GCAA GCAT GGTA GGTT GGAA GGAT TTCC TTCG TTGC TTGG TACC TACG TAGC TAGG TCTC TCTG TCCA TCCT TCGA TCGT TCAC TCAG TGTC TGTG TGCA TGCT TGGA TGGT TGAC TGAG ATCC ATCG ATGC ATGG AACC AACG AAGC AAGG ACTC ACTG ACCA ACCT ACGA ACGT ACAC ACAG AGTC AGTG AGCA AGCT AGGA AGGT AGAC AGAG Solution: AACC ACAG CCTA GTCA TCCT 32 Seed Building local search • The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random). • Experiments clearly show that a random order has to be preferred for DNA codes design problems. • The process of identify a good set of codewords is intrinsically difficult => codes produced are sometimes very good and sometimes very poor => not a very robust method • Slow for large problems (all possible codewords are examined at each iteration!) 33 Clique Search local search • Clique Given an undirected graph G, a clique is a set of the vertices in which every vertex is connected to every other vertex of the clique • Maximal clique problem Given an undirected graph G, identify the largest (number of nodes) clique of G • Complexity Classic NP-hard problem • {0, 3, 4} is a clique • {2, 3, 4, 5} is a maximal clique 34 Clique Search local search Clique Search (CS) Iterative approach A partial code can be completed by solving a subproblem (which is a maximum clique problem) to optimality During each iteration: • All possible codewords with the required GC-content are examined in a random order. • Codewords are accepted for the second phase if feasible with those of the partial code. • A maximum clique problem is solved on the set of accepted codewords to complete the partial code Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008). Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009) 35 Clique Search local search 36 Clique Search local search Example: n = 4, d = 3. Constraints: HD, GC, RC Partial code: CTTC CGAA TGGT GTGA Maximum clique problem on feasible extensions of the partial solution: CACT AGTG AAGC GCTT 37 Clique Search local search Example: n = 4, d = 3. Constraints: HD, GC, RC Partial code: CTTC CGAA TGGT GTGA Maximum clique problem on feasible extensions of the partial solution: CACT AGTG AAGC GCTT Solution: CTTC CGAA TGGT GTGA CACT GCTT 38 Clique Search local search • Solving a maximum clique problem (sub-procedure) is an NPhard problem itself! • Heuristics have to be used for the maximum clique problem => no optimality is guarantee for the sub-problem solutions • The choice of the number of codewords to eliminate is crucial ! too many codewords eliminated => very large maximum clique problem => high probability of having suboptimality ! not enough codewords eliminated => very likely to find a code with the same number of codewords of the original ! This aspect deserves a deeper study to tackle large problems! 39 Hybrid Search local search Hybrid Search (HS) Iterative approach Merges the concepts of the two methods analyzed before. A set of seed codewords is managed exactly as in Seed Building. Seed codewords represent the partial code in the context of the Clique Search. A relaxed distance d' < d is introduced. A candidate code has to be at least at distance d from the seeds, and d' from the other candidate codes (this to keep the maximum clique problem to a reasonable size!) Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 40 Hybrid Search local search Seed Building Clique Search 41 Hybrid Search local search Example: n = 4, d = 3. Constraints: HD, GC, RC Partial code (seed codewords): CAAC AGAG Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered): TGGT TCTC TGTC TTGC TAGG TACG ATGC ACTC 42 Hybrid Search local search Example: n = 4, d = 3. Constraints: HD, GC, RC Partial code (seed codewords): CAAC AGAG Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered): TGGT TCTC TGTC TTGC TAGG TACG ATGC ACTC Solution: CAAC AGAG TCTC TGGT TACG ATGC 43 Hybrid Search local search • Sums the advantages of Seed Building to those of Clique Search but… • There is the risk of summing up drawbacks instead! • The method deserves a further detailed study for larger problems 44 Experimental comparison of some of the heuristic algorithms Experimental settings Methods coded in ANSI C Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines Maximum computation times: 10'000 seconds (2.8 hours) Statistics over 5 runs for each combination problem/method ACstrs (5,3,2) identifies the problem with constraints Cstrs (HD is always 4 present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…] Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 45 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search 46 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search 47 Experimental comparison of some of the heuristic algorithms Comments • No clear ranking is possible among the methods considered: Seed Building, Clique Search, and Hybrid Search • Methods are therefore likely to represent different neighbourhoods 48 Idea • All the methods seen until now work on the search space of feasible solutions (we never have constraints violated…) • What if we move into the search space of infeasible solutions? => we will have to minimize (i.e. bring down to zero!) a measure of infeasibility! • This makes it possible to develop a completely different kind of local search! • It is likely that the search space is visited in a different way by such a family of algorithms… 49 Iterated Greedy Search local search Iterated Greedy Search (IGS) Iterative approach Working on an infeasible code W, trying to make it feasible. Measure of the infeasibility of W: where w = floor(n/2) Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 50 Iterated Greedy Search local search Iterated Greedy Search (IGS) An infeasible solution is obtained by adding a random codeword to a perturbed feasible solution During each iteration: • A codeword σ is selected at random and the optimal (according to Inf(W)) change of one bit of σ is carried out. • If Inf(W)=0, we are done, and we can add a random codeword 51 Iterated Greedy Search local search Perturbation of the solution Optimization of the solution 52 Iterated Greedy Search local search Example: n = 4, d = 3. Constraints: HD, GC, RC W Inf(W) ... TGGT GACC CGAA TCAC CCTT 1 TGGT GACT CGAA TCAC CCTT 0 TGGT GGCA CGAA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC GCTT TTTG 7 TGGT GGCA CGTC TCAC GCTT TTTG … 7 TGGT AGTG CGTC TCAC GCTT TTTG 4 TGGT AGTG CGTC TCAC GCTT TTCG TGGT AGTG CTTC TCAC GCTT TTCG 3 0 TGGT AGTG GTAG TCAC GGTT TTCG AACT 9 TGGT AGTG GTAG TCTC GGTT TTCG AACT 9 ... 53 Iterated Greedy Search local search • We change exactly one bit of a random codeword at each iteration: more complex neighbourhoods could be considered… • We never accept changes that make the solution worse: might be an idea to escape from local minima • A further investigation is deserved… 54 Experimental comparison of some of the heuristic algorithms Experimental settings Methods coded in ANSI C Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines Maximum computation times: 10'000 seconds (2.8 hours) Statistics over 5 runs for each combination problem/method ACstrs (5,3,2) identifies the problem with constraints Cstrs (HD is always 4 present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…] Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 55 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search • IGS = Iterated Greedy Search 56 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search • IGS = Iterated Greedy Search 57 Experimental comparison of some of the heuristic algorithms Comments • No clear ranking is possible among the methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search • Methods are likely to represent different neighbourhoods 58 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 59 Stochastic Local Search: Simple SLS methods Goal: Effectively escape from local minima of given evaluation function. General approach: For fixed neighbourhood, use step function that permits worsening search steps. Specific methods: • Randomized Iterative Improvement • Simulated Annealing • Attribute Based Hill Climber • Dynamic Local Search • Iterated Local Search • Tabu Search 60 Stochastic Local Search: Randomized Iterative Improvement Key idea: In each search step, with a fixed probability perform an uninformed random walk step instead of an iterative improvement step. Randomized Iterative Improvement (RII): determine initial candidate solution s while termination condition is not satisfied do With probability p: choose a neighbor s0 of s uniformly at random Otherwise: choose a neighbor s0 of s such that g(s0) < g(s) or, if no such s0 exists, choose s0 such that g(s0) is minimal s := s0 Where g(s) is the objective function value (fitness) of solution s 61 Stochastic Local Search: Randomized Iterative Improvement Observations: • No need to terminate search when local minimum is encountered. Instead: Impose limit on number of search steps or CPU time, from beginning of search or after last improvement. • Probabilistic mechanism permits arbitrary long sequences of random walk steps Therefore: When run sufficiently long, RII is guaranteed to find (optimal) solution to any problem instance with arbitrarily high probability. • Generally, RII is often outperformed by more complex LS methods. 62 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 63 Stochastic Local Search for the DNA codes design problem Target: a code with k codewords 1. Start with k random codewords 2. Mark unsatisfied constraints (conflicts) 3. If no unsatisfied constraints go to 8 4. Pick 2 codewords involved in a conflict 5. With probability p select a better word minimizing the number of conflicts 6. Otherwise select a random codeword 7. Go to step 3. 8. Display all k codewords It is a Randomized Iterative Improvement! Tulpan, D.C., Hoos, H.H., Condon, A.E. Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241 (2002). Stochastic Local Search for DNA codes design problem 64 the Best Improvement p SBI SI SF 1-p SRW Random Walk 65 Stochastic Local Search for DNA codes design problem the Initialization No Pick Conflict Probability p Best Improvement Evaluate Yes Return Result Probability 1-p Random Walk 66 Stochastic Local Search for the DNA codes design problem Select Conflicts Neighbourhood Random Walk Iterative / Best Improvement 67 Stochastic Local Search for DNA codes design problem the Given: a fixed set of constraints C, strand length n=8, set size k=14. Current Set 1. ACCTGATT 8. TCACCATG ATTCTCAG 2. TTTCTCAG 9. CTACTACG 3. ACCTTTTT 10. GGCTTTTA 4. TATATATA 11. TTGGCCAA 5. CATTCACC 12. CTATTCAC 6. GATTCAAT 13. GCGCGCGC 7. ATTCTCAA 14. CCGTTACA Conflicts: (1,3) (1,5) (2,7) (2,9)(12,14) (12,14) Conflicts: (1,3) (1,5) (2,11) Pick Conflict Neighbors: TTTCTCAG, AATCTCAG, … 1-p p Best Improvement Random Walk 68 Thesis Contributions: C1 Development of novel optimization algorithms Stochastic Local Search for DNA codes design problem - results the Simple SLS without Random Replacement k = {100, 120, 140}, n = 8, d = 4 HD constraint only 1000 successful runs Comments: • The number of iterations required increases with k • The increase is more dramatic when k is high => risk of stagnation Distribution of the number of iterations required to have a feasible solution for different values of k (target number of codewords) 69 Stochastic Local Search for DNA codes design problem - results the SLS with Random Replacement vs Simple SLS k = 70, n = 8, d = 4 HD, RC, GC constraints 1000 successful runs Comments: • Random Replacement helps! • Stagnation reduced • Better robustness Distribution of the number of iterations required to have a feasible solution for k = 70 (target number of codewords) 70 Stochastic Local Search for the DNA codes design problem Scaling of SLS with Random Replacement n = 8, d = 4 Number of search iterations 100000 HD HD+GC HD+GC+RC 10000 1000 100 10 20 40 Comments: 60 80 100 120 140 160 DNA set size • SLS scales up better when less constraints are considered • Why? Because less constraints => easier problem, intuitively 71 Stochastic Local Search for DNA codes design problem - results the New bounds on the size of DNA codes n d Previous best SLS 6 3 56 85 10 5 132 256 14 7 240 500 18 9 380 1200 20 10 1520 2193 Note: HD, GC constraints. 72 Stochastic Local Search for DNA codes design problem - results the Comments: • There are improvements over previous best. • The method is still extremely simple and intuitive [good quality in general but...] • Is it possible to improve it with some refinement? • Where should we work to refine the method? IDEA: trying different neighbourhoods! 73 Improved Stochastic Local Search for the DNA codes design problem Instead of simple 1-exchange ! • Combinatorial problem Π: DNA Word Design • Problem instance π : DNA/quaternary code design [ particular (n,d) combinations ] • Search space S(π): set of (code word) sets s • Neighborhood relation N(π): k-exchange + random based neighborhoods • Initialization function init(π): random choosing or predefined • Step function step(π): chooses with probability p between best improvement and random walk • Terminate predicate terminate(π): a function depending on the number of iterations performed or solution found Tulpan, D.C. Hoos, H.H. Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433 (2003). 74 Improved Stochastic Local Search for the DNA codes design problem Neighbourhoods Simple neighbourhoods • k-exchange / k-point mutation neighbourhoods • rotation-based neighbourhoods • random neighbourhoods Complex neighbourhoods • 1-exchange / 1-point mutation + rotation neighbourhoods • k-exchange / k-point mutation + random words neighbourhoods • 1-exchange / 1-point mutation + rotations + random words negihbourhoods 75 Improved Stochastic Local Search for the DNA codes design problem Simple neighbourhoods v-exchange / v-point mutation neighbourhoods Example: some of the codewords in the 2-exchange neighbourhood of CTA are: ACA GTT TTG TCA 76 Improved Stochastic Local Search for the DNA codes design problem Simple neighbourhoods rotation-based neighbourhoods Applying the neighbourhood to a given codeword, we get the codewords obtained from the input codeword by shifting right the codeword from 1 to n-1 positions. Example: CTA => TAC, ACT 77 Improved Stochastic Local Search for the DNA codes design problem Simple neighbourhoods random neighbourhoods Example: some of the codewords in the random neighbourhood of CTA are: CAA CTT TTC TCA 78 Improved Stochastic Local Search for the DNA codes design problem Complex neighbourhoods 1-exchange + rotation neighbourhoods v-exchange + random words neighbourhoods 1-exchange + rotations + random words neighbourhoods • These neighbourhoods are obtained by applying all the neighbourhoods involved sequentially (repeated codewords have to be avoided) • When rotation is involved, it is applied to all the codewords obtained by the neighbourhoods previously applied 79 Improved Stochastic Local Search for the DNA codes design problem The difference is here! 80 Improved Stochastic Local Search for the DNA codes design problem - results k-exchange Neighbourhoods k = 70, n = 8, d = 4 HD, RC, GC constraints 1000 successful runs {1, 2, 3}-exchange neighbourhoods Comments: • Using larger neighbourhood seems to helps but… • The difference between 2-exchange and 3-exchange is not dramatic • Larger neighbourhood means more time at each iteration… Distribution of the number of iterations required to have a feasible solution for different v-exchange methods Why 16? 2 words I have to respect GC content 81 Improved Stochastic Local Search for the DNA codes design problem - results k-exchange Neighbourhoods Time for 1 iteration Neighbourhood CPU Time 1-exchange .0017 2-exchange .0088 3-exchange .0314 Comments: • 1-exchange is still the best in terms of run times => not what we hoped! Distribution of the CPU time required to have a feasible solution for different v-exchange methods 82 Improved Stochastic Local Search for the DNA codes design problem - results Hybrid Randomized Neighbourhoods k = 70, n = 8, d = 4 HD, RC, GC constraints 1000 successful runs random, hybrid neighbourhoods Comments: • Pure random performs surprisingly well • 1-exchange + random is however the best method Distribution of the number of iterations required to have a feasible solution for different hybrid neighbourhoods 83 Improved Stochastic Local Search for the DNA codes design problem All combinations of neighbourhoods together (usual benchmark) Comments: • 1-exchange + rotation + random is the most promising combination in terms of number of iterations • Methods including the random neighbourhood are definitely better Distribution of the number of iterations required to have a feasible solution for different neighbourhoods 84 Improved Stochastic Local Search for the DNA codes design problem Approximate CPU Cost per Iteration for all the combinations of neighbourhoods considered Neighbourhood Type Neighbourhood size CPU Time [sec] 1-exchange 2-exchange 3-exchange 1-exchange + rotations random 1-exchange + random 16 72 184 128 128 16 + 112 .002184 .008830 .031493 .017294 .015100 .022889 2-exchange + random 3-exchange + random 1-exchange + rotations + random 72 + 112 184 + 112 128 + 100 .029167 .040833 .043333 Comment: • 1-exchange + random is a good compromise between speed and quality of the solutions • Let’s see now what happen if we consider both the time spent on each iteration, and the number of iterations required to converge… [next slide] 85 Improved Stochastic Local Search for the DNA codes design problem All combinations of neighbourhood together (usual benchmark) Comments: Distribution of the CPU time required to have a feasible solution for different neighbourhoods • Rotation is time consuming => methods with rotation are not so convenient anymore • 1-exchange + random neighbourhood is far the most promising combination in terms of CPU time 86 Improved Stochastic Local Search for the DNA codes design problem Is this randomized step still interesting? 87 Improved Stochastic Local Search for the DNA codes design problem k = 70, n = 8, d = 4 HD, RC, GC constraints 1000 successful runs random, hybrid neighbourhoods Number of iterations to have a feasible solution for different values of the randomizing parameter Comments: • The randomized step is useless when the hybrid randomized neighbourhood is used! • This happens because the neighbourhood already does the “random work” 88 Improved Stochastic Local Search for the DNA codes design problem 89 Improved Stochastic Local Search for the DNA codes design problem - results Scaling of the Improved SLS n = 8, d = 4 HD, RC, GC constraints 1000 successful runs 1-exchange, random, hybrid neighbourhoods Comments: • Surprising how pure random neighbourhood scales up well • However, 1-exchange + random neighbourhood is the best 90 Improved Stochastic Local Search for the DNA codes design problem SLS Results and Analysis • New bounds for DNA set sizes • Improved SLS using various neighborhoods Combinatorial constraints: HD, RC, GC Improved SLS (k) Length (n) Hamming dist. (d) Existing Bounds (k) Simple SLS (k) 4 3 - 5 6 8 4 108 112* 128 10 5 - 127 158 12 6 - 210 240 [Tulpan et al., 2002] [Tulpan et al., 2003] [Frutos et al., 1997] Thesis Contributions: C1 Development of novel optimization algorithms 91 Improved Stochastic Local Search for the DNA codes design problem Conclusions • Random neighbourhoods => increased SLS performance • 1-exchange + random neighbourhood is the best combination • Larger DNA codes have been obtained 92 Another Stochastic Local Search for the DNA codes design problem • A different SLS algorithm has been presented in the literature. • It can be seen as a Simulated Annealing algorithm without a cooling schedule (constant temperature). • The current code L is always feasible • At each iteration a new (feasible) codeword s is added, and all the codewords of L that are not compatible with s are removed, leading to a new code L’ • Code L’ is accepted with a certain probability depending on | L’| - |L| (difference in the cardinalities of the two sets) Chee, Y. M, Ling, S. Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394 (2008). 93 Another Stochastic Local Search for the DNA codes design problem Max number of iterations Code Target number of codewords (k before) Set of incompatible codes Acceptance probability of the new code: 94 Another Stochastic Local Search for the DNA codes design problem Improvements over previous bests in the literature (theoretical methods, other SLSs and a few more) HD, GC and RC constraints 95 Stochastic Local Searches for the DNA codes design problem • Different methods based on a similar idea lead to very different codes • There is not a method dominating the others • The methods seem to explore the search space in a different manner • Is it possible to combine the good property of (some of) the different approaches into a unique method? 96 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Bibliography 97 VNS 98 VNS 99 VNS 100 VNS 101 VNS 102 Outline • • • • • • Introduction The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Future research • Acknowledgment • Bibliography 103 A VNS algorithm for DNA codes design A primitive Variable Neighbourhood Search (VNS) algorithm is introduced. It iteratively runs in turns the local search algorithms (basic ingredients) seen before. The reference solution for local searches is always the best solution retrieved so far. This is a Variable Neighbourhood Descent! Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009) Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic 104 International Conference. S. Voss and M. Caserta eds., Springer (to appear) A VNS algorithm for DNA codes design Methods involved in our implementation 105 A VNS algorithm for DNA codes design • We hope to take advantage of the different philosophies behind the local search methods listed before • From previous experiments we know that the basic local searches visit the search space is a different way • We hope basic local searches will help each other to exit from local minima within a VNS framework 106 Experimental comparison of some of the heuristic algorithms Experimental settings Methods coded in ANSI C Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines Maximum computation times: 10'000 seconds (2.8 hours) Statistics over 5 runs for each combination problem/method ACstrs (5,3,2) identifies the problem with constraints Cstrs (HD is always 4 present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…] Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 107 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search • IGS = Iterated Greedy Search • VNS = Variable Neighbourhood Search 108 Experimental comparison of some of the heuristic algorithms • SB = Seed Building • CS = Clique Search • HS = Hybrid Search • IGS = Iterated Greedy Search • VNS = Variable Neighbourhood Search 109 Experimental comparison of some of the heuristic algorithms Comments • No clear ranking is possible among the basic methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search (as seen before…) ⇒ Methods are likely to represent different neighbourhoods • Variable Neighbourhood Search clearly dominates the other methods ⇒ VNS takes advantage of the different neighbourhoods ⇒ VNS is likely to be competitive against all the other methods! 110 Experimental results of VNS The VNS algorithm discussed in: • Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326. is compared with the methods discussed in the following 6 papers [which provide all the best known codes]: • Li, M., Lee, H. J., Condon, A. E., and Corn, R. M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812. • Tulpan, D. C., Hoos, H. H., and Condon, A. E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, 2568, 229-241. • Tulpan, D. C. and Hoos, H. H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, 2671, 418-433. • King, O. D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. • Gaborit, P. and King, O. D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. • Chee, Y. M. and Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394. Reference algorithm Theor. Constructions Heuristic Algorithms 111 Experimental results of VNS Experimental settings • Methods coded in ANSI C • Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines • Maximum computation times: 100'000 seconds (27.8 hours) => Comparable with that of other heuristic algorithms • Best over 5 runs for each combination problem/method Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 112 Experimental results of VNS • We will consider 254 problems with - 4 ≤ n ≤ 20 - 3 ≤ d ≤ n ≤ 20 - Case 1: HD and GC constraints - Case 2: HD, RC and GC constraints • These settings matches those of the state-of-the-art tables maintained at http://llama.med.harvard.edu/~king/dnacodes.html by O.D. King (last checked November 2009) • We left out problems corresponding to very large codes (the current VNS algorithm cannot tackle them) 113 Experimental results of VNS • over 254 problems considered: • in 128 cases the best known result is matched • in 52 cases a new best result is found Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008). 114 Detailed results of VNS 115 Detailed results of VNS 116 Detailed results of VNS 117 Detailed results of VNS 118 Experimental results of VNS • After the publication of the paper we have been improving the VNS algorithms in many ways (work still in progress!) • over 254 problems considered: • in 128 132 cases the best known result is matched • in 52 87 cases a new best result is found • We miss the best known solution in 13.8% of the cases only! Montemanni, R., Smith D.H. Metaheuristics for the construction of constant GC• We feel there is room for further improvements… content DNA codes. Proceedings of the MIC 2009 Conference (2009) Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear) 119 Detailed results of VNS Comments • VNS works (slightly) better on problems with RC contraints • Result confirmed also by our last improved implementations • Is this because the other methods are more competitive without RC constraints? YES => we might have not too much chances to improve on problems without RC constraints NO => we probably have chances to improve on problems without RC constraints => Worth to be investigated! 120 Outline • • • • • • • Introduction Real applications The DNA Codes Design problem Approaches in the literature Construction heuristics Simple local searches Metaheuristics – – – – Intro to Stochastic Local Search Applications to the DNA codes design problem Intro to Variable Neighbourhood Search Applications to the DNA codes design problem • Future research • Acknowledgment • Bibliography 121 Essential bibliography (1/4) [HEUR] => Heuristics related publication. Brenner, S., Lerner, R.A. (1992). Encoded combinatorial chemistry. Proceedings of the National Academy of Science USA, 89, 5381-5383. Adleman, L. (1994) Molecular computation of solutions to combinatorial problems. Science, 266, 1021-1024. Frutos, A.G., Liu, Q., Thiel, A.J., Sanner, A.M.W., Condon, A.E., Smith, L.M., Corn, R.M. (1997). Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Research, 25, 4748-4757. Hansen, P., Mladenovic, N. (2001). Variable neighbourhood search: principles and applications. European Journal of Operational Research, 130, 449-467. [HEUR] Marathe, A., Condon, A.E., Corn, R.M.. (2001). On combinatorial DNA word design. Journal of Computational Biology, 8, 201-219. Arita, M., Kobayashi, S. (2002). DNA sequence design using templates. New Generation Computing, 20, 263-277. 122 Essential bibliography (2/4) Li, M., Lee, H.J., Condon, A.E., Corn, R.M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812. Tulpan, D.C., Hoos, H.H., Condon, A.E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241. [HEUR] Tulpan, D.C. Hoos, H.H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433. [HEUR] King, O.D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. [HEUR] Kobayashi, S., Konto, T., Arita, M. (2003). On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214. Hoos, H.H., Stuetzle, T. (2004). Stochastic Local Search: foundations and applications. Morgan Kaufmann/Elsevier. [HEUR] 123 Essential bibliography (3/4) Gaborit, P., King, O.D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. [HEUR] Tulpan, D.C. (2006). Effective heuristic methods for DNA strand design. PhD thesis, University of British Columbia. [HEUR] King, O.D. (2006). Tables of lower bounds for DNA codes with constant GC-content. http:// llama.med.harvard.edu/~king/dnacodes.html, last checked: November 2009. [HEUR] Chee, Y. M, Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394. [HEUR] Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326. [HEUR] Montemanni, R., Smith, D.H. (2009). Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656. [HEUR] Montemanni, R., Smith D.H. (2009). Metaheuristics for the construction of constant GCcontent DNA codes. Proceedings of the MIC 2009 Conference. [HEUR] 124 Essential bibliography (4/4) Montemanni, R., Smith D.H., Koul, N. (2010). Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer. [HEUR] Tulpan, D., Montemanni, R., Ghiggi, A. (2010). Computational Sequence Design Techniques for DNA Microarray Technologies. Submitted for publication. [HEUR] Ghiggi, A. (2010). DNA strand design with thermodynamic constraints. Master thesis, USI. [HEUR] Koul, N. (2010). Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI. [HEUR] Neelakandan, I. (2010). New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI. [HEUR] 125 Exercises 1 1. We have the following code with n=4: CGTA GGAA AATG TAGA a. Does it respect the GC-content constraint? b. Does it respect the Hamming distance constraint for a DNA codes design problem with d=2? c. Does it respect the Reverse Complement Hamming distance constraint for a DNA codes design problem with with d=2? 2. Given the settings n=4, d=3 and constraints HD, GC, RC, consider the following code: AACC CAGT GAAG TCCT TGAC a. Is it feasible? b. Can it be extended? 126 Exercises 2 1. Given the settings n=3, d=2 and constraints HD, GC, show an execution of the Construction Heuristic working on top of the inverse lexicographic order. 2. Given the settings n=2, d=1 and constraints HD, GC, RC, show and execution of the Construction Heuristic working on top of the lexicographic order 3. Given the settings n=3, d=2, constraints HD, GC, RC, and the following partial code: CTT CAA TGT GTA show an iteration of the Clique Search algorithm. 4. Given the settings n=4, d=2, constraints HD, GC, RC, and the following code: TGGT GACC CGAA TCTC CGTT calculate its measure of infeasibility Inf(W) according to the definition given in slide 75 (Iterative Greedy Search) 127 Exercises 3 1. Write the rotation neighbourhood of codeword CATGA. 2. Write 5 of the codewords of the 3-exchange neighbourhood of codeword CATGA. 5. Write 5 of the codewords of the random neighbourhood of codeword CATGA. 6. Write 5 of the codewords of the 2-exchange + random neighbourhood of codeword CATGA. 7. Consider the SLS method described from slide 119 on, with input parameters n=4, d=3, and constraints HD, GC, RC. At a given iteration we have the following code L CTTC GGTT GTCA AGGA ACTG TTGG and the selected random codeword is TTGC. Write down code L’ (we do not care if it will be accepted or not) 128