Prof. Yefim Dinitz
Course number: 202.1.8611
Mandatory for Bioinformatics program
Credits: 4
Course Site: http://www.cs.bgu.ac.il/~oma121
Prerequisites: o o
202.1.2041 Design of Algorithms
201.1.0131 Probability Theory 1
The course is built and used as a theoretical introduction to discrete algorithms for
Bioinformatics. Its main objective is learning basic algorithmic techniques used for string problems in Bio-Informatics .
Students will acquire basic understanding of the techniques allowing building fast, linear time string algorithms. The analysis given in classic pattern matching algorithms KMP and BM will be combined with the recent approach of Z-algorithm , significantly simplifying the pre-processing for the above algorithms. The probabilistic method of algorithm KR and the suffix tree approach will complete the basic picture of fast algorithms. Various applications will show the power of suffix trees techniques.
Students will learn a variety of applications of the main discrete bio-informatics algorithmic tool: dynamic programming . The basic problem is finding either the edit distance or the optimal alignment of two given strings. Students will see solutions to a few generalizations of the latter problem, each one supported by a special technique. The main approaches to multiple string alignment will be learned. Students will study special techniques for saving time and space in the above algorithms.
Basic knowledge on the Branch-and-Cut approach for handling huge size problems will be acquired. This approach will be exemplified by finding the optimal multiple string alignment and by generating all sub-optimal two string alignments.
Students will learn the Hidden Markov Model ( HMM ) technique, allowing taking into account interrelation of neighboring symbols in a sequence. It will be exemplified by detecting CpG islands in a DNA sequence.
Students will learn solving non-trivial algorithmic problems by themselves, while preparing five solid theoretical home-works.
1
1.
26 x 2-hours lectures.
2.
4-5 theoretical (and sometimes technical) assignments. Submission: alone or in pairs. The expected load is about 20 hours per assignment. Every assignment grade is required be passing, except for maybe one of them.
3.
3-4 quizzes, 30-40 minutes long each.
4.
Final theoretical exam, 3.5 hours long. No auxiliary material allowed, except for a single two-sided A4 paper sheet with text, prepared and brought by each student.
5.
Grade computation:
5.1
In general, the exam grade E is 75%, the assignments grade A is 15%, and the quizzes grade Q is 10%.
5.2
A is the average of the assignment grades, while a half of the worst one is not included into the average.
5.3
Q is the average of the quiz grades, while the worst one is not included into the average.
5.4
In the case when the class work grade C=(0.73 E + 0.12 A)/0.85 is less than
56, C is the final grade. That is, then the assignment grade is not included into the computation.
1.1.
Basic problem of finding all occurrences of a pattern P in a text T, a naïve algorithm for solving it, and first ideas of running time saving.
1.2.
Knuth-Morris-Pratt algorithm (KMP): Its main loop, based on a certain preprocessing of P, its correctness and running time bound.
1.3.
Z-algorithm of Gusfield and its running time bound. Its usage for solving the basic problem and for the preprocessing for KMP.
1.4.
Boyer-Moore algorithm (BM): Bad character rule for achieving a sublinear time in most cases, and preprocessing for it. Good suffix rule for achieving a linear time bound in the worst case, and the preprocessing for it using Z-algorithm.
1.5.
Probabilistic Karp-Rabin algorithm (KR), with one-sided error. Bounds for its error probability. Converting KR from a Monte-Carlo to a Las-
Vegas type algorithm.
1.6.
The suffix tree definition, and its data structure. Time-space trade-off, for finding a pattern in a suffix tree.
Linear time algorithms using suffix trees:
1.6.1
Finding all occurrences of a pattern P and of a group of patterns in a text T, using the suffix tree of T .
1.6.2
Finding a longest common substring of a few given strings.
1.6.3
Finding all maximal repeats in a given string.
1.6.4
Finding all maximal palindromes in a given string.
2
2.1.
The basic edit distance and optimal alignment problems.
2.2.
Their solution via Dynamic Programming.
2.3.
Variants of the basic problem, up to the local alignment problem.
2.4.
Optimal alignment with gaps, and its finding. Effective algorithm for the affine gap cost case.
2.5.
Hirschberg algorithm for space saving.
2.6.
Representation of string groups: profile, consensus, signature. Operations with profiles.
2.7.
Finding optimal multiple string alignment with the sum-of-pairs goal function.
3.1.
A time-saving method for finding an optimal alignment of several strings.
3.2.
The general cutting scheme based on the rule: f(S)+g(S) ≤ z, where S is the current partial solution, f(S) its gain, g(S) the optimistic estimation of the gain of completing S, and z is a pre-given (not bad) feasible solution.
3.3.
Using Branch-and-Cut method for finding all ∆-optimal paths in a graph, in time linear in the input and output size.
4.1.
Problem of finding CpG islands in a DNA.
4.2.
Recognizing an excerpt of a DNA as a CpG island, using Markov Model.
4.3.
Hidden Markov Model for CpG islands.
4.4.
Algorithm Viterbi for optimal path finding in an HMM.
4.5.
Forward-backward algorithm for finding the posterior probabilities.
4.6.
Algorithm Baum-Welsh for estimating the parameter values.
D. Gusfield. Algorithms on Strings, Trees, and Sequences (QA 76.9.A43G87 1999 in the Aranne library)
Part 1, up to 1.5: Chapters 1, 2, 4.4
Part 1.6: Chapters 5, 6.4, 6.5, 7.1-7.6, 7.12-7.12.1, 9.1, 9.2
Part 2: Chapters 11, 12.1, 14.3-14.3.2, 14.5
Part 3: Chapters 14.6-14.6.1, 13.2
Durbin et al. Biological Sequence Analysis (QP 620 B576 in the Aranne library)
Part 4: Chapters 3-3.3 and 3.6
3