CS/BF549 Pattern Matching and Pattern Detection Fall 2008 Room: KCB 102 Meeting Time: T, Th 3:30 – 5:00 pm Instructor: Gary Benson Office: Rm 903, 24 Cummington St. Phone: 617-358-2965 Email: gbenson@bu.edu Office Hours: W 4:00-5:00 or by appointment Grading: Homework: assigned each week Programming project: Final: Class Participation: 60% 20% 15% 5% Project Due: Dec. 5 Final Exam (take home) Due: Dec. 18 Policy on Academic Honesty Except as otherwise noted, all homeworks, projects, and take home tests are to represent individual effort, and are to be written up and turned in individually. In class tests are to be taken without notes, other aides, or reference to another student’s work. Violations of these policies will result in a failing grade for the assignment or test. Violations of academic honesty which exceed the purview of this class will be referred to the Dean of Students. Approximate Week-by-Week Syllabus Week 1: Exact Pattern Matching Introduction Naïve algorithm Finite Automata Week 2: Exact Pattern Matching Knuth – Morris – Pratt (KMP) automaton KMP text scan algorithm KMP failure table Readings: S. Baase. Computer Algorithms: Introduction to Design and Analysis. Addison-Wesley, 2nd edition, 1988, pp 212 – 219. Week 3: Exact Pattern Matching Boyer – Moore (BM) algorithm Finding all cyclic shifts of a pattern in a text Substring – Prefix oracle Readings: D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997, pp 7 – 9, 16 – 23. Week 4: Exact Pattern Matching Randomized Pattern Matching (Karp – Rabin) Readings: R. Karp and M. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development. 31:249-260, 1987. Week 5: Exact Pattern Matching Aho – Corasick Dictionary Matching Algorithm fail links Readings: D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997, pp 52 – 59. Week 6: Exact Pattern Matching Introduction to Suffix Trees O(m3) time construction Suffix links and (m2) time construction Readings: Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997, pp 94 – 103. Week 7: Exact Pattern Matching Suffix trees O(m) time construction Readings: Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997, pp 103 – 107. Week 8: Exact Pattern Matching Suffix arrays Readings: TBA Week 9: Approximate Pattern Matching Introduction to Sequence Alignment Common mutations, distance and similarity scoring Longest common subsequence problem Readings: G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 1 – 18. Week 10: Approximate Pattern Matching Local and Global Alignment Readings: G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 19 – 28. Week 11: Approximate Pattern Matching Tandem Alignment – Wraparound Dynamic Programming Alignment of minisatellite maps Readings W. Miller and E. Myers. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51:5-37, 1989. G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 28 – 35. Berard, Nicolas, Buard, Gascuel, Rivals, A fast and specific alignment method for minisatellite maps Week 12: Approximate Pattern Matching Heuristic Matching –BLAT Readings: W.J. Kent. BLAT – the BLAST-like alignment tool. Genome Research.12: 656-664, 2002. Week 13: Approximate Pattern Matching Seeds for heuristic matching Spaced seeds Indel seeds Readings: B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search. Bioinformatics,18:440-445, 2002. D. Mak, G. Benson, All Hit All the Time, Parameter Free Calculation of Seed Sensitivity, Proceedings of the Fifth Asia-Pacific Bioinformatics Conference (APBC 2007), 2007. D. Mak, Y. Gelfand, and G. Benson, Indel Seeds for Homology Search, Proceedings of the 14th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2006), Bioinformatics, 22(14):e341-e349, 2006. Week 14: Pattern Detection Multiple Short Word Methods Tandem Repeats Finder Inverted Repeats Finder Readings: G. Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27:573-580, 1999. Week 15: Pattern Detection Pattern Enumeration Methods Readings: Jonassen, J. Collins and D. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, 4:1587-1595, 1995. A. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, 239:698-712, 1994. Additional Readings come from the following books: T. Cormen, C. Leiserson and R. Rivest. Introduction to Algorithms. MIT Press, 1990. W. Feller. An introduction to probability theory and its applications, volume I. John Wiley & Sons, 3rd edition, 1968.