BE768 Biological Databases

advertisement
CS/BF549
Pattern Matching and Pattern Detection
Fall 2008
Room: KCB 102
Meeting Time: T, Th 3:30 – 5:00 pm
Instructor: Gary Benson
Office:
Rm 903, 24 Cummington St.
Phone:
617-358-2965
Email:
gbenson@bu.edu
Office Hours: W 4:00-5:00 or by appointment
Grading:
Homework: assigned each week
Programming project:
Final:
Class Participation:
60%
20%
15%
5%
Project Due: Dec. 5
Final Exam (take home) Due: Dec. 18
Policy on Academic Honesty
Except as otherwise noted, all homeworks, projects, and take home tests are to represent individual
effort, and are to be written up and turned in individually. In class tests are to be taken without
notes, other aides, or reference to another student’s work. Violations of these policies will result in
a failing grade for the assignment or test. Violations of academic honesty which exceed the purview
of this class will be referred to the Dean of Students.
Approximate Week-by-Week Syllabus
Week 1: Exact Pattern Matching
 Introduction
 Naïve algorithm
 Finite Automata
Week 2: Exact Pattern Matching
 Knuth – Morris – Pratt (KMP) automaton
 KMP text scan algorithm
 KMP failure table
Readings:
S. Baase. Computer Algorithms: Introduction to Design and Analysis. Addison-Wesley,
2nd edition, 1988, pp 212 – 219.
Week 3: Exact Pattern Matching
 Boyer – Moore (BM) algorithm
 Finding all cyclic shifts of a pattern in a text
 Substring – Prefix oracle
Readings:
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, 1997, pp 7 – 9, 16 – 23.
Week 4: Exact Pattern Matching
 Randomized Pattern Matching (Karp – Rabin)
Readings:
R. Karp and M. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Journal
of Research and Development. 31:249-260, 1987.
Week 5: Exact Pattern Matching
 Aho – Corasick Dictionary Matching
 Algorithm
 fail links
Readings:
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, 1997, pp 52 – 59.
Week 6: Exact Pattern Matching
 Introduction to Suffix Trees
 O(m3) time construction
 Suffix links and (m2) time construction
Readings:
Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, 1997, pp 94 – 103.
Week 7: Exact Pattern Matching
 Suffix trees O(m) time construction
Readings:
Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, 1997, pp 103 – 107.
Week 8: Exact Pattern Matching
 Suffix arrays
Readings:
TBA
Week 9: Approximate Pattern Matching
 Introduction to Sequence Alignment
 Common mutations, distance and similarity scoring
 Longest common subsequence problem
Readings:
G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 1 – 18.
Week 10: Approximate Pattern Matching
 Local and Global Alignment
Readings:
G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 19 – 28.
Week 11: Approximate Pattern Matching
 Tandem Alignment – Wraparound Dynamic Programming
 Alignment of minisatellite maps
Readings
W. Miller and E. Myers. Approximate matching of regular expressions. Bulletin of
Mathematical Biology, 51:5-37, 1989.
G. Benson. An Introduction to Computational Biology. Lecture Notes, 2001, pp 28 – 35.
Berard, Nicolas, Buard, Gascuel, Rivals, A fast and specific alignment method for
minisatellite maps
Week 12: Approximate Pattern Matching
 Heuristic Matching –BLAT
Readings:
W.J. Kent. BLAT – the BLAST-like alignment tool. Genome Research.12: 656-664,
2002.
Week 13: Approximate Pattern Matching
 Seeds for heuristic matching
 Spaced seeds
 Indel seeds
Readings:
B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search.
Bioinformatics,18:440-445, 2002.
D. Mak, G. Benson, All Hit All the Time, Parameter Free Calculation of Seed Sensitivity,
Proceedings of the Fifth Asia-Pacific Bioinformatics Conference (APBC 2007), 2007.
D. Mak, Y. Gelfand, and G. Benson, Indel Seeds for Homology Search, Proceedings of
the 14th Annual International Conference on Intelligent Systems for Molecular Biology
(ISMB 2006), Bioinformatics, 22(14):e341-e349, 2006.
Week 14: Pattern Detection
 Multiple Short Word Methods
 Tandem Repeats Finder
 Inverted Repeats Finder
Readings:
G. Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids
Research, 27:573-580, 1999.
Week 15: Pattern Detection
 Pattern Enumeration Methods
Readings:
Jonassen, J. Collins and D. Higgins. Finding flexible patterns in unaligned protein
sequences. Protein Science, 4:1587-1595, 1995.
A. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of
Molecular Biology, 239:698-712, 1994.
Additional Readings come from the following books:
T. Cormen, C. Leiserson and R. Rivest. Introduction to Algorithms. MIT Press, 1990.
W. Feller. An introduction to probability theory and its applications, volume I. John
Wiley & Sons, 3rd edition, 1968.
Download