CS 696 Programming Problems in Bioinformatics Fall 2014 Credits: 3 units Contact Hours: Monday and Wednesday 1730-1845 Instructors: Robert Edwards Office: GMCS 536 Email: redwards@mail.sdsu.edu Office Hours: Mondays and Wednesdays 1530 – 1700 (and by appointment) Course Materials 1. CS 696 lecture notes/slides (available on Blackboard) 2. Supplementary textbooks: i. Bioinformatics Algorithms: An Active Learning Approach by Compeau and Pevzner. ISBN: 0990374602 ii. Essential Bioinformatics by Jin Xiong ISBN: 0521600820 Course Information for CS 696 Description from the Official Course Catalog Methods for analysis of biological data using computational techniques and commnly used scripting languages (perl, python, etc.). Discussion, coding, and complexity analysis of different algorithms and types of problems. Prerequisites: Computer Science 310, Bioinformatics and Medical Informatics 668 Course Type: Selected elective course in the program Specific Goals for CS 696 Course-Level Student Learning Outcomes 1. Ability to write code to solve a biological problem 2. Ability to review and discuss code written in different languages 3. Ability to analyze algorithmic complexity Relationship to CS Program Course Outcomes CS 696 addresses the following CS Program course outcomes: a) An ability to apply knowledge of computing and mathematics appropriate to the program’s student outcomes and to the discipline b) An ability to analyze a problem, and identify and define the computing requirements appropriate to its solution c) An ability to design, implement, and evaluate a computer-based system, process, component, or program to meet desired needs d) Recognition of the need for and an ability to engage in continuing professional development e) An ability to use current techniques, skills, and tools necessary for computing practice. f) An ability to apply mathematical foundations, algorithmic principles, and computer science theory in the modeling and design of computer-based systems in a way that demonstrates comprehension of the tradeoffs involved in design choices. g) An ability to apply design and development principles in the construction of software systems of varying complexity. Topics Covered The following topics are covered in CS 696: 1. Algorithms for counting k-mers 2. Algorithms for DNA statistics 3. Alignment based phylogenys 4. Binary trees 5. Edit distance between two strings 6. Efficient substring matches 7. Fibonacci sequences 8. Global pairwise sequence alignments 9. Greedy searches 10. Introduction to bioinformatics. 11. Local sequence alignments 12. Longest multiple repeats 13. Longest substrings in a sequence 14. Motif searching 15. Pattern matching and random strings 16. Phylogenetic trees 17. Public databases 18. Sequence data file formats 19. Suboptimal sequence alignments 20. Suffix tries 21. Transcription and translation Course Schedule Week Topic 1 Introduction, syllabus, announcements, programming languages, introduction to bioinformatics. 2 Algorithms for DNA statistics, transcription,and translation 3 Algorithms for counting k-mers. Efficient nucleotide counting 4 Pattern matching and random strings, Fibonacci sequences 5 Motif searching, pattern matching, and greedy searches September 22nd. Assignment 1 Due: Describe the efficiency of the k-mer counting approaches that you have tried. What is the most efficient way to do it. 6 Longest substrings in a sequence, longest multiple repeats 7 Public databases and sequence data file formats 8 Local sequence alignments and edit distances 9 Global pairwise sequence alignments and multiple string alignments 10 Suboptimal sequence alignments November 2nd at the end of the day (midnight). Assignment 2 Due: Describe and discuss different approaches for pairwise and multiple sequence alignment. 11 Phylogenetic trees, binary trees, and suffix tries 12 Alignment based phylogenys and binary trees 13 Efficient substring matches and fast searches: restriction sites 14 Review December 10th, Final Report is due Major assignments Assignment 1 For this assignment you will write a paper discussing k-mer counting. Identifying the short substrings (k-mers) in DNA sequences is an important bioinformatics application and is widely used in different areas of our research. It is also computationally intensive as we often need to count the k-mers present in very large numbers of very short sequences. In addition, once k gets sufficiently large (above 12 or 14) it becomes impossible to keep all possible k-mers in memory and we need alternative data structures to hold the data. Your paper should address: Why is k-mer counting so important? What areas of bioinformatics is it commonly used in, and what is it used for? For the class you had to write a k-mer counting algorithm. How did the algorithm that you wrote work, what data structure did you use. What are the time and memory complexity of that algorithm in the best case, worst case, and average case, and why? What other algorithmic approaches have been used to count k-mers by other researchers? What is the most efficient (time and/or memory), and fastest, algorithm you have found to count k-mers? Briefly describe how that algorithm works. What's the lowest possibly complexity you could get, and the fastest possible algorithm you could create, for k-mer counting. Your paper should include citations to the appropriate literature. You may want to check Google Scholar and PubMed for literature about k-mer counting. Your paper should not include plagiarism. Do not copy work, even if you cite something you are still not allowed to copy the text verbatim. Please see the course documents for more details. You must submit your paper through the turnitin link below before the deadline. Late work will not be accepted. Assignment 2 Describe and discuss different approaches for pairwise and multiple sequence alignment In the class we have covered different aspects of multiple sequence alignment, including edit distance, local alignment, global alignment, affine gap penalties, and so on. Why are sequence alignments central to bioinformatics, and what do they tell us about the sequences that we are aligning? Compare and contrast the different approaches to sequence alignment, and the solutions that you wrote for each of the problems. What are the time and memory complexities of those solutions? Can your solution be improved upon to reduce etither time or memory complexity (or both!). Could you design an implementation of the algorithms that does not require at least O(nm) memory? One of the ways that we can make algorithms faster is through parallelization. Is it possible to parallelize sequence alignments (either pairwise or multiple sequences), and if so how would you do it? Assignments are due 00:00 November 3rd (i.e. at the end of the day, November 2nd), and must be submitted through Turnitin. You should be able to make multiple submissions to Turnitin up until the deadline. Late work will not be accepted. Assignment 3 For the final assignment please choose one algorithm that we covered in class, and review the algorithm. Why is this algorithm important in bioinformatics? What does it tell us, and why is it used? What is the complexity of the algorithm that you wrote? Consider both time and memory complexity of the algorithm. What are the alternative implementations of this algorithm, and how do they differ from the version that you wrote? Are there other ways to speed up the algorithm? Does the algorithm have implications for other areas of Computer Science, outside of bioinformatics? If so, where is the algorithm commonly used and how is it used? The assignment is due on Dec 10th at the end of the day and work must be submitted through Turnitin at the link below. Late work will not be accepted. Grading Policies The final grade will be comprised of: In class participation: 5% Successfully coding solutions to 45 bioinformatics problems from a total of 50: 45% Each student will write three papers describing three different biological problems, the code they wrote to solve the problem, and the algorithmic complexity of that code. The assignments will be due at weeks 5, 10, and 15: 50% A 92 and above A- 90-92 B+ 88-90 B 82-88 B- 80-82 C+ 78-80 C 72-78 C- 70-72 D+ 68-70 D 62-68 D- 60-62 Fail Below 60 Special Assistance If you are a student with a disability and believe you will need accommodations for this class, it is your responsibility to contact Student Disability Services at (619) 594-6473. To avoid any delay in the receipt of your accommodations, you should contact Student Disability Services as soon as possible. Please note that accommodations are not retroactive, and that accommodations based upon disability cannot be provided until you have presented your instructor with an accommodation letter from Student Disability Services. Your cooperation is appreciated.