View/Open

advertisement
CS 696 Programming Problems in Bioinformatics
Fall 2014
Credits: 3 units
Contact Hours: Monday and Wednesday 1730-1845
Instructors: Robert Edwards
Office: GMCS 536 Email: redwards@mail.sdsu.edu
Office Hours: Mondays and Wednesdays 1530 – 1700 (and by appointment)
Course Materials
1. CS 696 lecture notes/slides (available on Blackboard)
2. Supplementary textbooks:
i. Bioinformatics Algorithms: An Active Learning Approach by Compeau and
Pevzner. ISBN: 0990374602
ii. Essential Bioinformatics by Jin Xiong ISBN: 0521600820
Course Information for CS 696
Description from the Official Course Catalog
Methods for analysis of biological data using computational techniques and commnly
used scripting languages (perl, python, etc.). Discussion, coding, and complexity analysis
of different algorithms and types of problems.
Prerequisites: Computer Science 310, Bioinformatics and Medical Informatics 668
Course Type: Selected elective course in the program
Specific Goals for CS 696
Course-Level Student Learning Outcomes
1. Ability to write code to solve a biological problem
2. Ability to review and discuss code written in different languages
3. Ability to analyze algorithmic complexity
Relationship to CS Program Course Outcomes
CS 696 addresses the following CS Program course outcomes:
a) An ability to apply knowledge of computing and mathematics appropriate to the
program’s student outcomes and to the discipline
b) An ability to analyze a problem, and identify and define the computing requirements
appropriate to its solution
c) An ability to design, implement, and evaluate a computer-based system, process,
component, or program to meet desired needs
d) Recognition of the need for and an ability to engage in continuing professional
development
e) An ability to use current techniques, skills, and tools necessary for computing practice.
f) An ability to apply mathematical foundations, algorithmic principles, and computer
science theory in the modeling and design of computer-based systems in a way that
demonstrates comprehension of the tradeoffs involved in design choices.
g) An ability to apply design and development principles in the construction of software
systems of varying complexity.
Topics Covered
The following topics are covered in CS 696:
1. Algorithms for counting k-mers
2. Algorithms for DNA statistics
3. Alignment based phylogenys
4. Binary trees
5. Edit distance between two strings
6. Efficient substring matches
7. Fibonacci sequences
8. Global pairwise sequence alignments
9. Greedy searches
10. Introduction to bioinformatics.
11. Local sequence alignments
12. Longest multiple repeats
13. Longest substrings in a sequence
14. Motif searching
15. Pattern matching and random strings
16. Phylogenetic trees
17. Public databases
18. Sequence data file formats
19. Suboptimal sequence alignments
20. Suffix tries
21. Transcription
and translation
Course Schedule
Week
Topic
1
Introduction, syllabus, announcements, programming languages, introduction to
bioinformatics.
2
Algorithms for DNA statistics, transcription,and translation
3
Algorithms for counting k-mers. Efficient nucleotide counting
4
Pattern matching and random strings, Fibonacci sequences
5
Motif searching, pattern matching, and greedy searches
September 22nd. Assignment 1 Due: Describe the efficiency of the k-mer counting approaches
that you have tried. What is the most efficient way to do it.
6
Longest substrings in a sequence, longest multiple repeats
7
Public databases and sequence data file formats
8
Local sequence alignments and edit distances
9
Global pairwise sequence alignments and multiple string alignments
10
Suboptimal sequence alignments
November 2nd at the end of the day (midnight). Assignment 2 Due: Describe and discuss
different approaches for pairwise and multiple sequence alignment.
11
Phylogenetic trees, binary trees, and suffix tries
12
Alignment based phylogenys and binary trees
13
Efficient substring matches and fast searches: restriction sites
14
Review
December 10th, Final Report is due
Major assignments
Assignment 1
For this assignment you will write a paper discussing k-mer counting.
Identifying the short substrings (k-mers) in DNA sequences is an important bioinformatics
application and is widely used in different areas of our research. It is also computationally
intensive as we often need to count the k-mers present in very large numbers of very short
sequences. In addition, once k gets sufficiently large (above 12 or 14) it becomes impossible to
keep all possible k-mers in memory and we need alternative data structures to hold the data.
Your paper should address:

Why is k-mer counting so important?

What areas of bioinformatics is it commonly used in, and what is it used for?

For the class you had to write a k-mer counting algorithm. How did the algorithm that
you wrote work, what data structure did you use. What are the time and memory
complexity of that algorithm in the best case, worst case, and average case, and why?

What other algorithmic approaches have been used to count k-mers by other researchers?

What is the most efficient (time and/or memory), and fastest, algorithm you have found
to count k-mers? Briefly describe how that algorithm works.

What's the lowest possibly complexity you could get, and the fastest possible algorithm
you could create, for k-mer counting.
Your paper should include citations to the appropriate literature. You may want to check Google
Scholar and PubMed for literature about k-mer counting.
Your paper should not include plagiarism. Do not copy work, even if you cite something you are
still not allowed to copy the text verbatim. Please see the course documents for more details.
You must submit your paper through the turnitin link below before the deadline.
Late work will not be accepted.
Assignment 2
Describe and discuss different approaches for pairwise and multiple sequence alignment
In the class we have covered different aspects of multiple sequence alignment, including edit
distance, local alignment, global alignment, affine gap penalties, and so on.

Why are sequence alignments central to bioinformatics, and what do they tell us about
the sequences that we are aligning?

Compare and contrast the different approaches to sequence alignment, and the solutions
that you wrote for each of the problems.

What are the time and memory complexities of those solutions?

Can your solution be improved upon to reduce etither time or memory complexity (or
both!).

Could you design an implementation of the algorithms that does not require at least
O(nm) memory?

One of the ways that we can make algorithms faster is through parallelization. Is it
possible to parallelize sequence alignments (either pairwise or multiple sequences), and if
so how would you do it?
Assignments are due 00:00 November 3rd (i.e. at the end of the day, November 2nd), and must
be submitted through Turnitin. You should be able to make multiple submissions to Turnitin up
until the deadline.
Late work will not be accepted.
Assignment 3
For the final assignment please choose one algorithm that we covered in class, and review the
algorithm.

Why is this algorithm important in bioinformatics? What does it tell us, and why is it
used?

What is the complexity of the algorithm that you wrote? Consider both time and memory
complexity of the algorithm.

What are the alternative implementations of this algorithm, and how do they differ from
the version that you wrote?

Are there other ways to speed up the algorithm?

Does the algorithm have implications for other areas of Computer Science, outside of
bioinformatics? If so, where is the algorithm commonly used and how is it used?
The assignment is due on Dec 10th at the end of the day and work must be submitted through
Turnitin at the link below.
Late work will not be accepted.
Grading Policies
The final grade will be comprised of:
In class participation: 5%
Successfully coding solutions to 45 bioinformatics problems from a total of 50: 45%
Each student will write three papers describing three different biological problems, the
code they wrote to solve the problem, and the algorithmic complexity of that code. The
assignments will be due at weeks 5, 10, and 15: 50%
A
92 and above
A-
90-92
B+
88-90
B
82-88
B-
80-82
C+
78-80
C
72-78
C-
70-72
D+
68-70
D
62-68
D-
60-62
Fail
Below 60
Special Assistance
If you are a student with a disability and believe you will need accommodations for this class, it
is your responsibility to contact Student Disability Services at (619) 594-6473. To avoid any
delay in the receipt of your accommodations, you should contact Student Disability Services as
soon as possible. Please note that accommodations are not retroactive, and that accommodations
based upon disability cannot be provided until you have presented your instructor with an
accommodation letter from Student Disability Services. Your cooperation is appreciated.
Download