BCB 567 - Bioinformatics & Computational Biology

advertisement
BCB 567. Bioinformatics I (Fundamentals of Genome Informatics). (Cross-listed
with COM S, CPR E.) (3-0) Cr. 3. F. Prereq: Com S 208; Com S 330; Stat 341; credit or
enrollment in Biol 315, Stat 430. Biology as an information science. Review of
algorithms and information processing. Generative models for sequences. String
algorithms. Pairwise sequence alignment. Multiple sequence alignment. Searching
sequence databases. Genome sequence assembly.
Goals:
Study methods for designing efficient algorithms and data structures for problems in
Computational Biology. Analyzing the performance of algorithms for various tasks in
Computational Biology and learning to estimate their intrinsic resource requirements.
Study practical intractability in Computational Biology and approaches for dealing with
it. Study models for Computational Biology.
Prerequisites:
ComS 208, ComS 330, Stat 341; Credit or enrollment in Biol 315; Stat 430
Textbook:
(1) Jones and Pevzner – An Introduction to Bioinformatics Algorithms; MIT Press 2004
(required)
(2) Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press
1999 (reprint with corrections)
(3) Cormen, Leiserson, Rivest, and Stein – Introduction to Algorithms; MIT Press 2009
(3rd edition)
(4) Kleinberg and Tardos – Algorithms Design; Addison--‐Wesley 2005
Another Instructor Description
This is a course on computational techniques for reconstruction and alignment of
genome sequences. Those techniques are very useful for students to solve computational
problems in genome sequencing and analysis. The course starts with review of concepts
in algorithm design and analysis. This course covers pairwise sequence alignment, model
of sequence evolution, fast string matching, sequence database searching, multiple
sequence alignment, and genome sequence assembly.
Course Objectives
Upon completion of this course the student should be able to apply computational
techniques to solve problems in genome sequencing and analysis.
Half sheet synopsis of BCB 567 – summary of topics – Fall 2008
1. Review of concepts in algorithm design
An instance of a problem, the size of an input instance, the algorithm,
the time and space requirements of an algorithm as functions of input size,
the big O notation, an example of designing and analyzing an algorithm.
2. Pair sequence alignment
2.1 Motivation
Alignments of DNA and protein sequences are useful in studying
the evolutionary history of the sequences and finding functional
elements in the sequences, e.g., reconstruction of phylogenetic trees
and finding sequence regions under strong selection.
2.2 A global alignment model
Alignment configuration: matches, mismatches, deletion and insertion gaps,
substitution scoring table, affine gap scoring function, the score of an alignment,
an optimal alignment.
2.3 A dynamic programming algorithm
the major steps of applying dynamic programming as an algorithm design technique,
developing an algorithm for computing an optimal global alignment by applying
dynamic programming to the problem. Obtaining the time and space requirements of
the algorithm.
2.4 A linear space algorithm
The high space requirement of the standard algorithm on long sequences.
Obtaining the necessary and sufficient conditions for finding a middle
pair of positions on an optimal global alignment.
Developing a recursive algorithm based on finding a middle pair of positions
on an optimal global alignment.
Obtaining the time and space requirements of the algorithm.
2.5 A banded alignment algorithm
The high time requirement of the standard algorithm on long sequences.
Developing an efficient algorithm by restricting the standard algorithm
or the linear space algorithm to a small area of the matrix.
2.6 A local alignment model
Limitation of the global alignment algorithm on sequences that are not entirely
similar
but contain local regions that are similar.
Definition of a local alignment.
Developing a dynamic programming algorithm for computing an optimal local
alignment
between two sequences.
Developing a linear-space algorithm for computing an optimal local alignment.
2.7 A generalized alignment algorithm
Limitations of the global and local alignment algorithms on sequences with
similar regions (exons) separated by different regions (introns).
Introducing a new type of alignment configurations called difference blocks
for dealing with different regions.
Developing an algorithm for computing an optimal alignment that consists of
similarity blocks separated by difference blocks.
3. String matching
3.1 Finding exact string matches between sequences
A lookup table for finding exact matches of words of length w and its extension for
finding exact matches of strings of lengths in a multiple of w.
Or suffix trees and arrays for finding exact matches of strings of any length.
3.2 Finding approximate string matches between sequences
A word model of 1's and 0's with 1 indicating a match and 0 for "don't care".
Use of a lookup table for finding approximate word matches under a word model.
4. Fast sequence comparison and database search methods
4.1 Limitations of alignment algorithms on whole genome sequences
4.2 Computing high-scoring segment pairs (HSPs) based on finding string matches.
4.3 Dynamic programming algorithm for computing high-scoring chains of HSPs.
5. Construction of substitution matrices
5.1 Construction of PAM matrices based on an evolutionary model
5.2 Construction of Blosum matrices based on sequence similarity
6. Reconstruction of phylogenetic trees
6.1 Computation of evolutionary distances between sequences.
6.2 A distance method for building a phylogenetic tree.
7. Multiple sequence alignment
7.1 A sum-of-pair scoring scheme for a multiple sequence alignment.
7.2 Limitation of a dynamic programming algorithm for building a multiple sequence
alignment.
7.3 A progressive alignment method for building a multiple alignment of sequences
based on
a phylogenetic tree of the sequences.
8. Genome assembly
8.1 Terms in genome assembly
Base sequences, quality values, pairs of reads from the ends of DNA segments,
overlaps, the layouts and consensus sequences of contigs, and scaffolds.
8.2 Algorithm for quickly computing overlaps between sequences
8.3 Algorithm for building the layouts of contigs
8.4 Algorithm for building scaffolds of contigs
8.5 Algorithm for generating the consensus sequences of contigs.
BCB Program/Orientation/2011/BCB567-Description.doc
Download