BCB 567. Bioinformatics I (Fundamentals of Genome Informatics). (Cross-listed with COM S, CPR E.) (3-0) Cr. 3. F. Prereq: Com S 208; Com S 330; Stat 341; credit or enrollment in Biol 315, Stat 430. Biology as an information science. Review of algorithms and information processing. Generative models for sequences. String algorithms. Pairwise sequence alignment. Multiple sequence alignment. Searching sequence databases. Genome sequence assembly. Goals: Study methods for designing efficient algorithms and data structures for problems in Computational Biology. Analyzing the performance of algorithms for various tasks in Computational Biology and learning to estimate their intrinsic resource requirements. Study practical intractability in Computational Biology and approaches for dealing with it. Study models for Computational Biology. Prerequisites: ComS 208, ComS 330, Stat 341; Credit or enrollment in Biol 315; Stat 430 Textbook: (1) Jones and Pevzner – An Introduction to Bioinformatics Algorithms; MIT Press 2004 (required) (2) Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press 1999 (reprint with corrections) (3) Cormen, Leiserson, Rivest, and Stein – Introduction to Algorithms; MIT Press 2009 (3rd edition) (4) Kleinberg and Tardos – Algorithms Design; Addison--âWesley 2005 Another Instructor Description This is a course on computational techniques for reconstruction and alignment of genome sequences. Those techniques are very useful for students to solve computational problems in genome sequencing and analysis. The course starts with review of concepts in algorithm design and analysis. This course covers pairwise sequence alignment, model of sequence evolution, fast string matching, sequence database searching, multiple sequence alignment, and genome sequence assembly. Course Objectives Upon completion of this course the student should be able to apply computational techniques to solve problems in genome sequencing and analysis. Half sheet synopsis of BCB 567 – summary of topics – Fall 2008 1. Review of concepts in algorithm design An instance of a problem, the size of an input instance, the algorithm, the time and space requirements of an algorithm as functions of input size, the big O notation, an example of designing and analyzing an algorithm. 2. Pair sequence alignment 2.1 Motivation Alignments of DNA and protein sequences are useful in studying the evolutionary history of the sequences and finding functional elements in the sequences, e.g., reconstruction of phylogenetic trees and finding sequence regions under strong selection. 2.2 A global alignment model Alignment configuration: matches, mismatches, deletion and insertion gaps, substitution scoring table, affine gap scoring function, the score of an alignment, an optimal alignment. 2.3 A dynamic programming algorithm the major steps of applying dynamic programming as an algorithm design technique, developing an algorithm for computing an optimal global alignment by applying dynamic programming to the problem. Obtaining the time and space requirements of the algorithm. 2.4 A linear space algorithm The high space requirement of the standard algorithm on long sequences. Obtaining the necessary and sufficient conditions for finding a middle pair of positions on an optimal global alignment. Developing a recursive algorithm based on finding a middle pair of positions on an optimal global alignment. Obtaining the time and space requirements of the algorithm. 2.5 A banded alignment algorithm The high time requirement of the standard algorithm on long sequences. Developing an efficient algorithm by restricting the standard algorithm or the linear space algorithm to a small area of the matrix. 2.6 A local alignment model Limitation of the global alignment algorithm on sequences that are not entirely similar but contain local regions that are similar. Definition of a local alignment. Developing a dynamic programming algorithm for computing an optimal local alignment between two sequences. Developing a linear-space algorithm for computing an optimal local alignment. 2.7 A generalized alignment algorithm Limitations of the global and local alignment algorithms on sequences with similar regions (exons) separated by different regions (introns). Introducing a new type of alignment configurations called difference blocks for dealing with different regions. Developing an algorithm for computing an optimal alignment that consists of similarity blocks separated by difference blocks. 3. String matching 3.1 Finding exact string matches between sequences A lookup table for finding exact matches of words of length w and its extension for finding exact matches of strings of lengths in a multiple of w. Or suffix trees and arrays for finding exact matches of strings of any length. 3.2 Finding approximate string matches between sequences A word model of 1's and 0's with 1 indicating a match and 0 for "don't care". Use of a lookup table for finding approximate word matches under a word model. 4. Fast sequence comparison and database search methods 4.1 Limitations of alignment algorithms on whole genome sequences 4.2 Computing high-scoring segment pairs (HSPs) based on finding string matches. 4.3 Dynamic programming algorithm for computing high-scoring chains of HSPs. 5. Construction of substitution matrices 5.1 Construction of PAM matrices based on an evolutionary model 5.2 Construction of Blosum matrices based on sequence similarity 6. Reconstruction of phylogenetic trees 6.1 Computation of evolutionary distances between sequences. 6.2 A distance method for building a phylogenetic tree. 7. Multiple sequence alignment 7.1 A sum-of-pair scoring scheme for a multiple sequence alignment. 7.2 Limitation of a dynamic programming algorithm for building a multiple sequence alignment. 7.3 A progressive alignment method for building a multiple alignment of sequences based on a phylogenetic tree of the sequences. 8. Genome assembly 8.1 Terms in genome assembly Base sequences, quality values, pairs of reads from the ends of DNA segments, overlaps, the layouts and consensus sequences of contigs, and scaffolds. 8.2 Algorithm for quickly computing overlaps between sequences 8.3 Algorithm for building the layouts of contigs 8.4 Algorithm for building scaffolds of contigs 8.5 Algorithm for generating the consensus sequences of contigs. BCB Program/Orientation/2011/BCB567-Description.doc