Computer Science 286 Spring 2002 Bioinformatics Projects The project will involve some concentrated work in a research area related to Bioinformatics. Some of these projects might lead to master thesis or writing projects. Students are encouraged to develop a project that has some relationship to their own interests, but a list of suggested projects is attached to help get started. The key idea is to be creative either in developing a new algorithm or in implementing an existing one. Results, whether good or bad, should be compared and contrasted with the literature, or with other existing algorithms. Some possibilities are listed below. You can pick a project from the list, modify a project to suit your interests, or invent your own. The two-page written proposal from each group is due on October 16, 2003. The proposal should give a clear description of the project and should contain absolutely no generalities and definitions that are from the notes or from the book. Clearly state what you are planning to do and clearly explain how you plan to achieve it. It is important to get started early as possible. The project may be developed for any available platform. You can use C, C++, Java or Perl. The World Wide Web contains many programs for Bioinformatics. You may use and modify such code provided suitable acknowledgements and citations are made. In other words, all material taken from other sources must be completely and properly acknowledged and cited. The projects are due on Wednesday, December 2, 2003, at the beginning of the lecture. All final reports should consist of both a technical project description with appendices and references, and a disk containing the source code written for the project and the PowerPoint presentation. The presentations will be given on Saturday, December 6, 2003 and/or during class time, towards the end of the semester (dates to be announced later). A good project report will include the following: Background of the problem (literature search). A clear definition of the problem. A description and justification of the data sources. An explanation and justification of methods of data analysis. A description of the results of the analysis. Conclusions based on the results. © 2002 by Sami Khuri Computer Science 286 Spring 2002 Possible directions for future research. A full list of references. You should aim to keep your report to within 10 pages if it is a programming project and to 30 pages if it is a non-programming project. Please do not include a hard copy of the source code. List of Suggested Projects A - Programming Projects 1. DP Pairwise Comparison Algorithm In this project you are to implement a dynamic programming algorithm that does pairwise comparison. Your program should have three options: Local comparison Global comparison Semiglobal comparison Read Section 3.2 [SM97] for a clear explanation of all three options including procedures Similarity and Align on pages 52 and 53, respectively. Compare your program to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 2. Two-Gap Value DP Pairwise Comparison Algorithm In this project you are to implement a dynamic programming algorithm that does pairwise comparison. Your program should allow two gap penalties: one for starting a gap and one for extending a gap. Your program should have two options: Local comparison Global comparison The program should allow the user to enter gap penalties. © 2002 by Sami Khuri Computer Science 286 Spring 2002 3. Optimal Linear Space DP for Pairwise Comparison In this project you are to implement a dynamic programming algorithm that does pairwise comparison in linear space. The basic algorithm is of quadratic complexity. With respect to space, it is possible to improve the complexity from quadratic to linear. The algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3: “Extensions of the Basic Algorithms” [SM97]. Your program should have three options: Local comparison Global comparison Semiglobal comparison Read Sections 3.2 and 3.3.1 [SM97] for a clear explanation of all three options, including procedures BestScore and Align on pages 59 and 61, respectively. Compare your program to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 4. K-Band DP for Pairwise Comparison If two sequences are similar, the best alignments have their paths near the main diagonal. To compute the optimal score and alignment it is not necessary to fill the entire matrix. A narrow band around the main diagonal should suffice. In this project you are to implement a dynamic programming algorithm that does pairwise comparison and uses the K-Band procedure described in Section 3.3.4 [SM97]. Your program should have three options: Local comparison Global comparison Semiglobal comparison Read Sections 3.2 and 3.3.4 [SM97] for a clear explanation of all three options. Compare your program to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 5. The Original BLAST Basic Local Alignment Search Tool (BLAST) was first published in 1990 [AGM+90]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package. © 2002 by Sami Khuri Computer Science 286 Spring 2002 [AGM+90] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215: 403-410. 6. PSI-BLAST Position Specific Iterated Basic Local Alignment Search Tool (PSIBLAST) was first published in 1997 [AMS+97]. This project consists in reading, understanding and implementing PSI-BLAST as described in the article and comparing it to an existing package. [AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 17: 3389-3402. 7. FASTA and FASTAP W. Pearson and D. Lipman developed FASTA which provides a rapid way of finding short stretches of similar sequences between a new sequence and any sequence in a database [PL88]. Pearson continued to improve the FASTA method for similarity searches in sequence databases [Pea90], [Pea96]. This project consists in choosing one of the three FAST algorithms from one of the three referenced articles, reading, understanding and implementing it as described in the article and comparing it to an existing package. [PL88] Pearson, W. and Lipman, D. Improved tools for biological sequence comparisons. Proc. Natl. Acad. Sci. 85: 2444-2448. [Pea90] Pearson, W. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63-98. [Pea96] Pearson, W. Effective protein sequence comparison. Methods Enzymol. 266: 227-258. 8. CLUSTAL W CLUSTAL W is a commonly used package for multiple sequence alignment. It was published in 1994 [THG94]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package. © 2002 by Sami Khuri Computer Science 286 Spring 2002 [THG94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680 [1994]. 9. Multiple Sequence Alignment and Genetic Algorithms Dynamic programming approach to solve the MSA problem results in exponential time complexity. Genetic Algorithms are of considerable interest to researchers because they can find high scoring alignment as good as those found by other methods. This project consists in implementing your own genetic algorithm for the MSA problem and comparing it to SAGA [NH96] or [ZW97]. [NH96] Notredame, C and Higgins, D.G. Sequence Alignment by Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8, 1515-1524 [1996]. [ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581. [1997]. [Whi93] Whitley, D., A Genetic Algorithm Tutorial [1993). [Khu97] Khuri, S., Genetic Algorithm Workshop [1997]. [CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User-friendly workbench for order-based genetic algorithm research [1996]. 10. Fragment Assembly With current technology, it is impossible to sequence directly contiguous DNA stretches of more than a few hundred bases. Typically, several copies of random pieces of long DNA are cut. The task of sequencing DNA is called fragment assembly: which consists in reconstructing the original sequence from the fragments. [SM97] This project consists in choosing either the greedy algorithm described in Section 4.3.4 or one of the heuristics described in Section 4.4. Read, understand and implement the algorithm you chose and compare its performance to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 11. Physical Mapping of DNA © 2002 by Sami Khuri Computer Science 286 Spring 2002 Physical mapping is the process of determining the location of certain markers (landmarks) of a DNA molecule. The markers are generally small but precisely defined sequences. The resulting maps are used as basis for DNA sequencing, and for the isolation and characterization of individual genes or other DNA regions of interest. For example, see Figure 5.1 [MS97, page 144]. This project consists in choosing one of the algorithms described Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it, implementing it and comparing its performance to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 12. Phylogenetic Trees The reconstructing of phylogenetic trees is a general problem in biology. It is used in molecular biology to help understand the evolutionary relationships among proteins, for example. [SM97] This project consists in mentioned below, choosing reading, understanding and article and comparing it to an choosing one of the four algorithms the appropriate referenced article(s), implementing it as described in the existing package. Phylogenetic Trees Based on Pairwise Distances [FD96] Phylogenetic Trees Based on Neighbor Joining [SN87] Phylogenetic Trees Based on Maximum Parsimony [Fel96] Phylogenetic Trees Based on Maximum Likelihood Estimation [BT86], [Fel81]. [BT86] Bishop, M. and Thompson, E. Maximum likelihood alignment of DNA sequences. Journal of Molecular Biology. 190:159-165. [1986]. [FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 266[1996]. [Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 17:368-376. [1981]. © 2002 by Sami Khuri Computer Science 286 Spring 2002 [Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. 266 [1996]. [SN87] Saitou, N. and Nei, M. The neighbor joining method: a new method for reconstructing phylogenetic trees. Molecular Biology Evolution; 4:406-425. [1987]. 13. Gene Prediction Gene prediction consists in identifying regions of genomic DNA that encode proteins. Some of the existing models that identify and distinguish coding regions from non-coding regions are based on: Hidden Markov Model (HMM), Neural Network, Probabilistic model, Linear discrimination analysis, Decision tree classification, Quadratic discriminant analysis. This project consists in choosing one of the above techniques and implementing the prediction (search) algorithm, which will be able to search a given database for genes that do code for proteins. The algorithm will be compared to an existing package. 14. The Protein Prediction Problem The main goal in the protein prediction problem is to determine the three-dimensional structure of a protein based on its amino acid sequence. Recall that there are three levels to look at the proteinstructure: The primary structure is the sequence of amino acids in the chain i.e., a one-dimensional structure. The secondary structure is the result of the folding of parts of the amino acid chain. The two most important secondary structures are the -helix, and the -strand. The tertiary structure is the real 3-dimensional configuration of the protein under given environmental conditions (solvent, pH and temperature). The tertiary structure decides the biochemical function of the protein. If the tertiary structure is changed, the protein normally looses its ability to perform whatever function it has, since this function © 2002 by Sami Khuri Computer Science 286 Spring 2002 depends on the geometrical shape of the active site in the interior of the molecule This project consists in choosing an existing algorithm for protein prediction, implementing it, and comparing it to a package currently in use. 15. The RNA Structure Prediction Problem Unlike DNA, which most frequently assumes its well-known double-helical conformation, the three-dimensional structure of single stranded RNA is determined by the sequence of nucleotides in much the same way the protein structure is determined by sequence. RNA structure, however, is less complex than protein structure and can be well characterized by identifying the location of commonly occurring secondary structure elements. [KR03] This project consists in choosing an existing algorithm for RNA structure prediction, implementing it, and comparing it to a package currently in use. [KR03] Krane, D. and Raymer, M. Fundamental Concepts of Bioinformatics [2003]. 16. The Clustering Gene Expression Problem Analysis of gene expression patterns can provide insight into relationships between a gene and its function. Clustering techniques applied to gene expression data partitions genes into clusters (groups) based on their expression patterns. Genes in the same cluster will have similar expression patterns, while genes in different clusters will have distinct expression patterns. This project consists in choosing an existing clustering algorithm (such as CLICK [SS00]), implementing it, and comparing it to a package currently in use. [BSY99]. [BSY99] Ben-Dor, A. Shamir, R. and Yakhini, Z. Gene clustering expression patterns. Journal of Computational Biology, 6:281-297. [1999]. [SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with applications to gene expression analysis. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 307-316. [2000]. © 2002 by Sami Khuri Computer Science 286 Spring 2002 Visualization Tools for Bioinformatics Algorithms The main purpose of the projects in this category is to present a visualization tool to assist students in learning algorithms related to Bioinformatics. One should be able to use the interactive, userfriendly, educational tool you are to develop, in classroom demonstrations, hands-on laboratories, self-directed work outside of class, and distance learning. The visualization package will include background material, a detailed explanation of the algorithm, with examples, and quizzes and exercises. Using Java is highly recommended. 17. Needleman and Wunsch’s DP Algorithm Design and implement a visual interactive software package to demonstrate how Needleman and Wunsch’s dynamic programming algorithm is applied to solve protein and genomic sequence alignment problems. 18. CLUSTAL W Algorithm Design and implement a visual interactive software package to demonstrate how the CLUSTAL W algorithm is applied to align multiple sequences. 19. Phylogenetic Tree Construction Choose a phylogenetic construction algorithm. Design and implement a visual interactive software package to demonstrate how the algorithm you selected constructs a phylogenetic tree. B – Non-Programming Projects Survey Papers 20. A Comprehensive Survey and Comparison Multiple Sequence Alignment Programs of In this project, you are to choose 5 MSA programs. Describe each one in detail and compare their performances. The comparison should include several runs with data from various databases. © 2002 by Sami Khuri Computer Science 286 Spring 2002 21. A Comprehensive Survey and Comparison of Protein Structure Visualization Tools In this project, you are to choose 8 protein structure visualization programs. Describe each one in detail and use them to visualize the structure of different proteins. The project should include screen dumps. 22. DNA Computing Write a survey paper on the state of the art in DNA computing, highlighting new directions. 23. Bayesian statistical methods for sequence alignment and evolutionary distance estimation Write a survey paper based on the results published by Agarwal and States in 1996 [AS96], and Zhu et al. in 1998 [ZLL98]. For further reading, a Bayesian bioinformatics tutorial by C. Lawrence is available on the Internet [Law01]. Divide your paper into the following four parts: (a) Explain what Bayesian statistics is. (b) Explain how the Bayesian statistics methods can be applied to sequence analysis. (c) Explain how the Bayesian statistics methods can be applied to evolutionary distance estimation. (d) Describe the most common Bayesian sequence alignment algorithms. Non-Survey Projects 24. PAM and BLOSUM Substitution Matrices Explain how various PAM or BLOSUM substitution matrices were constructed. Discuss the differences and similarities between the PAM and BLOSUM matrices. Describe applications of each of the matrices. 25. Amino Acid Scoring Matrices The most commonly used substitution scoring matrices are PAM and BLOSUM. Choose two other amino acid scoring matrices. Describe how they were constructed. Discuss the differences and similarities between these matrices. Describe applications of each of the matrices. © 2002 by Sami Khuri Computer Science 286 Spring 2002 26. Test of Markov Model of Evolution in Proteins In 1985, Wilbur tested the Markov model of evolution and showed that it may be applicable if certain changes are made in the way the PAM matrices are calculated [Wil85]. Write a research paper describing how the tests were done, which conclusions were drawn from the tests and the extension of the original paper by George et al. in 1990 [GBH90]. © 2002 by Sami Khuri