Bioinformatics - Department of Computer Science

advertisement
Computer Science 286
Spring 2002
Bioinformatics Projects
The project will involve some concentrated work in a research area
related to Bioinformatics. Some of these projects might lead to master
thesis or writing projects. Students are encouraged to develop a project
that has some relationship to their own interests, but a list of suggested
projects is attached to help get started. The key idea is to be creative
either in developing a new algorithm or in implementing an existing one.
Results, whether good or bad, should be compared and contrasted with
the literature, or with other existing algorithms. Some possibilities are
listed below. You can pick a project from the list, modify a project to suit
your interests, or invent your own.
The two-page written proposal from each group is due on October 16,
2003. The proposal should give a clear description of the project and
should contain absolutely no generalities and definitions that are from
the notes or from the book. Clearly state what you are planning to do and
clearly explain how you plan to achieve it. It is important to get started
early as possible.
The project may be developed for any available platform. You can use C,
C++, Java or Perl. The World Wide Web contains many programs for
Bioinformatics. You may use and modify such code provided suitable
acknowledgements and citations are made. In other words, all material
taken from other sources must be completely and properly acknowledged
and cited.
The projects are due on Wednesday, December 2, 2003, at the beginning
of the lecture. All final reports should consist of both a technical project
description with appendices and references, and a disk containing the
source code written for the project and the PowerPoint presentation. The
presentations will be given on Saturday, December 6, 2003 and/or
during class time, towards the end of the semester (dates to be
announced later).
A good project report will include the following:
Background of the problem (literature search).
A clear definition of the problem.
A description and justification of the data sources.
An explanation and justification of methods of data analysis.
A description of the results of the analysis.
Conclusions based on the results.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
Possible directions for future research.
A full list of references.
You should aim to keep your report to within 10 pages if it is a
programming project and to 30 pages if it is a non-programming project.
Please do not include a hard copy of the source code.
List of Suggested Projects
A - Programming Projects
1. DP Pairwise Comparison Algorithm
In this project you are to implement a dynamic programming
algorithm that does pairwise comparison. Your program should have
three options:
Local comparison
Global comparison
Semiglobal comparison
Read Section 3.2 [SM97] for a clear explanation of all three options
including procedures Similarity and Align on pages 52 and 53,
respectively. Compare your program to an existing package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
2. Two-Gap Value DP Pairwise Comparison Algorithm
In this project you are to implement a dynamic programming
algorithm that does pairwise comparison. Your program should allow
two gap penalties: one for starting a gap and one for extending a gap.
Your program should have two options:
Local comparison
Global comparison
The program should allow the user to enter gap penalties.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
3. Optimal Linear Space DP for Pairwise Comparison
In this project you are to implement a dynamic programming
algorithm that does pairwise comparison in linear space. The basic
algorithm is of quadratic complexity. With respect to space, it is
possible to improve the complexity from quadratic to linear. The
algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3:
“Extensions of the Basic Algorithms” [SM97]. Your program should
have three options:
Local comparison
Global comparison
Semiglobal comparison
Read Sections 3.2 and 3.3.1 [SM97] for a clear explanation of all three
options, including procedures BestScore and Align on pages 59 and
61, respectively. Compare your program to an existing package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
4. K-Band DP for Pairwise Comparison
If two sequences are similar, the best alignments have their paths
near the main diagonal. To compute the optimal score and alignment
it is not necessary to fill the entire matrix. A narrow band around the
main diagonal should suffice. In this project you are to implement a
dynamic programming algorithm that does pairwise comparison and
uses the K-Band procedure described in Section 3.3.4 [SM97]. Your
program should have three options:
Local comparison
Global comparison
Semiglobal comparison
Read Sections 3.2 and 3.3.4 [SM97] for a clear explanation of all three
options. Compare your program to an existing package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
5. The Original BLAST
Basic Local Alignment Search Tool (BLAST) was first published in
1990 [AGM+90]. This project consists in reading, understanding and
implementing the algorithm presented in the article and comparing it
to an existing package.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
[AGM+90] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,
D. Basic Local Alignment Search Tool. Journal of Molecular Biology,
215: 403-410.
6. PSI-BLAST
Position Specific Iterated Basic Local Alignment Search Tool (PSIBLAST) was first published in 1997 [AMS+97]. This project consists in
reading, understanding and implementing PSI-BLAST as described in
the article and comparing it to an existing package.
[AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D. Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Research, 25, 17: 3389-3402.
7. FASTA and FASTAP
W. Pearson and D. Lipman developed FASTA which provides a rapid
way of finding short stretches of similar sequences between a new
sequence and any sequence in a database [PL88]. Pearson continued
to improve the FASTA method for similarity searches in sequence
databases [Pea90], [Pea96]. This project consists in choosing one of
the three FAST algorithms from one of the three referenced articles,
reading, understanding and implementing it as described in the
article and comparing it to an existing package.
[PL88] Pearson, W. and Lipman, D. Improved tools for biological
sequence comparisons. Proc. Natl. Acad. Sci. 85: 2444-2448.
[Pea90] Pearson, W. Rapid and sensitive sequence comparison with
FASTP and FASTA. Methods Enzymol. 183: 63-98.
[Pea96] Pearson, W. Effective protein sequence comparison. Methods
Enzymol. 266: 227-258.
8. CLUSTAL W
CLUSTAL W is a commonly used package for multiple sequence
alignment. It was published in 1994 [THG94]. This project consists in
reading, understanding and implementing the algorithm presented in
the article and comparing it to an existing package.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
[THG94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res. 22: 4673-4680 [1994].
9. Multiple Sequence Alignment and Genetic Algorithms
Dynamic programming approach to solve the MSA problem results in
exponential time complexity. Genetic Algorithms are of considerable
interest to researchers because they can find high scoring alignment
as good as those found by other methods. This project consists in
implementing your own genetic algorithm for the MSA problem and
comparing it to SAGA [NH96] or [ZW97].
[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by
Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8,
1515-1524 [1996].
[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple
Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581. [1997].
[Whi93] Whitley, D., A Genetic Algorithm Tutorial [1993).
[Khu97] Khuri, S., Genetic Algorithm Workshop [1997].
[CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User-friendly
workbench for order-based genetic algorithm research [1996].
10.
Fragment Assembly
With current technology, it is impossible to sequence directly
contiguous DNA stretches of more than a few hundred bases.
Typically, several copies of random pieces of long DNA are cut. The
task of sequencing DNA is called fragment assembly: which consists
in reconstructing the original sequence from the fragments. [SM97]
This project consists in choosing either the greedy algorithm
described in Section 4.3.4 or one of the heuristics described in
Section 4.4. Read, understand and implement the algorithm you
chose and compare its performance to an existing package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
11. Physical Mapping of DNA
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
Physical mapping is the process of determining the location of certain
markers (landmarks) of a DNA molecule. The markers are generally
small but precisely defined sequences. The resulting maps are used as
basis for DNA sequencing, and for the isolation and characterization
of individual genes or other DNA regions of interest. For example, see
Figure 5.1 [MS97, page 144].
This project consists in choosing one of the algorithms described
Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it,
implementing it and comparing its performance to an existing
package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
12. Phylogenetic Trees
The reconstructing of phylogenetic trees is a general problem in
biology. It is used in molecular biology to help understand the
evolutionary relationships among proteins, for example. [SM97]
This project consists in
mentioned below, choosing
reading, understanding and
article and comparing it to an
choosing one of the four algorithms
the appropriate referenced article(s),
implementing it as described in the
existing package.

Phylogenetic Trees Based on Pairwise Distances [FD96]

Phylogenetic Trees Based on Neighbor Joining [SN87]

Phylogenetic Trees Based on Maximum Parsimony [Fel96]

Phylogenetic Trees Based on Maximum Likelihood Estimation
[BT86], [Fel81].
[BT86] Bishop, M. and Thompson, E. Maximum likelihood alignment
of DNA sequences. Journal of Molecular Biology. 190:159-165. [1986].
[FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid
sequences and construction of phylogenetic trees from them. Methods
Enzymol. 266[1996].
[Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A
maximum likelihood approach. Journal of Molecular Evolution.
17:368-376. [1981].
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
[Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by
parsimony, distance and likelihood methods. Methods Enzymol. 266
[1996].
[SN87] Saitou, N. and Nei, M. The neighbor joining method: a new
method for reconstructing phylogenetic trees. Molecular Biology
Evolution; 4:406-425. [1987].
13. Gene Prediction
Gene prediction consists in identifying regions of genomic DNA that
encode proteins.
Some of the existing models that identify and distinguish coding
regions from non-coding regions are based on:
Hidden Markov Model (HMM),
Neural Network,
Probabilistic model,
Linear discrimination analysis,
Decision tree classification,
Quadratic discriminant analysis.
This project consists in choosing one of the above techniques and
implementing the prediction (search) algorithm, which will be able to
search a given database for genes that do code for proteins. The
algorithm will be compared to an existing package.
14. The Protein Prediction Problem
The main goal in the protein prediction problem is to determine the
three-dimensional structure of a protein based on its amino acid
sequence. Recall that there are three levels to look at the proteinstructure:
The primary structure is the sequence of amino acids in the chain
i.e., a one-dimensional structure.
The secondary structure is the result of the folding of parts of the
amino acid chain. The two most important secondary structures
are the -helix, and the -strand.
The tertiary structure is the real 3-dimensional configuration of the
protein under given environmental conditions (solvent, pH and
temperature).
The tertiary structure decides the biochemical function of the protein.
If the tertiary structure is changed, the protein normally looses its
ability to perform whatever function it has, since this function
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
depends on the geometrical shape of the active site in the interior of
the molecule
This project consists in choosing an existing algorithm for protein
prediction, implementing it, and comparing it to a package currently
in use.
15. The RNA Structure Prediction Problem
Unlike DNA, which most frequently assumes its well-known
double-helical conformation, the three-dimensional structure of single
stranded RNA is determined by the sequence of nucleotides in much
the same way the protein structure is determined by sequence. RNA
structure, however, is less complex than protein structure and can be
well characterized by identifying the location of commonly occurring
secondary structure elements. [KR03]
This project consists in choosing an existing algorithm for RNA
structure prediction, implementing it, and comparing it to a package
currently in use.
[KR03] Krane, D. and Raymer, M. Fundamental Concepts of
Bioinformatics [2003].
16. The Clustering Gene Expression Problem
Analysis of gene expression patterns can provide insight into
relationships between a gene and its function. Clustering techniques
applied to gene expression data partitions genes into clusters (groups)
based on their expression patterns. Genes in the same cluster will
have similar expression patterns, while genes in different clusters will
have distinct expression patterns.
This project consists in choosing an existing clustering
algorithm (such as CLICK [SS00]), implementing it, and comparing it
to a package currently in use. [BSY99].
[BSY99] Ben-Dor, A. Shamir, R. and Yakhini, Z. Gene clustering
expression patterns. Journal of Computational Biology, 6:281-297.
[1999].
[SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with
applications to gene expression analysis. Proceedings of the 8th
International Conference on Intelligent Systems for Molecular Biology,
307-316. [2000].
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
Visualization Tools for Bioinformatics Algorithms
The main purpose of the projects in this category is to present a
visualization tool to assist students in learning algorithms related to
Bioinformatics. One should be able to use the interactive, userfriendly, educational tool you are to develop, in classroom
demonstrations, hands-on laboratories, self-directed work outside of
class, and distance learning. The visualization package will include
background material, a detailed explanation of the algorithm, with
examples, and quizzes and exercises. Using Java is highly
recommended.
17. Needleman and Wunsch’s DP Algorithm
Design and implement a visual interactive software package to
demonstrate how Needleman and Wunsch’s dynamic programming
algorithm is applied to solve protein and genomic sequence alignment
problems.
18. CLUSTAL W Algorithm
Design and implement a visual interactive software package to
demonstrate how the CLUSTAL W algorithm is applied to align
multiple sequences.
19. Phylogenetic Tree Construction
Choose a phylogenetic construction algorithm. Design and implement
a visual interactive software package to demonstrate how the
algorithm you selected constructs a phylogenetic tree.
B – Non-Programming Projects
Survey Papers
20. A Comprehensive Survey and Comparison
Multiple Sequence Alignment Programs
of
In this project, you are to choose 5 MSA programs. Describe each one
in detail and compare their performances. The comparison should
include several runs with data from various databases.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
21. A Comprehensive Survey and Comparison of Protein
Structure Visualization Tools
In this project, you are to choose 8 protein structure visualization
programs. Describe each one in detail and use them to visualize the
structure of different proteins. The project should include screen
dumps.
22. DNA Computing
Write a survey paper on the state of the art in DNA computing,
highlighting new directions.
23. Bayesian statistical methods for sequence
alignment and evolutionary distance estimation
Write a survey paper based on the results published by Agarwal and
States in 1996 [AS96], and Zhu et al. in 1998 [ZLL98]. For further
reading, a Bayesian bioinformatics tutorial by C. Lawrence is available
on the Internet [Law01]. Divide your paper into the following four
parts:
(a) Explain what Bayesian statistics is.
(b) Explain how the Bayesian statistics methods can be applied to
sequence analysis.
(c) Explain how the Bayesian statistics methods can be applied to
evolutionary distance estimation.
(d) Describe the most common Bayesian sequence alignment
algorithms.
Non-Survey Projects
24. PAM and BLOSUM Substitution Matrices
Explain how various PAM or BLOSUM substitution matrices were
constructed. Discuss the differences and similarities between the PAM
and BLOSUM matrices. Describe applications of each of the matrices.
25. Amino Acid Scoring Matrices
The most commonly used substitution scoring matrices are PAM and
BLOSUM. Choose two other amino acid scoring matrices. Describe
how they were constructed. Discuss the differences and similarities
between these matrices. Describe applications of each of the matrices.
© 2002 by Sami Khuri
Computer Science 286
Spring 2002
26. Test of Markov Model of Evolution in Proteins
In 1985, Wilbur tested the Markov model of evolution and showed
that it may be applicable if certain changes are made in the way the
PAM matrices are calculated [Wil85]. Write a research paper
describing how the tests were done, which conclusions were drawn
from the tests and the extension of the original paper by George et al.
in 1990 [GBH90].
© 2002 by Sami Khuri
Download