Introduction to Mathematical Models in Computational Biology and

advertisement
CISG 5203: Algorithms in Computational Biology
Spring 2009
Project
Due Date: May 4th, 2009
The goal of project is to engage students in a hands-on experiment in different
aspects of computational biology. Students can work alone for the project, or they
can form a group of 2. A student can also come up with a project of his/her own as
long as the project is related to computational biology or bioinformatics.
Recommended Projects
A list of given projects are listed below. The goals and some basic steps are provided
in the given projects. You can choose one of the given projects to work on based on
your interest. The delivery includes a final report that is due May 4th of 2009, and a
final presentation which is scheduled the week before the final exam week.
Description of recommended projects:
1. Sequence alignment web tools
Goals: Design a web tool (e.g. CGI program, or PhP script) for global and local
sequence alignment. Your program should contain the following components:
a) A web user interface to allow user to enter two sequences, choose different
scoring matrices (e.g., PAM250, BLOSUM62, etc.), choose different open gap
and gap extension penalties. You may allow user to input sequence from file
(in fasta format) as well.
b) Your program should have both global and local alignment functionalities.
The deliverable will be a web program that you can demonstrate during final
presentation week. (If you don’t have a web account, you can give me your
program, I can upload to my web site.)
For reference, you can visit the following alignment web site at
http://www.ebi.ac.uk/Tools/emboss/align/index.html
2. Metagenome Analysis
Metagenomics is the study of the genomic content of a sample of organisms
obtained from a common habitat using targeted or random sequencing. Goals
include understanding the extent and role of microbial diversity. The taxonomical
content of such a sample is usually estimated by comparison against DNA and
protein sequence databases of known sequences. Most published studies employ
the analysis of paired-end reads, complete sequences of environmental fosmid
and BAC clones, or environmental assemblies. Emerging very-high-throughput
sequencing technologies are paving the way to low-cost random shotgun
approaches. In this project, you will use MEGAN (MetaGenome Analysis Software:
http://www-ab.informatik.uni-tuebingen.de/software/megan) to perform data
analysis. Several high profile journal articles published recently used MEGAN for
their data analysis. The experimental data can be downloaded at: http://wwwab.informatik.unituebingen.de/software/megan/welcome.html#Publications_Using_the_Software .
The data size is usually in the scale of giga bites.
3. Gene Prediction
Open reading frames (ORFs) can be predicted by a combination of in silico ORF
predictions and synteny-based ORF mapping from annotated genomes to
unannotated genome. The following lists a general procedure for gene prediction.
a. Use program GeneMark and Glimmer3 to predict in silico ORFs (putative gene
locations).
GeneMark uses a species-specific inhomegeneous Markov model to calculate the
probability that a given segment of the sequence is gene-encoding. GeneMark
was developed by Mark Borodovsky’s group (Borodovsky & McIninch, Comp.
Chem., 1993, 17, 123-133).
Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to
identify the coding regions and distinguish them from noncoding DNA, especially
for the genomes of bacteria, archeak, and viruses. Glimmer2 was developed by
Steven Salzberg’s group.
b. Use a synteny-based approach to map ORFs from annotated genomes to
unannotated genome.
c. Predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to
Pfam, the top blast hits against the non-redundant protein database, and/or
published mass spectrometry peptide data.
The details of the procedure and sample genomic sequences are available at
http://www.broad.mit.edu/annotation/genome/tbdb/GeneFinding.html#q1
4. Curating a biological data set
Goals: Create a curated dataset that contains at least three of the following
components:
a) All related sequences sharing a common function (Homologous Sequences)
b) All substantial motifs
c) Evolutionary history
d) Structural information
e) Experimental information
Suggested research process:
In order to do this project, the following steps are recommended to reach the
final goal:
Step 1: Submit three candidate families for your course project.
Step 2: Collect an initial set of sequences, generate a multiple sequence
alignment.
Step 3: Improve the quality of your alignment, and identify additional family
members.
Step 4: Add structural and/or evolutionary information, and give a final report.
Background knowledge
You need to familiarize yourself with some biological sequence analysis tools that
are listed below:
a) Multiple sequence alignment tools: T-Coffee, ClustalW, ProbCons.
b) Multiple sequence alignment & shading utility: GeneDoc
c) Pattern identification and matching tools: MEME, Profile, HMMER.
d) Phylogenetic analysis tools: PHYLIP, Notung.
e) Tools to visualize biological molecules: RasMol, VMD.
Examples of Some Protein Super-families
a) PIRSF002383 (D1 dopamine receptor-interacting protein or calcyon)
This protein super-family consists of 25 protein members. Defective
calcyon proteins have been implicated in both attentiondeficit/hyperactivity disorder (ADHD) and schizophrenia.
b) PIRSF001837 (Leptin, obesity protein)
This protein super-family consists of 25 protein members and 36
fragments. You can remove the fragment when you perform multiple
sequence alignment to improve efficiency. Leptin is the Greek term for
thin. Leptin is an appetite suppressant. It stops you eating too much as
well as makes you more active so you burn off more energy. Eating in
harmony with leptin is essential for healthy metabolism, especially as a
person grows older and begins to struggle with weight around the
midsection.
c) PIRSF038235 (cAMP response element-binding protein)
This protein super-family consists of 68 protein members. CREB -- cyclic
AMP responsive element binding protein. The CREB protein was found to
regulate brain function during development and learning. The protein is
also involved in the process of alcohol tolerance, dependence, and
withdrawal symptoms. Experimentally, rats deficient in CREB drank more
alcohol.
You can search Protein Information Resource web site for protein super-family of
your interest for your course project.
5.
Adaptive Gene Selection for Multi-class Cancer Classification using
Semi-supervised Learning Approach (Difficulty level: 4)
In this project, you will learn to use some machine learning and some
biostatistical techniques to do multi-class cancer classification.
Goals: Find a set of genes with minimum redundancy that can discriminate
different types (more than 3 subtypes) of cancer cells based on the gene
expression data.
Suggested Research Process:
a) Download cancer cell gene expression data.
b) Normalize the data.
c) Perform a first round statistical analysis to get a list of highly differentially
expressed genes.
d) Perform classification of data set using machine learning method such as
Support Vector Machines, Neural Networks, Random Forest, etc. You can
choose one based on your research interest.
e) Using semi-supervised learning approach and ensemble technique to classify
different types of cancer cells based on their gene expression profile.
f)
Do a cross-validation on your methodology and give a final report.
Background knowledge:
a) Machine learning techniques (Support Vector Machine (SVM), neural networks,
random forests, discriminant analysis, ensemble technique, semi-supervised
learning etc.)
b) Elementary statistics (t-test, ANOVA, principal component analysis, factor
analysis).
c) Microarray technology, gene expression profiles.
d) ROC curve for binary learning, confusion matrix.
e) R/S-Plus statistical programming language.
6.
Identify Common Substructures in Proteins (Difficulty level: 4)
The goal of this project is to write a program to identify common substructures
between two proteins. First, you will need a way of superimposing two 3-D
structures such that their distance is minimized. Such a minimized distance is called
RMSD (root mean standard deviation).
Protein structure is represented as a point set. Suppose point sets PA = {a1, a2, …,
an} and PB = {b1, b2, …, bn} represent the structures of protein A and B. To align
these two structures, first move the center of mass of each protein to the origin.
This is a translation process. PA and PB are translated into matrix A and B,
respectively. The next step is to rotate matrix B into A. This is also called the
orthogonal procrustes problem. Schonemann proposed a generalized solution to this
problem. The algorithm to align structure A and B is shown in the following:
Input: Matrix A, B
Output: Rotation matrix to rotate B into A, minimum distance between A
& B after transformation.
Procedure align(A, B)
begin
Move mass center of A to origin
Move mass center of B to origin
C := B'A
Compute the SVD: [U, S, V]:= svd(C)
Q := U*V'
||A-BQ||2 := trace(A'A) + trace(B'B) - 2trace(Q'B'A)
RMS:= SQRT(||A-BQ||2/N)
Output rotation matrix Q
Output RMS
end
RMSD is the norm of the distance vector between two data sets, provided that they
have optimally superimposed:
RMSD 
1 N
 x(i)  y(i)
N i 1
2
A window searching (fragment matching) procedure was used to detect common
substructures in proteins. In order to facilitate the computation, we first determine a
set of candidate shifts using the size of the shorter protein as our window size. We
use contiguous alignment that is simpler than the standard biological notion of
sequence alignment (which allows for insertions and deletions into a sequence). By
shifting relative to one another, we compute the corresponding URMS score of
equivalence residue pairs for each shift. Typically only the 10-20 best shifts provide
any useful information. The pseudo code to determine such a set of shifts is given in
the following:
Input: long protein of length n, short protein of length m.
Output: k(=20) shifts with least URMS scores for aligning the two
proteins.
Procedure getShifts(longprotein, n, shortprotein, m)
begin
for i := 1 to n do
calculate RMSD(longprotein(i:i+m-1), shortprotein(1:m))
store the values of URMS and shifts
end
output k shifts with lest RMSD scores
end
Then for each of the candidate shifts, we identify the common substructures using a
smaller window. We arbitrarily use a window size of 10. We move the window from
the start residue to the end, one residue at a time. For each position of the small
window, the RMS distance is computed for the corresponding residues within the
window. A curve is plotted for the RMS value against each offset. Common
substructures are identified as a consistent “low” value in the plot. An arbitrary
threshold value is used to decide “low” values.
7.
Medical Image Registration using Mutual Information
The goal of this project is to implement Mutual Information based medical image
registration that were taught in class. If you want to implement the algorithm in
Matlab. I will recommend you use Optimization toolbox.
Input data: http://www.cs.nccu.edu/~gmilledge/CB2/Project/image.mat
Download