CISG 5203: Algorithms in Computational Biology Spring 2009 Project Due Date: May 4th, 2009 The goal of project is to engage students in a hands-on experiment in different aspects of computational biology. Students can work alone for the project, or they can form a group of 2. A student can also come up with a project of his/her own as long as the project is related to computational biology or bioinformatics. Recommended Projects A list of given projects are listed below. The goals and some basic steps are provided in the given projects. You can choose one of the given projects to work on based on your interest. The delivery includes a final report that is due May 4th of 2009, and a final presentation which is scheduled the week before the final exam week. Description of recommended projects: 1. Sequence alignment web tools Goals: Design a web tool (e.g. CGI program, or PhP script) for global and local sequence alignment. Your program should contain the following components: a) A web user interface to allow user to enter two sequences, choose different scoring matrices (e.g., PAM250, BLOSUM62, etc.), choose different open gap and gap extension penalties. You may allow user to input sequence from file (in fasta format) as well. b) Your program should have both global and local alignment functionalities. The deliverable will be a web program that you can demonstrate during final presentation week. (If you don’t have a web account, you can give me your program, I can upload to my web site.) For reference, you can visit the following alignment web site at http://www.ebi.ac.uk/Tools/emboss/align/index.html 2. Metagenome Analysis Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against DNA and protein sequence databases of known sequences. Most published studies employ the analysis of paired-end reads, complete sequences of environmental fosmid and BAC clones, or environmental assemblies. Emerging very-high-throughput sequencing technologies are paving the way to low-cost random shotgun approaches. In this project, you will use MEGAN (MetaGenome Analysis Software: http://www-ab.informatik.uni-tuebingen.de/software/megan) to perform data analysis. Several high profile journal articles published recently used MEGAN for their data analysis. The experimental data can be downloaded at: http://wwwab.informatik.unituebingen.de/software/megan/welcome.html#Publications_Using_the_Software . The data size is usually in the scale of giga bites. 3. Gene Prediction Open reading frames (ORFs) can be predicted by a combination of in silico ORF predictions and synteny-based ORF mapping from annotated genomes to unannotated genome. The following lists a general procedure for gene prediction. a. Use program GeneMark and Glimmer3 to predict in silico ORFs (putative gene locations). GeneMark uses a species-specific inhomegeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky’s group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133). Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archeak, and viruses. Glimmer2 was developed by Steven Salzberg’s group. b. Use a synteny-based approach to map ORFs from annotated genomes to unannotated genome. c. Predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to Pfam, the top blast hits against the non-redundant protein database, and/or published mass spectrometry peptide data. The details of the procedure and sample genomic sequences are available at http://www.broad.mit.edu/annotation/genome/tbdb/GeneFinding.html#q1 4. Curating a biological data set Goals: Create a curated dataset that contains at least three of the following components: a) All related sequences sharing a common function (Homologous Sequences) b) All substantial motifs c) Evolutionary history d) Structural information e) Experimental information Suggested research process: In order to do this project, the following steps are recommended to reach the final goal: Step 1: Submit three candidate families for your course project. Step 2: Collect an initial set of sequences, generate a multiple sequence alignment. Step 3: Improve the quality of your alignment, and identify additional family members. Step 4: Add structural and/or evolutionary information, and give a final report. Background knowledge You need to familiarize yourself with some biological sequence analysis tools that are listed below: a) Multiple sequence alignment tools: T-Coffee, ClustalW, ProbCons. b) Multiple sequence alignment & shading utility: GeneDoc c) Pattern identification and matching tools: MEME, Profile, HMMER. d) Phylogenetic analysis tools: PHYLIP, Notung. e) Tools to visualize biological molecules: RasMol, VMD. Examples of Some Protein Super-families a) PIRSF002383 (D1 dopamine receptor-interacting protein or calcyon) This protein super-family consists of 25 protein members. Defective calcyon proteins have been implicated in both attentiondeficit/hyperactivity disorder (ADHD) and schizophrenia. b) PIRSF001837 (Leptin, obesity protein) This protein super-family consists of 25 protein members and 36 fragments. You can remove the fragment when you perform multiple sequence alignment to improve efficiency. Leptin is the Greek term for thin. Leptin is an appetite suppressant. It stops you eating too much as well as makes you more active so you burn off more energy. Eating in harmony with leptin is essential for healthy metabolism, especially as a person grows older and begins to struggle with weight around the midsection. c) PIRSF038235 (cAMP response element-binding protein) This protein super-family consists of 68 protein members. CREB -- cyclic AMP responsive element binding protein. The CREB protein was found to regulate brain function during development and learning. The protein is also involved in the process of alcohol tolerance, dependence, and withdrawal symptoms. Experimentally, rats deficient in CREB drank more alcohol. You can search Protein Information Resource web site for protein super-family of your interest for your course project. 5. Adaptive Gene Selection for Multi-class Cancer Classification using Semi-supervised Learning Approach (Difficulty level: 4) In this project, you will learn to use some machine learning and some biostatistical techniques to do multi-class cancer classification. Goals: Find a set of genes with minimum redundancy that can discriminate different types (more than 3 subtypes) of cancer cells based on the gene expression data. Suggested Research Process: a) Download cancer cell gene expression data. b) Normalize the data. c) Perform a first round statistical analysis to get a list of highly differentially expressed genes. d) Perform classification of data set using machine learning method such as Support Vector Machines, Neural Networks, Random Forest, etc. You can choose one based on your research interest. e) Using semi-supervised learning approach and ensemble technique to classify different types of cancer cells based on their gene expression profile. f) Do a cross-validation on your methodology and give a final report. Background knowledge: a) Machine learning techniques (Support Vector Machine (SVM), neural networks, random forests, discriminant analysis, ensemble technique, semi-supervised learning etc.) b) Elementary statistics (t-test, ANOVA, principal component analysis, factor analysis). c) Microarray technology, gene expression profiles. d) ROC curve for binary learning, confusion matrix. e) R/S-Plus statistical programming language. 6. Identify Common Substructures in Proteins (Difficulty level: 4) The goal of this project is to write a program to identify common substructures between two proteins. First, you will need a way of superimposing two 3-D structures such that their distance is minimized. Such a minimized distance is called RMSD (root mean standard deviation). Protein structure is represented as a point set. Suppose point sets PA = {a1, a2, …, an} and PB = {b1, b2, …, bn} represent the structures of protein A and B. To align these two structures, first move the center of mass of each protein to the origin. This is a translation process. PA and PB are translated into matrix A and B, respectively. The next step is to rotate matrix B into A. This is also called the orthogonal procrustes problem. Schonemann proposed a generalized solution to this problem. The algorithm to align structure A and B is shown in the following: Input: Matrix A, B Output: Rotation matrix to rotate B into A, minimum distance between A & B after transformation. Procedure align(A, B) begin Move mass center of A to origin Move mass center of B to origin C := B'A Compute the SVD: [U, S, V]:= svd(C) Q := U*V' ||A-BQ||2 := trace(A'A) + trace(B'B) - 2trace(Q'B'A) RMS:= SQRT(||A-BQ||2/N) Output rotation matrix Q Output RMS end RMSD is the norm of the distance vector between two data sets, provided that they have optimally superimposed: RMSD 1 N x(i) y(i) N i 1 2 A window searching (fragment matching) procedure was used to detect common substructures in proteins. In order to facilitate the computation, we first determine a set of candidate shifts using the size of the shorter protein as our window size. We use contiguous alignment that is simpler than the standard biological notion of sequence alignment (which allows for insertions and deletions into a sequence). By shifting relative to one another, we compute the corresponding URMS score of equivalence residue pairs for each shift. Typically only the 10-20 best shifts provide any useful information. The pseudo code to determine such a set of shifts is given in the following: Input: long protein of length n, short protein of length m. Output: k(=20) shifts with least URMS scores for aligning the two proteins. Procedure getShifts(longprotein, n, shortprotein, m) begin for i := 1 to n do calculate RMSD(longprotein(i:i+m-1), shortprotein(1:m)) store the values of URMS and shifts end output k shifts with lest RMSD scores end Then for each of the candidate shifts, we identify the common substructures using a smaller window. We arbitrarily use a window size of 10. We move the window from the start residue to the end, one residue at a time. For each position of the small window, the RMS distance is computed for the corresponding residues within the window. A curve is plotted for the RMS value against each offset. Common substructures are identified as a consistent “low” value in the plot. An arbitrary threshold value is used to decide “low” values. 7. Medical Image Registration using Mutual Information The goal of this project is to implement Mutual Information based medical image registration that were taught in class. If you want to implement the algorithm in Matlab. I will recommend you use Optimization toolbox. Input data: http://www.cs.nccu.edu/~gmilledge/CB2/Project/image.mat