Computer Science 286 Fall 2005 Bioinformatics Projects The project will involve research in bioinformatics. You are encouraged to work on a project related to your interests. A list of suggested topics is attached to help you get started. You can pick a project from the list, modify a project to suit your interests, or invent your own. The key idea is to be creative either in developing a new algorithm or in implementing an existing one. Results, whether good or bad, should be compared with those obtained from existing bioinformatics tools or packages. The project implementation may be developed for any platform. You can use C, C++, C#, Java, Matlab, or Perl. The World Wide Web contains many bioinformatics programs as well as the source code. You may use and modify such code provided appropriate acknowledgements and citations are made. A two-page project proposal is due by the beginning of the lecture on Thursday, October 13, 2005. The proposal should give a clear description of the project and should contain absolutely no generalities and definitions. Clearly state what you are planning to do and explain how you plan to achieve it. Do not forget to specify the programming language you intend to use. It is important to get started as early as possible. The projects are due at on Tuesday, November 29, 2005. Make sure to hand in both a technical project description with appendices and references and a disk or CD containing the source code. The approximate length of the programming project report should be 10 pages and 30 pages for non-programming projects. You do not need to include a hard copy of the source code. A good project report will include the following: Background of the problem from the literature search. A clear definition of the problem. An explanation and justification of methods of data analysis. A description and justification of the data sources. Analysis of the results and comparison with existing tools. Conclusions based on the results. Possible directions for future research. Instructions on how to compile and execute your program, if applicable. A full list of references. © 2005 by Sami Khuri Computer Science 286 Fall 2005 List of Suggested Projects A - Programming Projects 1. DP Pairwise Comparison Algorithm In this project, you should implement a dynamic programming algorithm that does pairwise comparison. Your program should allow the user to either use: One penalty for gaps, or Two gap penalties: one for starting a gap and one for extending a gap. Your program should also have the following three options: Local comparison Global comparison Semiglobal comparison The program should allow the user to enter gap penalties. Compare your program to an existing package. [SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 2. K-Band DP for Pairwise Comparison If two sequences are similar, the best alignments have their paths near the main diagonal. It is not necessary to fill the entire matrix to compute the optimal score and alignment. A narrow band around the main diagonal should suffice. In this project, you are to implement a dynamic programming algorithm that does pairwise comparison and uses the K-Band procedure [SM97]. Your program should allow the user to either use: One penalty for gaps, or Two gap penalties: one for starting a gap and one for extending a gap. Your program should also have three options: Local comparison Global comparison Semiglobal comparison © 2005 by Sami Khuri Computer Science 286 Fall 2005 The program should allow the user to enter gap penalties. Compare your program to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 3. Optimal Linear Space DP for Pairwise Comparison In this project, you are to implement a dynamic programming algorithm that does pairwise comparison in linear space. The basic algorithm is of quadratic complexity. With respect to space, it is possible to improve the complexity from quadratic to linear. The algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3: “Extensions of the Basic Algorithms” [SM97]. Your program should have three options: Local comparison Global comparison Semiglobal comparison Compare your program to an existing package. [SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 4. The Original BLAST Basic Local Alignment Search Tool (BLAST) was first published in 1990 [AGM+90]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package. [AGM+90] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215: 403-410; 1990. 5. PSI-BLAST Position Specific Iterated Basic Local Alignment Search Tool (PSIBLAST) was first published in 1997 [AMS+97]. This project consists in reading, understanding and implementing PSI-BLAST as described in the article and comparing it to an existing package. © 2005 by Sami Khuri Computer Science 286 Fall 2005 [AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 17: 3389-3402; 1997. 6. FASTA and FASTAP (three) W. Pearson and D. Lipman developed FASTA, which provides a rapid way of finding short stretches of similar sequences between a new sequence and any sequence in a database [PL88]. Pearson continued to improve the FASTA method for similarity searches in sequence databases [Pea90], [Pea96]. This project consists in choosing one of the three FAST algorithms from one of the three referenced articles, reading, understanding and implementing it as described in the article and comparing it to an existing package. [PL88] Pearson, W. and Lipman, D. Improved tools for biological sequence comparisons. Proc. Natl. Acad. Sci. 85: 2444-2448; 1988. [Pea90] Pearson, W. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63-98; 1990. [Pea96] Pearson, W. Effective protein sequence comparison. Methods Enzymol. 266: 227-258; 1996. 7. CLUSTAL W CLUSTAL W is a commonly used package for multiple sequence alignment. It was published in 1994 [THG94]. This project consists in reading, understanding and implementing the algorithm presented in the article and comparing it to an existing package. [THG94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680; 1994. 8. Multiple Sequence Alignment and Genetic Algorithms Dynamic programming approach to solve the MSA problem results in exponential time complexity. Genetic Algorithms are of considerable interest to researchers because they can find high scoring alignment as good as those found by other methods. This project consists in implementing your own genetic algorithm for the MSA problem and comparing it to SAGA [NH96] or to [ZW97]. © 2005 by Sami Khuri Computer Science 286 Fall 2005 [NH96] Notredame, C and Higgins, D.G. Sequence Alignment by Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8, 1515-1524; 1996. [ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997. [CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User-friendly workbench for order-based genetic algorithm research; 1996. 9. Multiple Sequence Alignment and Genetic Doping Algorithms In Genetic Doping Algorithm, nothing is fixed. Everything, including the stochastic operators, depends on the context, which varies over time [Bus04]. Unlike traditional Genetic Algorithms, the probabilities of crossover and mutation vary from generation to generation. These numbers are interconnected, both depending on average fitness values of the population for each generation. It is very possible that genetic doping algorithms are more suitable for the multiple sequence alignment problem than traditional genetic algorithms. This project consists in implementing your own genetic doping algorithm for the MSA problem and comparing it to SAGA [NH96] or to [ZW97]. [Bus04] Buscena, M. Genetic Doping Algorithm (GenD): theory and applications. Expert Systems, vol. 21, 2, May 2004. [NH96] Notredame, C and Higgins, D.G. Sequence Alignment by Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8, 1515-1524; 1996. [ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997. 10. Fragment Assembly (several) With current technology it is impossible to directly sequence contiguous DNA stretches of more than a few hundred bases. Typically, several copies of random pieces of long DNA are cut. The task of sequencing DNA is called fragment assembly: which consists in reconstructing the original sequence from the fragments. This project consists in choosing either the greedy algorithm described in Section 4.3.4 or one of the heuristics described in Section 4.4. [SM97]. Alternatively, you may choose a more recent approach described in [KS99]. Read, understand and implement the © 2005 by Sami Khuri Computer Science 286 Fall 2005 algorithm you chose and compare its performance to an existing package. [KS99] Kim, S., and Segre, A., AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly. Journal of Computational Biology , 6(2), 1999, pp 163-186; 1999. [SM97] Setubal, J. and Meidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 10. Physical Mapping of DNA (several) Physical mapping is the process of determining the location of certain markers (landmarks) of a DNA molecule. The markers are generally small but precisely defined sequences. The resulting maps are used as basis for DNA sequencing, and for the isolation and characterization of individual genes or other DNA regions of interest. For example, see Figure 5.1 [MS97, page 144]. This project consists in choosing one of the algorithms described in Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it, implementing it and comparing its performance to an existing package. [SM97] Setubal, J. and Medidanis J. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. 11. Phylogenetic Trees (several) The reconstructing of phylogenetic trees is a general problem in biology. It is used in molecular biology to help understand the evolutionary relationships among proteins, for example. This project consists in mentioned below, choosing reading, understanding and article and comparing it to an choosing one of the four algorithms the appropriate referenced article(s), implementing it as described in the existing package. Phylogenetic Trees Based on Pairwise Distances [FD96] Phylogenetic Trees Based on Neighbor Joining [SN87] Phylogenetic Trees Based on Maximum Parsimony [Fel96] Phylogenetic Trees Based on Maximum Likelihood Estimation [BT86], [Fel81]. © 2005 by Sami Khuri Computer Science 286 Fall 2005 [BT86] Bishop, M. and Thompson, E. Maximum likelihood alignment of DNA sequences. Journal of Molecular Biology. 190:159-165; 1986. [FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 266; 1996. [Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 17:368-376; 1981. [Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. 266; 1996. [SN87] Saitou, N. and Nei, M. The neighbor joining method: a new method for reconstructing phylogenetic trees. Molecular Biology Evolution; 4:406-425; 1987. 12. Gene Prediction (several) Gene prediction consists in identifying regions of genomic DNA that encode proteins. Some of the existing models that identify and distinguish coding regions from non-coding regions are based on: Hidden Markov Model, Neural Network, Probabilistic model, Linear discrimination analysis, Decision tree classification, Quadratic discriminant analysis, Stochastic context free grammars. This project consists in choosing one of the above techniques and implementing the prediction (search) algorithm, which will be able to search a given database for genes that do code for proteins. Your algorithm should be compared to an existing package. [BK97] Burge C and Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 78-94; 1997. [Kro97] Krogh A: Two methods for improving performance of an HMM and their application for gene-finding. Proc Int Conf Intell Syst Mol Biol 5: 179-186; 1997. [Pre95] Prestridge, D. S. Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249: 923-932; 1995. © 2005 by Sami Khuri Computer Science 286 Fall 2005 [Rab89] Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–285; 1989. [RJ86] Rabiner, L. R. and Juang, B. H. “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, vol. 3, February 1986. [Zha97] Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. 94: 565–568; 1997. 13. The Protein Prediction Problem (several) The main goal in the protein prediction problem is to determine the three-dimensional structure of a protein based on its amino acid sequence. Recall that there are three levels to look at the proteinstructure: The primary structure is the sequence of amino acids in the chain i.e., a one-dimensional structure. The secondary structure is the result of the folding of parts of the amino acid chain. The two most important secondary structures are the -helix, and the -strand. The tertiary structure is the real 3-dimensional configuration of the protein under given environmental conditions (solvent, pH and temperature). The tertiary structure decides the biochemical function of the protein. If the tertiary structure is changed, the protein normally looses its ability to perform whatever function it has, since this function depends on the geometrical shape of the active site in the interior of the molecule This project consists in choosing an existing algorithm for protein prediction, implementing it, and comparing it to a package currently in use. [CB00] Clote, P., and Backofen, R., Computational Molecular Biology: An Introduction, John Wiley and Sons, LTD; 2000. 14. The RNA Structure Prediction Problem (several) Unlike DNA, which most frequently assumes its well-known doublehelical conformation, the three-dimensional structure of single stranded RNA is determined by the sequence of nucleotides in much the same way the protein structure is determined by sequence. RNA structure, however, is less complex than protein structure and can be © 2005 by Sami Khuri Computer Science 286 Fall 2005 well characterized by identifying the location of commonly occurring secondary structure elements. [KR03] This project consists in choosing an existing algorithm for RNA structure prediction, implementing it, and comparing it to a package currently in use. [KR03] Krane, D. and Raymer, M. Fundamental Concepts of Bioinformatics; 2003. 15. The Clustering Gene Expression Problem (several) Analysis of gene expression patterns can provide insight into relationships between a gene and its function. Clustering techniques applied to gene expression data partitions genes into clusters (groups) based on their expression patterns. Genes in the same cluster will have similar expression patterns, while genes in different clusters will have distinct expression patterns. This project consists in choosing an existing clustering algorithm (such as CLICK [SS00]), implementing it, and comparing it to a package currently in use. [BSY99]. [BSY99] Ben-Dor, A. Shamir, R. and Yakhini, Z. Gene clustering expression patterns. Journal of Computational Biology, 6:281-297; 1999. [SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with applications to gene expression analysis. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 307-316; 2000. 16. Comparative Genomics (several) Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and non-coding regions of the genome. [MBS+00] C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. Pachter, I. Dubchak. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16: 1046-1047; 2000. [LOP+02] G. G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak and E. M. Rubin. Comparative sequence-based approach to high-throughput © 2005 by Sami Khuri Computer Science 286 Fall 2005 discovery of functional regulatory elements. Genome Res., 12:832839; 2002. 17. Thermostability and Preferential Amino Acid and Codon Usage Most organisms grow at temperatures from 20 to 50ºC, but some prokaryotes, including Archaea and Bacteria, are capable of withstanding higher temperatures, from 60º to over 100ºC. Farias and Bonato [FB02] investigated the preferential usage of certain amino acids (AA) and codons in thermally adapted organisms, by comparative proteome analysis. This project consists in writing a program that calculates the G+C% of the genome sequences, computes the average proportion of each AA in each genome, computes the E+K/Q+H ratio for each genome, and computes the codon usage of each AA in each genome. Perform the analysis of the whole genome sequences mentioned in the article. Your program should be able to group all non-thermophylic (mesophylic) genomes into one category and hyperthermophylic and thermophilic genomes into a second category and to compute the required statistics. [FB02] “Preferred codons and amino acid couples in hyperthermophiles” by S.T. Farias and C.M. Bonato. Genome Biology, 2002. 18. Genome signatures in prokaryotic and eukaryotic organisms Each genome has a characteristic "signature" defined as the ratios between the observed dinucleotide frequencies and the frequencies expected if neighbors were chosen at random (dinucleotide relative abundances). The remarkable fact is that the signature is relatively constant throughout the genome; i.e., the patterns and levels of dinucleotide relative abundances of every 50-kb segment of the genome are about the same. Campbell, Mrazek and Karlin analyzed the signatures of different genomes in [CMK99]. More precisely, they compute the G+C% of the genome sequences, the dinucleotide relative abundances of complete genomes, the genomic signature profiles of organisms and also the genomic signature difference between pairs of organisms. © 2005 by Sami Khuri Computer Science 286 Fall 2005 This project consists in implementing an algorithm that computes the G+C% of the genome sequences, the dinucleotide relative abundances of complete genomes, the genomic signature profiles of organisms and also the genomic signature difference between pairs of organisms. Apply your program to prokaryotic and eukaryote genome sequences and present your results in a format similar to Figure 1 and Figure 2 (see [CMK99]. [CMK99] “Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA” by A. Campbell, J. Mrazek, and S. Karlin. Proc Natl Acad Sci U S A. 1999 August 3; 96(16): 9184–9189. 19. Asymmetric substitution patterns The analyses of the genomes of three prokaryotes, Escherichia coli, Bacillus subtilis, and Haemophilus influenzae, by Lobry [Lob96] revealed a new type of genomic compartmentalization of base frequencies. There was a departure from intrastrand equifrequency between A and T or between C and G, showing that the substitution patterns of the two strands of DNA were asymmetric. The positions of the boundaries between these compartments were found to coincide with the origin and terminus of chromosome replication. Grigoriev [Gri98] developed a method of cumulative diagrams that shows that the nucleotide composition of a microbial chromosome changes at two points separated by about a half of its length. These points also coincide with sites of replication origin and terminus for all bacteria. The leading strand is found to contain more guanine than cytosine residues. This project consists in writing a program that calculates the 3 indices of base frequency using a nonoverlapping moving window of size 10 kb [Lob96], computes the GC and AT skews as defined in [Gri98], estimates the positions of the origin and terminus of replication, and computes (C-G)/(C+G) % and (A-T)/(A+T) % in leading and lagging coding sequences (See Table 2 in [Lob96]). Apply your program to the sequences mentioned in both articles and compare the results. Then, apply the program to other whole genome sequences. [Lob96] “Asymmetric substitution patterns in the two DNA strands of bacteria” by J.R. Lobry. 1996 May; 13(5):660-665. [Gri98] “Analyzing genomes with cumulative skew diagrams” by A. Grigoriev. Nucleic Acids Res. 1998 May 15; 26(10):2286-90. © 2005 by Sami Khuri Computer Science 286 Fall 2005 Visualization Tools for Bioinformatics Algorithms The main purpose of the projects in this category is to present a visualization tool to assist students in learning algorithms related to bioinformatics. One should be able to use the interactive, userfriendly, educational tool you are to develop, in classroom demonstrations, hands-on laboratories, self-directed work outside of class, and distance learning. The visualization package will include background material, a detailed explanation of the algorithm, with examples, and quizzes and exercises. Using Java is highly recommended. 20. Needleman and Wunsch’s DP Algorithm Design and implement a visual interactive software package to demonstrate how Needleman and Wunsch’s dynamic programming algorithm is applied to solve protein and genomic sequence alignment problems. 21. CLUSTAL W Algorithm Design and implement a visual interactive software package to demonstrate how the CLUSTAL W algorithm is applied to align multiple sequences. 22. Phylogenetic Tree Construction Choose a phylogenetic construction algorithm. Design and implement a visual interactive software package to demonstrate how the algorithm you selected constructs a phylogenetic tree. B – Non-Programming Projects Survey Papers 23. A Comprehensive Survey and Comparison Multiple Sequence Alignment Programs of In this project, you are to choose 5 MSA programs. Describe each one in detail and compare their performances. The comparison should include several runs with data from various databases. © 2005 by Sami Khuri Computer Science 286 Fall 2005 24. A Comprehensive Survey and Comparison of Protein Structure Visualization Tools In this project, you are to choose 8 protein structure visualization programs. Describe each one in detail and use them to visualize the structure of different proteins. The project should include screen dumps. 25. DNA Computing Write a survey paper on the state of the art in DNA computing, highlighting new directions. 26. Bayesian statistical methods for sequence alignment and evolutionary distance estimation Write a survey paper based on the results published by Agarwal and States in 1996 [AS96], and Zhu et al. in 1998 [ZLL98]. For further reading, a Bayesian bioinformatics tutorial by C. Lawrence is available on the Internet [Law01]. Divide your paper into the following four parts: (a) Explain what Bayesian statistics is. (b) Explain how the Bayesian statistics methods can be applied to sequence analysis. (c) Explain how the Bayesian statistics methods can be applied to evolutionary distance estimation. (d) Describe the most common Bayesian sequence alignment algorithms. [AS96] Agarwal P. and States D.J., A Bayesian evolutionary distance for parametrically aligned sequences. Journal of Computational Biology, vol. 3, pp. 1-17; 1996. [Law01] Lawrence, C. Bayesian Bioinformatics Page http://www.wadsworth.org/resnres/bioinfo/ [ZLL98] Zhu J., Liu J.S., and Lawrence C.E. Bayesian adaptive sequence alignment algorithms. Bioinformatics, vol. 14, pp. 25-39; 1998. 27. Non-Coding DNA Geneticists have long focused on just the small part of DNA that contains blueprints for proteins. The remainder - in human, 98% of the DNA – was often dismissed as junk. But the discovery of many © 2005 by Sami Khuri Computer Science 286 Fall 2005 hidden genes that work through RNA, rather than protein, have overturned that assumption [Gib03]. Write a survey paper on that topic, which would include: antisense RNAs, microRNAs and riboswitches. [Gib03] Gibbs, W. W. Unseen Genome: gems among the junk. Scientific American, November 2003, pp. 47-53; 2003. [Sto02] Storz, G. An Expanding Universe of Noncoding RNAs. Science, vol. 296, May 17, 2002; pp. 1260-1263; 2002. 28. Stem Cells Stem cells raise the prospect of regeneration failing body parts and curing diseases that have so far defied drug-based treatment [LR04]. Embryonic stem cells are derived from the portion of a very early stage embryo that would eventually give rise to an entire body. Because embryonic stem cells originate in this primordial stage, they retain the “pluripotent” ability to form any cell type in the body. This survey project introduces stem cells and elaborates on its potential various applications. It answers such questions as to why couldn’t we simply inject embryonic stem cells into the parts of the body we wish to regenerate and simply let them take their cues from the surrounding environment. More importantly, the survey paper has to clearly identify areas in stem cell research where bioinformatics could play (and is playing) an important role. [LR04] Lanza, R., and Rosenthal, N. The stem cell challenge. Scientific American, 93-99, June 2004. Non-Survey Projects 29. PAM and BLOSUM Substitution Matrices Explain how various PAM or BLOSUM substitution matrices were constructed. Discuss the differences and similarities between the PAM and BLOSUM matrices. Describe applications of each of the matrices. 30. Amino Acid Scoring Matrices The most commonly used substitution scoring matrices are PAM and BLOSUM. Choose two other amino acid scoring matrices. Describe how they were constructed. Discuss the differences and similarities between these matrices. Describe applications of each of the matrices. © 2005 by Sami Khuri Computer Science 286 Fall 2005 31. Test of Markov Model of Evolution in Proteins In 1985, Wilbur tested the Markov model of evolution and showed that it may be applicable if certain changes are made in the way the PAM matrices are calculated [Wil85]. Write a research paper describing how the tests were done, which conclusions were drawn from the tests and the extension of the original paper by George et al. in 1990 [GBH90]. [GBH90] George, D.G., Barker, W.C., and Hunt L.T., Mutation data matrix and its uses. Methods Enzymol. Vol. 183, pp. 333-351; 1990. [Wil85] Wilbur, W.J. On the PAM model of protein evolution. Molecular Biol. Evol. Vol. 2, pp. 434-447; 1985. 32. Which Genes Make Us Human? The sequence of the human genome provides a new tool with which to investigate human origins. It has been known since 1975, through the work of Mary-Claire King and Allan Wilson that the genomes of humans and chimpanzees differ by only 1.3%. This DNA sequence difference is unusually small for two species so different in anatomy and behavior (Pra02). This puzzle has sparked intense interest in the chimpanzee genome, now scheduled to be completely sequenced. A comparative chimp-human clone map has recently been published (Fujiyama et al., 2002). SNP mappers have jumped into the question, reasoning that single nucleotide polymorphisms may hold the key (Lew02). However, gene expression studies will be required for any real answer to the question, as predicted by King and Wilson. Researchers last year presented the first comparative gene expression studies in humans and other primates (Ape Genomics). A more comprehensive analysis, including some proteomic data, shows major differences in the pattern of brain gene expression between humans and chimps (Enard et al., 2002). Recently, a very exciting candidate gene has been identified that appears to be linked with language ability (see speech gene). In addition, the gene shows statistical evidence of strong selection during human evolution (Stephens et al., 2001). The background above describes at least three different approaches to answering the question of what, genetically speaking, defines our human species. Based on publicly available bioinformatics tools and databases, can your group suggest a purely bioinformatics approach? This project should outline the approach and perform the analysis. © 2005 by Sami Khuri Computer Science 286 Fall 2005 Discuss the results and compare them to those found in the references. Describe the limitations of this approach (if any). [Adapted from B. Chapman, 2003]. [Pra02] “Altered gene expression could explain the genetic difference between human and chimp: Evolutionists Present Their 1.3% Solution” by L. Pray. The Scientist 16(16): 36-41, 2002. [Fujiyama et al, 2002] “Construction and Analysis of a HumanChimpanzee Comparative Clone Map” by A. Fujiyama and 16 coauthors. Science, vol. 295, January 2002; 131-134. [Lew02] “SNPs as Windows on Evolution: Recent studies reveal that the human species is young and genetically uniform” by R. Lewis. The Scientist 16(1): 16-21, 2002. [Enard et al., 2002] “Molecular evolution of FOXP2, a gene involved in speech and language” by W. Enard, M. Przeworski, S.E. Fisher, C.S.L. Lai, V. Wiebe, T. Kitano, A.P. Monaco and S. Paabo. Nature, vol.418, August 2002; 869-872. [Stephens et al, 2001] “Haplotype Variation and Linkage Disequilibrium in 313 Human Genes” by J.C. Stephens and 27 coauthors. Science 293: 489-493, 2001. 33. The USP6 Oncogene Gene Gene duplication is thought to be the major mechanism for the emergence of novel genes during evolution. Such events are thought to have occurred at early stages in the vertebrate lineage. Paulding et al. [PRH03] report that the USP6 oncogene is derived from the fusion of two other genes: USP32 and TBC1D3. This project consists in performing the bioinformatics analysis described in “Materials and Methods” and reporting the results. [PRH03] “The Tre2 (USP6) oncogene is a hominoid-specific gene” by C.A. Paulding, M. Ruvolo, and D.A. Haber. Proceedings of the National Academy of Sciences, vol. 100, number 5, 2507-2511, March 4, 2003. 34. The Myosin Gene MYH16 Mutation Powerful masticatory muscles are found in most primates, including chimpanzees and gorillas, and were part of a prominent adaptation of Australopithecus and Paranthropus, extinct genera of the family Hominidae. In contrast, masticatory muscles are considerably smaller in both modern and fossil members of Homo. The evolving hominid masticatory apparatus—traceable to a Late Miocene, chimpanzee-like © 2005 by Sami Khuri Computer Science 286 Fall 2005 morphology —shifted towards a pattern of gracilization nearly simultaneously with accelerated encephalization in early Homo. Stedman et al. [Stedman et al., 2004] showed that the gene encoding the predominant myosin heavy chain (MYH) expressed in these muscles was inactivated by a frameshifting mutation after the lineages leading to humans and chimpanzees diverged. Loss of this protein isoform is associated with marked size reductions in individual muscle fibres and entire masticatory muscles. Using the coding sequence for the myosin rod domains as a molecular clock, we estimate that this mutation appeared approximately 2.4 million years ago, predating the appearance of modern human body size and emigration of Homo from Africa. This represents the first proteomic distinction between humans and chimpanzees that can be correlated with a traceable anatomic imprint in the fossil record. This project consists in performing the bioinformatics analysis described in [Stedman et al., 2004] with the help of [Cur04] and [Pen04]. The findings of Stedman et al. were challenged a year later by Perry et al. [Perry et al., 2005]. Your project should address the second article’s findings and give your own conclusion. In other words, you should give your own evaluation of the critics mentioned in [Perry et al., 2005]. [Pen04] “The Primate Bite: Brawn Versus Brain?” by E. Pennisi. Science vol. 303, page 1957, March 2004. [Stedman et al., 2004] “Myosin gene mutation correlates with anatomical changes in the human lineage” by H.H. Stedman and 9 coauthors. Nature vol. 428, pages 415-418, March 2004. [Cur04] “Muscling in on hominid evolution” by P. Currie. Nature vol. 428, pages 373, March 2004. [Perry et al., 2005] “Comparative Analyses Reveal a Complex History of Molecular Evolution for Human MYH16” by G.H. Perry, B.C. Verrelli, A.C. Stone. Molecular Biology and Evolution, Vol. 22, Number 3, March 2005, pp. 379-382. © 2005 by Sami Khuri