Bioinformatics - Department of Computer Science

advertisement
Computer Science 286
Fall 2005
Bioinformatics Projects
The project will involve research in bioinformatics. You are encouraged to
work on a project related to your interests. A list of suggested topics is
attached to help you get started. You can pick a project from the list,
modify a project to suit your interests, or invent your own. The key idea
is to be creative either in developing a new algorithm or in implementing
an existing one. Results, whether good or bad, should be compared with
those obtained from existing bioinformatics tools or packages.
The project implementation may be developed for any platform. You can
use C, C++, C#, Java, Matlab, or Perl. The World Wide Web contains
many bioinformatics programs as well as the source code. You may use
and modify such code provided appropriate acknowledgements and
citations are made.
A two-page project proposal is due by the beginning of the lecture on
Thursday, October 13, 2005. The proposal should give a clear description
of the project and should contain absolutely no generalities and
definitions. Clearly state what you are planning to do and explain how
you plan to achieve it. Do not forget to specify the programming language
you intend to use. It is important to get started as early as possible.
The projects are due at on Tuesday, November 29, 2005. Make sure to
hand in both a technical project description with appendices and
references and a disk or CD containing the source code. The approximate
length of the programming project report should be 10 pages and 30
pages for non-programming projects. You do not need to include a hard
copy of the source code.
A good project report will include the following:
Background of the problem from the literature search.
A clear definition of the problem.
An explanation and justification of methods of data analysis.
A description and justification of the data sources.
Analysis of the results and comparison with existing tools.
Conclusions based on the results.
Possible directions for future research.
Instructions on how to compile and execute your program, if
applicable.
A full list of references.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
List of Suggested Projects
A - Programming Projects
1. DP Pairwise Comparison Algorithm
In this project, you should implement a dynamic programming
algorithm that does pairwise comparison. Your program should allow
the user to either use:

One penalty for gaps, or

Two gap penalties: one for starting a gap and one for extending
a gap.
Your program should also have the following three options:
Local comparison
Global comparison
Semiglobal comparison
The program should allow the user to enter gap penalties.
Compare your program to an existing package.
[SM97] Setubal, J. and Meidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
2. K-Band DP for Pairwise Comparison
If two sequences are similar, the best alignments have their paths
near the main diagonal. It is not necessary to fill the entire matrix to
compute the optimal score and alignment. A narrow band around the
main diagonal should suffice. In this project, you are to implement a
dynamic programming algorithm that does pairwise comparison and
uses the K-Band procedure [SM97]. Your program should allow the
user to either use:

One penalty for gaps, or

Two gap penalties: one for starting a gap and one for extending
a gap.
Your program should also have three options:
Local comparison
Global comparison
Semiglobal comparison
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
The program should allow the user to enter gap penalties.
Compare your program to an existing package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
3. Optimal Linear Space DP for Pairwise Comparison
In this project, you are to implement a dynamic programming
algorithm that does pairwise comparison in linear space. The basic
algorithm is of quadratic complexity. With respect to space, it is
possible to improve the complexity from quadratic to linear. The
algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3:
“Extensions of the Basic Algorithms” [SM97]. Your program should
have three options:
Local comparison
Global comparison
Semiglobal comparison
Compare your program to an existing package.
[SM97] Setubal, J. and Meidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
4. The Original BLAST
Basic Local Alignment Search Tool (BLAST) was first published in
1990 [AGM+90]. This project consists in reading, understanding and
implementing the algorithm presented in the article and comparing it
to an existing package.
[AGM+90] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,
D. Basic Local Alignment Search Tool. Journal of Molecular Biology,
215: 403-410; 1990.
5. PSI-BLAST
Position Specific Iterated Basic Local Alignment Search Tool (PSIBLAST) was first published in 1997 [AMS+97]. This project consists in
reading, understanding and implementing PSI-BLAST as described in
the article and comparing it to an existing package.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
[AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D. Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Research, 25, 17: 3389-3402; 1997.
6. FASTA and FASTAP (three)
W. Pearson and D. Lipman developed FASTA, which provides a rapid
way of finding short stretches of similar sequences between a new
sequence and any sequence in a database [PL88]. Pearson continued
to improve the FASTA method for similarity searches in sequence
databases [Pea90], [Pea96]. This project consists in choosing one of
the three FAST algorithms from one of the three referenced articles,
reading, understanding and implementing it as described in the
article and comparing it to an existing package.
[PL88] Pearson, W. and Lipman, D. Improved tools for biological
sequence comparisons. Proc. Natl. Acad. Sci. 85: 2444-2448; 1988.
[Pea90] Pearson, W. Rapid and sensitive sequence comparison with
FASTP and FASTA. Methods Enzymol. 183: 63-98; 1990.
[Pea96] Pearson, W. Effective protein sequence comparison. Methods
Enzymol. 266: 227-258; 1996.
7. CLUSTAL W
CLUSTAL W is a commonly used package for multiple sequence
alignment. It was published in 1994 [THG94]. This project consists in
reading, understanding and implementing the algorithm presented in
the article and comparing it to an existing package.
[THG94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res. 22: 4673-4680; 1994.
8. Multiple Sequence Alignment and Genetic Algorithms
Dynamic programming approach to solve the MSA problem results in
exponential time complexity. Genetic Algorithms are of considerable
interest to researchers because they can find high scoring alignment
as good as those found by other methods.
This project consists in implementing your own genetic algorithm for
the MSA problem and comparing it to SAGA [NH96] or to [ZW97].
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by
Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8,
1515-1524; 1996.
[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple
Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997.
[CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User-friendly
workbench for order-based genetic algorithm research; 1996.
9. Multiple Sequence Alignment and Genetic Doping
Algorithms
In Genetic Doping Algorithm, nothing is fixed. Everything, including
the stochastic operators, depends on the context, which varies over
time [Bus04]. Unlike traditional Genetic Algorithms, the probabilities
of crossover and mutation vary from generation to generation. These
numbers are interconnected, both depending on average fitness
values of the population for each generation. It is very possible that
genetic doping algorithms are more suitable for the multiple sequence
alignment problem than traditional genetic algorithms.
This project consists in implementing your own genetic doping
algorithm for the MSA problem and comparing it to SAGA [NH96] or to
[ZW97].
[Bus04] Buscena, M. Genetic Doping Algorithm (GenD): theory and
applications. Expert Systems, vol. 21, 2, May 2004.
[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by
Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24, 8,
1515-1524; 1996.
[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple
Sequence Alignment. Comput. Appl. Bioscience, 13, 565-581; 1997.
10.
Fragment Assembly (several)
With current technology it is impossible to directly sequence
contiguous DNA stretches of more than a few hundred bases.
Typically, several copies of random pieces of long DNA are cut. The
task of sequencing DNA is called fragment assembly: which consists
in reconstructing the original sequence from the fragments.
This project consists in choosing either the greedy algorithm
described in Section 4.3.4 or one of the heuristics described in
Section 4.4. [SM97]. Alternatively, you may choose a more recent
approach described in [KS99]. Read, understand and implement the
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
algorithm you chose and compare its performance to an existing
package.
[KS99] Kim, S., and Segre, A., AMASS: A Structured Pattern Matching
Approach to Shotgun Sequence Assembly. Journal of Computational
Biology , 6(2), 1999, pp 163-186; 1999.
[SM97] Setubal, J. and Meidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
10. Physical Mapping of DNA (several)
Physical mapping is the process of determining the location of certain
markers (landmarks) of a DNA molecule. The markers are generally
small but precisely defined sequences. The resulting maps are used as
basis for DNA sequencing, and for the isolation and characterization
of individual genes or other DNA regions of interest. For example, see
Figure 5.1 [MS97, page 144].
This project consists in choosing one of the algorithms described in
Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it,
implementing it and comparing its performance to an existing
package.
[SM97] Setubal, J. and Medidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.
11. Phylogenetic Trees (several)
The reconstructing of phylogenetic trees is a general problem in
biology. It is used in molecular biology to help understand the
evolutionary relationships among proteins, for example.
This project consists in
mentioned below, choosing
reading, understanding and
article and comparing it to an
choosing one of the four algorithms
the appropriate referenced article(s),
implementing it as described in the
existing package.

Phylogenetic Trees Based on Pairwise Distances [FD96]

Phylogenetic Trees Based on Neighbor Joining [SN87]

Phylogenetic Trees Based on Maximum Parsimony [Fel96]

Phylogenetic Trees Based on Maximum Likelihood Estimation
[BT86], [Fel81].
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
[BT86] Bishop, M. and Thompson, E. Maximum likelihood alignment
of DNA sequences. Journal of Molecular Biology. 190:159-165; 1986.
[FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid
sequences and construction of phylogenetic trees from them. Methods
Enzymol. 266; 1996.
[Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A
maximum likelihood approach. Journal of Molecular Evolution.
17:368-376; 1981.
[Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by
parsimony, distance and likelihood methods. Methods Enzymol. 266;
1996.
[SN87] Saitou, N. and Nei, M. The neighbor joining method: a new
method for reconstructing phylogenetic trees. Molecular Biology
Evolution; 4:406-425; 1987.
12. Gene Prediction (several)
Gene prediction consists in identifying regions of genomic DNA that
encode proteins.
Some of the existing models that identify and distinguish coding
regions from non-coding regions are based on:
Hidden Markov Model,
Neural Network,
Probabilistic model,
Linear discrimination analysis,
Decision tree classification,
Quadratic discriminant analysis,
Stochastic context free grammars.
This project consists in choosing one of the above techniques and
implementing the prediction (search) algorithm, which will be able to
search a given database for genes that do code for proteins.
Your algorithm should be compared to an existing package.
[BK97] Burge C and Karlin S: Prediction of complete gene structures
in human genomic DNA. J Mol Biol 268: 78-94; 1997.
[Kro97] Krogh A: Two methods for improving performance of an HMM
and their application for gene-finding. Proc Int Conf Intell Syst Mol
Biol 5: 179-186; 1997.
[Pre95] Prestridge, D. S. Predicting Pol II promoter sequences using
transcription factor binding sites. J. Mol. Biol. 249: 923-932; 1995.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
[Rab89] Rabiner, L. R. A tutorial on hidden Markov models and
selected applications in speech recognition. Proc. IEEE, 77, 257–285;
1989.
[RJ86] Rabiner, L. R. and Juang, B. H. “An Introduction to Hidden
Markov Models,” IEEE ASSP Magazine, vol. 3, February 1986.
[Zha97] Zhang, M.Q. Identification of protein coding regions in the
human genome by quadratic discriminant analysis. Proc. Natl. Acad.
Sci. 94: 565–568; 1997.
13. The Protein Prediction Problem (several)
The main goal in the protein prediction problem is to determine the
three-dimensional structure of a protein based on its amino acid
sequence. Recall that there are three levels to look at the proteinstructure:
The primary structure is the sequence of amino acids in the chain
i.e., a one-dimensional structure.
The secondary structure is the result of the folding of parts of the
amino acid chain. The two most important secondary structures
are the -helix, and the -strand.
The tertiary structure is the real 3-dimensional configuration of the
protein under given environmental conditions (solvent, pH and
temperature).
The tertiary structure decides the biochemical function of the
protein. If the tertiary structure is changed, the protein normally
looses its ability to perform whatever function it has, since this
function depends on the geometrical shape of the active site in the
interior of the molecule
This project consists in choosing an existing algorithm for protein
prediction, implementing it, and comparing it to a package currently
in use.
[CB00] Clote, P., and Backofen, R., Computational Molecular Biology:
An Introduction, John Wiley and Sons, LTD; 2000.
14. The RNA Structure Prediction Problem (several)
Unlike DNA, which most frequently assumes its well-known doublehelical conformation, the three-dimensional structure of single
stranded RNA is determined by the sequence of nucleotides in much
the same way the protein structure is determined by sequence. RNA
structure, however, is less complex than protein structure and can be
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
well characterized by identifying the location of commonly occurring
secondary structure elements. [KR03]
This project consists in choosing an existing algorithm for RNA
structure prediction, implementing it, and comparing it to a package
currently in use.
[KR03] Krane, D. and Raymer, M. Fundamental Concepts of
Bioinformatics; 2003.
15. The Clustering Gene Expression Problem (several)
Analysis of gene expression patterns can provide insight into
relationships between a gene and its function. Clustering techniques
applied to gene expression data partitions genes into clusters (groups)
based on their expression patterns. Genes in the same cluster will
have similar expression patterns, while genes in different clusters will
have distinct expression patterns.
This project consists in choosing an existing clustering
algorithm (such as CLICK [SS00]), implementing it, and comparing it
to a package currently in use. [BSY99].
[BSY99] Ben-Dor, A. Shamir, R. and Yakhini, Z. Gene clustering
expression patterns. Journal of Computational Biology, 6:281-297;
1999.
[SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with
applications to gene expression analysis. Proceedings of the 8th
International Conference on Intelligent Systems for Molecular Biology,
307-316; 2000.
16. Comparative Genomics (several)
Comparative genomics is the analysis and comparison of genomes
from different species. The purpose is to gain a better understanding
of how species have evolved and to determine the function of genes
and non-coding regions of the genome.
[MBS+00] C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M.
Rubin, K. A. Frazer, L. Pachter, I. Dubchak. VISTA: Visualizing global
DNA sequence alignments of arbitrary length. Bioinformatics, 16:
1046-1047; 2000.
[LOP+02] G. G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak and E.
M. Rubin. Comparative sequence-based approach to high-throughput
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
discovery of functional regulatory elements. Genome Res., 12:832839; 2002.
17. Thermostability and Preferential Amino Acid and
Codon Usage
Most organisms grow at temperatures from 20 to 50ºC, but some
prokaryotes, including Archaea and Bacteria, are capable of
withstanding higher temperatures, from 60º to over 100ºC. Farias and
Bonato [FB02] investigated the preferential usage of certain amino
acids (AA) and codons in thermally adapted organisms, by
comparative proteome analysis.
This project consists in writing a program that calculates the G+C% of
the genome sequences, computes the average proportion of each AA in
each genome, computes the E+K/Q+H ratio for each genome, and
computes the codon usage of each AA in each genome.
Perform the analysis of the whole genome sequences mentioned in the
article. Your program should be able to group all non-thermophylic
(mesophylic) genomes into one category and hyperthermophylic and
thermophilic genomes into a second category and to compute the
required statistics.
[FB02]
“Preferred
codons
and
amino
acid
couples
in
hyperthermophiles” by S.T. Farias and C.M. Bonato. Genome Biology,
2002.
18. Genome signatures in prokaryotic and eukaryotic
organisms
Each genome has a characteristic "signature" defined as the ratios
between the observed dinucleotide frequencies and the frequencies
expected if neighbors were chosen at random (dinucleotide relative
abundances). The remarkable fact is that the signature is relatively
constant throughout the genome; i.e., the patterns and levels of
dinucleotide relative abundances of every 50-kb segment of the
genome are about the same. Campbell, Mrazek and Karlin analyzed
the signatures of different genomes in [CMK99]. More precisely, they
compute the G+C% of the genome sequences, the dinucleotide relative
abundances of complete genomes, the genomic signature profiles of
organisms and also the genomic signature difference between pairs of
organisms.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
This project consists in implementing an algorithm that computes the
G+C% of the genome sequences, the dinucleotide relative abundances
of complete genomes, the genomic signature profiles of organisms and
also the genomic signature difference between pairs of organisms.
Apply your program to prokaryotic and eukaryote genome sequences
and present your results in a format similar to Figure 1 and Figure 2
(see [CMK99].
[CMK99] “Genome signature comparisons among prokaryote, plasmid,
and mitochondrial DNA” by A. Campbell, J. Mrazek, and S. Karlin.
Proc Natl Acad Sci U S A. 1999 August 3; 96(16): 9184–9189.
19. Asymmetric substitution patterns
The analyses of the genomes of three prokaryotes, Escherichia coli,
Bacillus subtilis, and Haemophilus influenzae, by Lobry [Lob96]
revealed a new type of genomic compartmentalization of base
frequencies. There was a departure from intrastrand equifrequency
between A and T or between C and G, showing that the substitution
patterns of the two strands of DNA were asymmetric. The positions of
the boundaries between these compartments were found to coincide
with the origin and terminus of chromosome replication.
Grigoriev [Gri98] developed a method of cumulative diagrams that
shows that the nucleotide composition of a microbial chromosome
changes at two points separated by about a half of its length. These
points also coincide with sites of replication origin and terminus for
all bacteria. The leading strand is found to contain more guanine than
cytosine residues.
This project consists in writing a program that calculates the 3
indices of base frequency using a nonoverlapping moving window of
size 10 kb [Lob96], computes the GC and AT skews as defined in
[Gri98], estimates the positions of the origin and terminus of
replication, and computes (C-G)/(C+G) % and (A-T)/(A+T) % in leading
and lagging coding sequences (See Table 2 in [Lob96]). Apply your
program to the sequences mentioned in both articles and compare the
results. Then, apply the program to other whole genome sequences.
[Lob96] “Asymmetric substitution patterns in the two DNA strands of
bacteria” by J.R. Lobry. 1996 May; 13(5):660-665.
[Gri98] “Analyzing genomes with cumulative skew diagrams” by A.
Grigoriev. Nucleic Acids Res. 1998 May 15; 26(10):2286-90.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
Visualization Tools for Bioinformatics Algorithms
The main purpose of the projects in this category is to present a
visualization tool to assist students in learning algorithms related to
bioinformatics. One should be able to use the interactive, userfriendly, educational tool you are to develop, in classroom
demonstrations, hands-on laboratories, self-directed work outside of
class, and distance learning. The visualization package will include
background material, a detailed explanation of the algorithm, with
examples, and quizzes and exercises. Using Java is highly
recommended.
20. Needleman and Wunsch’s DP Algorithm
Design and implement a visual interactive software package to
demonstrate how Needleman and Wunsch’s dynamic programming
algorithm is applied to solve protein and genomic sequence alignment
problems.
21. CLUSTAL W Algorithm
Design and implement a visual interactive software package to
demonstrate how the CLUSTAL W algorithm is applied to align
multiple sequences.
22. Phylogenetic Tree Construction
Choose a phylogenetic construction algorithm. Design and implement
a visual interactive software package to demonstrate how the
algorithm you selected constructs a phylogenetic tree.
B – Non-Programming Projects
Survey Papers
23. A Comprehensive Survey and Comparison
Multiple Sequence Alignment Programs
of
In this project, you are to choose 5 MSA programs. Describe each one
in detail and compare their performances. The comparison should
include several runs with data from various databases.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
24. A Comprehensive Survey and Comparison of Protein
Structure Visualization Tools
In this project, you are to choose 8 protein structure visualization
programs. Describe each one in detail and use them to visualize the
structure of different proteins. The project should include screen
dumps.
25. DNA Computing
Write a survey paper on the state of the art in DNA computing,
highlighting new directions.
26. Bayesian statistical methods for sequence
alignment and evolutionary distance estimation
Write a survey paper based on the results published by Agarwal and
States in 1996 [AS96], and Zhu et al. in 1998 [ZLL98]. For further
reading, a Bayesian bioinformatics tutorial by C. Lawrence is available
on the Internet [Law01]. Divide your paper into the following four
parts:
(a) Explain what Bayesian statistics is.
(b) Explain how the Bayesian statistics methods can be applied to
sequence analysis.
(c) Explain how the Bayesian statistics methods can be applied to
evolutionary distance estimation.
(d) Describe the most common Bayesian sequence alignment
algorithms.
[AS96] Agarwal P. and States D.J., A Bayesian evolutionary distance
for parametrically aligned sequences. Journal of Computational
Biology, vol. 3, pp. 1-17; 1996.
[Law01] Lawrence, C. Bayesian Bioinformatics Page
http://www.wadsworth.org/resnres/bioinfo/
[ZLL98] Zhu J., Liu J.S., and Lawrence C.E. Bayesian adaptive
sequence alignment algorithms. Bioinformatics, vol. 14, pp. 25-39;
1998.
27. Non-Coding DNA
Geneticists have long focused on just the small part of DNA that
contains blueprints for proteins. The remainder - in human, 98% of
the DNA – was often dismissed as junk. But the discovery of many
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
hidden genes that work through RNA, rather than protein, have
overturned that assumption [Gib03]. Write a survey paper on that
topic, which would include: antisense RNAs, microRNAs and
riboswitches.
[Gib03] Gibbs, W. W. Unseen Genome: gems among the junk.
Scientific American, November 2003, pp. 47-53; 2003.
[Sto02] Storz, G. An Expanding Universe of Noncoding RNAs. Science,
vol. 296, May 17, 2002; pp. 1260-1263; 2002.
28. Stem Cells
Stem cells raise the prospect of regeneration failing body parts and
curing diseases that have so far defied drug-based treatment [LR04].
Embryonic stem cells are derived from the portion of a very early stage
embryo that would eventually give rise to an entire body. Because
embryonic stem cells originate in this primordial stage, they retain the
“pluripotent” ability to form any cell type in the body.
This survey project introduces stem cells and elaborates on its
potential various applications. It answers such questions as to why
couldn’t we simply inject embryonic stem cells into the parts of the
body we wish to regenerate and simply let them take their cues from
the surrounding environment. More importantly, the survey paper has
to clearly identify areas in stem cell research where bioinformatics
could play (and is playing) an important role.
[LR04] Lanza, R., and Rosenthal, N. The stem cell challenge. Scientific
American, 93-99, June 2004.
Non-Survey Projects
29. PAM and BLOSUM Substitution Matrices
Explain how various PAM or BLOSUM substitution matrices were
constructed. Discuss the differences and similarities between the PAM
and BLOSUM matrices. Describe applications of each of the matrices.
30. Amino Acid Scoring Matrices
The most commonly used substitution scoring matrices are PAM and
BLOSUM. Choose two other amino acid scoring matrices. Describe
how they were constructed. Discuss the differences and similarities
between these matrices. Describe applications of each of the matrices.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
31. Test of Markov Model of Evolution in Proteins
In 1985, Wilbur tested the Markov model of evolution and showed
that it may be applicable if certain changes are made in the way the
PAM matrices are calculated [Wil85]. Write a research paper
describing how the tests were done, which conclusions were drawn
from the tests and the extension of the original paper by George et al.
in 1990 [GBH90].
[GBH90] George, D.G., Barker, W.C., and Hunt L.T., Mutation data
matrix and its uses. Methods Enzymol. Vol. 183, pp. 333-351; 1990.
[Wil85] Wilbur, W.J. On the PAM model of protein evolution.
Molecular Biol. Evol. Vol. 2, pp. 434-447; 1985.
32. Which Genes Make Us Human?
The sequence of the human genome provides a new tool with which to
investigate human origins. It has been known since 1975, through the
work of Mary-Claire King and Allan Wilson that the genomes of
humans and chimpanzees differ by only 1.3%. This DNA sequence
difference is unusually small for two species so different in anatomy
and behavior (Pra02). This puzzle has sparked intense interest in the
chimpanzee genome, now scheduled to be completely sequenced. A
comparative chimp-human clone map has recently been published
(Fujiyama et al., 2002). SNP mappers have jumped into the question,
reasoning that single nucleotide polymorphisms may hold the key
(Lew02).
However, gene expression studies will be required for any real answer
to the question, as predicted by King and Wilson. Researchers last
year presented the first comparative gene expression studies in
humans and other primates (Ape Genomics). A more comprehensive
analysis, including some proteomic data, shows major differences in
the pattern of brain gene expression between humans and chimps
(Enard et al., 2002). Recently, a very exciting candidate gene has been
identified that appears to be linked with language ability (see speech
gene). In addition, the gene shows statistical evidence of strong
selection during human evolution (Stephens et al., 2001).
The background above describes at least three different approaches to
answering the question of what, genetically speaking, defines our
human species. Based on publicly available bioinformatics tools and
databases, can your group suggest a purely bioinformatics approach?
This project should outline the approach and perform the analysis.
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
Discuss the results and compare them to those found in the
references. Describe the limitations of this approach (if any).
[Adapted from B. Chapman, 2003].
[Pra02] “Altered gene expression could explain the genetic difference
between human and chimp: Evolutionists Present Their 1.3%
Solution” by L. Pray. The Scientist 16(16): 36-41, 2002.
[Fujiyama et al, 2002] “Construction and Analysis of a HumanChimpanzee Comparative Clone Map” by A. Fujiyama and 16 coauthors. Science, vol. 295, January 2002; 131-134.
[Lew02] “SNPs as Windows on Evolution: Recent studies reveal that
the human species is young and genetically uniform” by R. Lewis. The
Scientist 16(1): 16-21, 2002.
[Enard et al., 2002] “Molecular evolution of FOXP2, a gene involved in
speech and language” by W. Enard, M. Przeworski, S.E. Fisher, C.S.L.
Lai, V. Wiebe, T. Kitano, A.P. Monaco and S. Paabo. Nature, vol.418,
August 2002; 869-872.
[Stephens et al, 2001] “Haplotype Variation and Linkage
Disequilibrium in 313 Human Genes” by J.C. Stephens and 27 coauthors. Science 293: 489-493, 2001.
33. The USP6 Oncogene Gene
Gene duplication is thought to be the major mechanism for the
emergence of novel genes during evolution. Such events are thought
to have occurred at early stages in the vertebrate lineage. Paulding et
al. [PRH03] report that the USP6 oncogene is derived from the fusion
of two other genes: USP32 and TBC1D3.
This project consists in performing the bioinformatics analysis
described in “Materials and Methods” and reporting the results.
[PRH03] “The Tre2 (USP6) oncogene is a hominoid-specific gene” by
C.A. Paulding, M. Ruvolo, and D.A. Haber. Proceedings of the National
Academy of Sciences, vol. 100, number 5, 2507-2511, March 4, 2003.
34. The Myosin Gene MYH16 Mutation
Powerful masticatory muscles are found in most primates, including
chimpanzees and gorillas, and were part of a prominent adaptation of
Australopithecus and Paranthropus, extinct genera of the family
Hominidae. In contrast, masticatory muscles are considerably smaller
in both modern and fossil members of Homo. The evolving hominid
masticatory apparatus—traceable to a Late Miocene, chimpanzee-like
© 2005 by Sami Khuri
Computer Science 286
Fall 2005
morphology —shifted towards a pattern of gracilization nearly
simultaneously with accelerated encephalization in early Homo.
Stedman et al. [Stedman et al., 2004] showed that the gene encoding
the predominant myosin heavy chain (MYH) expressed in these
muscles was inactivated by a frameshifting mutation after the
lineages leading to humans and chimpanzees diverged. Loss of this
protein isoform is associated with marked size reductions in
individual muscle fibres and entire masticatory muscles. Using the
coding sequence for the myosin rod domains as a molecular clock, we
estimate that this mutation appeared approximately 2.4 million years
ago, predating the appearance of modern human body size and
emigration of Homo from Africa. This represents the first proteomic
distinction between humans and chimpanzees that can be correlated
with a traceable anatomic imprint in the fossil record.
This project consists in performing the bioinformatics analysis
described in [Stedman et al., 2004] with the help of [Cur04] and
[Pen04]. The findings of Stedman et al. were challenged a year later by
Perry et al. [Perry et al., 2005]. Your project should address the
second article’s findings and give your own conclusion. In other
words, you should give your own evaluation of the critics mentioned
in [Perry et al., 2005].
[Pen04] “The Primate Bite: Brawn Versus Brain?” by E. Pennisi.
Science vol. 303, page 1957, March 2004.
[Stedman et al., 2004] “Myosin gene mutation correlates with
anatomical changes in the human lineage” by H.H. Stedman and 9 coauthors. Nature vol. 428, pages 415-418, March 2004.
[Cur04] “Muscling in on hominid evolution” by P. Currie. Nature vol.
428, pages 373, March 2004.
[Perry et al., 2005] “Comparative Analyses Reveal a Complex History
of Molecular Evolution for Human MYH16” by G.H. Perry, B.C.
Verrelli, A.C. Stone. Molecular Biology and Evolution, Vol. 22, Number
3, March 2005, pp. 379-382.
© 2005 by Sami Khuri
Download