PRESENTATION No

advertisement
TITLES, AUTHORS, ABSTRACTS FOR ISMB02
Paper01, Presentation02
Title:
Mining viral protease data to extract cleavage knowledge
Authors:
Ajit Narayanan, Xikun Wu and Z. Rong Yang
Abstract:
Motivation: The motivation is to identify, through machine learning techniques,
specific patterns in HIV and HCV viral polyprotein amino acid residues where
viral protease cleaves the polyprotein as it leaves the ribosome. An
understanding of viral protease specificity may help the development of future
anti-viral drugs involving protease inhibitors by identifying specific features
of protease activity for further experimental investigation. While viral
sequence information is growing at a fast rate, there is still comparatively
little understanding of how viral polyproteins are cut into their functional
unit lengths. The aim of the work reported here is to investigate whether it is
possible to generalise from known cleavage sites to unknown cleavage sites for
two specific viruses - HIV and HCV. An understanding of proteolytic activity
for specific viruses will contribute to our understanding of viral protease
function in general, thereby leading to a greater understanding of protease
families and their substrate characteristics.
Results: Our results show that artificial neural networks and symbolic learning
techniques (See5) capture some fundamental and new substrate attributes, but
neural networks outperform their symbolic counterpart.
Availability: Publicly available software was used (Stuttgart Neural Network
Simulator - http://www-ra.informatik.uni-tuebingen.de/SNNS/ , and See5 http://www.rulequest.com). The datasets used (HIV, HCV) for See5 are available
at: http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/
Contact: a.narayanan@ex.ac.uk, z.r.yang@ex.ac.uk
Paper02, Presentation03
Title:
The metric space of proteins - comparative study of clustering algorithms
Authors:
Ori Sasson, Nati Linial, Michal Linial
Abstract:
Motivation: A large fraction of biological research concentrates on individual
proteins and on small families of proteins. One of the current major challenges
in bioinformatics is to extend our knowledge also to very large sets of proteins.
Several major projects have tackled this problem. Such undertakings usually
start with a process that clusters all known proteins or large subsets of this
space. Some work in this area is carried out automatically, while other attempts
incorporate expert advice and annotation.
Results: We propose a novel technique that automatically clusters protein
sequences. We consider all proteins in SWISSPROT, and carry out an all-againstall BLAST similarity test among them. With this similarity measure in hand we
proceed to perform a continuous bottom-up clustering process by applying
alternative rules for merging clusters. The outcome of this clustering process
is a classification of the input proteins into a hierarchy of clusters of
varying degrees of granularity. Here we compare the clusters that result from
alternative merging rules, and validate the results against InterPro.
Our preliminary results show that clusters that are consistent with several
rather than a single merging rule tend to comply with InterPro annotation. This
is an affirmation of the view that the protein space consists of families that
differ markedly in their evolutionary conservation.
Availability: The outcome of these investigations can be viewed in an
interactive Web site at http://www.protonet.cs.huji.ac.il.
Supplementary information: Biological examples for comparing the performance of
the different algorithms used for classification are presented in
http://www.protonet.cs.huji.ac.il/examples.html.
Contact: ori@cs.huji.ac.il
Paper03, Presentation04
Title:
Authors:
Nicholas Steffen, Scott Murphy, Lorenzo Tolleri, Wesley Hatfield, Richard
Lathrop
Abstract:
Motivation: Direct recognition, or direct readout, of DNA bases by a DNA-binding
protein involves amino acids that interact directly with features specific to
each base. Experimental evidence also shows that in many cases the protein
achieves partial sequence specificity by indirect recognition, i.e., by
recognizing structural properties of the DNA. (1) Could threading a DNA
sequence onto a crystal structure of bound DNA help explain the indirect
recognition component of sequence specificity? (2) Might the resulting purestructure computational motif manifest itself in familiar sequence-based
computational motifs?
Results: The starting structure motif was a crystal structure of DNA bound to
the integration host factor protein (IHF) of {\it E.~coli}. IHF is known to
exhibit both direct and indirect recognition of its binding sites. (1)
Threading DNA sequences onto the crystal structure showed statistically
significant partial separation of 60 IHF binding sites from random and
intragenic sequences and was positively correlated with binding affinity. (2)
The crystal structure was shown to be equivalent to a linear Markov network, and
so, to a joint probability distribution over sequences, computable in linear
time. It was transformed algorithmically into several common pure-sequence
representations, including (a) small sets of short exact strings, (b) weight
matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and
(e) phylogenetic trees. In all cases the pure-sequence motifs retained
statistically significant partial separation of the IHF binding sites from
random and intragenic sequences. Most exhibited positive correlation with
binding affinity. The multiple alignment showed some conserved columns, and the
phylogenetic tree partially mixed low-energy sequences with IHF binding sites
but separated high-energy sequences. The conclusion is that deformation energy
explains part of indirect recognition, which explains part of IHF sequencespecific binding.
Availability: Code and data on request.
Contact: Nick Steffen for code and Lorenzo Tolleri for data. nsteffen@uci.edu ,
Tolleri@chiron.it
Paper04, Presentation05
Title:
Beyond tandem repeats: complex pattern structures and distant regions of
similarity
Authors:
Amy M. Hauth, Deborah A. Joseph
Abstract:
Motivation: Tandem repeats (TRs) are associated with human disease, play a role
in evolution and are important in regulatory processes. Despite their
importance, locating and characterizing these patterns within anonymous DNA
sequences remains a challenge. In part, the difficulty is due to imperfect
conservation of patterns and complex pattern structures. We study recognition
algorithms for two complex pattern structures: variable length tandem repeats
(VLTRs) and multi-period tandem repeats (MPTRs).
Results: We extend previous algorithmic research to a class of regular tandem
repeats (RegTRs). We formally define RegTRs, as well as, two important
subclasses: VLTRs and MPTRs. We present algorithms for identification of TRs in
these classes. Furthermore, our algorithms identify degenerate VLTRs and MPTRs:
repeats containing substitutions, insertions and deletions. To illustrate our
work, we present results of our analysis for two difficult regions in cattle and
human which reflect practical occurrences of these subclasses in GenBank
sequence data. In addition, we show the applicability of our algorithmic
techniques for identifying Alu sequences, gene clusters and other distant
regions of similarity. We illustrate this with an example from yeast chromosome
I.
Availability: Algorithms can be accessed at http://www.cs.wisc.edu/areas/theory.
Contact: Amy M. Hauth kryder@cs.wisc.edu, 608-831-2164) or Deborah A. Joseph
joseph@cs.wisc.edu 608-262-8022), FAX: 608-262-9777.
Paper05, Presentation06
Title:
The POPPs: Clustering and Searching Using Peptide Probability Profiles
Author:
Michael J. Wise
Abstract:
The POPPs is a suite of inter-related software tools which allow the user to
discover what is statistically "unusual" in the composition of an unknown
protein, or to automatically cluster proteins into families based on peptide
composition. Finally, the user can search for related proteins based on peptide
composition. Statistically based peptide composition provides a view of
proteins that is, to some extent, orthogonal to that provided by sequence. In a
test study, the POPP suite is able to regroup into their families sets of
approximately 100 randomised Pfam protein domains. The POPPs suite is used to
explore the diverse set of late embryogenesis abundant (LEA) proteins.
Availability: Contact the author.
Contact: mw263@cam.ac.uk
Paper06, Presentation08
Title:
A sequence profile based HMM for predicting and discriminating beta barrel
membrane proteins
Authors:
Pier Luigi Martelli, Piero Fariselli, Anders Krogh, Rita Casadio
Abstract:
Motivation: Membrane proteins are an abundant and functionally relevant subset
of proteins that putatively include from about 15 up to 30% of the proteome of
organisms fully sequenced. These estimates are mainly computed on the basis of
sequence comparison and membrane protein prediction. It is therefore urgent to
develop methods capable of selecting membrane proteins especially in the case of
outer membrane proteins, barely taken into consideration when proteome wide
analysis is performed. This will also help protein annotation when no homologous
sequence is found in the database. Outer membrane proteins solved so far at
atomic resolution interact with the external membrane of bacteria with a
characteristic _ barrel structure comprising different even numbers of _ strands
(_ barrel membrane proteins). In this they differ from the membrane proteins of
the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane
proteins) and need specialised predictors.
Results: We develop a HMM model, which can predict the topology of _ barrel
membrane proteins, using as input evolutionary information. The model is cyclic
with 6 types of states: two for the _ strand transmembrane core, one for the _
strand cap on either side of the membrane, one for the inner loop, one for the
outer loop and one for the globular domain state in the middle of each loop. The
development of a specific input for HMM based on multiple sequence alignment is
novel. The accuracy per residue of the model is 82% when a jack knife procedure
is adopted. With a model optimisation method using a dynamic programming
algorithm seven topological models out the twelve proteins included in the
testing set are also correctly predicted. When used as a discriminator, the
model is rather selective. At a fixed probability value, it retains 84% of a
non-redundant set comprising 145 sequences of well-annotated outer membrane
proteins. Concomitantly, it correctly rejects 90% of a set of globular proteins
including about 1200 chains with low sequence identity (< 30%) and 90% of a set
of all alpha membrane proteins, including 188 chains.
Availability: The program will be available on request from the authors.
Contact: gigi@lipid.biocomp.unibo.it, www.biocomp.unibo.it
Paper07, Presentation09
Title:
Fully automated ab initio
and ROSETTA
protein structure prediction using I-SITES, HMMSTR
Authors:
Christopher Bystroff, Yu Shao
Abstract:
Motivation: The Monte Carlo fragment insertion method for protein tertiary
structure prediction (ROSETTA) of Baker and others, has been merged with the ISITES library of sequence structure motifs and the HMMSTR model for local
structure in proteins, to form a new public server for the ab initio prediction
of protein structure. The server performs several tasks in addition to tertiary
structure prediction, including a database search, amino acid profile generation,
fragment structure prediction, and backbone angle and secondary structure
prediction. Meeting reasonable service goals required improvements in the
efficiency, in particular for the ROSETTA algorithm.
Results: The new server was used for blind predictions of 40 protein sequences
as part of the CASP4 blind structure prediction experiment. The results for 31
of those predictions are presented here. 61% of the residues overall were found
in topologically correct predictions, which are defined as fragments of 30
residues or more with a root-mean-square deviation in superimposed alpha carbons
of less than 6Å. HMMSTR 3-state secondary structure predictions were 73% correct
overall. Tertiary structure predictions did not improve the accuracy of
secondary structure prediction.
Availability: The server is accessible through the web at
http://isites.bio.rpi.edu/hmmstr/index.html Programs are available upon
requests for academics. Licensing agreements are available for commercial
interests.
Supplementary information: http://isites.bio.rpi.edu,
http://predictioncenter.llnl.gov/casp4/
Contacts: bystrc@rpi.edu, shaoy@rpi.edu
Paper08, Presentation10
Title:
Prediction of Contact Maps by GIOHMMs and Recurrent Neural Networks Using
Lateral Propagation From All Four Cardinal Corners
Authors:
Gianluca Pollastri, Pierre Baldi
Abstract:
Motivation: Accurate prediction of protein contact maps is an important step in
computational structural proteomics. Because contact maps provide a translation
and rotation invariant topological representation of a protein, they can be used
as a fundamental intermediary step in protein structure prediction.
Results: We develop a new set of flexible machine learning architectures for the
prediction of contact maps, as well as other information processing and pattern
recognition tasks. The architectures can be viewed as recurrent neural network
parameterizations of a class of Bayesian networks we call generalized inputoutput HMMs. For the specific case of contact maps, contextual information is
propagated laterally through four hidden planes, one for each cardinal corner.
We show that these architectures can be trained from examples and yield contact
map predictors that outperform previously reported methods. While several
extensions and improvements are in progress, the current version can accurately
predict 60.5% of contacts at a distance cutoff of 8Å and 45% of distant contacts
at 10Å, for proteins of length up to 300.
Availability and Contact: The contact map predictor will be made available
through
http://promoter.ics.uci.edu/BRNN-PRED/ as part of an existing suite of
proteomics predictors.
Email: {gpollast,pfbaldi}@ics.uci.edu
Paper09, Presentation11
Title:
Rate4Site: An Algorithmic Tool for the Identification of Functional Regions on
Proteins by Surface Mapping of Evolutionary Determinants within Their Homologues
Authors:
Tal Pupko, Rachel Bell, Itay Mayrose, Fabian Glaser, Nir Ben-Tal
Abstract:
Motivation: A number of proteins of known three-dimensional (3D) structure exist,
with yet unknown function. In light of the recent progress in structure
determination methodology, this number is likely to increase rapidly. A novel
method is presented here: "Rate4Site", which maps the rate of evolution among
homologous proteins onto the molecular surface of one of the homologues whose
3D-structure is known. Functionally important regions correspond to surface
patches of slowly evolving residues.
Results: Rate4Site estimates the rate of evolution of amino acid sites using the
maximum likelihood (ML) principle. The ML estimate of the rates considers the
topology and branch lengths of the phylogenetic tree, as well as the underlying
stochastic process. To demonstrate its potency, we study the Src SH2 domain.
Like previously established methods, Rate4Site detected the SH2 peptide-binding
groove. Interestingly, it also detected inter-domain interactions between the
SH2 domain and the rest of the Src protein that other methods failed to detect.
Availability: Rate4Site can be downloaded at: http://ashtoret.tau.ac.il/
Contact: tal@ism.ac.jp; rebell@ashtoret.tau.ac.il; fabian@ashtoret.tau.ac.il;
bental@ashtoret.tau.ac.il
Supplementary Information: multiple sequence alignment of homologous domains
from the SH2 protein family, the corresponding phylogenetic tree and additional
examples are available at http://ashtoret.tau.ac.il/~rebell
Paper10, Presentation12
Title:
Inferring sub-cellular localization through automated lexical analysis
Authors:
Rajesh Nair, Burkhard Rost
Abstract:
Motivation: The SWISS-PROT sequence database contains keywords of functional
annotations for many proteins. In contrast, information about the sub-cellular
localization is only available for few proteins. Experts can often infer
localization from keywords describing protein function. We developed LOCkey, a
fully automated method for lexical analysis of SWISS-PROT keywords that assigns
sub-cellular localization. With the rapid growth in sequence data, the
biochemical characterisation of sequences has been falling behind. Our method
may be a useful tool for supplementing functional information already
automatically available.
Results: The method reached a level of more than 82% accuracy in a full crossvalidation test. Due to a lack of functional annotations, we could infer
localization for less than half of all proteins in SWISS-PROT. We applied LOCkey
to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae
(yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly),
Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found
about 8000 new annotations of sub-cellular localization for these eukaryotes.
Availability: Annotations of localization for eukaryotes at:
http://cubic.bioc.columbia.edu/services/LOCkey.
Contact: rost@columbia.edu
Paper11, Presentation14
Title:
Support vector regression applied to the determination of the developmental age
of a Drosophila embryo from its segmentation gene expression patterns
Authors:
Ekaterina Myasnikova, Anastassia
Samsonova, John Reinitz, Maria Samsonova
Abstract:
Motivation: In this paper we address the problem of the determination of
developmental age of an embryo from its segmentation gene expression patterns in
Drosophila.
Results: By applying support vector regression we have developed a fast method
for automated staging of an embryo on the basis of its gene expression pattern.
Support vector regression is a statistical method for creating regression
functions of arbitrary type from a set of training data. The training set is
composed of embryos for which the precise developmental age was determined by
measuring the degree of membrane invagination. Testing the quality of
regression on the training set showed good prediction accuracy. The optimal
regression function was then used for the prediction of the gene expression
based age of embryos in which the precise age has not been measured by membrane
morphology. Moreover, we show that the same accuracy of prediction can be
achieved when the dimensionality of the feature vector was reduced by applying
factor analysis. The data reduction allowed us to avoid over-fitting and to
increase the efficiency of the algorithm.
Availability: This software may be obtained from the authors.
Contact: samson@fn.csa.ru
Paper12, Presentation15
Title:
Variance stabilization applied to microarray data calibration and to the
quantification of differential expression
Authors:
Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, Martin
Vingron
Abstract:
We introduce a statistical model for microarray gene expression data that
comprises data calibration, the quantification of differential expression, and
the quantification of measurement error. In particular, we derive a
transformation $h$ for intensity measurements, and a difference statistic *h
whose variance is approximately constant along the whole intensity range. This
forms a basis for statistical inference from microarray data, and provides a
rational data pre-processing strategy for multivariate analyses. For the
transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a
model of the variance-versus-mean dependence for microarray intensity data,
using the method of variance stabilizing transformations. For large intensities,
h coincides with the logarithmic transformation, and *h with the log-ratio. The
parameters of h together with those of the calibration between experiments are
estimated with a robust variant of maximum-likelihood estimation. We
demonstrate our approach on data sets from different experimental platforms,
including two-color cDNA arrays and a series of Affymetrix oligonucleotide
arrays.
Availability: Software is freely available for academic use as an R package at
http://www.dkfz.de/abt0840/whuber
Contact: w.huber@dkfz.de
Paper13, Presentation16
Title:
Binary Tree-Structured Vector Quantization Approach to Clustering and
Visualizing Microarray Data
Authors:
M. Sultan, D. Wigle, CA Cumbaa, M. Maziarz, Janice Glasgow, M.-S. Tsao and Igor
Jurisica
Abstract:
Motivation: With the increasing number of gene expression databases, the need
for more powerful analysis and visualization tools is growing. Many techniques
have successfully been applied to unravel latent similarities among genes and/or
experiments. Most of the current systems for microarray data analysis use
statistical methods, hierarchical clustering, self-organizing maps, support
vector machines, or k-means clustering to organize genes or experiments into
"meaningful" groups. Without prior explicit bias almost all of these clustering
methods applied to gene expression data not only produce different results, but
may also produce clusters with little or no biological relevance. Of these
methods, agglomerative hierarchical clustering has been the most widely applied,
although many limitations have been identified.
Results: Starting with a systematic comparison of the underlying theories behind
clustering approaches, we have devised a technique that combines tree-structured
vector quantization and partitive k-means clustering (BTSVQ). This hybrid
technique has revealed clinically relevant clusters in three large publicly
available data sets. In contrast to existing systems, our approach is less
sensitive to data preprocessing and data normalization. In addition, the
clustering results produced by the technique have strong similarities to those
of self-organizing maps (SOMs). We discuss the advantages and the mathematical
reasoning behind our approach.
Availability: The BTSVQ system is implemented in Matlab R12 using the SOM
toolbox for the visualization and preprocessing of the
data(http://www.cis.hut.fi/projects/somtoolbox/}. BTSVQ is available for noncommercial use(http://www.uhnres.utoronto.ca/ta3/BTSVQ}.
Contact: ij@uhnres.utoronto.ca
Paper14, Presentation17
Title:
A Variance-Stabilizing Transformation for Gene-Expression Microarray Data
Authors:
Blythe Durbin, Johanna Hardin, Douglas
Hawkins, David Rocke
Abstract:
Motivation: Standard statistical techniques often assume that data are normally
distributed, with constant variance not depending on the mean of the data. Data
that violate these assumptions can often be brought in line with the assumptions
by application of a transformation. Gene-expression microarray data have a
complicated error structure, with a variance that changes with the mean in a
non-linear fashion. Log transformations, which are often applied to microarray
data, can inflate the variance of observations near background.
Results: We introduce a transformation that stabilizes the variance of
microarray data across the full range of expression. Simulation studies also
suggest that this transformation approximately symmetrizes microarray data.
Contact: bpdurbin@wald.ucdavis.edu
Paper15, Presentation18
Title:
Linking Gene Expression Data with Patient Survival Times Using Partial Least
Squares
Authors:
Peter Park, Lu Tian, Isaac Kohane
Abstract:
There is an increasing need to link the large amount of genotypic data, gathered
using microarrays for example, with various phenotypic data from patients. The
classification problem in which gene expression data serve as predictors and a
class label phenotype as the binary outcome variable has been examined
extensively, but there has been less emphasis in dealing with other types of
phenotypic data. In particular, patient survival times with censoring are often
not used directly as a response variable due to the complications that arise
from censoring.
We show that the issues involving censored data can be circumvented by
reformulating the problem as a standard Poisson regression problem. The
procedure for solving the transformed problem is a combination of two approaches:
partial least squares, a regression technique that is especially effective when
there is severe collinearity due to a large number of predictors, and
generalized linear regression, which extends standard linear regression to deal
with various types of response variables. The linear combinations of the
original variables identified by the method are highly correlated with the
patient survival times and at the same time account for the variability in the
covariates. The algorithm is fast, as it does not involve any matrix
decompositions in the iterations. We apply our method to data sets from lung
carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.
Contact: peter_park@harvard.edu
Paper16, Presentation20
Title:
Microarray Synthesis through Multiple-Use PCR Primer Design
Authors:
Rohan Fernandes, Steven Skiena
Abstract:
A substantial percentage of the expense in constructing full-genome spotted
microarrays comes from the cost of synthesizing the PCR primers to amplify the
desired DNA. We propose a computationally-based method to substantially reduce
this cost. Historically, PCR primers are designed so that each primer occurs
uniquely in the genome. This condition is unnecessarily strong for selective
amplification, since only the primer pair associated with each amplification
need be unique. We demonstrate that careful design in a genome-level
amplification project permits us to save the cost of several thousand primers
over conventional approaches.
Contact: {skiena, rohan}@cs.sunysb.edu
Paper17, Presentation21
Title:
Discovering Statistically Significant Biclusters in Gene Expression Data
Authors:
Amos Tanay, Roded Sharan, Ron Shamir
Abstract:
In gene expression data, a bicluster is a subset of the genes exhibiting
consistent patterns over a subset of the conditions. We propose a new method to
detect significant biclusters in large expression datasets. Our approach is
graph theoretic coupled with statistical modeling of the data. Under plausible
assumptions, our algorithm is polynomial and is guaranteed to find the most
significant biclusters. We tested our method on a collection of yeast
expression profiles and on a human cancer dataset. Cross validation results
show high specificity in assigning function to genes based on their biclusters,
and we are able to annotate in this way 196 uncharacterized yeast genes. We also
demonstrate how the biclusters lead to detecting new concrete biological
associations. In cancer data we are able to detect and relate finer tissue
types than was previously possible. We also show that the method outperforms
the biclustering algorithm of Cheng and Church (2000).
Contact: {amos,roded,rshamir}@tau.ac.il.
Availability: www.cs.tau.ac.il/~rshamir/biclust.html.
Paper18, Presentation22
Title:
Co-Clustering of Biological Networks and Gene Expression Data
Authors:
Daniel Hanisch, Alexander Zien, Ralf Zimmer, Thomas Lengauer
Abstract:
Motivation: Large scale gene expression data are often analyzed by clustering
genes based on gene expression data alone, though a-priori knowledge in the form
of biological networks is available. The use of this additional information
promises to improve exploratory analysis considerably.
Results: We propose to construct a distance function which combines information
from expression data and biological networks. Based on this function, we compute
a joint clustering of genes and vertices of the network. This general approach
is elaborated for metabolic networks. We define a graph distance function on
such networks and combine it with a correlation-based distance function for gene
expression measurements. A hierarchical clustering and an associated statistical
measure is computed to arrive at a reasonable number of clusters. Our method is
validated using expression data of the yeast diauxic shift. The resulting
clusters are easily interpretable in terms of the biochemical network and the
gene expression data and suggest that our method is able to automatically
identify processes that are relevant under the measured conditions.
Contact: Daniel.Hanisch@scai.fhg.de
Paper19, Presentation23
Title:
Statistical process control for large scale microarray experiments
Authors:
Fabian Model, Thomas Koenig, Christian Piepenbrock, Peter Adorjan
Abstract:
Motivation: Maintaining and controlling data quality is a key problem in large
scale microarray studies. Especially systematic changes in experimental
conditions across multiple chips can seriously affect quality and even lead to
false biological conclusions. Traditionally the influence of these effects can
only be minimized by expensive repeated measurements, because a detailed
understanding of all process relevant parameters seems impossible.
Results: We introduce a novel method for microarray process control that
estimates quality solely based on the distribution of the actual measurements
without requiring repeated experiments. A robust version of principle component
analysis detects single outlier microarrays and only thereby enables the use of
techniques from multivariate statistical process control. In particular, the
T^2 control chart reliably tracks undesired changes in process relevant
parameters. This can be used to improve the microarray process itself, limits
necessary repetitions only to affected samples and therefore maintains quality
in a cost effective way. We prove the power of the approach on 3 large sets of
DNA methylation microarray data.
Contact: Fabian.Model@epigenomics.com
Paper20, Presentation24
Title:
Evaluating Machine Learning Approaches for Aiding Probe Selection for GeneExpression Arrays
Authors:
John Tobler, Mike Molla, Jude Shavlik
Abstract:
Motivation: Microarrays are a fast and cost-effective method of performing
thousands of DNA hybridization experiments simultaneously. DNA probes are
typically used to measure the expression level of specific genes.
Because
probes greatly vary in the quality of their hybridizations, choosing good probes
is a difficult task.
If one could accurately choose probes that are likely to
hybridize well, then fewer probes would be needed to represent each gene in a
gene-expression microarray, and, hence, more genes could be placed on an array
of a given physical size. Our goal is to empirically evaluate how successfully
three standard machine-learning algorithms - naïve Bayes, decision trees, and
artificial neural networks - can be applied to the task of predicting good
probes. Fortunately it is relatively easy to get training examples for such a
learning task: place various probes on a gene chip, add a sample where the
corresponding genes are highly expressed, and then record how well each probe
measures the presence of its corresponding gene. With such training examples,
it is possible that an accurate predictor of probe quality can be learned.
Results: Two of the learning algorithms we investigate - naïve Bayes and neural
networks - learn to predict probe quality surprisingly well. For example, in
the top ten predicted probes for a given gene not used for training, on average
about five rank in the top 2.5% of that gene's hundreds of possible probes.
Decision-tree induction and the simple approach of using predicted melting
temperature to rank probes perform significantly worse than these two algorithms.
The features we use to represent probes are very easily computed and the time
taken to score each candidate probe after training is minor. Training the naïve
Bayes algorithm takes very little time, and while it takes over 10 times as long
to train a neural network, that time is still not very substantial (on the order
of a few hours on a desktop workstation). We also report the information
contained in the features we use to describe the probes. We find the fraction
of cytosine in the probe to be the most informative feature. We also find, not
surprisingly, that the nucleotides in the middle of the probes sequence are more
informative than those at the ends of the sequence.
Contact: molla@cs.wisc.edu
Paper21, Presentation26
Title:
The Degenerate Primer Design Problem
Authors:
Chaim Linhart, Ron Shamir
Abstract:
A PCR primer sequence is called degenerate if some of its positions have several
possible bases. The degeneracy of the primer is the number of unique sequence
combinations it contains. We study the problem of designing a pair of primers
with prescribed degeneracy that match a maximum number of given input sequences.
Such problems occur when studying a family of genes that is known only in part,
or is known in a related species. We prove that various simplified versions of
the problem are hard, show the polynomiality of some restricted cases, and
develop approximation algorithms for one variant. Based on these algorithms, we
implemented a program called HYDEN for designing highly-degenerate primers for a
set of genomic sequences. We report on the success of the program in an
experimental scheme for identifying all human olfactory receptor (OR) genes. In
that project, HYDEN was used to design primers with degeneracies up to 10^10
that amplified with high specificity many novel genes of that family, tripling
the number of OR genes known at the time.
Availability: Available on request from the authors
Contact: {chaiml,rshamir}@tau.ac.il
Paper22, Presentation27
Title:
Splicing Graphs and EST Assembly Problem
Authors:
Steffen Heber, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, Pavel Pevzner
Abstract:
Motivation: The traditional approach to annotate alternative splicing is to
investigate every splicing variant of the gene in a case-by-case fashion. This
approach, while useful, has some serious shortcomings. Recent studies indicate
that alternative splicing is more frequent than previously thought and some
genes may produce tens of thousands of different transcripts. A list of
alternatively spliced variants for such genes would be difficult to build and
hard to analyze. Moreover, such a list does not show the relationships between
different transcripts and does not show the overall structure of all transcripts.
A better approach would be to represent all splicing variants for a given gene
in a way that captures the relationships between different splicing variants.
Results: We introduce the notion of the splicing graph that is a natural and
convenient representation of all splicing variants. The key difference with the
existing approaches is that we abandon the linear (sequence) representation of
each transcript and substitute it with a graph representation where each
transcript corresponds to a path in the graph. We further design an algorithm to
assemble EST reads into the splicing graph rather than assembling them into each
splicing variant in a case-by-case fashion.
Availability: http://www-cse.ucsd.edu/groups/bioinformatics/software.html
Contact: sheber@ucsd.edu
Paper23, Presentation28
Title:
Exact Genetic Linkage Computations for General Pedigrees
Authors:
Ma'ayan Fishelson, Dan Geiger
Abstract:
Motivation: Genetic linkage analysis is a useful statistical tool for mapping
disease genes and for associating functionality of genes to their location on
the chromosome. There is a need for a program that computes multipoint
likelihood on general pedigrees with many markers that also deals with two-locus
disease models.
Results: In this paper we present algorithms for performing exact multipoint
likelihood calculations on general pedigrees with a large number of highly
polymorphic markers, taking into account a variety of disease models. We have
implemented these algorithms in a new computer program called SUPERLINK which
outperforms leading linkage software with regards to functionality, speed,
memory requirements and extensibility.
Availability: SUPERLINK is available at
http://bioinfo.cs.technion.ac.il/superlink
Contact: {fmaayan, dang}@cs.technion.ac.il}
Paper24, Presentation29
Title:
Assigning Probes into A Small Number of Pools Separable by Electrophoresis
Authors:
Teemu Kivioja, Mikko Arvas, Kari Kataja, Merja Penttil‰, Hans Sˆderlund, Esko
Ukkonen
Abstract:
Motivation: Measuring transcriptional expression levels (transcriptional
profiling) has become one of the most important methods in functional genomics.
Still, new measuring methods are needed to obtain more reliable, quantitative
data about transcription on a genomic scale. In this paper we concentrate on
certain computational optimization problems arising in the design of one such
novel method. From a computational point of view the key feature of the new
method is that the hybridized probes are distinguished from each other based on
their different size. Therefore the probes have to be assigned into pools such
that the probes in the same pool have unique sizes different enough from each
other. Identification of expressed RNA is given by probe pool and probe size
while quantification is given by the label of the probe, e.g. fluorescence
intensity.
Results: We show how to computationally find the probes and assign them into
pools for a whole genome such that i) each gene has a specific probe suitable
for amplification and hybridization, and ii) the expression level measurement
can be done in a minimal number of pools separable by electrophoresis in order
to minimize the total experiment cost of the measurement. Our main result is a
polynomial-time approximation algorithm for assigning the probes into pools. We
demonstrate the feasibility of the procedure by selecting probes for the yeast
genome and assigning them into less than 100 pools. The probe sequences and
their assignment into pools are available for academic research on request from
the authors.
Contact; Teemu.Kivioja@cs.Helsinki.FI
Paper25, Presentation30
Title:
Representing Genetic Sequence Data for Pharmacogenomics: An Evolutionary
APproach Using Ontological and Relational Models
Authors:
Daniel Rubin, Farhad Shafa, Diane Oliver, Micheal
Altman
Hewett, Teri Klein, Russ
Abstract:
Motivation: The information model chosen to store biological data affects the
types of queries possible, database performance, and difficulty in updating that
information model. Genetic sequence data for pharmacogenetics studies can be
complex, and the best information model to use may change over time. As
experimental and analytical methods change, and as biological knowledge advances,
the data storage requirements and types of queries needed may also change.
Results: We developed a model for genetic sequence and polymorphism data, and
used XML schema to specify the elements and attributes required for this model.
We implemented this model as an ontology in a frame-based representation and as
a relational model in a database system. We collected genetic data from two
pharmacogenetics resequencing studies, and formulated queries useful for
analyzing these data. We compared the ontology and relational models in terms
of query complexity, performance, and difficulty in changing the information
model. Our results demonstrate benefits of evolving the schema for storing
pharmacogenetics data: ontologies perform well in early design stages as the
information model changes rapidly and simplify query formulation, while
relational models offer improved query speed once the information model and
types of queries needed stabilize.
Availability: Our ontology and relational models are available at
http://www.pharmgkb.org/data/schemas/.
Contact: rubin@smi.stanford.edu, russ.altman@stanford.edu, help@pharmgkb.org
Paper26, Presentation32
Title:
Evaluating Functional Network Inference Using Simulations of Complex Biological
Systems
Authors:
V. Anne Smith, Erich D. Jarvis, Alexander J. Hartemink
Abstract:
Motivation: Although many network inference algorithms have been presented in
the bioinformatics literature, no suitable approach has been formulated for
evaluating their effectiveness at recovering models of complex biological
systems from limited data. To overcome this limitation, we propose an approach
to evaluate network inference algorithms according to their ability to recover a
complex functional network from biologically reasonable simulated data.
Results: We designed a simulator to generate data from a complex biological
system at multiple levels of organization: behavior, neural anatomy, brain
electrophysiology, and gene expression of songbirds. About 90% of the simulated
variables are unregulated by other variables in the system and are included
simply as distracters. We sampled the simulated data at intervals as one would
sample from a biological system in practice, and then used the sampled data to
evaluate the effectiveness of an algorithm we developed for functional network
inference. We found that our algorithm is highly effective at recovering the
functional network structure of the simulated system---including the irrelevance
of unregulated variables---from sampled data alone. To assess the
reproducibility of these results, we tested our inference algorithm on
50separately simulated sets of data and it consistently recovered almost
perfectly the complex functional network structure underlying the simulated data.
To our knowledge, this is the first approach for evaluating the effectiveness of
functional network inference algorithms at recovering models from limited data.
Our simulation approach also enables researchers a priori to design experiments
and data-collection protocols that are amenable to functional network inference.
Availability: Source code and simulated data are available upon request.
Contact: amink@cs.duke.edu, asmith@neuro.duke.edu, jarvis@neuro.duke.edu
Paper27, Presentation33
Title:
The Pathway Tools Software
Authors:
Peter Karp, Suzanne Paley, Pedro Romero
Abstract:
Motivation: Bioinformatics requires reusable software tools for creating modelorganism databases (MODs).
Results: The Pathway Tools is a reusable, production-quality software
environment for creating a type of MOD called a Pathway/Genome Database (PGDB).
A PGDB such as EcoCyc (see ecocyc.org) integrates our evolving understanding of
the genes, proteins, metabolic network, and genetic network of an organism.
This paper provides an overview of the four main components of the Pathway Tools:
The PathoLogic component supports creation of new PGDBs from the annotated
genome of an organism. The Pathway/Genome Navigator provides query,
visualization, and Web-publishing services for PGDBs. The Pathway/Genome
Editors support interactive updating of PGDBs. The Pathway Tools ontology
defines the schema of PGDBs. The Pathway Tools makes use of the Ocelot object
database system for data management services for PGDBs. The Pathway Tools has
been used to build PGDBs for 13 organisms within SRI and by external users.
Availability: The software is freely available to academics and is available for
a fee to commercial institutions. Contact ptools-support@ai.sri.com for
information on obtaining the software.
Contact: pkarp@ai.sri.com
Paper28, Presentation34
Title:
Discovering regulatory and signaling circuits in molecular interaction networks
Authors:
Trey Ideker, Owen Ozier, Benno Schwikowski, Andrew Siegel
Abstract:
Motivation: In model organisms such as yeast, large databases of protein-protein
and protein-DNA interactions have become an extremely important resource for the
study of protein function, evolution, and gene regulatory dynamics. In this
paper we demonstrate that by integrating these interactions with widelyavailable mRNA expression data, it is possible to generate concrete hypotheses
for the underlying mechanisms governing the observed changes in gene expression.
To perform this integration systematically and at large scale, we introduce an
approach for screening a molecular interaction network to identify active
subnetworks, i.e., connected regions of the network that show significant
changes in expression over particular subsets of conditions. The method we
present here combines a rigorous statistical measure for scoring subnetworks
with a search algorithm for identifying subnetworks with high score.
Results: We evaluated our procedure on a small interaction network of 332 genes
and a large network of 4160 genes containing all 7462 protein-protein and
protein-DNA interactions in the yeast public databases. In the case of the small
network, we identified five significant subnetworks that covered 41 out of 77
(53%) of all significant changes in expression. Both network analyses returned
several top-scoring subnetworks with good correspondence to known regulatory
mechanisms in the literature. These results demonstrate how large-scale genomic
approaches may be used to uncover signaling and regulatory pathways in a
systematic, integrative fashion.
Availability: The methods presented in this paper are implemented in the
Cytoscape software package which is available to the academic community at
www.cytoscape.org.
Contact: trey@wi.mit.edu
Paper29, Presentation35
Title:
Modelling Pathways in E.coli from Time Series Expression Profiles
Authors:
Irene Ong, David Page, Jeremy Glasner
Abstract:
Motivation: Cells continuously reprogram their gene expression network as they
move through the cell cycle or sense changes in their environment. In order to
understand the regulation of cells, time series expression profiles provide a
more complete picture than single time point expression profiles. Few analysis
techniques, however, are well suited to modeling such time series data.\
Results: We describe an approach that naturally handles time series data with
the capabilities of modeling causality, feedback loops, and environmental or
hidden variables using a Dynamic Bayesian network. We also present a novel way
of combining prior biological knowledge and current observations to improve the
quality of analysis and to model interactions between sets of genes rather than
individual genes. Our approach is evaluated on time series expression data
measured in response to physiological changes that affect tryptophan metabolism
in E. coli. Results indicate that this approach is capable of finding
correlations between sets of related genes.
Contact: ong@cs.wisc.edu
Paper30, Presentation36
Title:
Of Truth and Pathways: Chasing Bits of Information through Myriads of Articles
Authors:
Michael Krauthammer, Carol Friedman, George Hripcsak, Ivan Iossifov, Andrey
Rzhetsky
Abstract:
Knowledge on interactions between molecules in living cells is indispensable for
theoretical analysis and practical applications in modern genomics and molecular
biology. Building such networks relies on the assumption that the correct
molecular interactions are known or can be identified by reading a few research
articles. However, this assumption does not necessarily hold, as truth is rather
an emerging property based on many potentially conflicting facts. This paper
explores the processes of knowledge generation and publishing in the molecular
biology literature using modeling and analysis of real molecular interaction
data. The data analyzed in this article was automatically extracted from 50,000
research articles in molecular biology using a computer system called GeneWays
containing a natural language processing module. The paper indicates that
truthfulness of statements is associated in the minds of scientists with the
relative importance (connectedness) of substances under study, revealing a
potential selection bias in the reporting of research results. Aiming at
understanding the statistical properties of the life cycle of biological facts
reported in research articles, we formulate a stochastic model describing
generation and propagation of knowledge about molecular interactions through
scientific publications. We hope that in the future such a model can be useful
for automatically producing consensus views of molecular interaction data.
Contact: ar345@columbia.edu
Paper31, Presentation37
Title:
A Fast and Robust Method to Infer and Characterize an Active Regulator Set for
Molecular Pathways
Authors:
Dana Pe'er, Aviv
Regev, Amos Tanay
Abstract:
Regulatory relations between genes are an important component of molecular
pathways. Here, we devise a novel global method that uses a set of gene
expression profiles to find a small set of relevant active regulators,
identifies the genes that they regulate, and automatically annotates them. We
show that our algorithm is capable of handling a large number of genes in a
short time and is robust to a wide range of parameters. We apply our method to a
combined dataset of S. cerevisiae expression profiles, and validate the
resulting model of regulation by cross-validation and extensive biological
analysis of the selected regulators and their derived annotations.
Paper32, Presentation39
Title:
Marginalized Kernels for Biological Sequences
Authors:
Koji Tsuda, Taishin Kin, Kiyoshi Asai
Abstract:
Motivation: Kernel methods such as support vector machines require a kernel
function between objects to be defined a priori. Several works have been done
to derive kernels from probability distributions, e.g. the Fisher kernel.
However, a general methodology to design a kernel is not fully developed.
Results: We propose a reasonable way of designing a kernel when objects are
generated from latent variable models (e.g. HMM).First of all, a joint kernel is
designed for complete data which include both visible and hidden variables.
Then a marginalized kernel for visible data is obtained by taking the
expectation with respect to hidden variables. We will show that the Fisher
kernel is a special case of marginalized kernels, which gives another viewpoint
to the Fisher kernel theory. Although our approach can be applied to any object,
we particularly derive several marginalized kernels useful for biological
sequences (e.g. DNA and proteins).The effectiveness of marginalized kernels is
illustrated in the task of classifying bacterial gyrase subunit B (gyrB} amino
acid sequences.
Contact]: koji.tsuda@aist.go.jp
Paper33, Presentation40
Title:
A tree kernel to analyze phylogenetic profiles
Authors:
Jean-Philippe Vert
Abstract:
Motivation: The phylogenetic profile of a protein is a string that encodes the
presence or absence of the protein in every fully sequenced genome. Because
proteins that participate in a common structural complex or metabolic pathway
are likely to evolve in a correlated fashion, the phylogenetic profiles of such
proteins are often "similar" or at least "related" to each other. The question
we address in this paper is the following: how to measure the "similarity"
between two profiles, in an evolutionarily relevant way, in order to develop
efficient function prediction methods?
Results: We show how the profiles can be mapped to a high-dimensional vector
space which incorporates evolutionarily relevant information, and we provide an
algorithm to compute efficiently the inner product in that space, which we call
the tree kernel. The tree kernel can be used by any kernel-based analysis method
for classification or data mining of phylogenetic profiles. As an application a
Support Vector Machine (SVM) trained to predict the functional class of a gene
from its phylogenetic profile is shown to perform better with the tree kernel
than with a naive kernel that does not include any information about the
phylogenetic relationships among species. Moreover a kernel principal component
analysis (KPCA) of the phylogenetic profiles illustrates the sensitivity of the
tree kernel to evolutionarily relevant variations.
Availability: All data and software used are freely and publicly available upon
request.
Contact: Jean-Philippe.Vert@mines.org
Paper34, Presentation41
Title:
Statistically Based Postprocessing of Phylogenetic Analysis by Clustering
Authors:
Cara Stockham, Li-San Wang, Tandy Warnow
Abstract:
Motivation: Phylogenetic analyses often produce thousands of candidate trees.
Biologists resolve the conflict by computing the consensus of these trees.
Single-tree consensus as postprocessing methods can be unsatisfactory due to
their inherent limitations.
Results: In this paper we present an alternative approach by using clustering
algorithms on the set of candidate trees. We propose bicriterion problems, in
particular using the concept of information loss, and new consensus trees called
characteristic trees that minimize the information loss. Our empirical study
using four biological datasets shows that our approach provides a significant
improvement in the information content, while adding only a small amount of
complexity. Furthermore, the consensus trees we obtain for each of our large
clusters are more resolved than the single-tree consensus trees. We also
provide some initial progress on theoretical questions that arise in this
context.
Availability: Software available upon request from the authors. The
agglomerative clustering is implemented using Matlab (MathWorks, 2000) with the
Statistics Toolbox. The Robinson-Foulds distance matrices and the strict
consensus trees are computed using PAUP (Swofford, 2001) and the Daniel Huson's
tree library on Intel Pentium workstations running Debian Linux.
Contact: lisan@cs.utexas.edu,telephone: +1 (512) 232-7455, fax: +1 (512) 4718885.
Supplementary Information: http://www.cs.utexas.edu/users/lisan/ismb02/
Paper35, Presentation42
Title:
Efficiently Detecting Polymorphisms During The Fragment Assembly Process
Authors:
Daniel Fasulo, Aaron Halpern, Ian Dew, Clark Mobarry
Abstract:
Motivation: Current genomic sequence assemblers assume that the input data is
derived from a single, homogeneous source. However, recent whole-genome shotgun
sequencing projects have violated this assumption, resulting in input fragments
covering the same region of the genome whose sequences differ due to polymorphic
variation in the population. While single-nucleotide polymorphisms (SNPs) do
not pose a significant problem to state-of-the-art assembly methods, these
methods do not handle insertion/deletion (indel) polymorphisms of more than a
few bases.
Results: This paper describes an efficient method for detecting sequence
discrepencies due to polymorphism that avoids resorting to global use of more
costly, less stringent affine sequence alignments. Instead, the algorithm uses
graph-based methods to determine the small set of fragments involved in each
polymorphism and performs more sophisticated alignments only among fragments in
that set. Results from the incorporation of this method into the Celera
Assembler are reported for the D. melanogaster, H. sapiens, and M. musculus
genomes.
Availability; The method described herein does not constitute a stand-alone
software application, but is laid out in sufficient detail to be implemented as
a component of any genomic sequence assembler.
Contact: daniel.fasulo@celera.com
Paper36, Presentation43
Title:
Multiple Genome Rearrangement: A General Approach via the Evolutionary Genome
Graph
Authors:
Dmitry Korkin, Lev Goldfarb
Abstract:
Motivation: In spite of a well-known fact that genome rearrangements are
supposed to be viewed in the light of the evolutionary relationships within and
between the species involved, no formal underlying framework based on the
evolutionary considerations for treating the questions arising in the area has
been proposed. If such an underlying framework is provided, all the basic
questions in the area can be posed in a biologically more appropriate and useful
form: e.g., the similarity between two genomes can then be computed via the
nearest ancestor, rather than "directly", ignoring the evolutionary connections.
Results: We outline an evolution-based general framework for answering questions
related to the multiple genome rearrangement. In the proposed model, the
evolutionary genome graph (EG-graph) encapsulates an evolutionary history of a
genome family. For a set of all EG-graphs, we introduce a family of similarity
measures each defined via a set of genome transformations associated with a
particular EG-graph. Given a set of genomes and restricting ourselves to the
transpositions, an algorithm for constructing an EG-graph is presented. We also
present the experimental results in the form of an EG-graph for a set of
concrete genomes (for several species). This EG-graph turns out to be very close
to the corresponding known phylogenetic tree.
Contact: dkorkin@unb.ca
Paper37, Presentation44
Title:
Efficient multiple genome alignment
Authors:
Michael Hoehl, Stefan Kurtz, Enno Ohlebusch
Abstract:
Motivation: To allow a direct comparison of the genomic DNA sequences of
sufficiently similar organisms, there is an urgent need for software tools that
can align more than two genomic sequences.
Results: We developed new algorithms and a software tool "Multiple Genome
Aligner" (MGA for short) that efficiently computes multiple genome alignments of
large, closely related DNA sequences. For example, it can align 85% percent of
the complete genomes of six human adenoviruses (average length 35,305 bp.) in
159 seconds. An alignment of 74% of the complete genomes of three strains of E.
coli (lengths: 5,528,445; 5,498,450; 4,639,221 bp.) is produced in 30 minutes.
Availability: The software MGA is available free of charge for non-commercial
research institutions. For details see http://bibiserv.techfak.unibielefeld.de/mga/
Contact]: {kurtz, enno}@techfak.uni-bielefeld.de
Paper38, Presentation46
Title:
PSEUDOVIEWER: Automatic Visualization of RNA Pseudoknots
Authors:
Kyungsook Han, Yujin Lee, Wootaek Kim
Abstract:
Motivation: Several algorithms have been developed for drawing RNA secondary
structures, however none of these can be used to draw RNA pseudoknot structures.
In the sense of graph theory, a drawing of RNA secondary structures is a tree,
whereas a drawing of RNA pseudoknots is a graph with inner cycles within a
pseudoknot as well as possible outer cycles formed between a pseudoknot and
other structural elements. Thus, RNA pseudoknots are more difficult to visualize
than RNA secondary structures. Since no automatic method for drawing RNA
pseudoknots exists, visualizing RNA pseudoknots relies on significant amount of
manual work and does not yield satisfactory results. The task of visualizing RNA
pseudoknots by hand becomes more challenging as the size and complexity of the
RNA pseudoknots increase.
Results: We have developed a new representation and an algorithm for drawing Htype pseudoknots with RNA secondary structures. Compared to existing
representations of H-type pseudoknots, the new representation ensures uniform
and clear drawings with no edge crossing for all H-type pseudoknots. To the best
of our knowledge, this is the first algorithm for automatically drawing RNA
pseudoknots with RNA secondary structures. The algorithm has been implemented in
a Java program, which can be executed on any computing system. Experimental
results demonstrate that the algorithm generates an aesthetically pleasing
drawing of all H-type pseudoknots. The results have also shown that the drawing
has high readability, enabling the user to quickly and easily recognize the
whole RNA structure as well as the pseudoknots themselves.
Availability: available on request from the corresponding author.
Contact: khan@inha.ac.kr
Paper39, Presentation47
Title:
A powerful non-homology method for the prediction of operons in prokaryotes
Authors:
Gabriel Moreno-Hagelsieb, Julio Collado-Vides
Abstract:
Motivation: The prediction of the transcription unit organization of genomes is
an important clue in the inference of functional relationships of genes, the
interpretation and evaluation of transcriptome experiments, and the overall
inference of the regulatory networks governing the expression of genes in
response to the environment. Though several methods have been devised to predict
operons, most need a high characterization of the genome analyzed. Loglikelihoods derived from inter-genic distance distributions work surprisingly
well to predict operons in Escherichia coli and are available for any genome as
soon as the gene sets are predicted.
Results: Here we provide evidence that the very same method is applicable to any
prokaryotic genome. First, the method has the same efficiency when evaluated
using a collection of experimentally known operons of Bacillus subtilis. Second,
operons among most if not all prokaryotes seem to have the same tendencies to
keep short distances between their genes, the most frequent distances being the
overlaps of four and one base pairs. The universality of this structural feature
allows us to predict the organization of transcription units in all prokaryotes.
Third, predicted operons contain a higher proportion of genes with related
phylogenetic profiles and conservation of adjacency than predicted borders of
transcription units.
Contact: moreno@cifn.unam.mx
Supplementary information: Additional materials and graphs, are available at:
http://www.cifn.unam.mx/moreno/pub/TUpredictions/.
Paper40, Presentation48
Title:
Identifying Operons and Untranslated Regions of Transcripts Using Escherichia
coli RNA Expression Analysis
Authors:
Brian Tjaden, David Haynor, Sergey Stolyar, Carsten Rosenow, Eugene Kolker
Abstract:
Microarrays traditionally have been used to assay the transcript expression of
coding regions of genes. Here, we use Escherichia coli oligonucleotide
microarrays to assay transcript expression of both open reading frames (ORFs)
and intergenic regions. We then use hidden Markov models to analyze this
expression data and estimate transcription boundaries of genes. This approach
allows us to identify 5( untranslated regions (UTRs) of transcripts as well as
genes which are likely to be operon members. The operon elements we identify
correspond to documented operons with 99% specificity and 63% sensitivity.
Similarly we find that our 5( UTR results accurately coincide with
experimentally verified promoter regions for most genes.
Contact: tjaden@cs.washington.edu
Paper41, Presentation49
Title:
Detecting Recombination with MCMC
Authors:
Dirk Husmeier, Grainne
McGuire
Abstract:
Motivation: We present a statistical method for detecting recombination, whose
objective is to accurately locate the recombinant breakpoints in DNA sequence
alignments of small numbers of taxa (4 or 5). Our approach explicitly models
the sequence of phylogenetic tree topologies along a multiple sequence alignment.
Inference under this model is done in a Bayesian way, using Markov chain Monte
Carlo (MCMC). The algorithm returns the site-dependent posterior probability of
each tree topology, which is used for detecting recombinant regions and locating
their breakpoints.
Results: The method was tested on a synthetic and three real DNA sequence
alignments, where it was found to outperform the established detection methods
PLATO, RECPARS, and TOPAL.
Availability: The algorithm has been implemented in the C++ program package
BARCE, which is freely available from http://www.bioss.sari.ac.uk/~dirk/software
Contact: dirk@bioss.ac.uk
Paper42, Presentation50
Title:
Finding Composite Regulatory Patterns in DNA Sequences
Authors:
Eleazar Eskin, Pavel Pevzner
Abstract:
Motivation: Pattern discovery in unaligned DNA sequences is a fundamental
problem in computational biology with important applications in finding
regulatory signals. Current approaches to pattern discovery focus on monad
patterns that correspond to relatively short contiguous strings. However, many
of the actual regulatory signals are composite patterns that are groups of monad
patterns that occur near each other. A difficulty in discovering composite
patterns is that one or both of the component monad patterns in the group may be
"`too weak". Since the traditional monad-based motif finding algorithms usually
output one (or a few) high scoring patterns, they often fail to find composite
regulatory signals consisting of weak monad parts. In this paper, we present a
MITRA (MIsmatch TRee Algorithm)} approach for discovering composite signals. We
demonstrate that MITRA performs well for both monad and composite patterns by
presenting experiments over biological and synthetic data.
Availability: MITRA is available at http://www.cs.columbia.edu/compbio/mitra/
Contact: eeskin@cs.columbia.edu
Download