V. Machine learning applications in Systems biology

advertisement
1
Machine Learning Applications in
Systems Biology
Natasha Alves, M.A.Sc. Candidate, ECE

Abstract—
Recent advances in high-throughput
technologies have led to an immense flow of biological
data. Extracting the information hidden in the everexpanding biological databases has been an obstacle in the
progress of systems biology. Machine Learning has proved
to be an efficient and inexpensive approach to organizing
data; developing new tools to analyze data; and
discovering new knowledge from data.
This paper
introduces Machine Learning techniques like inductive
logic programming, clustering, Bayesian networks, and
decision trees in the context of their applications in systems
biology. The shortcomings of these Machine Learning
techniques are also addressed.
Index Terms—Artificial Intelligence, Bayesian Networks,
Clustering, Decision Trees, Inductive Logic Programming,
Machine Learning, Systems Biology.
I. INTRODUCTION
Systems Biology is an in-depth, systems-level analysis of
biological systems grounded on the molecular level [1]. It is
different from other methods of biological study where the
focus is on the characteristics of isolated parts of a cell or
organism. Systems biology examines the structure and
dynamics of cellular and organism functions, and their
interconnections and interrelationships. One ultimate goal of
systems biology is to use the knowledge of the complete
genome sequence and all proteins encoded by that genome to
reconstruct the biological systems that are implied [2].
The development of systems biology is driven by
technology. Sophisticated computational techniques are
needed to analyze biological systems because of the
complexity and dynamics involved. Machine Learning, which
is an automatic and intelligent learning technique, has for long
been used to discover meaningful associations between
proteins, and for scientific hypothesis formation [3].
The aim of this paper is to introduce Machine Learning
techniques in the context of their application in systems
biology.
II. CHALLENGES IN SYSTEMS BIOLOGY
Much of our failure to fully understand biological systems
has been due to their size and complexity. Systems biology
emphasizes on large-scale discovery of the interactions of
genes, proteins, and other cell elements. It is confronted with
dynamic biological responses, a huge number of interactions,
inherent redundancy in many pathways and feedback systems.
A lot of useful and important information about biological
systems is hidden in high volumes of experimental data. For
instance, there are 37 billion bases of DNA in 32,000 sequence
records in GenBank alone (Feb. 2004)[12]. Analyzing high
volumes of data to understand biological systems demands
tedious experimentation and modern computational
technology. This is the grand challenge for systems biology in
this era.
An intelligent approach is needed to extract the hidden
information from the data and to cope with the rapid rate of
data deposition.
III. MACHINE LEARNING
Machine Learning (ML) is the capability of computer
algorithms to improve automatically through experience (i.e.
the computer programs itself by seeing examples of the
behavior we want). ML approaches are ideally suited for
domains characterized by the presence of large amounts of
data, noisy patterns and the absence of general theories [4].
The fundamental idea behind these approaches is to learn the
theory automatically from the data through a process of
inference and model fitting. A system that can learn from
experience and improve its performance automatically could
serve as a tool for solving biological systems.
The main goal of ML is to induce general functions from a
specific training data set. The learning agent is given a set of
training examples, and it defines the hypothesis for them. The
agent must search through the hypothesis space and locate the
best hypothesis when given the test set [5].
Because ML is concerned with learning from data examples,
it often uses a probabilistic approach.
IV. OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY
Manuscript received November 1, 2004
ML approaches have gained popularity in systems biology.
The characteristics of ML that make it well suited for systems
2
biology are:
1. Many problems in biological systems are not well
defined, but have a lot of experimental data. ML is
useful when the structure of the task is not well
understood but the task can be characterized by a data
set with strong statistic regularity. While input/output
pairs can be easily specified, the relationship between
the inputs and outputs are often unknown (e.g. the
protein folding mechanism). ML approaches can extract
relationships and correlations hidden under large
volumes of data (data mining). It could thus extract the
information encoded in biological databases and use the
available data to predict meaningful biological
properties.
2. ML approaches can adjust their internal structure to
produce correct outputs for a large number of sample
inputs. They can thus constrain their input/output
function to approximate the implicit relationship in the
training examples [6].
3. ML approaches adapt themselves to new information
(training examples). This is important in systems
biology because new data are generated every day. The
newly generated data might update the initial learning
hypotheses.
ML thus provides efficient approaches to analyze biological
data.
V. MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY
A variety of ML techniques can be used to solve most of the
problems in systems biology. In a systems biology context,
ML is used to discover meaningful knowledge from existing
biological databases and to present that knowledge in an
understandable pattern. The tasks of ML in systems biology
can be divided into seven categories as shown in Table 1 [5].
These techniques, operating individually or in combination,
can meet the various challenges in systems biology.
A. Protein Structure Prediction
Proteins are the essence of life. The secondary structure of
protein consists of -helices, -strands and coils. The folding
of these secondary structure elements forms the unique 3D
structure of a protein. A lot of useful information is contained
in this 3D structure. However, predicting proteins’ structure is
a central problem in bioinformatics. It is the bottleneck
between sequencing efforts and drug design. ML approaches
like Inductive Logic Programming can be used to predict
protein structure.
1) Inductive Logic Programming
Inductive logic programming (ILP) is a research area
formed at the intersection of ML and Logic Programming. ILP
systems develop predicate descriptions from observations and
background knowledge. There are three main elements in an
ILP learning system: observations, background knowledge,
and hypothesis [5]. Each of these elements of ILP is a logic
program. Fig.1 shows the general scheme for ILP methods.
Observations and background knowledge are combined by an
ILP program to form a hypothesis. A set of IF – THEN rules
can then be derived from the hypothesis. For example:
Hypothesis: fold('Four-helical up-and-down bundle',P) :helix(P,H1), length(H1,hi), position(P,H1,Pos),
interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2).
Rule: The protein P has fold class ‘Four-helical up-anddown bundle’ if it contains a long helix H1 at a secondary
structure position between 1 and 3, and H1 is followed by a
second helix H2. [5]
The rules are tested on additional data. If experimentation
leads to high confidence in the hypothesis validity the
hypothesis is added to the background knowledge.
TABLE I
MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY [5]
1
2
3
4
5
6
7
Application
Description
Classification
Forecasting
Clustering
Description
Deviation Detection
Link Analysis
Visualization
Predicting an item’s class.
Predicting a parameter value.
Finding groups of items.
Describing a group.
Finding changes.
Finding relationships.
Presenting data visually to
facilitate human discovery.
Machine learning approaches to protein structure prediction
and gene pathway discovery in are examined in the following
sections.
Figure 1. Scheme for ILP Methods [7]
ILP has been used for protein structure prediction.
Muggleton et al. implemented ILP by separating the data set of
proteins into groups of the same type of domain structure (ex.
-type domains). This allowed the system to have a more
homogenous data set, thus allowing better prediction [7]. The
ILP program used in this method was Golem. The basic
algorithm was as follows:
1. Take a random sample of pairs of residues from the
training set. This represents a set of pairs of residues
3
chosen randomly from the set of all residues in all
proteins represented.
2. Compute all the common properties for each pair of
residues.
3. Convert the common properties into a rule that is true
for the residue pair under consideration.
4. Choose the rule for the best residue pair. For example,
choose the rule that predicts the most true -helix
residues while predicting less than a pre-defined
threshold of non--helix residues from the training set.
5. Take another sample of unpredicted residue pairs.
6. Form rules which express the common properties of the
best pair together with each of the individual residue
pairs in the sample.
7. Repeat steps 4-6 until no improvement in prediction is
produced.
The algorithm uses the best rule to eliminate a set of
predicted residues from the training set. The reduced training
set is then used to build up further rules. The process
terminates when no further rules can be found.
Golem produced an accuracy of about 81% when applied to
16 proteins with -type domains.
The disadvantage of ILP is the lack of probability in its
rules. Biological systems are characterized by a high degree of
uncertainty; thus, the hypotheses will have a higher descriptive
power if they incorporate a certain degree of probability [5].
To date, ML methods cannot, by themselves, completely
describe a new protein’s structure; however, they can provide
valuable information regarding numerous structural attributes.
B. Gene Pathway Discovery
Systems biology seeks to discover causal relationships
among a large number of genes and other cellular constituents.
From a system-level point of view, the various interactions and
control loops, which form a genetic network, represent the
basis upon which the vast complexity and flexibility of life
processes emerges.
ML techniques like clustering, Bayesian networks and
decision trees can be used to discover gene regulation
pathways.
1) Gene Clustering
Clustering is a discovery approach that organizes and
identifies subsets of data and groups them into classes. Each
class represents data with similar attributes. A derivative
clustering algorithm can also be used to predict and explain
complex data.
Clustering algorithms are used to discover groups of genes
that show similar expression patterns under different
experimental conditions. By this procedure, different families
of cell-cycle regulated genes in the bakers’ yeast,
Saccharomyces Cerevisiae, have been identified [8].
Gene clustering has several drawbacks. Firstly, the
assignment of genes to single clusters by most clustering
methods potentially prevents the exposure of complex
interrelationships among genes. Secondly, clustering does not
always provide causal information. Genes sharing similar
expression profiles may not always share a function. Even
when similar expression levels correspond to similar functions,
the functional relationships among genes in a cluster cannot be
determined from the cluster data alone [9]. In contrast, a gene
may be suppressed to allow another to be expressed; thus,
functionally related genes may be clustered separately,
blurring the existing relationship.
A system named GEEVE, introduced by Yoo and Cooper,
uses gene expression data to learn the models of gene
regulation and thus discover causal gene pathways [10]. The
GEEVE system, shown in Fig.2, consists of two modules: the
causal Bayesian network update module, and the decision tree
generation and evaluation module.
Figure 2: The GEEVE system [10]
2) Causal Bayesian Networks
A Bayesian network is a directed, acyclic graph of nodes
representing variables and arcs representing dependencies the
variables. A Bayesian network encodes the joint probability
distribution over all the variables. The joint distribution of a
Bayesian network with N variables can be factored as follows:
N
P(x1, x2,…., xN| K) =  P(xi | i,K) ,
(1)
i=1
where xi is the state of variable Xi, πi is a joint state of the
parents of Xi, and K denotes background knowledge [10].
Bayesian networks are capable of handling incomplete data
sets, and are able to learn and predict the missing data. They
also provide models of causal influence. These properties
make Bayesian networks a promising tool for analyzing gene
expression patterns.
In the context of genetic pathway inference, each node of a
Bayesian network is assigned to a gene, and can assume the
different expression levels of this gene throughout the training
data. Each edge between the nodes (genes) denotes a
4
regulatory relationship between them. If the edge is directed,
as shown in Fig 3, it denotes that one gene controls the other.
Fig.4 shows the feature graph trained for a genetic subnetwork of the bakers’ yeast.
Figure 3: The structure of a causal Bayesian network that represents a
portion of a hypothetical gene regulation pathway [10]
Figure 4: Genetic sub-networks of the bakers yeast. [11]
While Bayesian networks produce better results than rulebased learning methods, there is no clear explanation of the
learning process. It is therefore hard to understand the results
and to interpret it into useful knowledge [5].
3) Decision Trees
The decision tree is a simple inductive learning system that
uses discrete-valued functions to estimate and classify the
provided training set. The system is represented by a tree
whose internal nodes are tests (boolean decisions) and whose
leaf nodes are classes. The tree can make predictions about the
probability of a particular case belonging to a particular class.
Decision trees can be used to model gene perturbation in
experiments. The GEEVE system, for example, builds and
evaluates a decision tree based on pair-wise gene relationships.
Thus, the effects on gene X when gene Y is perturbed can be
modeled [11].
The drawbacks of decision trees are over-fitting of data and
overlapping in the classes. These and other factors make
decision trees difficult to optimize.
VI. CONCLUSION
Since ML primarily deals with the extraction of knowledge
from data, redundancy of data is an important issue facing ML.
The quality of biological data is usually compromised by
experimental errors, wrong interpretation by biologists, nonstandardized experimental techniques, etc. The uncertainty
associated with experiment-based research is very high.
Despite these challenges, ML techniques have prompted the
success of systems biology in recent years. ML has helped
accelerate research in several areas of systems biology
including protein structure prediction, inference of genetic and
molecular networks, and gene-protein interactions.
The author believes that systems biology will continue to
benefit from ML techniques in coming years.
REFERENCES
[1] Kitano, H.,”Looking beyond the details: a rise in systemoriented approaches in genetics and molecular biology”,
Curr. Genet., Vol. 41(1), 2002,pp.1-10
[2] R. Lathrop,” Intelligent Systems in Biology: Why the
Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp.
8-13
[3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology:
see it again-for the first time”, IEEE Intelligent Systems,
Vol. 13 (5), 1998, pp. 6-8.
[4] Hu, Y, Kibler, D, “Combinatorial motif analysis and
hypothesis generation on a genomic scale”,
Bioinformatics., Vol 16 (3), 2000;pp. 222-32
[5] Tan, A, Gilbert, G,”Machine Learning and its Application
to Bioinformatics: An Overview”, www.brc.dcs.gla.ac.uk/
~actan/publications.html), Retrieved: Oct. 27, 2004
[6] Nilsson, N, “Introduction to Machine Learning”,
unpublished,http://robotics.stanford.edu/people/nilsson/ml
book.html,1996, Retrieved: Oct. 27, 2004
[7] Muggleton, S., King, R., Sternberg, M., “Using logic for
protein structure prediction”, Proceedings of the 25th
Hawaii Int. Conf. on System Sciences, IEEE Computer
Society Press, 1992
[8] Spellman, P.T., “Comprehensive Identification of Cell
Cycle-regulated Genes of the Yeast Saccharomyces
cerevisiae by Microarray Hybridization”, Molecular
Biology of the Cell, 1998, pp. 3273-3297.
[9] Shatkay, H. Edwards, S. Boguski, M., “Information
retrieval meets gene analysis”, IEEE Intelligent Systems,
Vol. 17 (2), 2002, pp. 45- 53.
[10] Yoo, C, Cooper, G.,”An Evaluation of a System that
Recommends Microarray Experiments to Perform to
Discover Gene-Regulation Pathways”, Journal of
Artificial Intelligence in Medicine;Vol. 31(2), 2004,
pp.169-182.
[11] Stetter, M, “Large-Scale Computational Modeling of
Genetic Regulatory Networks”, Artificial Intelligence
Review 20, 2003, pp. 75–93
[12] National Center for Biotechnology Information: GenBank
Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOver
view.html, Retrieved: Oct 27, 2004
5
Download