1 Machine Learning Applications in Systems Biology Natasha Alves, M.A.Sc. Candidate, ECE Abstract— Recent advances in high-throughput technologies have led to an immense flow of biological data. Extracting the information hidden in the everexpanding biological databases has been an obstacle in the progress of systems biology. Machine Learning has proved to be an efficient and inexpensive approach to organizing data; developing new tools to analyze data; and discovering new knowledge from data. This paper introduces Machine Learning techniques like inductive logic programming, clustering, Bayesian networks, and decision trees in the context of their applications in systems biology. The shortcomings of these Machine Learning techniques are also addressed. Index Terms—Artificial Intelligence, Bayesian Networks, Clustering, Decision Trees, Inductive Logic Programming, Machine Learning, Systems Biology. I. INTRODUCTION Systems Biology is an in-depth, systems-level analysis of biological systems grounded on the molecular level [1]. It is different from other methods of biological study where the focus is on the characteristics of isolated parts of a cell or organism. Systems biology examines the structure and dynamics of cellular and organism functions, and their interconnections and interrelationships. One ultimate goal of systems biology is to use the knowledge of the complete genome sequence and all proteins encoded by that genome to reconstruct the biological systems that are implied [2]. The development of systems biology is driven by technology. Sophisticated computational techniques are needed to analyze biological systems because of the complexity and dynamics involved. Machine Learning, which is an automatic and intelligent learning technique, has for long been used to discover meaningful associations between proteins, and for scientific hypothesis formation [3]. The aim of this paper is to introduce Machine Learning techniques in the context of their application in systems biology. II. CHALLENGES IN SYSTEMS BIOLOGY Much of our failure to fully understand biological systems has been due to their size and complexity. Systems biology emphasizes on large-scale discovery of the interactions of genes, proteins, and other cell elements. It is confronted with dynamic biological responses, a huge number of interactions, inherent redundancy in many pathways and feedback systems. A lot of useful and important information about biological systems is hidden in high volumes of experimental data. For instance, there are 37 billion bases of DNA in 32,000 sequence records in GenBank alone (Feb. 2004)[12]. Analyzing high volumes of data to understand biological systems demands tedious experimentation and modern computational technology. This is the grand challenge for systems biology in this era. An intelligent approach is needed to extract the hidden information from the data and to cope with the rapid rate of data deposition. III. MACHINE LEARNING Machine Learning (ML) is the capability of computer algorithms to improve automatically through experience (i.e. the computer programs itself by seeing examples of the behavior we want). ML approaches are ideally suited for domains characterized by the presence of large amounts of data, noisy patterns and the absence of general theories [4]. The fundamental idea behind these approaches is to learn the theory automatically from the data through a process of inference and model fitting. A system that can learn from experience and improve its performance automatically could serve as a tool for solving biological systems. The main goal of ML is to induce general functions from a specific training data set. The learning agent is given a set of training examples, and it defines the hypothesis for them. The agent must search through the hypothesis space and locate the best hypothesis when given the test set [5]. Because ML is concerned with learning from data examples, it often uses a probabilistic approach. IV. OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY Manuscript received November 1, 2004 ML approaches have gained popularity in systems biology. The characteristics of ML that make it well suited for systems 2 biology are: 1. Many problems in biological systems are not well defined, but have a lot of experimental data. ML is useful when the structure of the task is not well understood but the task can be characterized by a data set with strong statistic regularity. While input/output pairs can be easily specified, the relationship between the inputs and outputs are often unknown (e.g. the protein folding mechanism). ML approaches can extract relationships and correlations hidden under large volumes of data (data mining). It could thus extract the information encoded in biological databases and use the available data to predict meaningful biological properties. 2. ML approaches can adjust their internal structure to produce correct outputs for a large number of sample inputs. They can thus constrain their input/output function to approximate the implicit relationship in the training examples [6]. 3. ML approaches adapt themselves to new information (training examples). This is important in systems biology because new data are generated every day. The newly generated data might update the initial learning hypotheses. ML thus provides efficient approaches to analyze biological data. V. MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY A variety of ML techniques can be used to solve most of the problems in systems biology. In a systems biology context, ML is used to discover meaningful knowledge from existing biological databases and to present that knowledge in an understandable pattern. The tasks of ML in systems biology can be divided into seven categories as shown in Table 1 [5]. These techniques, operating individually or in combination, can meet the various challenges in systems biology. A. Protein Structure Prediction Proteins are the essence of life. The secondary structure of protein consists of -helices, -strands and coils. The folding of these secondary structure elements forms the unique 3D structure of a protein. A lot of useful information is contained in this 3D structure. However, predicting proteins’ structure is a central problem in bioinformatics. It is the bottleneck between sequencing efforts and drug design. ML approaches like Inductive Logic Programming can be used to predict protein structure. 1) Inductive Logic Programming Inductive logic programming (ILP) is a research area formed at the intersection of ML and Logic Programming. ILP systems develop predicate descriptions from observations and background knowledge. There are three main elements in an ILP learning system: observations, background knowledge, and hypothesis [5]. Each of these elements of ILP is a logic program. Fig.1 shows the general scheme for ILP methods. Observations and background knowledge are combined by an ILP program to form a hypothesis. A set of IF – THEN rules can then be derived from the hypothesis. For example: Hypothesis: fold('Four-helical up-and-down bundle',P) :helix(P,H1), length(H1,hi), position(P,H1,Pos), interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2). Rule: The protein P has fold class ‘Four-helical up-anddown bundle’ if it contains a long helix H1 at a secondary structure position between 1 and 3, and H1 is followed by a second helix H2. [5] The rules are tested on additional data. If experimentation leads to high confidence in the hypothesis validity the hypothesis is added to the background knowledge. TABLE I MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY [5] 1 2 3 4 5 6 7 Application Description Classification Forecasting Clustering Description Deviation Detection Link Analysis Visualization Predicting an item’s class. Predicting a parameter value. Finding groups of items. Describing a group. Finding changes. Finding relationships. Presenting data visually to facilitate human discovery. Machine learning approaches to protein structure prediction and gene pathway discovery in are examined in the following sections. Figure 1. Scheme for ILP Methods [7] ILP has been used for protein structure prediction. Muggleton et al. implemented ILP by separating the data set of proteins into groups of the same type of domain structure (ex. -type domains). This allowed the system to have a more homogenous data set, thus allowing better prediction [7]. The ILP program used in this method was Golem. The basic algorithm was as follows: 1. Take a random sample of pairs of residues from the training set. This represents a set of pairs of residues 3 chosen randomly from the set of all residues in all proteins represented. 2. Compute all the common properties for each pair of residues. 3. Convert the common properties into a rule that is true for the residue pair under consideration. 4. Choose the rule for the best residue pair. For example, choose the rule that predicts the most true -helix residues while predicting less than a pre-defined threshold of non--helix residues from the training set. 5. Take another sample of unpredicted residue pairs. 6. Form rules which express the common properties of the best pair together with each of the individual residue pairs in the sample. 7. Repeat steps 4-6 until no improvement in prediction is produced. The algorithm uses the best rule to eliminate a set of predicted residues from the training set. The reduced training set is then used to build up further rules. The process terminates when no further rules can be found. Golem produced an accuracy of about 81% when applied to 16 proteins with -type domains. The disadvantage of ILP is the lack of probability in its rules. Biological systems are characterized by a high degree of uncertainty; thus, the hypotheses will have a higher descriptive power if they incorporate a certain degree of probability [5]. To date, ML methods cannot, by themselves, completely describe a new protein’s structure; however, they can provide valuable information regarding numerous structural attributes. B. Gene Pathway Discovery Systems biology seeks to discover causal relationships among a large number of genes and other cellular constituents. From a system-level point of view, the various interactions and control loops, which form a genetic network, represent the basis upon which the vast complexity and flexibility of life processes emerges. ML techniques like clustering, Bayesian networks and decision trees can be used to discover gene regulation pathways. 1) Gene Clustering Clustering is a discovery approach that organizes and identifies subsets of data and groups them into classes. Each class represents data with similar attributes. A derivative clustering algorithm can also be used to predict and explain complex data. Clustering algorithms are used to discover groups of genes that show similar expression patterns under different experimental conditions. By this procedure, different families of cell-cycle regulated genes in the bakers’ yeast, Saccharomyces Cerevisiae, have been identified [8]. Gene clustering has several drawbacks. Firstly, the assignment of genes to single clusters by most clustering methods potentially prevents the exposure of complex interrelationships among genes. Secondly, clustering does not always provide causal information. Genes sharing similar expression profiles may not always share a function. Even when similar expression levels correspond to similar functions, the functional relationships among genes in a cluster cannot be determined from the cluster data alone [9]. In contrast, a gene may be suppressed to allow another to be expressed; thus, functionally related genes may be clustered separately, blurring the existing relationship. A system named GEEVE, introduced by Yoo and Cooper, uses gene expression data to learn the models of gene regulation and thus discover causal gene pathways [10]. The GEEVE system, shown in Fig.2, consists of two modules: the causal Bayesian network update module, and the decision tree generation and evaluation module. Figure 2: The GEEVE system [10] 2) Causal Bayesian Networks A Bayesian network is a directed, acyclic graph of nodes representing variables and arcs representing dependencies the variables. A Bayesian network encodes the joint probability distribution over all the variables. The joint distribution of a Bayesian network with N variables can be factored as follows: N P(x1, x2,…., xN| K) = P(xi | i,K) , (1) i=1 where xi is the state of variable Xi, πi is a joint state of the parents of Xi, and K denotes background knowledge [10]. Bayesian networks are capable of handling incomplete data sets, and are able to learn and predict the missing data. They also provide models of causal influence. These properties make Bayesian networks a promising tool for analyzing gene expression patterns. In the context of genetic pathway inference, each node of a Bayesian network is assigned to a gene, and can assume the different expression levels of this gene throughout the training data. Each edge between the nodes (genes) denotes a 4 regulatory relationship between them. If the edge is directed, as shown in Fig 3, it denotes that one gene controls the other. Fig.4 shows the feature graph trained for a genetic subnetwork of the bakers’ yeast. Figure 3: The structure of a causal Bayesian network that represents a portion of a hypothetical gene regulation pathway [10] Figure 4: Genetic sub-networks of the bakers yeast. [11] While Bayesian networks produce better results than rulebased learning methods, there is no clear explanation of the learning process. It is therefore hard to understand the results and to interpret it into useful knowledge [5]. 3) Decision Trees The decision tree is a simple inductive learning system that uses discrete-valued functions to estimate and classify the provided training set. The system is represented by a tree whose internal nodes are tests (boolean decisions) and whose leaf nodes are classes. The tree can make predictions about the probability of a particular case belonging to a particular class. Decision trees can be used to model gene perturbation in experiments. The GEEVE system, for example, builds and evaluates a decision tree based on pair-wise gene relationships. Thus, the effects on gene X when gene Y is perturbed can be modeled [11]. The drawbacks of decision trees are over-fitting of data and overlapping in the classes. These and other factors make decision trees difficult to optimize. VI. CONCLUSION Since ML primarily deals with the extraction of knowledge from data, redundancy of data is an important issue facing ML. The quality of biological data is usually compromised by experimental errors, wrong interpretation by biologists, nonstandardized experimental techniques, etc. The uncertainty associated with experiment-based research is very high. Despite these challenges, ML techniques have prompted the success of systems biology in recent years. ML has helped accelerate research in several areas of systems biology including protein structure prediction, inference of genetic and molecular networks, and gene-protein interactions. The author believes that systems biology will continue to benefit from ML techniques in coming years. REFERENCES [1] Kitano, H.,”Looking beyond the details: a rise in systemoriented approaches in genetics and molecular biology”, Curr. Genet., Vol. 41(1), 2002,pp.1-10 [2] R. Lathrop,” Intelligent Systems in Biology: Why the Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp. 8-13 [3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology: see it again-for the first time”, IEEE Intelligent Systems, Vol. 13 (5), 1998, pp. 6-8. [4] Hu, Y, Kibler, D, “Combinatorial motif analysis and hypothesis generation on a genomic scale”, Bioinformatics., Vol 16 (3), 2000;pp. 222-32 [5] Tan, A, Gilbert, G,”Machine Learning and its Application to Bioinformatics: An Overview”, www.brc.dcs.gla.ac.uk/ ~actan/publications.html), Retrieved: Oct. 27, 2004 [6] Nilsson, N, “Introduction to Machine Learning”, unpublished,http://robotics.stanford.edu/people/nilsson/ml book.html,1996, Retrieved: Oct. 27, 2004 [7] Muggleton, S., King, R., Sternberg, M., “Using logic for protein structure prediction”, Proceedings of the 25th Hawaii Int. Conf. on System Sciences, IEEE Computer Society Press, 1992 [8] Spellman, P.T., “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 1998, pp. 3273-3297. [9] Shatkay, H. Edwards, S. Boguski, M., “Information retrieval meets gene analysis”, IEEE Intelligent Systems, Vol. 17 (2), 2002, pp. 45- 53. [10] Yoo, C, Cooper, G.,”An Evaluation of a System that Recommends Microarray Experiments to Perform to Discover Gene-Regulation Pathways”, Journal of Artificial Intelligence in Medicine;Vol. 31(2), 2004, pp.169-182. [11] Stetter, M, “Large-Scale Computational Modeling of Genetic Regulatory Networks”, Artificial Intelligence Review 20, 2003, pp. 75–93 [12] National Center for Biotechnology Information: GenBank Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOver view.html, Retrieved: Oct 27, 2004 5