Hybrid Systems: Two Examples of the Combination of Rule-Based Systems and Neural Nets March 2001 Pieter Buzing Neural networks and rule based systems both have their clear advantages as well as their disadvantages. Combining these two could lead to a powerful system that profits from the positive characteristics of each other. Fu(1989) and Towell & Shavlik(1994) each proposed such a hybrid systems. This paper gives a comparison between them and concludes that in both systems the inductive learning ability of a neural network can contribute to a knowledge base and that the semantic foundedness of a knowledge base can be a good starting point for a neural network. 1 Introduction In the AI tradition rule-based systems and neural networks are two very different fields of research. Both approaches have their own merits and their own flaws. These flaws are for the most part complementary [1, 3]: a neural net holds no semantics, while a knowledge base does contain explicit knowledge. A knowledge based system has difficulty dealing with continuous variables; a neural net can learn continuous probability distributions from examples. Neural networks ignore problem-specific theory (e.g. Features may be context dependent), while knowledge based systems are designed to deal with domain-specific reasoning. Knowledge based systems have difficulty acquiring new knowledge, while neural nets learn inductively by their nature. So combining the two could be fruitful: the network could be structured in such a way that knowledge is 'visible' in the architecture and the learning mechanism of the neural net could show if the rules and the data set are in accordance with each other. In this paper we will discuss two hybrid systems that use a knowledge base to initialize a neural network. The network is then trained with examples. I shall make a comparison of the two systems based on the following questions: How do the systems react to (initial) erroneous rules? Are the semantics maintained (extractable) after neural training? The first question indicates whether the neural net approach adds something to a knowledge base system. The latter addresses the added value of the rule base structure to a neural network. Section two will hand some technical knowledge on the two classical approaches and the new field of hybrid systems. In the following three sections the two systems (one by Fu [2] and one by Towell & Shavlik: KBANN [1]) will be compared with regard to the rules-to-network phase (typical in these kind of hybrid systems), the training phase and the fault tolerance (i.e. Handling incorrect rules and noisy data). In section six you will find a discussion of the context in which these different systems were developed. Section seven presents the conclusions that can be drawn, focussing on the two points mentioned above. 2 Background of the techniques In this section I will explain about the basic principles of both classical rule based systems and neural networks. The strengths and weaknesses of each system will be discussed. Paragraph 2.3 will give arguments for hybridization. 2.1 Rule-based systems Rule based systems consist of a rule base and a fact base [3]. The rule base contains general knowledge (in implicational form) about a certain subject area, while the fact base expresses specific knowledge of a particular case. The rules are used in the inference process to derive new facts from given ones. There are two basic reasoning methods: forward-chaining and backward-chaining. The first method starts with the known facts and applies rules in order to eventually reach the goal conclusion. The latter method starts with the goal. It recursively selects rules that would deduce a (sub) goal until the set of goals is completely resolved by given facts. Of course a bi-directional approach is also possible. One way of dealing with uncertainty is the use of certainty factors. Each rule has been given a CF between -1 and 1. When a rule fires the conclusion of that rule is assigned the CF of the rule. When the premise is uncertain (CF<1) then the CF of the conclusion is adjusted to either the minimum of the premises (in case of a conjunction) or the maximum of the premises (disjunction). E.g.: Rule1 if A and B then C (CF=0.8) if CF(A)=0.8 and CF(B)=0.5 then CF(C)=0.4 Rule2 if A or B then C (CF=0.9) if CF(A)=0.7 and CF(B)=0.2 then CF(C)=0.63 The strength of a rule-based system is the high abstraction level. Knowledge can be declared in a very comprehensive manner, making it possible to easily verify the knowledge (rule) base with (human) domain experts. The system also gives explanations for the given answers in the form of inference traces. Typical weaknesses are dealing with incomplete, incorrect and uncertain knowledge, continuous variables and non-monotonic logic [3]. A complete domain theory may require thousands of (possibly recursive) rules, which could lead to a very slow system [1]. Also the system does not “learn” anything by itself. 2.2 Neural networks In an artificial neural network a number of neurons are connected with each other (see figure 1). We distinguish the input layer, the hidden layer(s) and the output layer. Each connection has a certain weight. Each node propagates a value calculated by a function taking the net sum of the weighted activation of all connections leading to that node as its input (see figure 2). A bias can be added by connecting a bias-node (that always has 1 as activation value) with the node. The weight of this connection is called the bias. In the training phase each input pattern is propagated through the network, after which the error (squared sum of the difference between the desired output and the actual output) is calculated. The weights are then adjusted using a back-propagation algorithm. The adjustment to each weight value is calculated by using the derivative of the error function: we want to minimize the error, so we choose the direction that gives the steepest descent. The learning rate is a measure for the step size. Big learning rates give good results in the beginning, but often fail to find the optimum. A too small learning rate can cause the algorithm to run very slow or get caught in a local optimum [6]. X1 w1,1 w2,1 w1,2 X2 w1,3 X3 w2,2 w2,3 w2,4 w1,4 Input layer Hidden Output layer layer Figure 1 A simple neural network architecture. X1 s w1 x X2 w2 x x wn x Xn x f o Net=ni=1wi*xi (1) wi = **xi f(net)=(1+exp(-net))-1 (2) f’(net) = exp(-net)*(1+exp(-net))-2 = f(net)*(1-f(net)) (3) = (d-o)*f’(net) = (d-o)*o*(1-o) 2a: node f and its connections 2b: the gaussian activation function 2c: wi is the weight adjustment. is the learning rate. d is the desired activation. o is the actual output. Figure 2 A gaussian activation function and some formulas for the weight adjustment. A neural net can learn from mere examples. In many domains it is far more easier to collect a representative data set, than to construct a (complete) knowledge base. It does not suffer too much from noisy data and is not biased like human experts might be. A big problem in neural networks is the choice of architecture: the only way to decide on a certain architecture is by trial-and-error. But the main weakness of a neural net is the lack of understandability, i.e. It is impossible to extract any knowledge from a trained neural network. Also in complex domains large data sets are required which -if available- lead to lengthy training times. 2.3 Hybrid systems In the last decade researchers started to realize that rule based systems and neural networks are just two ends of a whole intelligence spectrum [3]. A combination of these two approaches certainly sounds appealing, as many weaknesses mentioned above can be compensated. The hybrid approach considered in this article takes the domain knowledge as a starting point for the network architecture. This is done by the mapping shown in figure 3. Knowledge Base Neural Network Final Conclusions Output Units Attributes Input Units Intermediate Conclusions Hidden Units Dependencies Weighted Connections Figure 3 The correspondence between a rule base and a neural net 3 From Rules to Network Both systems start out with a (classical) knowledge base, consisting of a (not necessarily complete) set of rules. This domain theory has to be translated into a neural network. Each system has it’s own algorithm for this. 3.1 Fu: Constructing a conceptual network The knowledge base is transformed into a conceptualization network in the following way (see also figure 4). 1 The rules are rewritten in their conjunctive form. 2 Data attributes are mapped into input units. 3 Concepts (intermediate hypotheses) are mapped into hidden nodes. 4 Final hypotheses are mapped into output units. 5 Conjunction nodes that form a bridge between the condition nodes and the consequence node are added. 6 The rule’s CF is mapped into the weight of the corresponding connection between a conjunction unit and a consequence unit. Connections between the non-conjunctive layers and the conjunctive layer are set to one. Step 2 Step 1 Step 3 R1: A B P (CF=0.8) Q P R2: B C D Q (CF=0.5) R3: P Q Z (CF=1.0) A Step 4 A Figure 4 C TheBsix A Z B B Q BZ w=1.0 P w=0.8 D A B steps in the Fu rules-to-network C D B algorithm A D C Step 6 P Q B D C Step 5 Z P B B C Q w=0.5 D B The resulting network is in fact a neural network. The optimal weight vector we are looking for actually represents the set of certainty factors of the rule set. In order to maintain the conjunctive interpretation of the connection he added extra layers with nodes that propagates the minimum value of the incoming activations – in conformity with the certainty factors calculus mentioned in 2.1. The weights of the connections between the conditions and the conjunction nodes are always set to one, never to be changed. This is done to simplify the problem caused by the conjunction, which mathematically (and semantically) is hard to deal with. Fu tested his approach on the MYCIN knowledge base [4]. The MYCIN program could diagnose infectious diseases and recommend therapy for that bacterial infection. Nowadays the system is considered overestimated and primitive, but in the eighties it was a breakthrough in AI. One of it’s renowned qualities is the fact that it can handle uncertainty in a comprehensive manner: certainty factors. 3.2 KBANN: Rules-to-Network algorithm The KBANN system uses a seven step algorithm to construct a network. 1 Rewriting: the rules are transformed into Horn clauses. Rules with the same consequence are rewritten in the following form: {CDB, EFGB} transformed to {CDB’, EFGB’’, B’B’’B} 2 Mapping: the rules are then organized into a neural net. The weight values are chosen in such a way that the activation emulates an AND-function. Towell & Shavlik empirically found w=4 to be a good initial weight value (w=-4 for negative antecedents). The bias of the unit is set to (P1/2)*w, where P is the number of positive antecedents. 3 Numbering: each node in the network is assigned a number according to its ‘level’ (defined as the longest path to an input unit). 4 Adding hidden units: new units are placed in the network to facilitate learning of new features that were not yet expressed in the initial knowledge rules. This step is optional, because the given rules often have enough expressive power to make the learning of new rules possible. 5 Adding input units: the domain expert can identify certain features that (by chance or ignorance) were not caught in a rule. 6 Adding links: in this step links with weight zero are added to the network. Each node at level n-1 is connected to all nodes with level n. 7 Perturbing: in order to avoid problems caused by symmetry (all the connections leading to one node are initialized with the same weight value) a small random value is added to each weight in the network. A B Z B’ C D B’’ E F G Y S T X Figure 5: Left is the situation after step 1. Right is the final network, where the dark nodes are the ones that were added. Not all nodes between two adjacent layers are connected in the figure, but they should be. Also the weight values are not shown. As shown in step two, KBANN chooses weight values that approximately give the same behavior as the original AND-function. But how do we interpret a weight value in a conjunction? The meaning of the rules is now lost. The implementation tested by Towell & Shavlik considers a DNA domain, where uncertainty is not really an issue. The only problem is that rules can turn out to be incorrect and that some rules may not be known yet. Each nucleotide of a DNA string (size: 57 nucleotides) is connected to an input unit. The goal is to decide whether the input is a so-called promoter or not. A promoter is a short DNA sequence that precedes the beginning of a gene. 3.3 Differences One major difference between the two systems is that Fu proposes to model uncertainty and that KBANN does not. KBANN rules can be given a certainty value, but in Towell & Shavliks view this is not necessary because CFs have not proven decisive: during the training process (discussed in the next section) the network will find the right weight even if the initial certainty is wrong. Even in the MYCIN system you could change a number of CFs and still get the correct conclusions. The strength of an artificial neural network (as opposed to humans) lies in its ability to find the correct weight values. This is achieved by the back-propagation algorithm, which is discussed in the next section. Another difference is that the Fu system has 'conjunction layers' to facilitate the conjunction function. He prefers to hold on to the semantics, but this means that he has to give up on the smooth gaussian activation function. The AND-function is also not differentiable. The implications of this will be made clear in the next section: training. KBANN's substitute AND-function bears no meaning in the way that we can interpret it in a (un)certainty context. 4 Learning Algorithms Both systems use back-propagation but Fu has the problem that the evaluation function is not differentiable everywhere (because of the conjunctions) so it has to use a second learning algorithm. 4.1 Fu: Back-propagation and Hill-climbing The constructed network consists of conjunction and non-conjunction layers. The latter have a differentiable function, so a conventional back-propagation algorithm can be used for these units. In the conjunction layers however, this is not possible because AND is not a differentiable function. The solution given by Fu is a hill-climbing mechanism to find the best path to propagate the error. We follow a one-step look-ahead strategy: the error in node Z (see figure 4, step 6) could be caused by unit P or Q. How much would the performance of the system improve if we blame unit P? This is done by adjusting the weight of the connection that leads to P. After comparing the two blaming possibilities we choose to adjust the weight of the unit that would yield the greatest gain in performance. 4.2 KBANN: Back-propagation In the KBANN network all functions are differentiable, so back-propagation is applied to all layers. This was explained in section 2.2. Towell and Shavlik compared their KBANN with a standard neural network. Both networks used the same functions, the same back-propagation algorithm and the same training and test sets. The researchers varied the size of the training set. Results clearly showed that when the training set was small KBANN performs better than the standard neural net. When the training set increased, the difference in performance between the two networks decreased to zero. This indicates that the initialization of KBANN with domain knowledge speeds up the learning process. 4.3 Differences One can argue that the hill-climbing heuristic is not very sophisticated, but the way in which KBANN transforms the conjunctions into conventional neural network nodes is not very graceful either. KBANN adds a lot of connections, which have nothing to do with the domain theory. Fu keeps his initial connections, but after a training session it can be decided to add some connections and train it again, expecting better results. The advantage of the Fu algorithm is that it keeps touch with the original rules, while KBANN trails of into pure mathematical functions, ignoring the meaning of the knowledge. This makes it very hard to extract knowledge from the network. 5 Fault Tolerance One of the problems in rule based systems is the inability to repair an incorrect knowledge base. Neural nets on the other hand can still perform pretty well when confronted with corrupt data sets. How do both hybrid systems cope with erroneous domain theory? 5.1 Fu: Removing the corrupt rule The Fu system is able to identify and remove incorrect rules. A rule is considered incorrect if the change in the weight was reasonably high. The only problem is that a threshold for the weight shift has to be identified, otherwise correct rules with a minor weight adjustment could be falsely accused. But if the threshold is too high incorrect rules might not be identified as such. When a rule has been removed, back-propagation is resumed until the network is stable and the next weakest rule is removed. Fu conducted ten experiments. In each of them he replaced six out of fifty rules are replaced by incorrect ones. The system would still identify these malicious rules as incorrect in all experiments. Another interesting issue is the deduction of new rules: the system could propose rules (i.e. correlation between attributes) that the human experts did not think of. Fu describes two procedures to create new rules, but he did not implement them. The first one is to add new nodes to the network. But this would be a big number of new nodes and connections, thus demanding a lot of computing power and losing the so carefully preserved semantics. Secondly he argues that if the desired output is always higher then the actual output, then the rule has to be generalized (i.e. deletion of a premise) and that if the desired output is lower then the actual output, then the rule should be specialized (i.e. adding of a premise). 5.2 KBANN: Handling incorrect rules This system does not add or delete rules in the way that Fu described. The addition of many connections causes the number of rules (remember that all the nodes between two layers are connected) to increase dramatically. Every connection with a weight value other than zero could indicate a rule. Even when you find a big change in a certain weight value it is hard to construct or delete a rule, because a node can not be considered a conjunction anymore. But the system can handle incorrect initial rules very convincingly. Tests made clear that 10% of the rules could be removed or added and the KBANN system would still outperform a standard neural network. Also a small ‘adjustment’ in the rules would not lead to a drastic loss in performance: 30% of incorrect rules (adding or removing a condition) would still leave KBANN superior to a standard Neural network. 5.3 Differences The different nature of both systems becomes very clear now. The Fu system maintains the meaning of the domain theory in his network, thereby making it possible to identify incorrect rules, (optionally) leaving it up to the expert to judge what caused the wrong behavior: the rule or the data set. The KBANN system solves the incorrect domain theory problem entirely different. It says nothing about the correctness of a rule, it just alters the weights in such a way that the output improves (like a real neural net should). The test results show no great difference in the way both systems cope with incorrect domain theory. 6 Discussion The first thing that has to be said is that Fu and Towell & Shavlik conducted their research in different times. Li-Min Fu published his findings in 1989, when explicit knowledge rules and certainty factors were still considered the basis for artificial intelligence. In my opinion he overestimates the ability of humans to correctly formulate domain theory. Towell & Shavlik(1994) have given up on the ‘sacred rules idea’; they use the domain knowledge to initialize the neural net, after which the computing power should lead to a good system behavior. Fu is more interested in the interpretation of the network. Though both projects were tested in the biological field, there are some apparent differences between the domains. First of all, the DNA knowledge base used by Towell & Shavlik does not contain uncertain rules in the way that the Mycin KB (infectious diseases) does. CFs have not proven decisive. In the Mycin rule base one can alter the CF of a rule and still acquire the same diagnose. They are not mathematically founded either. There are other logical techniques available that can express uncertainty much better, like probability theory and fuzzy logic. 7 Conclusion How do the systems react to (initial) erroneous rules? Both systems can handle incorrect rules. The difference is that Fu points out exactly which rules cause a discrepancy with the provided data set. This adds an extra dimension to the system, because the expert can verify his knowledge rules. The KBANN system just lets the weights of a certain connection drop to zero, when there is a low correlation between two concepts. The network performs well, but it is hard to tell which initial rules were wrong. Whether this is a problem depends on the goal of the implementation: if you want to verify your rule set with actual data the Fu system would be more appropriate. If your goal is to make a network that simply does the job, the KBANN black box is suitable. Are the semantics maintained (extractable) after neural training? The Fu system is in fact a conceptual network. The adjusted knowledge rules can easily be extracted: the nodes that lead to a conjunction node are the condition of a rule, the node that the conjunction unit is pointing to is the consequent and the weight of this last connection is the certainty factor of the rule. The KBANN system does not have that property. There are too many connections added to the network, which would mean that extracted rules would have many (all) attributes as a premise. This has no practical value, though Towell & Shavlik are working on that [7]. 8 [1] [2] [3] [4] [5] [6] [7] References Geoffrey G. Towell, Jude W. Shavlik 1994. Knowledge-Based Artificial Neural Networks. Artificial Intelligence, Vol. 70 1994. Li-Min Fu 1989. Integration of Neural Heuristics into Knowledge-based Inference. Connection Science, Vol. 1, No.3, 1989. W. Ertel, C. Goller, M. Schramm 1995. Integrating Rule Based Reasoning and Neural Networks. B.G. Buchanan, E.H. Shortliffe 1984. Rule-Based Expert Systems. Reading, MA: AddisonWesley K. Mehrotra, C.K. Mohan, S. Ranka 1997. Elements of Artificial Neural Networks. MIT Press. W.S. Sarle 1994. Neural Network Implementation in SAS Software, Proceedings of the Nineteenth Annual SAS Users Group International Conference. G.G. Towell 1991. Symbolic Knowledge and Neural Networks: Insertion, Refinement, And Extraction. PhD thesis, CS Department, University of Winsconsin, Madison, WI.