Extracting Qualitative Rules from Observations – A Practical Behavioural Application José Tomé INESC, Rua Alves Redol nº9, 1000 Lisboa PORTUGAL Ricardo Tomé Rua Principal, Lote 14C, 1º Esq, Alto dos Lombos, Carcavelos PORTUGAL João Carvalho INESC, Rua Alves Redol nº9, 1000 Lisboa PORTUGAL Key-Words: - Qualitative rules, Neural Fuzzy, Rule Extracting, Learning. Abstract:- Fuzzy Boolean Networks are capable of learning from experiments by setting internal memory elements of the neurons. They are also able to perform a kind of qualitative reasoning where the fuzzy properties emerge macroscopically from the internal structure of the neurons. Here an application is shown in the area of animal behaviour. From observations of owls hunting habits, fuzzy rules, that describe their behaviour, are extracted and confidence on those rules is also derived. 1 Introduction Natural or Biological neural systems have a number of features which lead to their learning capability when exposed to sets of experiments from the real outside world and they also have the capability to use the learnt knowledge to perform reasoning on an approximate way. This work presents a practical application of a very simple model [8], which exhibits the same kind of behaviour. The model can be considered a neural fuzzy model where the fuzziness is an inherent emerging property in contrast with other known models where fuzziness is artificially introduced on neural nets or where neural components are inserted on the fuzzy systems [1, 3, 4, 5, 7]. The model is based on cards (or areas) of neurons, being these areas connected by meshes of weightless connections between antecedent neuron outputs and consequent neuron inputs. The individual connections are random, just like in nature, and neurons are binary, both in which concerns inputs and outputs. It is considered, however, that neuron’s internal unitary memories can also have a third state in addition to the classical “1” and “0”, with the “not taught” meaning. Also, the model is robust in the sense that it is immune to individual neuron or connection errors, as in nature, which is not the case of other models, such as the classic artificial neural nets. To each area a concept or variable is associated and the “value” of that concept or variable, when stimulated, is given by the activation ratio of that area (that is, the relation between activated -output “1”- neurons and the total number of neurons), just as in natural systems. Each consequent neuron basically samples each of the antecedent spaces using a limited number (say m) of the neuron’s outputs of that antecedent as its own inputs. This number, m, is much smaller then the number of neurons per area. This means that for rules with N-antecedent/1-Consequent each neuron has N.m inputs. The operations carried out by each neuron are just the combinatorial count of the number of activated inputs from every antecedent. They are performed with the classic Boolean operations (AND, OR), using as operands the inputs coming from antecedents and the Boolean internal state variables. These internal variables are established during a teaching phase. It has been proved [8] that, from these micro operations (that is, on individual neurons using local and limited information), it emerges a global or macro qualitative reasoning capability on the concepts, which can be expressed on the form of rules of type: IF Antecedent1 is A1 AND Antecedent2 is A2 AND ….THEN Consequent is Ci, In the expression, Antecedent1, Antecedent2,.., Consequent are variables or concepts and A1, A2; …, Ci are linguistic terms (or fuzzy sets) of these variables (such as, “small”, “high”, etc.). The proof [8] is achieved through the interpretation made to the relationship between antecedent and consequent activation ratios, being the last one given by: L 1 L 1 k10 kN0 ... N m qi .pi .(1 pi) m qi .pr(q1,..., qN)) ( i 1 qiSki qi (1). In this expression L is the number of subsets of counts (each one a set of consecutive integers {0,1,…, m}), Ski are those sets, qi represents every possible number of activated inputs on antecedent i belonging to the set indexed by ki, pi are the antecedent activation ratios of area i and pr(q1,….,qn) are activation ratios of the internal binary memories (measured through every neuron), which are established during the learning phase. It is supposed that each neuron has one element of binary memory (with additional capability of memorizing if it has or not been taught) per set of counts on the antecedent inputs, (each set represented by the set of q1 Sk1 activated inputs on antecedent1,..., and qn Skn activated inputs on antecedent n). If, on expression above, the inner sum is taken to the outside one obtains the following expression, where the counts are viewed individually and not on the sets Ski: m m N .. i10 iN0 j1 m ij ij m ij .pr(i1,.., iN ) p j .(1 p j) (2) Using the algebraic product and the bounded sum, respectively for t-norm and t-conorm fuzzy operations, it follows that, from the microscopic neural operations defined above, emerges, at the macroscopic or network level, the fuzzy qualitative reasoning. To this purpose the equations above may be interpreted as follows [8]: Input variables, the activation ratios pj, are fuzzified through binomial membership functions of m the form pijj.(1 p j ) m ij . The evaluation of the ij expression for a given pj represents the membership degree of pj in that fuzzy set. N The product of the terms, the , represents the j1 fuzzy intersection of the antecedents (i=1,N), by definition of the above t-norm. Considering the consequent fuzzy sets as singletons (amplitude "1") at the consequent UD values p(i1,..,iN), it follows that the equations represent the defuzzification by the Centre of Area method [5]. Learning is performed by exposing the net to experiments and modifying the internal binary memories of each consequent neuron, according the activation of the m inputs (per antecedent) and the state of that consequent neuron. This obeys to the following evolution of the consequent activation ratio, for each of the binary memories in the neurons: p(t+1)-p(t) = Pa. (pout - p(t)), where Pa is the probability of activating one of the binary memories by the antecedent configuration and pout the probability of one consequent neuron to be activated. During the learning phase the network is activated by a collection of experiments that will set or reset the individual neuron’s binary memories and will establish the pr(i1,...,iN) probabilities in the above expression (1). For each experiment, a different input configuration (defined by the input areas specific samples) is presented to each and every of the consequent neurons. This configuration addresses one and only one the internal binary memories of each individual neuron. Updating of each binary memory value depends on its selection (or not) and on the logic value of the consequent neuron. This may be considered a Hebbian type of learning 2 if pre and post-synaptic activities are, in the present model, given by the activation ratios: pj for antecedent area j and pout for the consequent area. For each neuron, the m+1 different counts are the meaningful parameters to take into account for pre synaptic activity of one antecedent. Thus, in a given experiment, the correlation between posterior synapse activity (pout) and pre synaptic activity -the probability of activating a given decoder output d(i1,..iN), corresponding to i1 activated inputs on antecedent area 1, …, iN activated inputs on area N, can be represented by the probability of activating the different binary unitary memories. In practical terms, for each teaching experiment and for each consequent neuron, the state of binary memory ff( i1,...,i N ) is determined by, and only by, the Boolean values of decoder output d(i1, ..., iN) and of the output neuron state considered. Considering then the pr(k1,...,kN),in expression (2), as the synaptic strengths, one may have different learning types, depending on how they are updated. Here, one is considering the interesting case when non-selected binary memories maintain their state and selected binary memories take the value of consequent neuron, which corresponds to a kind of Grossberg based learning. It corresponds to the following updating equation (where indexes are not represented and p is used in place of pr, for simplicity,), and where Pa is the probability of activating, in the experiment, the decoder output associated with p: p(t+1)-p(t) = Pa. (pout - p(t)) . To prove that the network converges to a taught rule consider any initial p different from Pout, a given antecedent Pa and consider Pout as the consequent activation ratio to be taught. Take p(t+2)-p(t+1) and consider Pa and Pout the same for all experiments: p(t+2)-p(t+1) = Pa..(Pout - p(t)-Pa. (Pout - p(t))) = (p(t+1)-p(t)) - (Pa)2.(Pout-p(t)) It is a simple matter to verify that: p(t+2)-p(t+1) < p(t+1)-p(t) for any t Thus it may be concluded that with a set of coherent experiments -teaching the same rule- the net converges. It will establish the p with the same value as Pout, and in each experiment it will approach that value Pout proportionally to the distance between the present value of p and Pout itself, that is, with approaching zero decreasing steps. Moreover, it has been proved [10] that the Net is capable of learning a set of different rules without cross-influence between different rules, and that the number of distinct rules that the system can effectively distinguish (in terms of different consequent terms) increases with the square root of the number m. This is an interesting result, since it should be verified in animal brains, that is, the number of synapses coming from a given concept area is expected, according this simple model, to be at least of the order of the square of the different linguistic terms on the consequent. Finally, it has also been proved that this model is a Universal Approximator [9], since it theoretically implements a Parzen Window estimator [6]. This means that these networks are capable to implement (approximate) any possible multi-input single-output function of the type: 0,1n0,1. These results give the theoretical background to establish the capability of these simple binary networks to perform reasoning and effective learning based on real experiments. Since the nets present topologic similarities with natural systems and present also some of their important properties, it may be hypothesized that it may constitute a simple model for natural reasoning and learning. Also, as the emergent reasoning may be interpreted as fuzzy reasoning, it may also be hypothesized that natural reasoning is intrinsically fuzzy. (3) 2 Behavioural Application The practical problem one wants to address in this paper is related with the hunting behaviour of owls, specifically one wants to derive which qualitative rules one can extract from a representative set of observations. The monitored antecedent variables are the perch (resting place) height (H), the waiting time on that resting place (W) and the distance of attack (D). The consequent variable is the success (S) of the hunting tentative. In order to use the Boolean fuzzy net it is necessary to comply the continuous values of the antecedents with the binary inputs expected. Since the binary nature of antecedent inputs result from the fact that they come from an area of neurons where the respective variable value is the activation ratio, one has to define the interval limits for each variable and to find the ratio between the value of each continuous observation and the interval. Then a random assignation of “1”s and “0”s is made to every antecedent neuron in order to obtain the desired antecedent ratio. Any Monte Carlo type method can be used for this purpose. That is, for each antecedent Ak one gets: Oij = Rand(Vij/(Max-Min)) ; where Oij is the output from neuron i in experiment j, Vij is the actual continuous value of the variable k in the same experiment and Rand(x) is a function that returns “1” with probability x or “0” otherwise. The theoretical conditions for convergence [10] state that the number of inputs per consequent neuron and per antecedent should be of the order of the square of consequent linguistic terms that must be differentiated. These conditions are achieved when the number of experiments (observations in this case) goes to infinity. The theoretical results obtained also proved that the results of the Fuzzy Boolean Nets are equivalent to those achieved with non parametric estimation by Parzen windows [6], which implies that every data experiment is supposed to belong to a distribution with a given probability density. This same conjecture is made for proving the relation between the number of inputs per consequent neuron and the granularity. Thus one assumes that the observations obtained 1 0,8 0,6 Count{0,1} 0,4 Count{2,3} Count{4} 0,2 0 0,97 0,89 0,81 0,73 0,65 0,57 0,49 0,41 0,33 0,25 0,17 0,09 0,01 comply with these assumptions. It has also been studied [11] the behaviour of the Boolean Fuzzy Net in terms of the quadratic error obtained when the number of used neurons and of inputs per consequent neuron and per antecedent (mA) are used as parameters. In that study experimental values for antecedent and consequent have been artificially generated from normal distributed variables, with known average and variances. It has been observed that the error diminishes with the number mA until a level of minimum is achieved, depending on the number of neurons. This optimum value for mA, in those experiments, is of the order of 50 for 100 neurons and of 71 for 1000 neurons. Further increasing of the mA from that point on has the effect of increasing or decreasing the error on an oscillation form. This minimum error, however, depends on the number of consequent neurons and for a granularity of 3 on the consequent (say: “Low”, “Medium” and “High”) and for a number of the order of 100 experiments (the same as the number of observations in this paper) this minimum is 0.00063, for 100 neurons and 0.0003 , both for 1000 and 5000 neurons, being the error a value between these two for 500 neurons. This indicates a number of 1000 neurons a good choice for the work here presented, since a granularity of three has been used for variables. The type of implicit (they are not defined explicitly, rather they result from expression (1) above [8]) membership functions of input variables with three linguistic terms are shown in figure 1, for the simple case where these terms correspond to counts of {0,1}, {2,3} and {4} activated inputs. It should be noticed that when, for example, mA= 71 the sets of counts are {0, 1,…, 23}, {24,…., 47} and {48,…, 71}. Activation Ratio Figure 1. Membership functions 3 Results The data set resulting from the observations, and representing the teaching set, is shown on Table I. On Table II the 27 qualitative rules are shown for mA=71 and for a set of 89 observations. One rule per line is displayed, in the following way: antecedents are not explicitly shown, rather an implicit order is supposed; beginning every variable with the “Low” term in rule 1 and changing more rapidly the first variable and so on. Moreover, the first column represents the consequent corresponding to the rule of that line (it takes values between “0” - no success and “1” –success); the second column represents the ratio of neurons that have been taught with that rule. For example, rule 1, with the first column with value 0.422, and second column with value 1.000, represents the following: If H is Low AND W is Low AND W is Low Then Success is Medium (the rule has been taught in 100% of the neurons, then with high credibility). In contrast, rule 15 has not been taught enough and then has got very low credibility (0.12). From this data set and based on the obtained Table II, one should deduct that those rules that are sharper are the unsuccessful Rule 5 and successful rule 12 (both with high teaching ratio). They are respectively: If H is Medium AND W is Medium AND W is Low Then Success is Low If H is High AND W is Low AND W is Medium Then Success is High Figure 2 represents the evolution of some of the rule’s consequent, with the number of observations used for learning beginning with 59 observations and ending with the total of 89 observations. One can notice from this evolution and from Table II that some of the possible 27 rules have never been learnt, Table 1. Data Set H W D S H W D S H W D S H W D S H W D S H W D S H W D S H W D S H W D S H W D S H W D 1,12 790,00 15,00 1,00 1,43 140,00 33,00 1,00 0,66 326,00 30,00 1,00 0,45 58,00 13,00 1,00 0,67 189,00 11,00 0,00 1,63 229,00 16,00 1,00 1,28 239,00 13,50 0,00 1,08 264,00 10,00 0,00 0,63 146,00 1,38 1,00 1,63 38,00 11,00 0,00 0,71 17,00 2,00 0,62 40,00 3,50 1,00 0,67 115,00 1,10 0,00 1,47 66,00 7,00 1,00 1,07 20,00 22,00 1,00 0,77 469,00 15,00 0,00 1,28 467,00 16,00 1,00 0,58 59,00 11,00 0,00 1,40 76,00 9,00 1,00 1,18 372,00 20,50 1,00 1,00 13,00 5,00 1,00 0,67 56,00 0,79 1,63 135,00 30,50 1,00 0,53 254,00 1,00 0,00 1,40 65,00 3,50 0,00 0,53 70,00 3,00 0,00 1,51 83,00 13,60 1,00 0,69 25,00 1,00 0,00 1,30 559,00 10,00 1,00 0,57 267,00 0,52 1,00 0,28 55,00 5,00 0,00 0,90 144,00 6,50 0,00 0,92 359,00 1,00 0,70 22,00 5,00 1,00 0,54 209,00 5,00 1,00 0,77 33,00 3,00 0,00 1,26 304,00 2,00 1,00 1,43 134,00 3,50 0,00 0,66 41,00 18,00 1,00 0,67 367,00 4,00 1,00 0,33 25,00 3,00 1,00 0,72 26,00 16,00 1,00 1,26 23,00 11,50 1,00 0,93 186,00 2,05 1,58 59,00 10,00 1,00 1,63 98,00 13,00 0,00 0,66 135,00 5,00 1,00 0,69 284,00 6,00 0,00 1,26 205,00 5,70 0,00 1,28 51,00 15,00 1,00 0,64 36,00 3,00 0,00 0,22 40,00 11,00 0,00 0,69 75,00 6,00 0,00 1,18 214,00 21,00 1,00 1,26 93,00 1,40 0,70 239,00 5,00 0,00 0,92 185,00 3,00 1,00 0,64 70,00 1,00 0,00 0,67 145,00 2,50 1,00 0,66 574,00 3,00 0,00 0,77 701,00 4,00 0,00 1,40 42,00 9,00 0,00 0,53 26,00 6,00 0,00 1,51 82,00 11,00 0,00 1,63 218,00 40,00 1,00 0,85 35,00 14,00 0,92 98,00 50,00 1,00 0,63 365,00 1,38 0,00 0,95 19,00 70,00 1,00 0,54 119,00 16,50 0,00 1,58 100,00 15,50 0,00 1,63 588,00 12,00 0,00 0,66 515,00 5,00 1,00 0,53 306,00 24,00 1,00 0,53 704,00 0,50 0,00 1,08 271,00 11,00 1,00 0,63 465,00 11,50 S H W D S H W D S 1,00 0,09 28,00 9,00 1,00 0,54 29,00 23,00 1,00 1,00 0,63 82,00 1,66 0,00 1,26 498,00 18,00 1,00 0,00 1,00 48,00 2,00 1,00 0,63 265,00 12,00 0,00 0,00 1,00 81,00 1,50 0,00 0,93 99,00 8,00 0,00 0,00 0,39 41,00 16,00 0,00 1,00 0,77 42,00 5,00 0,00 0,00 0,15 60,00 5,00 1,00 Table 2. Qualitative Rules R1 0.422 1.000 R10 0.975 R2 0.595 1.000 R11 1.000 R3 0.328 1.000 R12 1.000 R4 0.463 0.789 R13 0.997 R5 0.031 1.000 R14 1.000 R6 0.876 0.565 R15 1.000 R7 0.149 0.883 R16 NT R8 0.564 1.000 R17 1.000 R9 0.382 0.896 R18 0.500 0.247 0.864 0.995 0.471 0.471 0.120 0.000 0.005 0.002 R19 1.000 R20 1.000 R21 1.000 R22 NT R23 NT R24 NT R25 NT R26 NT R27 NT 0.009 0.993 0.005 0.000 0.000 0.000 0.000 0.000 0.000 Most of rules that have been learnt show a good convergence rate since the values for the consequent are almost completely established when 59 observations are used. This is an important fact since this net is equivalent to a non-parametric estimation and one is more confident on the results when the number of observations tends to infinity. There are, however, some rules (in the case R8 and R9) where the value of the consequent has changed after 84 observations. This means that some of these last observations include some behaviour not seen before and even contradictory with previous behaviour. 0,70 0,60 Consequent since the ratio of taught neurons is not significant (Rules R16, R17, R18; R19 and from R21 to R27). They represent owl’s behaviours that have not been observed, and that are not likely to happen, considering the set of observations representative of the owl’s behaviour. 0,50 Rule1 0,40 Rule 4 0,30 Rule 8 0,20 0,10 59 64 69 74 79 84 89 Number of Observations Fig. 2. Rule results evolution 4 Conclusions Real world presents a series of problems where qualitative analysis seems to be the most powerful tool to understand their nature and complex relations. Classical statistical tools have not the power to give the behaviour implied on those problems in the form of rules, which could better explain to researchers those behaviours. This is particularly true on the animal behaviour study. Here a Fuzzy Boolean Neural Net is proposed for such type of research. The Net itself is based on biological systems, since it uses areas of neurons (cards) associated with variables or concepts and it uses the global activity among neurons on the different areas (activation ratios) as the meaningful reasoning parameter. They present other functional and topological similarities with natural systems: individual neurons have only a two-state space domain (fire/do not fire), meshes of connections are established from neural areas to other neural areas, depending on antecedent/consequent dependence of the related concepts, but individual weightless links are established randomly between neurons of those areas. In addition to the intrinsic immunity to noise in individual neurons or individual connections that make these networks robust and almost insensible to individual errors, these fuzzy Boolean Nets present the capability of learning from the experiments and also the qualitative reasoning feature, giving their output on the form of qualitative rules. In this work this Fuzzy Boolean Net has been used on an application to owl’s hunt behaviour, looking for the influence that parameters such as the perch (resting place), height (H), the waiting time on that resting place (W) and the distance of attack (D) have on the consequent variable, which is the success (S) of the hunting tentative. The results are expressed on two ways, the first one giving the rules teaching ratio and the second giving the rules themselves. These rules, allows the understanding of the hunt behaviour based on the available observations. References: [1] Gupta, M. and QI "On fuzzy neurone models" Proc. Int. Joint Conf. Neural Networks, volII, 431-436, Seattle. (1991). [2] Hebb, D. The Organization of Behaviour: A Neuropsychological Theory. John Wiley &Sons. (1949). [3] Horikawa, S., T. Furuhashi and Y. Uchikawa "On fuzzy modelling using fuzzy neural networks with the back propagation algorithm." IEEE Transactions Neural Networks 3(5): 801-806. (1992). [4] Keller,J.M. and D.J.Hunt"Incorporating fuzzy membership functions into the perceptron algorithm." IEEE Transactions Pattern Anal.Mach.Intell.. PAMI-7(6):693-699. (1985). [5].Lin, Chin-Ten and Lee,C.S. A Neuro-Fuzzy Synergism to Intelligent Systems.New Jersey : Prentice Hall. (1996). [6] Parzen, E.(1962). "On Estimation of a probability density function and mode" Ann. Math. Stat., 33, 1065-1072. [7] Pedrycz,W. "Fuzzy neural networks with reference neurones as pattern classifiers" IEEE Trans.Neural Networks 3(5):770-775. (1992). [8] Tomé,J.A.."Neural Activation ratio based Fuzzy Reasoning."Proc. IEEE World Congress on Computational Inteligence, Anchorage,May 1998,pp 1217-1222. [9] Tomé,J.A."Counting Boolean Networks are Universal Approximators" Proc. of 1998 Conference of NAFIPS, Florida, August 1998, pp 212-216. [10] Tomé,J.A. and Carvalho, J.P. "Rule Capacity in Fuzzy Boolean Networks”. "Proc. of NAFIPSFLINT 2002 International Conference, New Orleans, IEEE2002, pp 124-128. [11].Tomé, J.A. “Error convergence on Fuzzy Boolean Nets”. Internal Report, INESC- 2003