Extracting Qualitative Rules from Observations – A Practical

advertisement
Extracting Qualitative Rules from Observations – A Practical
Behavioural Application
José Tomé
INESC, Rua Alves Redol
nº9, 1000 Lisboa
PORTUGAL
Ricardo Tomé
Rua Principal, Lote 14C,
1º Esq, Alto dos Lombos,
Carcavelos
PORTUGAL
João Carvalho
INESC, Rua Alves Redol
nº9, 1000 Lisboa
PORTUGAL
Key-Words: - Qualitative rules, Neural Fuzzy, Rule Extracting, Learning.
Abstract:- Fuzzy Boolean Networks are capable of learning from experiments by setting internal memory
elements of the neurons. They are also able to perform a kind of qualitative reasoning where the fuzzy
properties emerge macroscopically from the internal structure of the neurons. Here an application is shown in
the area of animal behaviour. From observations of owls hunting habits, fuzzy rules, that describe their
behaviour, are extracted and confidence on those rules is also derived.
1 Introduction
Natural or Biological neural systems have a number
of features which lead to their learning capability
when exposed to sets of experiments from the real
outside world and they also have the capability to
use the learnt knowledge to perform reasoning on an
approximate way. This work presents a practical
application of a very simple model [8], which
exhibits the same kind of behaviour. The model can
be considered a neural fuzzy model where the
fuzziness is an inherent emerging property in
contrast with other known models where fuzziness is
artificially introduced on neural nets or where neural
components are inserted on the fuzzy systems [1, 3,
4, 5, 7]. The model is based on cards (or areas) of
neurons, being these areas connected by meshes of
weightless connections between antecedent neuron
outputs and consequent neuron inputs. The
individual connections are random, just like in
nature, and neurons are binary, both in which
concerns inputs and outputs. It is considered,
however, that neuron’s internal unitary memories
can also have a third state in addition to the classical
“1” and “0”, with the “not taught” meaning. Also,
the model is robust in the sense that it is immune to
individual neuron or connection errors, as in nature,
which is not the case of other models, such as the
classic artificial neural nets. To each area a concept
or variable is associated and the “value” of that
concept or variable, when stimulated, is given by the
activation ratio of that area (that is, the relation
between activated -output “1”- neurons and the total
number of neurons), just as in natural systems. Each
consequent neuron basically samples each of the
antecedent spaces using a limited number (say m) of
the neuron’s outputs of that antecedent as its own
inputs. This number, m, is much smaller then the
number of neurons per area. This means that for
rules with N-antecedent/1-Consequent each neuron
has N.m inputs. The operations carried out by each
neuron are just the combinatorial count of the
number of activated inputs from every antecedent.
They are performed with the classic Boolean
operations (AND, OR), using as operands the inputs
coming from antecedents and the Boolean internal
state variables. These internal variables are
established during a teaching phase. It has been
proved [8] that, from these micro operations (that is,
on individual neurons using local and limited
information), it emerges a global or macro
qualitative reasoning capability on the concepts,
which can be expressed on the form of rules of type:
IF Antecedent1 is A1 AND Antecedent2 is A2
AND ….THEN Consequent is Ci,
In the expression, Antecedent1, Antecedent2,..,
Consequent are variables or concepts and A1, A2;
…, Ci are linguistic terms (or fuzzy sets) of these
variables (such as, “small”, “high”, etc.).
The proof [8] is achieved through the
interpretation made to the relationship between
antecedent and consequent activation ratios, being
the last one given by:
L 1
L 1
k10
kN0
 ...
N
 m  qi
 .pi .(1  pi) m  qi .pr(q1,..., qN))
( 
i  1 qiSki  qi 

(1).
In this expression L is the number of subsets of
counts (each one a set of consecutive integers 
{0,1,…, m}), Ski are those sets, qi represents every
possible number of activated inputs on antecedent i
belonging to the set indexed by ki, pi are the
antecedent activation ratios of area i and
pr(q1,….,qn) are activation ratios of the internal
binary memories (measured through every neuron),
which are established during the learning phase. It is
supposed that each neuron has one element of binary
memory (with additional capability of memorizing if
it has or not been taught) per set of counts on the
antecedent inputs, (each set represented by the set of
q1  Sk1 activated inputs on antecedent1,..., and qn
 Skn activated inputs on antecedent n).
If, on expression above, the inner sum is taken to
the outside one obtains the following expression,
where the counts are viewed individually and not on
the sets Ski:
m
m
N
 
..
i10 iN0 j1

m
ij
ij
m  ij
.pr(i1,.., iN )
p j .(1  p j)
(2)
Using the algebraic product and the bounded sum,
respectively for t-norm and t-conorm fuzzy
operations, it follows that, from the microscopic
neural operations defined above, emerges, at the
macroscopic or network level, the fuzzy qualitative
reasoning. To this purpose the equations above may
be interpreted as follows [8]:
Input variables, the activation ratios pj, are
fuzzified through binomial membership functions of
 m
the form   pijj.(1  p j ) m ij . The evaluation of the
 ij 
expression for a given pj represents the membership
degree of pj in that fuzzy set.
N
The product of the terms, the  , represents the
j1
fuzzy intersection of the antecedents (i=1,N), by
definition of the above t-norm.
Considering the consequent fuzzy sets as
singletons (amplitude "1") at the consequent UD
values p(i1,..,iN), it follows that the equations
represent the defuzzification by the Centre of Area
method [5].
Learning is performed by exposing the net to
experiments and modifying the internal binary
memories of each consequent neuron, according the
activation of the m inputs (per antecedent) and the
state of that consequent neuron. This obeys to the
following evolution of the consequent activation
ratio, for each of the binary memories in the
neurons: p(t+1)-p(t) = Pa. (pout - p(t)), where Pa is
the probability of activating one of the binary
memories by the antecedent configuration and pout
the probability of one consequent neuron to be
activated.
During the learning phase the network is activated
by a collection of experiments that will set or reset
the individual neuron’s binary memories and will
establish the pr(i1,...,iN) probabilities in the above
expression (1). For each experiment, a different
input configuration (defined by the input areas
specific samples) is presented to each and every of
the consequent neurons. This configuration
addresses one and only one the internal binary
memories of each individual neuron. Updating of
each binary memory value depends on its selection
(or not) and on the logic value of the consequent
neuron. This may be considered a Hebbian type of
learning 2 if pre and post-synaptic activities are, in
the present model, given by the activation ratios: pj
for antecedent area j and pout for the consequent
area. For each neuron, the m+1 different counts are
the meaningful parameters to take into account for
pre synaptic activity of one antecedent. Thus, in a
given experiment, the correlation between posterior
synapse activity (pout) and pre synaptic activity -the
probability of activating a given decoder output
d(i1,..iN), corresponding to i1 activated inputs on
antecedent area 1, …, iN activated inputs on area N, can be represented by the probability of activating
the different binary unitary memories. In practical
terms, for each teaching experiment and for each
consequent neuron, the state of binary memory
ff( i1,...,i N ) is determined by, and only by, the
Boolean values of decoder output d(i1, ..., iN) and of
the output neuron state considered.
Considering then the pr(k1,...,kN),in expression
(2), as the synaptic strengths, one may have different
learning types, depending on how they are updated.
Here, one is considering the interesting case when
non-selected binary memories maintain their state
and selected binary memories take the value of
consequent neuron, which corresponds to a kind of
Grossberg based learning. It corresponds to the
following updating equation (where indexes are not
represented and p is used in place of pr, for
simplicity,), and where Pa is the probability of
activating, in the experiment, the decoder output
associated with p:
p(t+1)-p(t) = Pa. (pout - p(t)) .
To prove that the network converges to a taught
rule consider any initial p different from Pout, a
given antecedent Pa and consider Pout as the
consequent activation ratio to be taught. Take
p(t+2)-p(t+1) and consider Pa and Pout the same for
all experiments:
p(t+2)-p(t+1) = Pa..(Pout - p(t)-Pa. (Pout - p(t)))
= (p(t+1)-p(t)) - (Pa)2.(Pout-p(t))
It is a simple matter to verify that:
p(t+2)-p(t+1) < p(t+1)-p(t) for any t
Thus it may be concluded that with a set of
coherent experiments -teaching the same rule- the
net converges. It will establish the p with the same
value as Pout, and in each experiment it will
approach that value Pout proportionally to the
distance between the present value of p and Pout
itself, that is, with approaching zero decreasing
steps.
Moreover, it has been proved [10] that the Net is
capable of learning a set of different rules without
cross-influence between different rules, and that the
number of distinct rules that the system can
effectively distinguish (in terms of different
consequent terms) increases with the square root of
the number m. This is an interesting result, since it
should be verified in animal brains, that is, the
number of synapses coming from a given concept
area is expected, according this simple model, to be
at least of the order of the square of the different
linguistic terms on the consequent.
Finally, it has also been proved that this model is
a Universal Approximator [9], since it theoretically
implements a Parzen Window estimator [6]. This
means that these networks are capable to implement
(approximate) any possible multi-input single-output
function of the type: 0,1n0,1.
These results give the theoretical background to
establish the capability of these simple binary
networks to perform reasoning and effective
learning based on real experiments. Since the nets
present topologic similarities with natural systems
and present also some of their important properties,
it may be hypothesized that it may constitute a
simple model for natural reasoning and learning.
Also, as the emergent reasoning may be interpreted
as fuzzy reasoning, it may also be hypothesized that
natural reasoning is intrinsically fuzzy.
(3)
2 Behavioural Application
The practical problem one wants to address in this
paper is related with the hunting behaviour of owls,
specifically one wants to derive which qualitative
rules one can extract from a representative set of
observations.
The monitored antecedent variables are the perch
(resting place) height (H), the waiting time on that
resting place (W) and the distance of attack (D). The
consequent variable is the success (S) of the hunting
tentative.
In order to use the Boolean fuzzy net it is
necessary to comply the continuous values of the
antecedents with the binary inputs expected. Since
the binary nature of antecedent inputs result from the
fact that they come from an area of neurons where
the respective variable value is the activation ratio,
one has to define the interval limits for each variable
and to find the ratio between the value of each
continuous observation and the interval. Then a
random assignation of “1”s and “0”s is made to
every antecedent neuron in order to obtain the
desired antecedent ratio. Any Monte Carlo type
method can be used for this purpose. That is, for
each antecedent Ak one gets:
Oij = Rand(Vij/(Max-Min)) ;
where Oij is the output from neuron i in
experiment j, Vij is the actual continuous value of
the variable k in the same experiment and Rand(x) is
a function that returns “1” with probability x or “0”
otherwise.
The theoretical conditions for convergence [10]
state that the number of inputs per consequent
neuron and per antecedent should be of the order of
the square of consequent linguistic terms that must
be differentiated. These conditions are achieved
when the number of experiments (observations in
this case) goes to infinity. The theoretical results
obtained also proved that the results of the Fuzzy
Boolean Nets are equivalent to those achieved with
non parametric estimation by Parzen windows [6],
which implies that every data experiment is
supposed to belong to a distribution with a given
probability density. This same conjecture is made
for proving the relation between the number of
inputs per consequent neuron and the granularity.
Thus one assumes that the observations obtained
1
0,8
0,6
Count{0,1}
0,4
Count{2,3}
Count{4}
0,2
0
0,97
0,89
0,81
0,73
0,65
0,57
0,49
0,41
0,33
0,25
0,17
0,09
0,01
comply with these assumptions. It has also been
studied [11] the behaviour of the Boolean Fuzzy Net
in terms of the quadratic error obtained when the
number of used neurons and of inputs per
consequent neuron and per antecedent (mA) are used
as parameters. In that study experimental values for
antecedent and consequent have been artificially
generated from normal distributed variables, with
known average and variances. It has been observed
that the error diminishes with the number mA until a
level of minimum is achieved, depending on the
number of neurons. This optimum value for mA, in
those experiments, is of the order of 50 for 100
neurons and of 71 for 1000 neurons. Further
increasing of the mA from that point on has the
effect of increasing or decreasing the error on an
oscillation form. This minimum error, however,
depends on the number of consequent neurons and
for a granularity of 3 on the consequent (say: “Low”,
“Medium” and “High”) and for a number of the
order of 100 experiments (the same as the number of
observations in this paper) this minimum is 0.00063,
for 100 neurons and 0.0003 , both for 1000 and 5000
neurons, being the error a value between these two
for 500 neurons. This indicates a number of 1000
neurons a good choice for the work here presented,
since a granularity of three has been used for
variables.
The type of implicit (they are not defined
explicitly, rather they result from expression (1)
above [8]) membership functions of input variables
with three linguistic terms are shown in figure 1, for
the simple case where these terms correspond to
counts of {0,1}, {2,3} and {4} activated inputs. It
should be noticed that when, for example, mA= 71
the sets of counts are {0, 1,…, 23}, {24,…., 47} and
{48,…, 71}.
Activation Ratio
Figure 1. Membership functions
3 Results
The data set resulting from the observations, and
representing the teaching set, is shown on Table I.
On Table II the 27 qualitative rules are shown for
mA=71 and for a set of 89 observations. One rule
per line is displayed, in the following way:
antecedents are not explicitly shown, rather an
implicit order is supposed; beginning every variable
with the “Low” term in rule 1 and changing more
rapidly the first variable and so on. Moreover, the
first
column
represents
the
consequent
corresponding to the rule of that line (it takes values
between “0” - no success and “1” –success); the
second column represents the ratio of neurons that
have been taught with that rule. For example, rule 1,
with the first column with value 0.422, and second
column with value 1.000, represents the following:
If H is Low AND W is Low AND W is Low Then
Success is Medium (the rule has been taught in
100% of the neurons, then with high credibility).
In contrast, rule 15 has not been taught enough
and then has got very low credibility (0.12).
From this data set and based on the obtained
Table II, one should deduct that those rules that are
sharper are the unsuccessful Rule 5 and successful
rule 12 (both with high teaching ratio).
They are respectively:
If H is Medium AND W is Medium AND W is
Low Then Success is Low
If H is High AND W is Low AND W is Medium
Then Success is High
Figure 2 represents the evolution of some of the
rule’s consequent, with the number of observations
used for learning beginning with 59 observations
and ending with the total of 89 observations. One
can notice from this evolution and from Table II that
some of the possible 27 rules have never been learnt,
Table 1. Data Set
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
S
H
W
D
1,12
790,00
15,00
1,00
1,43
140,00
33,00
1,00
0,66
326,00
30,00
1,00
0,45
58,00
13,00
1,00
0,67
189,00
11,00
0,00
1,63
229,00
16,00
1,00
1,28
239,00
13,50
0,00
1,08
264,00
10,00
0,00
0,63
146,00
1,38
1,00
1,63
38,00
11,00
0,00
0,71
17,00
2,00
0,62
40,00
3,50
1,00
0,67
115,00
1,10
0,00
1,47
66,00
7,00
1,00
1,07
20,00
22,00
1,00
0,77
469,00
15,00
0,00
1,28
467,00
16,00
1,00
0,58
59,00
11,00
0,00
1,40
76,00
9,00
1,00
1,18
372,00
20,50
1,00
1,00
13,00
5,00
1,00
0,67
56,00
0,79
1,63
135,00
30,50
1,00
0,53
254,00
1,00
0,00
1,40
65,00
3,50
0,00
0,53
70,00
3,00
0,00
1,51
83,00
13,60
1,00
0,69
25,00
1,00
0,00
1,30
559,00
10,00
1,00
0,57
267,00
0,52
1,00
0,28
55,00
5,00
0,00
0,90
144,00
6,50
0,00
0,92
359,00
1,00
0,70
22,00
5,00
1,00
0,54
209,00
5,00
1,00
0,77
33,00
3,00
0,00
1,26
304,00
2,00
1,00
1,43
134,00
3,50
0,00
0,66
41,00
18,00
1,00
0,67
367,00
4,00
1,00
0,33
25,00
3,00
1,00
0,72
26,00
16,00
1,00
1,26
23,00
11,50
1,00
0,93
186,00
2,05
1,58
59,00
10,00
1,00
1,63
98,00
13,00
0,00
0,66
135,00
5,00
1,00
0,69
284,00
6,00
0,00
1,26
205,00
5,70
0,00
1,28
51,00
15,00
1,00
0,64
36,00
3,00
0,00
0,22
40,00
11,00
0,00
0,69
75,00
6,00
0,00
1,18
214,00
21,00
1,00
1,26
93,00
1,40
0,70
239,00
5,00
0,00
0,92
185,00
3,00
1,00
0,64
70,00
1,00
0,00
0,67
145,00
2,50
1,00
0,66
574,00
3,00
0,00
0,77
701,00
4,00
0,00
1,40
42,00
9,00
0,00
0,53
26,00
6,00
0,00
1,51
82,00
11,00
0,00
1,63
218,00
40,00
1,00
0,85
35,00
14,00
0,92
98,00
50,00
1,00
0,63
365,00
1,38
0,00
0,95
19,00
70,00
1,00
0,54
119,00
16,50
0,00
1,58
100,00
15,50
0,00
1,63
588,00
12,00
0,00
0,66
515,00
5,00
1,00
0,53
306,00
24,00
1,00
0,53
704,00
0,50
0,00
1,08
271,00
11,00
1,00
0,63
465,00
11,50
S
H
W
D
S
H
W
D
S
1,00
0,09
28,00
9,00
1,00
0,54
29,00
23,00
1,00
1,00
0,63
82,00
1,66
0,00
1,26
498,00
18,00
1,00
0,00
1,00
48,00
2,00
1,00
0,63
265,00
12,00
0,00
0,00
1,00
81,00
1,50
0,00
0,93
99,00
8,00
0,00
0,00
0,39
41,00
16,00
0,00
1,00
0,77
42,00
5,00
0,00
0,00
0,15
60,00
5,00
1,00
Table 2. Qualitative Rules
R1 0.422 1.000 R10 0.975
R2 0.595 1.000 R11 1.000
R3 0.328 1.000 R12 1.000
R4 0.463 0.789 R13 0.997
R5 0.031 1.000 R14 1.000
R6 0.876 0.565 R15 1.000
R7 0.149 0.883 R16 NT
R8 0.564 1.000 R17 1.000
R9 0.382 0.896 R18 0.500
0.247
0.864
0.995
0.471
0.471
0.120
0.000
0.005
0.002
R19 1.000
R20 1.000
R21 1.000
R22 NT
R23 NT
R24 NT
R25 NT
R26 NT
R27 NT
0.009
0.993
0.005
0.000
0.000
0.000
0.000
0.000
0.000
Most of rules that have been learnt show a good
convergence rate since the values for the consequent
are almost completely established when 59
observations are used. This is an important fact since
this net is equivalent to a non-parametric estimation
and one is more confident on the results when the
number of observations tends to infinity. There are,
however, some rules (in the case R8 and R9) where
the value of the consequent has changed after 84
observations. This means that some of these last
observations include some behaviour not seen before
and even contradictory with previous behaviour.
0,70
0,60
Consequent
since the ratio of taught neurons is not significant
(Rules R16, R17, R18; R19 and from R21 to R27).
They represent owl’s behaviours that have not been
observed, and that are not likely to happen,
considering the set of observations representative of
the owl’s behaviour.
0,50
Rule1
0,40
Rule 4
0,30
Rule 8
0,20
0,10
59
64
69
74
79
84
89
Number of Observations
Fig. 2. Rule results evolution
4 Conclusions
Real world presents a series of problems where
qualitative analysis seems to be the most powerful
tool to understand their nature and complex
relations. Classical statistical tools have not the
power to give the behaviour implied on those
problems in the form of rules, which could better
explain to researchers those behaviours. This is
particularly true on the animal behaviour study. Here
a Fuzzy Boolean Neural Net is proposed for such
type of research. The Net itself is based on
biological systems, since it uses areas of neurons
(cards) associated with variables or concepts and it
uses the global activity among neurons on the
different areas (activation ratios) as the meaningful
reasoning parameter. They present other functional
and topological similarities with natural systems:
individual neurons have only a two-state space
domain (fire/do not fire), meshes of connections are
established from neural areas to other neural areas,
depending on antecedent/consequent dependence of
the related concepts, but individual weightless links
are established randomly between neurons of those
areas.
In addition to the intrinsic immunity to noise in
individual neurons or individual connections that
make these networks robust and almost insensible to
individual errors, these fuzzy Boolean Nets present
the capability of learning from the experiments and
also the qualitative reasoning feature, giving their
output on the form of qualitative rules.
In this work this Fuzzy Boolean Net has been
used on an application to owl’s hunt behaviour,
looking for the influence that parameters such as the
perch (resting place), height (H), the waiting time on
that resting place (W) and the distance of attack (D)
have on the consequent variable, which is the
success (S) of the hunting tentative.
The results are expressed on two ways, the first
one giving the rules teaching ratio and the second
giving the rules themselves. These rules, allows the
understanding of the hunt behaviour based on the
available observations.
References:
[1]
Gupta, M. and QI "On fuzzy neurone
models" Proc. Int. Joint Conf. Neural Networks,
volII, 431-436, Seattle. (1991).
[2]
Hebb, D. The Organization of Behaviour: A
Neuropsychological Theory. John Wiley &Sons.
(1949).
[3]
Horikawa, S., T. Furuhashi and Y.
Uchikawa "On fuzzy modelling using fuzzy
neural networks with the back propagation
algorithm." IEEE Transactions Neural Networks
3(5): 801-806. (1992).
[4]
Keller,J.M. and D.J.Hunt"Incorporating
fuzzy membership functions into the perceptron
algorithm."
IEEE
Transactions
Pattern
Anal.Mach.Intell.. PAMI-7(6):693-699. (1985).
[5].Lin, Chin-Ten and Lee,C.S. A Neuro-Fuzzy
Synergism to Intelligent Systems.New Jersey :
Prentice Hall. (1996).
[6] Parzen, E.(1962). "On Estimation of a
probability density function and mode" Ann.
Math. Stat., 33, 1065-1072.
[7]
Pedrycz,W. "Fuzzy neural networks with
reference neurones as pattern classifiers" IEEE
Trans.Neural Networks 3(5):770-775. (1992).
[8] Tomé,J.A.."Neural Activation ratio based Fuzzy
Reasoning."Proc. IEEE World Congress on
Computational
Inteligence,
Anchorage,May
1998,pp 1217-1222.
[9] Tomé,J.A."Counting Boolean Networks are
Universal Approximators" Proc. of 1998
Conference of NAFIPS, Florida, August 1998, pp
212-216.
[10] Tomé,J.A. and Carvalho, J.P. "Rule Capacity in
Fuzzy Boolean Networks”. "Proc. of NAFIPSFLINT 2002 International Conference, New
Orleans, IEEE2002, pp 124-128.
[11].Tomé, J.A. “Error convergence on Fuzzy
Boolean Nets”. Internal Report, INESC- 2003
Download