On Rules Discovery from Incomplete Information Systems Agnieszka Dardzinska*,+ and Zbigniew W. Ras+,# *) Bialystok Technical Univ., Dept. of Mathematics, ul. Wiejska 45A, 15-351, Bialystok, Poland +) UNC-Charlotte, Department of Computer Science, Charlotte, N.C. 28223 #) Polish Academy of Sciences, Institute of Computer Science, Ordona 21, Warsaw, Poland adardzin@uncc.edu and ras@uncc.edu Abstract: In this paper we present a new rule-discovery method for incomplete Information Systems (IS) which are generalizations of information systems introduced by Pawlak [4]. The proposed method has some similarities with system LERS [1]. It is a bottom-up strategy, generating sets of weighted objects having first only one-value properties. Some pairs of these generated sets are used for constructing rules and they are marked as successful. All pairs having a number of supporting objects below some threshold value are marked as well. All remaining (unmarked) pairs are used to construct new sets of weighted objects having two-value properties. Some of them are in a supporting relationship with sets of weighted objects having one-value properties. By supporting relationship we mean a relationship identifying a rule which can be extracted from our incomplete IS. This process is continued recursively by moving to sets of weighted objects having k-value properties, where k2. Our strategy can also be used for chasing incomplete information systems with rules discovered from them. This way, our system can be seen as a new tool for handling incompleteness in IS. 1. Introduction There is a number of strategies which allow us to find rules describing decision attributes in terms of classification attributes. For instance, we can mention here such systems like LERS [1], AQ18, Rosetta [3] and, C4.5 [5]. In spite of the fact that the number of rule discovery methods is still increasing, most of them are developed under the assumption that the information about objects in the information system is either precisely known or not known at all. This implies that either one value of an attribute is assigned to an object as its property or no value is assigned to it (instead of no value we use the term null value). Problem of inducing rules from systems with incomplete attribute values represented as sets of possible values was discussed for instance by Kryszkiewicz and Rybinski [2] and by Ras and Joshi [6]. In this paper, we present a new strategy for discovering rules in incomplete information systems. The constraints placed on what values of attributes can be assigned to objects are as weak as possible. Quite often user’s knowledge about objects is not exact and even asking the user to assign a set of possible attribute values to an object is a property which is still too strong to be forced as a constraint. In our system, we allow to use not only sets of attribute values as values of an object but also we allow to assign a confidence to each of this attribute values. For instance, by assigning a value {(brown, 1/3), (black, 2/3)} of the attribute Color of Hair to an object x we say that the confidence in object x having brown hair is 1/3 whereas the confidence that x has black hair is 2/3. 2. Discovering Rules in Incomplete IS We start with a definition of an incomplete information system which is a generalization of an information system introduced by Pawlak [3]. By an incomplete Information System (IS) we mean S = (X, A, V), where X is a finite set of objects, A is a finite set of attributes and V = {Va : a A} is a finite set of values of attributes. The set Va is a domain of attribute a. We assume that for each attribute a A and x X, a(x) ={(ai, pi): i Ja(x) (i Ja(x)) [ai Va] pi = 1}. Null value assigned to an object is interpreted as all possible values of an attribute with equal confidence assigned to all of them. Figure 1 gives an example of an incomplete information system S = ({x1, x2, x3, x4, x5, x6, x7, x8}, {a, b, c, d, e}, V}. Clearly, a(x1) = {(a1,1/3), (a2,2/3)}, a(x2) = {(a2,1/4), (a3,3/4)}, ….. Let us try to extract rules from S describing attribute e in terms of attributes {a, b, c, d} following a strategy similar to LERS [1]. So, we start with identifying sets of objects in X having properties a1, a2, a3, b1, b2, c1, c2, c3, d1, d2 and next we find their relationships with sets of objects in X having properties e1, e2, e3. Let us start with a1* which is equal to {(x1,1/3), (x3,1), (x5, 2/3)}. The justification of this is quite simple. Only these 3 objects may have that property. Object x3 has property 1 for sure. The confidence that x1 has property a1 is 1/3 since (a1,1/3) a(x1). In a similar way we justify the property a1 for object x5. X x1 x2 x3 x4 x5 x6 x7 x8 a (a1,1/3), (a2,2/3) (a2,1/4), (a3,3/4) a1 a3 (a1,2/3), (a2,1/3) a2 a2 a3 b c (b1,2/3), (b2,1/3) c1 (b1,1/3), (b2,2/3) b2 (c1,1/2), (c3,1/2) c2 b1 c2 b2 c3 (b1,1/4), (b2,3/4) (c1,1/3), (c2,2/3) b2 c1 Figure 1. So, as far as values of classification attributes, we get : a1* {( x1 , 13 ), ( x3 ,1), ( x5 , 23 )} , a2 * {( x1 , 23 ), ( x2 , 14 ), ( x5 , 13 ), ( x6 ,1), ( x7 ,1)} , a3 * {( x2 , 43 ), ( x4 ,1), ( x8 ,1)} , b1* {( x1 , 23 ), ( x2 , 13 ), ( x4 , 12 ), ( x5 ,1), ( x7 , 14 )} , b1* {( x1 , 13 ), ( x2 , 23 ), ( x3 ,1), ( x4 , 12 ), ( x6 ,1), ( x7 , 43 ), ( x8 ,1)} , d d1 d2 d2 d1 d2 d2 d1 e (e1,1/2), (e2,1/2) e1 e3 (e1,2/3), (e2,1/3) e1 (e2,1/3), (e3,2/3) e2 e3 c1* {( x1 ,1), ( x2 , 13 ), ( x3 , 12 ), ( x7 , 13 ), ( x8 ,1)} , c2 * {( x2 , 13 ), ( x4 ,1), ( x5 ,1), ( x7 , 23 )} , c3 * {( x2 , 13 ), ( x3 , 12 ), ( x6 ,1)} , d1* {( x1 ,1), ( x4 ,1), ( x5 , 12 ), ( x8 ,1)} , d 2 * {( x2 ,1), ( x3 ,1), ( x5 , 12 ), ( x6 ,1), ( x7 ,1)} . For the values of decision attribute we get: e1 * {( x1 , 12 ), ( x2 ,1), ( x4 , 23 ), ( x5 ,1)} , e2 * {( x1 , 12 ), ( x4 , 13 ), ( x6 , 13 ), ( x7 ,1)} , e3 * {( x3 ,1), ( x6 , 23 ), ( x8 ,1)} . The next step is to propose a method for checking a relationship between classification attributes and the decision attribute. For any two sets ci * {( xi , pi )}iN , e j * {( y j , q j )} jM , where pi 0 and q j 0 , we propose that: {( xi , pi )}iN {( y j , q j )} jM iff the support of the rule [ ci e j ] is above some threshold value. So, the relationship between ci * , e j * depends only on how high is the support of the corresponding rule [ ci e j ]. Coming back to our example, let us notice that we may have a successful set-theoretical inclusion between objects from a1 * and objects from e3 * but the set-theoretical inclusion between objects from a1 * and objects from e1 * may fail. The justification of this claim is the following: Object x3 is a member of a1 * but it does not belong to e1 * . At the same time, the other set-theoretical inclusion is feasible because it is not certain that x1 and x5 belong to a1 * . Assuming that the relationship a1 * e3 * is successful, what can we say about the support and confidence of the rule a1 e3 ? Object x1 supports a1 with a confidence 13 and it supports e3 with a confidence 0. Object x3 supports a1 with a confidence 1 and it supports e3 with a confidence 1. Object x5 supports a1 with a confidence 23 and it supports e3 with a confidence 0. We calculate the support of the rule [ a1 e3 ] by calculating: 1 3 0 1 1 23 0 1 . Similarly, the support of the term a1 is 13 1 23 2 . So, the confidence of the above rule will be 12 , if we follow the standard strategy for calculating the confidence of a rule. Now, let us follow a strategy which has some similarity with LERS. All inclusions [ ] will be automatically marked assuming that the confidence and support of the corresponding rules are both above some threshold values (which are given by user). In the current example we take 12 as the threshold for the confidence and 1 as the threshold for support. Assume again that ci * {( xi , pi )}iN , e j * {( y j , q j )} jM . The algorithm will check first the support of the rule ci e j . If support is below a threshold value, then the corresponding relationship {( xi , pi )}iN {( y j , q j )} jM does not hold and it is not considered in later steps. Otherwise, the confidence of the rule ci e j is checked. If that confidence is either above or equal the assumed threshold value, the rule is approved and the corresponding relationship {( xi , pi )}iN {( y j , q j )} jM is marked. Otherwise this corresponding relationship remains unmarked. Now, let us follow the strategy similar to LERS. All feasible inclusions are automatically marked assuming that their confidence is above the threshold (given by user). In the current example we take Msup=1 as the threshold value for support and Mconf=0.5 as the threshold value for the confidence. a1* e1 * ( sup 65 1 ) – marked negative a1* e2 * ( sup 16 1 ) – marked negative a1* e3 * ( sup 1 1 and conf 0.5 ) – marked positive 11 1 ) – marked negative a2 * e1 * ( sup 12 a2 * e2 * ( sup 53 1 and conf 0.51 ) – marked positive a2 * e3 * ( sup 23 1 ) – marked negative a3 * e1 * ( sup 17 12 1 and conf 0.51 ) – marked positive a3 * e2 * ( sup 13 1 ) – marked negative a3 * e3 * ( sup 1 1 but conf 0.36 ) – not marked b1* e1 * ( sup 2 1 and conf 0.72 ) – marked positive b1* e2 * ( sup 43 1 ) – marked negative b1 * e3 * ( sup 0 1 ) – marked negative b2 * e1 * ( sup 76 1 but conf 0.22 ) – not marked b2 * e2 * ( sup 17 12 1 but conf 0.27 ) – not marked b2 * e3 * ( sup 83 1 and conf 0.51 ) – marked positive c1* e1 * ( sup 65 1 ) – marked negative c1* e2 * ( sup 65 1 ) – marked negative c1* e3 * ( sup 23 1 but conf 0.47 ) – not marked c2 * e1 * ( sup 2 1 and conf 0.66 ) – marked positive c2 * e2 * ( sup 1 1 but conf 0.33 ) – not marked c2 * e3 * ( sup 0 1 ) – marked negative c3 * e1 * ( sup 13 1 ) – marked negative c3 * e2 * ( sup 13 1 ) – marked negative c3* e3 * ( sup 76 1 and conf 0.64 ) – marked positive d1* e1 * ( sup 53 1 but conf 0.48 ) – not marked d1* e2 * ( sup 65 1 ) – marked negative d1* e3 * ( sup 1 1 but conf 0.28 ) – not marked d 2 * e1 * ( sup 23 1 but conf 0.33 ) – not marked d 2 * e2 * ( sup 13 1 ) – marked negative d 2 * e3 * ( sup 53 1 but conf 0.37 ) – not marked The next step is to build terms of length 2 from objects with support 1 , and confidence 0.5 . We propose the following definition for concatenating any two sets ci * {( xi , pi )}iN , e j * {( y j , q j )} jM where K M N : ( ci e j )* {( xi , pi qi )}iK . Following this definition, we get: (a3 c1 )* e3 * ( sup 1 1 and conf 0.8 ) – marked positive (a3 d1 )* e3 * ( sup 1 1 and conf 0.5 ) – marked positive (a3 d 2 )* e3 * ( sup 0 1 ) – marked negative (b2 d 2 )* e1 * ( sup 23 1 ) – marked negative (b2 c2 )* e2 * ( sup 23 1 ) – marked negative (c1 d1 )* e3 * ( sup 1 1 and conf 0.5 ) – marked positive (c1 d 2 )* e3 * ( sup 12 1 ) – marked negative 3. Algorithm ERID (Extracting Rules from Incomplete Decision Systems) We are ready to present the algorithm for extracting rules from an incomplete information system. To simplify the presentation, a new notations (only for the purpose of this algorithm) will be introduced. So, let us assume that S ( X , A {d },V ) is a decision system, where X is a set of objects, A {a[i ] : 1 i I } is a set of classification attributes, Vi {a[i, j ] : 1 j J (i )} is a set of values of attribute a[i ], i I . We also assume that d is a decision attribute where Vd {d [m] : 1 m M } . Finally, we assume that a(i1 , j1 ) a(i2 , j2 ) ... a(ir , jr ) is denoted by term [a (ik , jk )]k{1, 2,..., r} , where all i1 , i2 ,..., ir are distinct integers and j p J (i p ) , 1 p r . Algorithm for Extracting Rules from Incomplete Decision System S. Algorithm ERID ( S , 1 , 2 ) S – incomplete decision system, 1 – minimum support threshold, 2 – minimum confidence threshold. i : 1 while i I do begin j : 1 ; m : 1 while j J [i} do begin if sup(a[i, j ] d (m)) 1 then mark[( a[i, j ]* d ( m)*)] negative ; if sup(a[i, j ] d (m)) 1 and conf (a[i, j ] d (m)) 2 then begin mark[( a[i, j ]* d ( m)*)] positive ; output[a[i, j ] d ( m )] end j : j 1 end end I k ; {ik } ; / ik – index randomly chosen from {1,2,..., I } / for all jk J (ik ) do a[(ik , jk )]ik I k : a (ik , jk ) ; for all i, j such that both rules a[( ik , jk )]ik I k d ( m ) , a[i, j ] d ( m ) are not marked and i I k do begin if sup ( a[(ik , jk )]ik I k a[i, j ] d ( m)) 1 then mark[a[(ik , jk )]ik I k a[i, j ]* d ( m)*] negative ; if sup ( a[( ik , jk )]ik I k a[i, j ] d ( m)) 1 and conf ( a[(ik , jk )]ik I k a[i, j ] d ( m)) 2 then begin mark [a[( ik , jk )]ik I k a[i, j ]* d ( m)*] positive ; output[a[(ik , jk )]ik I k a[i, j ] d ( m)] else end begin I k : I k {i} ; a[(ik , jk )]ik I k : a[(ik , jk )]ik I k a[i, j ] end The complexity of this algorithm is similar to complexity of LERS. 4. Conclusion Missing data values cause problems both in knowledge extraction from information systems and in query answering systems. Quite often missing data are partially specified. It means that the missing value is an intermediate value between the unknown value and the completely specified value. Clearly there is a number of intermediate values possible for a given attribute. Algorithm ERID presented in this paper can be used as a new tool for chasing values in incomplete information systems with rules discovered from them. We can keep repeating this strategy till a fix point is reached. To our knowledge, there are no such tools available for incomplete information systems presented in this paper. 5. References [1] Grzymala-Busse, J. “A new version of the rule induction system LERS”, Fundamenta Informaticae, Vol. 31, No. 1, 1997, 27-39 [2] Kryszkiewicz, M., Rybinski, H., “Reducing information systems with uncertain attributes}, Proceedings of ISMIS'96, LNCS/LNAI, Springer-Verlag, Vol. 1079, 1996, 285-294 [3] Ohrn, A., Komorowski, J., Skowron, A., Synak, P., “A software systemfor rough data analysis”, Bulletin of the International Rough Sets Society, Vol. 1 , No. 2, 1997, 5859 [4] Pawlak, Z., “Rough sets-theoretical aspects of reasoning about data”, Kluwer, Dordrecht, 1991 [5] Quinlan, J.R., “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers, San Mateo, CA, 1993 [6] Ras, Z.W., Joshi, S., Ras, Z., Joshi, S., “Query approximate answering system for an incomplete DKBS", in Fundamenta Informaticae Journal, IOS Press, Vol. 30, No. 3/4, 1997, 313-324