Normal Symbolic Form Marc CSERNEL Paris IX Dauphine Inria(Axis) Supported by the French embassy in Brazil with the help of the cultural and technical services in RECIFE Recife Nov 2004 Normal Symbolic Form 1 outlines • • • • • • • • • Symbolic description with rules Comparison of S.D Notion of coherence Description Potential Normal Symbolic Form (NSF) Computation using N.S.F. Problem of memory growth The effective Growth Perspectives et Conclusion Recife Nov 2004 Normal Symbolic Form 2 Symbolic descriptions Classical individuals Symbolic Descriptions • • • • • Color Size Beatles 1 Blue 8.0 Beatles 2 Red 10.0 color size specie1 {Blue,Red} [7:12] specie2 {Yellow} [8:9] S.D are well adapted to describes species, classes, collections . S.D. allows to take into account variability.,uncertainty…. A Symbolic Description represent an intention Usually the extension of a S.D. is made from usual individual S.D are a good instrument to summarize information Recife Nov 2004 Normal Symbolic Form 3 Background Knowledge Background Knowledge can be introduced as rules • Two kinds of rules – hierarchical (mother-daughter) if wings {absent} then wings_colour = NA – logical: if wings_colour {red} then Thorax_colour {blue} • Dependencies reduce the description space – they introduces holes in the description space • But hierarchical dependencies reduce also the number of dimensions. Recife Nov 2004 Normal Symbolic Form 4 NA semantic • NA means that the variable is not applicable, (or non-existent or meaningless) if the premise is true. • NA should not be considered as a value, but for conveniences we denote: var = NA. – remark 1: hierarchical rules induce a kind of inheritance. – remark 2: As NA is not a possible value of a variable, if a variable is Not Applicable, it can’t have any value. Recife Nov 2004 Normal Symbolic Form 5 The Importance of the rules Similarity problem if we consider two symbolic descriptions: x = [a {a1, a2, a3, a4}] ^ [b {b1, b2, b3, b4}] y = [a {a3, a4}] ^ [b {b2, b3, b4, b5}] • if a {a1,a2} then b = NA a4 a4 a3 a3 a2 a2 a1 a1 b1 b2 b3 b4 b5 b1 b2 b3 b4 b5 the objects look more similar Recife Nov 2004 Normal Symbolic Form 6 Problem of Identification If we considers two objects x = [a { a3, a4}] ^ [b { b2, b3 }] y = [a {a2, a3}] ^ [b { b3, b4 }] x full white y grey shade if a {a3} then b {b1,b2, b4 } a4 a3 a2 a1 a4 a3 a2 a1 b1 b2 b3 b4 Recife Nov 2004 b1 b2 b3 b4 These objects cannot always be discriminated Normal Symbolic Form 7 Our needs • • • • Be able to compare S.D. contrained by rules Make then, data analysis, data mining….., Specially distance computation d(a,b)= (ab)0.5( (a) (b)) (ab) • where is the join operator • (a) the description potential a b of a Recife Nov 2004 Normal Symbolic Form 8 The Coherence • An individual is coherent if it's description respect the rules. • The coherent part of an S.D. is the part of the description which respect the rules. • An S.D. is coherent if a coherent part exist. • An S.D. is fully coherent when all the H-volume it describe is coherent. – If wing {Absent} then wings_colour = NA wings Wings-color d1 {absent} {blue,yellow,red} d2 {absent,present} {blue,yellow,red} d3 {present} {blue,yellow,red} • d1 is not coherent, d2 is coherent, d3 is fully coherent Recife Nov 2004 Normal Symbolic Form 9 The Description Potential • Description Potential: the measure of the COHERENT part of the volume described by a symbolic description. • The introduction of dependence rules changes the potential of an description. • The Computation of the Description Potential is combinatorial. r1 r2 D Recife Nov 2004 Normal Symbolic Form 10 combinatorial aspect of D.P x : {a1,a2}{b1,b2}{c1,c2}{d1,d2} Potential without rules = 2x2x2x2 = 16 if a {a1} then b { b1 } ;(r1) if c {c1 } then d { d1 } ;(r2) a1 b1 c1 d1 a1 b1 c1 d2 a1 b1 c2 d1 a1 b1 c2 d2 a1 b2 c1 d1 a1 b2 c1 d2 a1 b2 c2 d2 a1 b2 c2 d2 Recife Nov 2004 Y N(r2) Y Y N (r1) N(r1,r2) N(r1) N (r1) a2 b1 c1 d1 a2 b1 c1 d2 a2 b1 c2 d1 a2 b1 c2 d2 a2 b2 c1 d1 a2 b2 c1 d2 a2 b2 c2 d1 a2 b2 c2 d2 Normal Symbolic Form Y N(r2) Y Y Y N(r2) Y Y 11 computation of D.P. without dependencies p a Ai i1 Ai cardinal A , if y is discrete i i Range A , if y is continu i i where Range(Ai) is the absolute value of the difference between the upper bound and the lower bound of interval Ai. Recife Nov 2004 Normal Symbolic Form 12 computation of D.P. with dependencies d = [yi Ai] be a S.D. and rj {r1, , rt} a rule t (d rj) j1 ((d r j ) r ) k jk p (d / r rt ) (Di ) 1 i1 (1)t 1 ((d r ) r ) rt ) 1 2 complexity:- Exponential according to the number of rules - Linear according to the number of variables Recife Nov 2004 Normal Symbolic Form 13 Our Aim • Our aim is to represent only the valid part of a S.D. (fully coherent) – We have to split the description space in different subspaces. – Each subspace will correspond to one premise variable, and all conclusion Variables linked with. – In subspace will be cut in different slices where all the values of the premise variable lead to the same conclusion Recife Nov 2004 Normal Symbolic Form 14 Normal Symbolic Form • If Hand {Absent} then Hand_Size= N.A. • If Hand {Absent} then Finger = N.A. • If Finger {Absent} then Finger_size N.A. d1 d2 hand {absent, present} {absent, present} hand_size {big,midle} {big,small} finger finger_size … {absent, present} {big, small} … . { present} {small} Hand Finger Hand_size Finger_Size Recife Nov 2004 Normal Symbolic Form 15 N.S.F. Hand Hand_size d1 {absent, present} {big,midlle} d2 {absent, present} {big,small} d1 d2 Main table N° 1 2 3 Hand Hand_size {present } {big,middle} {present} {big,small} {absent} N.A. Hand { 1,3} {2,3} finger {absent, present} { present} .. …. …. finger_size … {big, small} … . {small} Secondary tables finger_table {1,3} {2} N.A. N° 1 2 3 finger { present} {present} {absent} finger_size {big,small} { small} {NA} 3 tables but NO MORE RULES Recife Nov 2004 Normal Symbolic Form 16 Normal Symbolic Form • First NSF condition: – If no dependency occurs between the variables, or if a dependency occurs between the first variable V1 and the others • Second NSF condition – If all values expressed for one object by the premise variable V1 leads to the same conclusion • Valid Only if the rules form a tree or a set of tree (first condition) • inspired by Codd's Normal Form in Databases Recife Nov 2004 Normal Symbolic Form 17 Consequences Two consequences: • Cut the data tables into different tables according to the dependence tree – possible only if the dependence form a tree or a forest • Cut each symbolic description in two parts (CQ2) – One where the premise is true – One where the premise is false If Finger {Absent} then Finger_Size N.A. Finger Finger_Size {absent,present} {big,small} Recife Nov 2004 Normal Symbolic Form finger Finger_size {present} {big,small} {absent} N.A. 18 The dependence tree • For each rule we draw an edge from the premise variable to the conclusion variable • Each node correspond to one variable • Each node can be linked to more than a rule • Each node and his son correspond to a secondary table • Y1 form a table with Y2,Y3,Y4 • Y2 form a table with Y5 and Y6 Recife Nov 2004 Normal Symbolic Form y1 y2 y5 y3 y4 y6 19 Potential computation with N.S.F (d1) = (ta(1) +ta(3))*…= (6+1)*…. ta(1) = 1*2*(tc(1) +tc(3)) = 1*2*(1 +2) = 6 ta(3) = 1*1 = 1 (a1) = 2 * (ta(1) +ta(2)) = 2*(3+2) = 10 7*.. d1 d2 Hand { 1,3} {2,3} N° Hand Hand_Size Finger_Table 6 1 {present } {big,med} {1,3} 2 {present} {big,small} {2} {absent} N.A. N.A. 1 3 ta Recife Nov 2004 .. …. …. 2 1 N° Finger 1 { present} 2 {present} 3 {absent} Finger_size {big,small} { small} {NA} tc Normal Symbolic Form 20 Memory Growth (1st approach) • CQ2=> size double for each node of the dependence tree N T: nb of premise variables, N: nb of descriptions, S: size of the biggest secondary table • S = < N*2D (D depth of the tree) • tree well balanced then D = Log2 (T) N*22 N*23 S= N*2 Log2 (T) = N*T => POLYNOMIAL N*24 • tree not balanced (worst case) then S = N*2T N*25 Recife Nov 2004 Normal Symbolic Form N*2T 21 Data Factorisation wings d1 {absent, present} d2 {absent, present} d3 {absent} d1 d2 d3 wings Color 1 2 {pres } {abs} {1,2} {4} 3 {pres} {1,3} wings_color Thor_col Thorax_size {red, blue} {blue, yellow} {big, small} {red, green} {blue, red} {small} NA { blue,yellow } {med,small} Wings... Thorax_Size { 1, 2} {2,3} {2} {big, small} {small} {med,small} color Wings_co Thorax_Col 1 2 3 4 { red} {blue} {green} NA {blue} { blue, yell} {blue,red} { blue, yell } if wings {Absent} then wings_colour = NA. if wings_colour {red} then Thorax_colour {blue}. Recife Nov 2004 Normal Symbolic Form 22 usual operations with N.S.F. • • no changes BUT recursion due to numerous tables (following table's tree) Two kinds of operation – – Creating a new Volume (join, union..) Restriction of a existing volume (intersection..) Recife Nov 2004 Normal Symbolic Form 23 The Join (without N.S.F.) An Operation Creating a new volume Recife Nov 2004 Normal Symbolic Form 24 operations with N.S.F. (creating a new volume) OK Recife Nov 2004 Suicide ?? Normal Symbolic Form 25 The Bounds We will first consider nominal variables and hierarchical rules. 2 Cases: - Locally : between mother and daughter - Globally : between the root and a leaf Sn: size according to CQ2 Sv : size according the number of possible different descriptions Ndaughter, Nmother, Fl = Nd/Nm Flocal N daughter N mother Recife Nov 2004 min S n , Sv 2 N mother Normal Symbolic Form 26 One set of premise Value • One premise variable y divided in two set of values A and A • Locally – m conclusion variables x1…xm – Nmother,Ndaugther : tables sizes – according to CQ2: Sn = 2* Nm – according to the variables domain m Xj Sv 2 1 2 1 2 1 j 1 A A – Flocal <= 2 Recife Nov 2004 Normal Symbolic Form 27 Globally -Line with NA can not refer to secondary tables N A A -Fglobal <= 2 -There is no following element for A -No global growing Recife Nov 2004 Normal Symbolic Form A1 A1 N N N N 28 More than 1set of premise values • Until now 1 set of premise value • But more than one (n) can exist if travel {plane} then car = N.A. if travel {plane} then train = N.A. if travel {car} then plane = N.A …. N F N D l Recife Nov 2004 M min T n,T d N n M Normal Symbolic Form 29 Tables Occupation N Items to Put I31,I37,I70 I10,I20 m possible descriptions (places) I15,I27 • how many busy places ? • Statistic Maxwell Boltzman in physic Recife Nov 2004 Normal Symbolic Form 30 Tables Occupation (without rules) N initial Individuals, m possible descriptions if d1, , dj, , dpvariables m = P(X1) - P(Xj) - P(Xp) - • With independent variables and equirepatition p p ( d1 ,...d n ) i 1 1 P ( Xi) 1 2 p i 1 Recife Nov 2004 Normal Symbolic Form Xi 1 31 Tables occupation with rules • 2 tables are generated – one where the rules did NOT apply – one where the rules did apply (T1) (T2) • For T2 preceding problem • T1 size : – small compare with N=> T1 mostly full – => factorisation • the size will be growing according to • greater Recife Nov 2004 P( A) P(A) A is greater the factorisation will be Normal Symbolic Form 32 Application • Phlebotomies (Shadflies) – 73 species (descriptions) – 53 nominal variables – 5 rules in 3 different trees with 8 variables • Fg = Fl • Secondary tables – 32 lines (56) – 18 lines (39) – 16 lines (30) Recife Nov 2004 Fl = 32/73 = 0.438 Fl = 18/73 = 0.246 Fl = 16/73 = 0.219 Normal Symbolic Form 33 About complexity • Size – – • Over cost with references variable Over cost induced by the possible growing factor F Computation – – – – the related to the rules disappeared On over cost due to the NSF transformation appears (N2 like a sort) but only Once A minor over cost (linear) appears with some operation (like the join) Over cost due to the recursion Recife Nov 2004 Normal Symbolic Form 34 perspectives and conclusions • Our work arrive to a Mature point • But still uncompleted – Accept dependence graph instead of dependence tree – Adapt algorithm to N.S.F – First distances and comparisons – Clustering, factorial analysis ….. • Made simulation studies Recife Nov 2004 Normal Symbolic Form 35