9O From: AAAI Technical Report SS-99-01. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved. Rule Generation by Means of Lattice Theory R.Brtiggemann 1, S.Pudenz 1, 2H.-G.Bartel l) Institute of Freshwater EcologyandInlandFisheries, Dep.of Ecohydrology RudowerChaussee6a D- 12484Berlin brg@igb-berlin.de 2) Institute of Chemistry Department of AnalyticalandEnvironmental Chemistry Humboldt UniversityBerlin HessischeStr. 1-2 D-10II5Berlin hans-georg=bartel @rz.hu-berlin,de Abstract Arather small data matrixof sevenchemicalsand17 different ecotoxicological endpoints is examined by lattice theory. Especially the FormalConceptAnalysis, developedsince 1980by the schoolof Willeis applied.Theheart of this mathematicalmethodis the concept,consistingof a pair of sets, whichare related to eachother (GaloisConnection).Theone set is a subsetof objects,the othera subsetof properties.The conceptsare partially ordereddue a subset relation among thosesets. Fromthe subsetrelation implicationsare derived. Here-after giving somereadingexamples-implicationsare discussed. For examplethe followingquestion can be answered: Is there a chemical that has a highecotoxicological effect for exampleon bothTanytarsusdiss. (insect) and Ranacatesbeiana (amphibium)? Ourchemicalset has no one, whichhas these properties in common. Otherexamplesare discussed. Introduction ity of visualizing rather complexrelationships in high-dimensionalspaces. Somefirst steps to apply lattice theory in environmental sciences are already published, see e.g. Briiggemann,Voigt and Steinberg (1997). Data Sevenchemicalsand 17 test species of different trophic levels are analyzed (see table 1). The species are commonly used in the frame of environmentalprotection studies. From Table1: Test speciesof different trophic levels taken fromDevillers, Thioulouse and Karcher (1993). Species Identifier Type Daphniamagna(LC50) A Crustacea Tanytarsus diss. B Insect Orconectesimmunis C Crustacea In our paper we concentrate on the purely qualitative point of view, i.e. we wouldlike to generate ,,if.-.then.- rules". Ecotoxicologicaltest data designedto evaluate the load status of the aquatic environmentare of specific interest. The premises of such rules should be structural properties of the chemicals. The conclusions should be statements about the order of amountof toxicities. A lack of numerical accuracy may be compensated for by the potency to derive many rules. Ranacatesbeiana D Amphibium Oncorynchusmykiss E Fish Lepomis macrochirus F Fish Gambusia affinis G Fish Ictalurus punctatus H Fish Carassiusduratus I Fish Thetool to generatesuch rules is derivedfromlattice theory, which is a rather new discipline of Discrete Mathematics (since around 1930) (Birkhoff 1984, Daveyand Priestley 1990). In particular the variant of FormalConceptAnalysis, a sub discipline of lattice theory dating from around1980, turns out to be useful in deriving such rules in an automatic, i.e. algorithmic way(Daveyand Priestley 1990, Ganter and Wille 1996). One of the main advantages of this methodis its connectionto graphtheory and throughthis the possibil- Pimephalespromelas J Fish Arbactapunctulata(early embryo) K Echinoderm Arbactapunctulata(spermcell) L Echinoderm Photobacteriumphosphoreum M Bacterium Arbactapunctulata (early embryo,DANN) N Echinoderm Daphniamagna(EC50) O Crustacea Dahniapulex P Crustacea Ceriodaphnia reticulata Q Crustacea Copyright © 1999, AmericanAssociation for Artifical Intelligence (www.aaai.org).All rights reserved. 91 a systematic point of view, moredifferent species per group wouldbe favored; howeverthe data availability for such an ambitious aim is rather poor.Devillers, Thioulouse and Karcher (1993) selected mainly the chemicals by their different physiological mechanisms.There are oxy- or halogenatedcompounds (see table 2). The full data matrix can examinedby looking into the original paper. Wecall the chemicalsthe objects of the analysis and the set of objects G. The toxicological data are called the attributes of each object. Theattribute set is called M. Table2: Sevenchemicalsandtheir identifier. Chemical H H OH I I I H3C~C~C~C~CH 3 I I I OH H CH 3 H I H3C--C--CHE--OH Identifier 2-Methyl-propanol Y .4-Pentadione 8 OH 2 1.89 - 3.00 3 3.09- 4.02 4 4.24 - 5.06 5 5.22 - 6.52 ..... hi7 ) Lattice theory Anorder relation has to be ¯ reflexive ¯ antisymmetric and ¯ transitive. 2-Chloroethanol E For example the comparison of two numbers follows these three axioms. Sets equippedwith an order relation are called partially orderedsets (posets). For any two objects one may look for upper and lower bounds. Simplified spokenif such boundsexist and there is an unique upper and lower boundresp. then the mathematical structureof a lattice arises. CI Cl " "’~’~"Cl 0.68- 1.83 For introductory literature the reader is referred to Birkhoff (1984), Daveyand Priestley (1990). Consider a set of jects G, then binary relations amongthe elements of this set maybe considered. For examplethe order relation: Hexachloroethane CI 1 corresponding to the 17 different species/end points. The quantities ni are integer numbers,as defined by table 3. A simplifying notation can be introduced: <value in terms of ni><identifier for the species>. Cl CI attribute-interval n(x) = (n1, nz CI@C--CI-I2--OH H H I I C1--CDCmOH I I H H °Interval N Instead of the raw measureddata Pi, each object x is now ecotoxicologically characterized by a 17-tuple 2.2.2-Trichloroethanol H I H3C--CmCnC--CH3 II I II O H O Table3: Classificationof the attribute values. -Methyl-2.4-pentadiol CH 3 CI In order of robustnessan attribute-wise classification is performed. Eachattribute is divided into five intervals as follows(table 3): Pentachlorophenol n Here we are interested in lattices of specific constructions, namely lattices of Formal Concept Analysis (FCA)(Ganter and Wille 1996). The basic notion of FCAis a triple (G, M, 1) called formal concept. G is the set of objects under consideration and M the set of their (original or suitable transformed)attributes. Anelement glm (g ~ G, m ~ M) of the binary relation I c_ G x M(glm - (g, m) ~ is read: "Object g h asattr ibute m". Anotherform for representing a context (G, M,/) is a cross table. Its rowsare labeled by the objects g ~ G and its columnsby the attributes m¢ M. A"X" at the position of cross- 92 ing row g with columnm meansthat glm holds. A formal concept corresponding to a given context (G, M,/) is defined as follows: Let X’ be the set of attributes shared by all objects of the subset X c_ G, X’ := {m~MI glm: all g X}, andY’ the object set sharing all attributes of the subset YcM, Y’ := {g~Glglm: all m E Y}. Then a formal concept can be defined as the pair (X, Y), if X’ = Y and Y’ = Xhold (,,Galois connection"). Thesubset X is called the extent and the subset Ythe intent of the concept(X, Y). Withinthe set of the concepts B(G,M,/) of the context (G, M,/) a partially ordered relation denoted by < (subconceptsuperconceptrelation) can be defined: the concept(X, Y) the subconcept of the superconcept (U, V), if the extent contains all objects of the extent X and the intent Y has all attributes of the intent V, (E Y) < (U, IO:¢:~Xc_U(c:~VcY). The partially ordered set (B(G,M,/), <) correspondsto the mathematical structure of a completelattice, the concept lattice of the context (G, M,/). It reflects the generalizedhierarchy and conceptual structure correspondingto (G, M,/). Conceptlattices are effectively visualized by labeled line diagrams(HASSE diagrams). Little circles or dots represent the concepts and the ascending paths of line segmentsrepresent the neighborhoodrelation -<. (X,Y) -< (U,V) holds (the concept(U, V) is upper neighbor of the concept(X, if (X, Y) < (U, V) and if there does not exist any other concept (T,S) whichis subconceptto (U, V) and superconcept (X,Y), that means (X, Y)<(T,S)<(U, V)is false. An equivalent description of the conceptual knowledgegiven by the concept lattice or the line diagram to a context (G, M,/) is the description by attribute implications: context ¢:v conceptlattice ¢:~ attribute implications (data) (conceptualhierarchy) (logicalstructure) Animplication between attributes, PR~ CO, PR, COc_ M, holds in (G, M,/), if all objects sharing all attributes of PR also share all attributes of CO.For each finite context a reducedand complete implication basis can be found. Results To construct an extendedcontext consisting of the ecotoxicological informationfromn(x) and the structure, the structure is codedas follows: -CI, -OH, =Oor aromatic (planar 4n+2system). Thus in the context table a chemical becomes a ,,X" if it has the correspondingstructural element: Chemical -OH X X X --C1 E X 11 X X X X aromatic =O The HASSEdiagram of (B(G,M,I), <)the context is shownin figure 1. By such a diagram the information summarized in a data matrix can be easily deduced. Starting from an object and following the lines upwardsthe information of the objects can be obtained. For examplechemical13: Three other points are reachable: 2C, 1A and -OH. This meanschemical~ has the toxicity 2C, 1Aand the structural element -OH. Figure1: HASSE diagramof the lattice. [] Following the lines downwardsand starting for example from properties one finds chemicals with a given spectrum of properties: Theproperty 2M(i.e. a specific ecotoxicological reaction with mediumextent) is fulfilled for all chemicals but 13. However the structural elements cannot characterize this exceptionalcase. Somemore skillness is evident if several properties or chemicalsare to be examined.For example: Is there a chemical that has a high ecotoxicological effect for exampleon both Tanytarsus diss. (insect) and Rana catesbeiana (amphibium), i.e. 4B and 5D ? Our chemical set has no one, which has these properties in common.Relaxing the demandfor the toxicity to the amphibiumto only 3Dthen the chemical~ is found. Againthe structural code is by far to crude to characterize this specifity. Findingrules: Here, the implications are selected, for which the premises are concernedsolely with chemicalstructural elements: X 8 X X C1 ~ 2A, 2C, 2M =O ~ 3A, 2C, 5D, 3E, 4L, 2M aromatics ~ all with exception 4B 93 The third rule however has not a firm basis because its premise is only fulfilled by one compound,whichis the only one in the wholeset of substances. information from them. The problem becomesmore evident by another rule: Birkhoff, G. 1984. Lattice theory. Providence, RhodeIsland: AmericanMathematical Society, Vol 25. References aromatics ~ -CI This is true within the actual chosenset but can clearly not generalized. Therules are extracted froma finite set of objects and attributes, therefore the generationof rules should better be seen as an automatic hypotheses-generation. Discussion A data matrix is converted into a HASSE diagramof concepts by several steps (Classification, extending by structural codes, forming simple valued context). Fromthe HASSE diagramand the subset-superset relation implications can be derived. Among the manyimplications (the total of 63 rules is the outcomeof that simple data matrix), whichrelate the attributes, those are of primaryinterest, whichrelate structural informationwith ecotoxicological end points. There are somedistinct drawbacks: Clearly the substancebasis is still by far too weakto establish real rules; howeverthe methodallows in principle to do this, as was shownwith a few examples. The real measureddata are converted into discrete values without any statistical guidance. Thusthere is no safety against slight changesin the data transformation. This however could be done and is not on the focus of this paper. The HASSEdiagram may consist of a complex system of lines and therefore difficult to read. However, there are techniques to circumvent such unreadable diagrams. One mayargue that the type of ecotoxicological end point was not discussed so far. However here someresults are given in Bartel (1999), which are moreor less methodological and whichare therefore suppressed here. Whatare the advantages? Clearly if a graphicalpresentation is possible, it is favoured against boring tables. As pointed out there can be performed manyinteresting studies, which are as more non trivial as morenot only single relations like glm are examinedbut relations like G’I? or ?IM’with G’ _c G or M’_c M. Finally the subset-superset relation allows to deduceimplications. Examplesare discussed. A crucial point is, whether a premise is emptyor not. If the premise is not emptythen one gets some kind of probability. As more often the premiseis fulfilled as moreweightit has as a real rule. However it should be clarified that eventhe machineryof lattice theory cannot ,,generate" newinsights. Theyare all already includedin the data set,- the problemis only, to distillate the Davey,B. A. and Priestley, H. A. 1990. Introduction to Lattices and Order. Cambridge:CambridgeUniversity Press. Ganter, B. and Wille, R. eds. 1996. FormaleBegriffsanalyse MathematischeGrundlagen. Berlin: Springer-Verlag. Briiggemann,R., Voigt, K. and Steinberg, C. 1997b. Application of Formal Concept Analysis to Evaluate Environmental Databases. Chemosphere35(3): 479-486. Devillers, J., Thioulouse, J. and Karcher, W. 1993. Chemometrical Evaluation of Multispecies-MultichemicalData by Means of Graphical Techniques Combinedwith Multivariate Analyses. Ecotoxicol. Environ. Saf. 26: 333-345. Bartel, H.-G. (1999). Formal ConceptAnalysis in Ecotoxicology. In Proceedings of L Workshopon Order Theoretical Tools in EnvironmentalSciences. Germany,Berlin. To appear as IGBReport.