Data Mining Query Languages Donato Malerba Dipartimento di Informatica Università degli studi di Bari malerba@di.uniba.it http://www.di.uniba.it/~malerba/ A database perspective on KDD Most current KDD systems offer isolated discovery features using tree inducers, neural nets, and rule discovery algorithms They cannot be embedded into a large application and typically offer just one knowledge discovery feature True also for OLAP tools This is the first generation of KDD tools DMQL – Prof. D. Malerba 2 Short term research program Efficient DM algorithms on top of large databases and utilizing the existing DBMS support Example: 1. Realization of C4.5 on top of a large database requires tighter coupling with the DBMS and intelligent use of indexing techniques. 2. Exploitation of caching techniques for association rule mining 3. Exploitation of special indexing techniques for clustering See IBM’s Intelligent Miner DMQL – Prof. D. Malerba 3 Long term research program KDD should follow one of the key DBMS paradigms: building interpreters for query languages and compilers for ad hoc queries and embedding queries in application programming interfaces (API) Focus: increasing programmer productivity for KDD application development Knowledge and Data Discovery Management Systems (KDDMS) are the second generation KDD systems. DMQL – Prof. D. Malerba 4 Imielinski & Mannila’s view KDD object Rule: probabilistic formula or multidimensional correlation X.Diagnosis=“heart disease” and X.Age <50 X.BMI > 29 [300, 0.80] Classifier: decision trees, neural network, multidimensional regression Clustering: collection of objects KDD query: a predicate which returns a set of objects that can either be KDD objects or database objects (records or tuples) DMQL – Prof. D. Malerba 5 Imielinski & Mannila’s view The KDD objects typically will not exist a priori, thus querying the KDD objects requires their generation at run time. KDD objects may also be pre-generated and stored in a “inductive” database, such as metadata. In such cases querying can be reduced to retrieval. KDDMS should be able to persistently store and manage the KDD objects as well as provide the ability to query them Querying involves The generation of new KDD objects Retrieval of the ones which were generated before DMQL – Prof. D. Malerba 6 Imielinski & Mannila’s view Closure principle: the result of a query is a relation that can be queried further. A result of a KDD query may be an argument of another compatible type of KDD query. In principle a KDD query can be nested within a regular relational query. KDD queries can be embedded in a host programming environment just as SQL queries can be embedded in host languages. DMQL – Prof. D. Malerba 7 Imielinski & Mannila’s view Generate a decision tree on a user-defined training set (specified through a database query) with userdefined attributes and user-specified classification categories. Then find all records in a database wrongly classified using that classifier as a training data for another classifier. Generate all rules with consequent values computed by an SQL query (KDD queries may not be completely known at a compile time!). Find tuples that belong to the largest cluster in a clustering constructed according to a user-specified distance metrics. DMQL – Prof. D. Malerba 8 Imielinski & Mannila’s view Research program: 1. A KDD query language has to be formally defined 2. Query optimization tools would be developed to compile queries into reasonably efficient execution plans. Very challenging! KDD queries are much more powerful than SQL queries DMQL – Prof. D. Malerba 9 Imielinski & Mannila’s view Example: Patient(Age, Sex, City, Diagnosis, Height, Weight, ClaimAmount, …) City(State, Population, …) X.Diagnosis=“heart disesase” and Sex=“male” X.Age>50 [1200,0.70] The user wants to see all the rules about a patient with heart disease such that the consequent of this rule says something about the age of the patient, there are at least 1,000 cases which the rule body applies, and the confidence of the rule is at least 65%. DMQL – Prof. D. Malerba 10 Imielinski & Mannila’s view In M-SQL (Imielinski et al., Proc. KDD’96) SELECT FROM MINE(T):R WHERE R.Body={(Diagnosis=“heart disesase”)} AND R.Consequent = {(Age=*)} R.Support > 1000 R.Confidence > 0.65 R renames MINE(T) MINE(T) is an operator that takes a class T and generates all propositional rules about T Rule discovery: Another type of querying! DMQL – Prof. D. Malerba 11 Imielinski & Mannila’s view Rules are not necessarily the final product of KDD applications. A proper API, which embeds a rule query language in a more expressive, general purpose, host programming environment is necessary. Iterate over a collection of rules DMQL – Prof. D. Malerba 12 KDD query languages Imielinski, Virmani, Abdulghani. Discovery board application programming interface and query language for database mining. Proc. KDD96 Imielinski and Virmani. MSQL: A query language for database mining. Journal of Data Mining and Knowledge Discovery, 3(4), 1999. Meo, Psaila, and Ceri. A new SQL-like operator for mining association rules. Proc. VLDB, 1996. Han, Fu, Koperski, Wang, and Zaiane. DMQL: A Data Mining Query Language for Relational Databases‘, Proc. SIGMOD'96 Workshop. on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), 1996. Shen, Ong, Mitbander, and Zaniolo. Metaqueries for Data Mining. In: Fayyad et al. Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996. DMQL – Prof. D. Malerba 13 KDD query languages Giannotti, Manco. Querying Inductive Databases via Logic-Based UserDefined Aggregates. PKDD 1999 De Raedt. An Inductive Logic Programming Query Language for Database Mining. AISC 1998 De Raedt. A Logical Database Mining Query Language. ILP 2000 De Raedt. Query execution and optimization for inductive databases. Proc. EDBT Workshop on Database Technologies for Data Mining, 2002 Boulicaut, Klemettinen, Mannila. Querying inductive databases: a case study on the MINE RULE operator. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery PKDD'98, LNAI 1510, 1998 Elfeky, Saad, Fouad. ODMQL: Object Data Mining Query Language. In Dittrich et al. (eds), Objects and Databases 2000, LNCS 1944, 2001 Johnson, Lakshmanan, Ng. The 3w model and algebra for unified data mining. Proc. VLDB, 1998 DMQL – Prof. D. Malerba 14 KDD query languages Han, Koperski, Stefanovic. GeoMiner: A System Prototype for Spatial Data Mining. SIGMOD Conference 1997 Malerba, Appice, Ceci, Vacca. SDMOQL: An OQL-based Data Mining Query Language for Map Interpretation. Proc. EDBT Workshop on Database Technologies for Data Mining, 2002 DMQL – Prof. D. Malerba 15 DMQL: just some syntactic sugar on top of DM algorithms? A user can formulate a DM task without paying attention to Logical and physical representation problems The correct procedural order in which some DM steps should be performed The development of decision support applications is easier, just as SQL make implementation of operational information systems easy A casual user can find patterns by means of a DMQL in the same way he can find data by means of a SQL query: no development of ad hoc applications A DMQL provides a foundation on which a GUI can be built DMQL – Prof. D. Malerba 16 Spatial Data Mining Spatial Data Mining: the extraction of spatial patterns from both spatial and aspatial data, possibly stored in a spatial database Spatial Pattern: a pattern showing the interaction of two or more spatial objects or space-depending attributes according to a particular spacing or set of arrangements IF a large town intersects the motorway A14 THEN it is also close to the Adriatic sea (13%, 90%) DMQL – Prof. D. Malerba 17 Spatial Data Mining & GIS Geographical Information Systems (GIS) offer an important application area where spatial data mining techniques can be effectively used Example: topographic map interpretation DMQL – Prof. D. Malerba 18 Interpreting Topographic Maps Topographic map: large scale (1:10000 to 1:100000) composite map showing relief, vegetation and man-made features of a portion of a land surface. Interpreting the colored lines, areas, and other symbols is the first step in using topographic maps. Easy! Symbols correspond univocally to concepts explicitly modelled by the map creator. Difficult! locating in a map some geographical objects not explicitly modelled (e.g., industrial area) DMQL – Prof. D. Malerba 19 Interpreting Topographic Maps Solution: embedding intelligent capabilities in geo-based tools Knowledge-based GIS use spatial reasoning capabilities available domain knowledge to support map interpretation But operational definitions of some complex concepts are difficult to elicit are not portable on different data models depend on the scale of the map DMQL – Prof. D. Malerba 20 Data Mining to Support Map Interpretation Tasks Data Mining tools and techniques to find spatial patterns of interest. INGENS (INductive GEographic iNformation System) = GIS + Data Mining Server + … Training functionality The user can train the system by providing instances of geographical objects to be recognized in a map DMQL – Prof. D. Malerba 21 INGENS Architecture Interface Layer GUI (Web Browser) Map Converter Application Enabler Resource Manager Map Descriptor Map Editor Data mining Server Query Interpreter Map Storage Subsystem Deductive DBMS ObjectStore DBMS Map Repository DMQL – Prof. D. Malerba Knowledge Repository The interface Suite Permits tools for layer of the Allows any user import/export integration implements a of Responsible for tosuite formulate and/or GUI, which is a Amaps of data the automated queries in of modification Java applet. mining systems generation of Is thecan only SDMOQL information that beaccess run first-order logic path to the data language. Manages acquired by concurrently by descriptions of contained in the discovered means of the to multiple users some Map patterns Map Repository Converter train INGENS geographical Involved in objects. storing, updating and retrieving items 22 The data model for the map repository Hybrid tessellation-topological model Tessellation model: a map is decomposed according to a regular grid of cells Topological model has two structural hierarchies: physical (describes the geographical objects by means of the most appropriate geometric entity); logical (expresses the semantics of geographical objects). DMQL – Prof. D. Malerba 23 The object-oriented data model in UML Lower scale 0..1 0..* Map 1 N/NE/NW/S/SE/SW/E/W 0..1 Gif 0..1 1..* Logical structure Logical Object 1..* 1 1 Grid Cell 1 1..* Physical structure 1 Physical Object 1..* 1..* Representation Disjoint/Meet/Overlap/Contains/Equal/Covers 1..* Point 1..* Hydrography Orography Land Adm inistration Vegetation Adm inistrative Boundary Ground Trasportation Net. Construction Region 1..* Line 1..* 1..* 1 0..1 0..* Built-up Area Boundary Line vertex Inside/Border River Lake Canal Font Sea Parcel Contour Slope Park Slope Cultivation Level point DMQL – Prof. D. Malerba Forest City Road Province County Ropeway State Railway Building Airport Bridge Wall Hamlet Power Station Town Factory Chief Town Boat Station Regional Capital Capital Deposit 24 Different technologies: what support for the user? Problem: The user should not suffer from problems related to the integration of different technologies, such as Data mining OODBMS Deductive databases GIS Solution: A data mining query language (DMQL) interfaces users with the whole system and hides the different technologies. DMQL – Prof. D. Malerba 25 SDMOQL DMQL is the data mining query language define by Han et al. (1996) for relational databases GMQL (Geo Mining Query Language) is a language for spatial data mining, based on DMQL (Koperski 1999) Both inspired to SQL and the relational model not appropriate for an OO information system like INGENS SDMOQL (Spatial Data Mining Object Query Language) is a spatial mining query language for INGENS users based on OQL DMQL – Prof. D. Malerba 26 Data Mining primitives A DMQL must incorporate a set of DM primitives designed to facilitate efficient, fruitful knowledge discovery. Primitives include: The specification of portions of the database in which the user is interested; The kinds of knowledge to be mined Background knowledge useful in guiding the discovery process; Interestingness measures of pattern evaluation How the discovered knowledge should be visualized DMQL – Prof. D. Malerba 27 Task-relevant data specification In traditional DM applications, it is sufficient to specify Database attributes or Datawarehouse dimensions since: 2. 1. No interaction complex between transformation objects of is assumed, stored data so that is each object can be effectively described by a tuple required the relation Not in in spatial data mining, where working at the level of Notstored in spatial data, that datais mining, geometric where representations attributes (points, of the neighbors lines and regions) of some of geographic spatial object objects of isinterest undesirable. may influence theinterested object itself. The user is in working at higher conceptual levels, Data set to human-interpretable mine cannot be straightforwardly where properties and represented relations between by means geographical of a relational objects aretable, expressed where distinct tuples refer to distinct, independent objects. DMQL – Prof. D. Malerba 28 Example Two roads can cross each other, or run parallel, or can be confluent, independently of the fact that they are represented by one or more tuples of a relational table of “lines” or “regions” DMQL – Prof. D. Malerba 29 A solution SDMOQL interpreter allows user to select the geographical objects that are relevant to the data mining task, and then it invokes the Map Descriptor to produce their high level conceptual descriptions. Conceptual descriptions are based on first-order logic language, where both properties and relations of selected geographical objects can be easily represented. DMQL – Prof. D. Malerba 30 Example SELECT x FROM x IN Cell WHERE x->num_cell = 11 contain(x1,x2)=true, …, contain(x1,x70)=true, type_of(x1)=cell, …, type_of(x4)=vegetation,…, subtype_of(x2)=cultivation,…, subtype_of(x7)=cart_track_road,…, color(x2)=black, …, color(x70)=black, extension(x7)=111.018,…, extension(x33)=1104.74, geographic_direction(x7)=north, …, geographic_direction(x68)=north, line_shape(x7)=straight,…, line_shape(x33)=cuspidal,…, altitude(x19)=106.00,…, altitude(x43)=102.00, area(x2)=187525.00, …, area(x62)=30250.00, density(x2)=high, …, density(x62)=low, line_to_line(x7,x68)=almost_parallel, …, region_to_region(x2,x21)=meet,…, distance(x7,x68)=5.00, line_to_region(x8,x27)=adjacent, …, point_to_region(x4,x18)=outside,… DMQL – Prof. D. Malerba 31 Describing topographic maps 33 geographical objects: contour_slope, slope, river, canal, primary_road, farm_road, interfarm_road, main_road, … 16 descriptors: contain(x, y), type_of(y), subtype_of(y), color(y), area(y), density(y), extension(y), geographic_direction(y), line_shape(y), altitude(y), line_to_line(y), distance(y, z), region_to_region(y,z), line_to_region(y,z), point_to_region(y,z) Defined together with town planners, the set of descriptors is quite general and can capture geometric, topological and directional features of geographical objects in a topographic map. DMQL – Prof. D. Malerba 32 Task-relevant data specification In SDMOQL the selection of geographical objects is performed by means of simplified OQL queries with a SELECT-FROM-WHERE structure. Example 1: cell-level query The user selects cell 26 from the topographic map of Canosa (Apulia, Italy) SELECT x FROM x IN Cell WHERE x->num_cell = 26 AND x->part_map->map_name = “Canosa” The Map Descriptor generates the description of all the objects in this cell. DMQL – Prof. D. Malerba 33 Task-relevant data specification Example 2: layer-level query The user selects the layer Horography from the topographic map of Canosa and the layer Construction from any map. SELECT x, y FROM x IN Horograhy, y IN Construction WHERE x->part_map->map_name = “Canosa” The Map Descriptor generates the description of the objects in these layers. DMQL – Prof. D. Malerba 34 Task-relevant data specification Example 3: object-level query The user selects the objects of the logic class River and the objects of type motorway (instances of the class Road), from cell 26 of the topographic map of Canosa. SELECT x, y FROM x IN River, y IN Road WHERE x->part_map->map_name = “Canosa” AND y->part_map->map_name = “Canosa” AND x->log_incell->num_cell = 26 AND y->log_incell->num_cell = 26 AND y->type_road = “motorway” The Map Descriptor generates the description of these objects. DMQL – Prof. D. Malerba 35 Task-relevant data specification Example 4: Semantically ambiguous query SELECT x, y FROM x IN Cell, y IN River WHERE x->num_cell = 26 AND y->log_incell->num_cell = 26 This query selects the object cell 26 and all rivers in it. However, it is unclear whether the Map Descriptor should describe 1. the entire cell 26 or Formulate a cell-level query 2. only the rivers in it, or Formulate an object-level query (unusual) case, anyway the problem can be 3. both. solved by the UNION operator, applied to the cell-level query and the object-level query. DMQL – Prof. D. Malerba 36 Task-relevant data specification The following constraint is imposed on SDMOQL: the selected data must belong to the same level (cell, layer or logic object). More formally the FROM clause can contain either a group of Cells or a set of Layers, or a set of Logic Objects, but never a mixture of them. DMQL – Prof. D. Malerba 37 The kind of knowledge to be mined <Spatial_Data_Mining_Statement> ::= <Limited_OQL_Query> mine <Kind_of_Pattern> <Kind_of_Pattern> ::= <Classification_Rules> | <Association_Rules> <Classification_Rules> ::= classification as <Pattern_Name> for <Classification_Concept>{,<Classification_Concept>} [analyze <Descriptor> {, <Descriptor>}] The analyze clause indicates that the descriptions of selected data is based on spatial/aspatial descriptors in the list DMQL – Prof. D. Malerba 38 Example SELECT x FROM x in Cell WHERE x->num_cell >= 5 AND x->num_cell <= 12 mine classification as MorphologicalElements for class(_)=system_of_farms, class(_)=fluvial_landscape analyze contain/2, type_of/1, subtype_of/1, area/1, density/1, extension/1, line_shape/1, geographic_direction/1, line_to_line/2, distance/2, line_to_region/2, region_to_region/2, point_to_region/2 DMQL – Prof. D. Malerba 39 Defining background knowledge In SDMOQL the BK is defined as a set of definite clauses. Example: define knowledge close_to(X,Y)=true :- region_to_region(X,Y)=meet. close_to(X,Y)=true :- close_to(Y,X)=true. DMQL – Prof. D. Malerba 40 Defining schema hierarchies Define a total or partial order among attributes in the database schema. Activity Example: business_activity low_business_activity other_activity high_business_activity define hierarchy Activity as level1:{business_activity, other_activity} < level0: Activity; level2:{low_business_activity,high_business_activity} < level1: business_activity; DMQL – Prof. D. Malerba 41 Defining set-grouping hierarchies Organize values for given attributes or dimensions into groups of constants or range of values Distance Example: far 2 Km .. + Km near 0 m … 1,999 m define hierarchy Distance for distance/2 as level1:{far, near} < level0: Distance; level2:{0, 1999} < level1: near; level2:{2000, +inf} < level1: far; DMQL – Prof. D. Malerba 42 Interestingness measure specification threshold values: e.g. the user can set thresholds such as confidence and support as follows: ThresholdParameter threshold Value search biases in the hypotheses space: The user can specify a number of preference criteria, such as maximization of the number of covered examples or minimization of the number of variables in the body of a learned clauses, according to the following syntax: preference criteria (minimize | maximize ) Criterion with tolerance Value. generic input parameter of a data mining algorithm: ParameterName = Value DMQL – Prof. D. Malerba 43 An example Problem: Localize a “sistema poderale” (system of farms) in Apulian maps. The user browses the maps with INGENS and finds some examples of system of farms … DMQL – Prof. D. Malerba 44 An example: the data … and some counterexample DMQL – Prof. D. Malerba 45 An example: the DM query Formulate a data mining task through SDMOQL: SELECT x FROM x in Cell WHERE(x->num_cell>=1 AND x->num_cell<=6) OR x->num_cell=11 OR x->num_cell=34 OR (x->num_cell>=15 and x->num_cell <= 17) mine classification as MorphologicalElements for class(X)=system_of_farms analyze contain/2, type_of/1, subtype_of/1, color/1, altitude/1, area/1, density/1, extension/1, line_shape/1, geographic_direction/1, line_to_line/2, distance/2, line_to_region/2, region_to_region/2, point_to_region/2 with preference criteria minimize negative_example_covered with tolerance 0.6, maximize positive_example_covered with tolerance 0.4, minimize cost with tolerance 0.4 number_of_rules threshold 15, consistent threshold 500 DMQL – Prof. D. Malerba 46 An example: the process VISUALIZATION QUERY OF SPATIAL DATA MINING DATA MINING ALGORITHMS MAP DESCRIPTOR OBJECT ORIENTED DBMS DISCOVERED KNOWLEDGE SYMBOLIC DESCRIPTIONS DEDUCTIVE DATABASE OBJECT ORIENTED DATABASE DMQL – Prof. D. Malerba 47 An example: results class(S1)=system_of_farms contain(S1,S2)=true, region_to_region(S2,S3)=meet, area(S2)[68437.5 .. 187525], region_to_region(S2,S4)=disjoint, region_to_region(S4,S3)=meet, type_of(S1)=cell, type_of(S2)=parcel, type_of(S4)=parcel, type_of(S3)=parcel there are two pairs of adjacent parcels (S2, S3) and (S4, S3), one of which is relatively large (the area is between 68437.5 and 187525 m2) DMQL – Prof. D. Malerba 48 An example:results class(S1)=system_of_farms contain(S1,S2)=true, region_to_region(S2,S3)=disjoint, density(S3)=high, region_to_region(S2,S4)=meet, region_to_region(S4,S5)=meet, region_to_region(S2,S5)=meet, type_of(S1)=cell, area(S2)[12381.2 .. 25981.2], type_of(S2)=parcel there are three adjacent regions (S2, S4, S5), one of which is certainly a medium-sized parcel (the area is between 12381.2 and 25981.2 m2), and there is a fourth region (S3) with a high density (presumably vegetation), disjoint from the parcel S2 DMQL – Prof. D. Malerba 49 An example: use of results The user asks INGENS to find all cells in the Canosa map that are classified as system of farms and contain a main road. SELECT C FROM M in Map, C in Cell, R in Road WHERE M->name = “Canosa” AND C->map = M AND R->log_incell = C AND R->type_road=“main_road” AND class(C) = system_of_farms To check the condition defined by the predicate class(C)=system_of_farms, the Query Interpreter generates the symbolic description of each cell in the map and asks the Query Engine of the Deductive Database to prove the goal class(C)=system_of_farms given the logic program previously learned. DMQL – Prof. D. Malerba 50 Conclusions and future work A query language for spatial data mining based on OQL A solution to the problem of integrating different technologies (OODBMS, Deductive database, DM, …) Differences with respect to traditional DMQL Implementation of the interpreter in INGENS. Future Work Extension of the set of descriptors automatically extracted from a vectorized map Extension to other spatial data mining tasks supporting quantitative interpretation of maps DMQL – Prof. D. Malerba 51