Spatial Association Rules - Artificial Intelligence Group

advertisement
Data Mining Query Languages
Donato Malerba
Dipartimento di Informatica
Università degli studi di Bari
malerba@di.uniba.it
http://www.di.uniba.it/~malerba/
A database perspective on
KDD
Most current KDD systems offer isolated
discovery features using tree inducers,
neural nets, and rule discovery algorithms
They cannot be embedded into a large
application and typically offer just one
knowledge discovery feature
True also for OLAP tools
This is the first generation of KDD tools
DMQL – Prof. D. Malerba
2
Short term research program
 Efficient DM algorithms on top of large
databases and utilizing the existing DBMS
support
Example:
1. Realization of C4.5 on top of a large database requires
tighter coupling with the DBMS and intelligent use of
indexing techniques.
2. Exploitation of caching techniques for association rule
mining
3. Exploitation of special indexing techniques for
clustering
See IBM’s Intelligent Miner
DMQL – Prof. D. Malerba
3
Long term research program
 KDD should follow one of the key DBMS
paradigms: building interpreters for query
languages and compilers for ad hoc queries
and embedding queries in application
programming interfaces (API)
 Focus: increasing programmer productivity for
KDD application development
Knowledge and Data Discovery Management Systems
(KDDMS) are the second generation KDD systems.
DMQL – Prof. D. Malerba
4
Imielinski & Mannila’s view
 KDD object
 Rule: probabilistic formula or multidimensional
correlation
X.Diagnosis=“heart disease” and X.Age <50  X.BMI > 29 [300, 0.80]
 Classifier: decision trees, neural network,
multidimensional regression
 Clustering: collection of objects
 KDD query: a predicate which returns a set of
objects that can either be KDD objects or
database objects (records or tuples)
DMQL – Prof. D. Malerba
5
Imielinski & Mannila’s view





The KDD objects typically will not exist a priori, thus
querying the KDD objects requires their generation at
run time.
KDD objects may also be pre-generated and stored in
a “inductive” database, such as metadata.
In such cases querying can be reduced to retrieval.
KDDMS should be able to persistently store and
manage the KDD objects as well as provide the ability
to query them
Querying involves


The generation of new KDD objects
Retrieval of the ones which were generated before
DMQL – Prof. D. Malerba
6
Imielinski & Mannila’s view
 Closure principle: the result of a query is a
relation that can be queried further.
 A result of a KDD query may be an argument
of another compatible type of KDD query.
 In principle a KDD query can be nested within
a regular relational query.
 KDD queries can be embedded in a host
programming environment just as SQL queries
can be embedded in host languages.
DMQL – Prof. D. Malerba
7
Imielinski & Mannila’s view



Generate a decision tree on a user-defined training set
(specified through a database query) with userdefined attributes and user-specified classification
categories. Then find all records in a database wrongly
classified using that classifier as a training data for
another classifier.
Generate all rules with consequent values computed
by an SQL query (KDD queries may not be completely
known at a compile time!).
Find tuples that belong to the largest cluster in a
clustering constructed according to a user-specified
distance metrics.
DMQL – Prof. D. Malerba
8
Imielinski & Mannila’s view
Research program:
1. A KDD query language has to be formally defined
2. Query optimization tools would be developed to
compile queries into reasonably efficient execution
plans.
Very challenging!
KDD queries are much more powerful than SQL
queries
DMQL – Prof. D. Malerba
9
Imielinski & Mannila’s view
Example:
Patient(Age, Sex, City, Diagnosis, Height, Weight,
ClaimAmount, …)
City(State, Population, …)
X.Diagnosis=“heart disesase” and Sex=“male” 
X.Age>50 [1200,0.70]
The user wants to see all the rules about a patient with
heart disease such that the consequent of this rule
says something about the age of the patient, there are
at least 1,000 cases which the rule body applies, and
the confidence of the rule is at least 65%.
DMQL – Prof. D. Malerba
10
Imielinski & Mannila’s view
In M-SQL (Imielinski et al., Proc. KDD’96)
SELECT
FROM MINE(T):R
WHERE R.Body={(Diagnosis=“heart disesase”)} AND
R.Consequent = {(Age=*)}
R.Support > 1000
R.Confidence > 0.65
R renames MINE(T)
MINE(T) is an operator that takes a class T and generates
all propositional rules about T
Rule discovery: Another type of querying!
DMQL – Prof. D. Malerba
11
Imielinski & Mannila’s view
Rules are not necessarily the final product of KDD
applications.
A proper API, which embeds a rule query
language in a more expressive, general
purpose, host programming environment is
necessary.
 Iterate over a collection of rules
DMQL – Prof. D. Malerba
12
KDD query languages
Imielinski, Virmani, Abdulghani. Discovery board application
programming interface and query language for database mining.
Proc. KDD96
Imielinski and Virmani. MSQL: A query language for database mining.
Journal of Data Mining and Knowledge Discovery, 3(4), 1999.
Meo, Psaila, and Ceri. A new SQL-like operator for mining association
rules. Proc. VLDB, 1996.
Han, Fu, Koperski, Wang, and Zaiane. DMQL: A Data Mining Query
Language for Relational Databases‘, Proc. SIGMOD'96 Workshop. on
Research Issues on Data Mining and Knowledge Discovery
(DMKD'96), 1996.
Shen, Ong, Mitbander, and Zaniolo. Metaqueries for Data Mining. In:
Fayyad et al. Advances in Knowledge Discovery and Data Mining,
AAAI Press, 1996.
DMQL – Prof. D. Malerba
13
KDD query languages
Giannotti, Manco. Querying Inductive Databases via Logic-Based UserDefined Aggregates. PKDD 1999
De Raedt. An Inductive Logic Programming Query Language for Database
Mining. AISC 1998
De Raedt. A Logical Database Mining Query Language. ILP 2000
De Raedt. Query execution and optimization for inductive databases. Proc.
EDBT Workshop on Database Technologies for Data Mining, 2002
Boulicaut, Klemettinen, Mannila. Querying inductive databases: a case
study on the MINE RULE operator. In: Proceedings of the Second
European Symposium on Principles of Data Mining and Knowledge
Discovery PKDD'98, LNAI 1510, 1998
Elfeky, Saad, Fouad. ODMQL: Object Data Mining Query Language. In
Dittrich et al. (eds), Objects and Databases 2000, LNCS 1944, 2001
Johnson, Lakshmanan, Ng. The 3w model and algebra for unified data
mining.
Proc. VLDB, 1998
DMQL – Prof. D. Malerba
14
KDD query languages
Han, Koperski, Stefanovic. GeoMiner: A System Prototype for
Spatial Data Mining. SIGMOD Conference 1997
Malerba, Appice, Ceci, Vacca. SDMOQL: An OQL-based Data
Mining Query Language for Map Interpretation. Proc. EDBT
Workshop on Database Technologies for Data Mining, 2002
DMQL – Prof. D. Malerba
15
DMQL: just some syntactic
sugar on top of DM algorithms?
 A user can formulate a DM task without paying attention
to
Logical and physical representation problems
The correct procedural order in which some DM steps should be
performed
 The development of decision support applications is
easier, just as SQL make implementation of operational
information systems easy
 A casual user can find patterns by means of a DMQL in
the same way he can find data by means of a SQL
query: no development of ad hoc applications
 A DMQL provides a foundation on which a GUI can be
built
DMQL – Prof. D. Malerba
16
Spatial Data Mining


Spatial Data Mining: the extraction of spatial
patterns from both spatial and aspatial data,
possibly stored in a spatial database
Spatial Pattern: a pattern showing the interaction
of two or more spatial objects or space-depending
attributes according to a particular spacing or set
of arrangements
IF a large town intersects the motorway A14
THEN it is also close to the Adriatic sea (13%, 90%)
DMQL – Prof. D. Malerba
17
Spatial Data Mining & GIS
Geographical Information Systems (GIS) offer an
important application area where spatial data mining
techniques can be effectively used
Example: topographic map interpretation
DMQL – Prof. D. Malerba
18
Interpreting Topographic Maps
 Topographic map: large scale
(1:10000 to 1:100000) composite
map showing relief, vegetation and
man-made features of a portion of
a land surface.
 Interpreting the colored lines,
areas, and other symbols is the first
step in using topographic maps.
 Easy! Symbols correspond univocally to concepts
explicitly modelled by the map creator.
 Difficult! locating in a map some geographical objects not
explicitly modelled (e.g., industrial area)
DMQL – Prof. D. Malerba
19
Interpreting Topographic Maps
 Solution: embedding intelligent capabilities in geo-based
tools
 Knowledge-based GIS use
spatial reasoning capabilities
available domain knowledge
to support map interpretation
 But operational definitions of some complex concepts
are difficult to elicit
are not portable on different data models
depend on the scale of the map
DMQL – Prof. D. Malerba
20
Data Mining to Support Map
Interpretation Tasks
Data Mining tools and techniques to find
spatial patterns of interest.
INGENS (INductive GEographic iNformation
System) = GIS + Data Mining Server + …
Training functionality
The user can train the system by providing
instances of geographical objects to be
recognized in a map
DMQL – Prof. D. Malerba
21
INGENS Architecture
Interface
Layer
GUI (Web Browser)
Map Converter
Application
Enabler
Resource
Manager
Map
Descriptor
Map Editor
Data mining
Server
Query
Interpreter
Map Storage
Subsystem
Deductive DBMS
ObjectStore DBMS
Map
Repository
DMQL – Prof. D. Malerba
Knowledge
Repository
The interface
Suite
Permits
tools for
layer of the
Allows any user
import/export
integration
implements
a of
Responsible
for
tosuite
formulate
and/or
GUI,
which
is a
Amaps
of data
the
automated
queries
in of
modification
Java
applet.
mining
systems
generation
of
Is
thecan
only
SDMOQL
information
that
beaccess
run
first-order
logic
path
to the
data
language.
Manages
acquired
by
concurrently
by
descriptions
of
contained
in
the
discovered
means of
the to
multiple
users
some
Map
patterns
Map Repository
Converter
train
INGENS
geographical
Involved in
objects.
storing, updating
and retrieving
items
22
The data model for the map
repository
Hybrid tessellation-topological model
Tessellation model: a map is decomposed
according to a regular grid of cells
Topological model has two structural hierarchies:
physical (describes the geographical objects by means
of the most appropriate geometric entity);
logical (expresses the semantics of geographical
objects).
DMQL – Prof. D. Malerba
23
The object-oriented data model
in UML
Lower scale
0..1
0..*
Map
1
N/NE/NW/S/SE/SW/E/W
0..1
Gif
0..1
1..*
Logical structure
Logical Object 1..*
1
1
Grid
Cell
1
1..*
Physical structure
1
Physical Object
1..*
1..*
Representation
Disjoint/Meet/Overlap/Contains/Equal/Covers
1..*
Point
1..*
Hydrography
Orography
Land Adm inistration
Vegetation
Adm inistrative Boundary Ground Trasportation Net.
Construction
Region 1..*
Line
1..*
1..*
1
0..1
0..*
Built-up Area
Boundary
Line vertex
Inside/Border
River
Lake
Canal
Font
Sea
Parcel
Contour Slope
Park
Slope
Cultivation
Level point
DMQL – Prof. D. Malerba
Forest
City
Road
Province
County
Ropeway
State
Railway
Building
Airport
Bridge
Wall
Hamlet
Power Station
Town
Factory
Chief Town
Boat Station
Regional Capital
Capital
Deposit
24
Different technologies: what
support for the user?
 Problem: The user should not suffer from problems
related to the integration of different technologies, such
as
Data mining
OODBMS
Deductive databases
GIS
 Solution: A data mining query language (DMQL)
interfaces users with the whole system and hides the
different technologies.
DMQL – Prof. D. Malerba
25
SDMOQL
 DMQL is the data mining query language define by Han
et al. (1996) for relational databases
 GMQL (Geo Mining Query Language) is a language for
spatial data mining, based on DMQL (Koperski 1999)
 Both inspired to SQL and the relational model  not
appropriate for an OO information system like INGENS
 SDMOQL (Spatial Data Mining Object Query Language)
is a spatial mining query language for INGENS users
based on OQL
DMQL – Prof. D. Malerba
26
Data Mining primitives
A DMQL must incorporate a set of DM primitives
designed to facilitate efficient, fruitful knowledge
discovery.
Primitives include:
The specification of portions of the database in which
the user is interested;
The kinds of knowledge to be mined
Background knowledge useful in guiding the
discovery process;
Interestingness measures of pattern evaluation
How the discovered knowledge should be visualized
DMQL – Prof. D. Malerba
27
Task-relevant data specification
In traditional DM applications, it is sufficient to specify
 Database attributes or
 Datawarehouse dimensions
since: 2.
1. No interaction
complex between
transformation
objects of
is assumed,
stored data
so that
is
each object can be effectively described by a tuple
required
the relation
Not in in
spatial
data mining, where working at the level of
Notstored
in spatial
data, that
datais mining,
geometric
where
representations
attributes (points,
of the
neighbors
lines
and regions)
of some
of geographic
spatial object
objects
of isinterest
undesirable.
may
influence
theinterested
object itself.
 The
user is
in working at higher conceptual
 levels,
Data set
to human-interpretable
mine cannot be straightforwardly
where
properties and
represented
relations
between
by means
geographical
of a relational
objects aretable,
expressed
where
distinct tuples refer to distinct, independent objects.

DMQL – Prof. D. Malerba
28
Example
Two roads can cross each other, or run parallel,
or can be confluent, independently of the fact
that they are represented by one or more tuples
of a relational table of “lines” or “regions”
DMQL – Prof. D. Malerba
29
A solution
SDMOQL interpreter allows user to select the
geographical objects that are relevant to the
data mining task, and then it invokes the Map
Descriptor to produce their high level conceptual
descriptions.
Conceptual descriptions are based on first-order
logic language, where both properties and
relations of selected geographical objects can be
easily represented.
DMQL – Prof. D. Malerba
30
Example
SELECT x FROM x IN Cell
WHERE x->num_cell = 11
contain(x1,x2)=true, …, contain(x1,x70)=true,
type_of(x1)=cell, …, type_of(x4)=vegetation,…,
subtype_of(x2)=cultivation,…, subtype_of(x7)=cart_track_road,…,
color(x2)=black, …, color(x70)=black,
extension(x7)=111.018,…, extension(x33)=1104.74,
geographic_direction(x7)=north, …, geographic_direction(x68)=north,
line_shape(x7)=straight,…, line_shape(x33)=cuspidal,…,
altitude(x19)=106.00,…, altitude(x43)=102.00,
area(x2)=187525.00, …, area(x62)=30250.00,
density(x2)=high, …, density(x62)=low,
line_to_line(x7,x68)=almost_parallel, …, region_to_region(x2,x21)=meet,…,
distance(x7,x68)=5.00, line_to_region(x8,x27)=adjacent, …,
point_to_region(x4,x18)=outside,…
DMQL – Prof. D. Malerba
31
Describing topographic maps
 33 geographical objects: contour_slope, slope, river, canal,
primary_road, farm_road, interfarm_road, main_road, …
 16 descriptors: contain(x, y), type_of(y), subtype_of(y),
color(y), area(y), density(y), extension(y),
geographic_direction(y), line_shape(y), altitude(y),
line_to_line(y), distance(y, z), region_to_region(y,z),
line_to_region(y,z), point_to_region(y,z)
 Defined together with town planners, the set of descriptors
is quite general and can capture geometric, topological and
directional features of geographical objects in a topographic
map.
DMQL – Prof. D. Malerba
32
Task-relevant data specification
 In SDMOQL the selection of geographical objects is
performed by means of simplified OQL queries with a
SELECT-FROM-WHERE structure.
 Example 1: cell-level query
The user selects cell 26 from the topographic map of Canosa
(Apulia, Italy)
SELECT x
FROM x IN Cell
WHERE x->num_cell = 26 AND x->part_map->map_name =
“Canosa”
The Map Descriptor generates the description of all the
objects in this cell.
DMQL – Prof. D. Malerba
33
Task-relevant data specification
 Example 2: layer-level query
The user selects the layer Horography from the
topographic map of Canosa and the layer Construction
from any map.
SELECT x, y
FROM x IN Horograhy, y IN Construction
WHERE x->part_map->map_name = “Canosa”
The Map Descriptor generates the description of the objects
in these layers.
DMQL – Prof. D. Malerba
34
Task-relevant data specification
 Example 3: object-level query
The user selects the objects of the logic class River and the objects
of type motorway (instances of the class Road), from cell 26 of
the topographic map of Canosa.
SELECT x, y
FROM x IN River, y IN Road
WHERE x->part_map->map_name = “Canosa” AND
y->part_map->map_name = “Canosa” AND
x->log_incell->num_cell = 26 AND
y->log_incell->num_cell = 26 AND
y->type_road = “motorway”
The Map Descriptor generates the description of these objects.
DMQL – Prof. D. Malerba
35
Task-relevant data specification

Example 4: Semantically ambiguous query
SELECT x, y
FROM x IN Cell, y IN River
WHERE x->num_cell = 26 AND
y->log_incell->num_cell = 26
This query selects the object cell 26 and all rivers in it. However, it is
unclear whether the Map Descriptor should describe
1. the entire cell 26 or
 Formulate a cell-level query
2. only the rivers in it, or  Formulate an object-level query
 (unusual) case, anyway the problem can be
3. both.
solved by the UNION operator, applied to
the cell-level query and the object-level
query.
DMQL – Prof. D. Malerba
36
Task-relevant data specification
The following constraint is imposed on SDMOQL:
the selected data must belong to the same level (cell, layer or
logic object).
More formally the FROM clause can contain either a group of
Cells or a set of Layers, or a set of Logic Objects, but
never a mixture of them.
DMQL – Prof. D. Malerba
37
The kind of knowledge to be
mined
<Spatial_Data_Mining_Statement> ::=
<Limited_OQL_Query>
mine
<Kind_of_Pattern>
<Kind_of_Pattern> ::=
<Classification_Rules> | <Association_Rules>
<Classification_Rules> ::=
classification as <Pattern_Name>
for <Classification_Concept>{,<Classification_Concept>}
[analyze <Descriptor> {, <Descriptor>}]
The analyze clause indicates that the descriptions of selected data is
based on spatial/aspatial descriptors in the list
DMQL – Prof. D. Malerba
38
Example
SELECT x
FROM x in Cell
WHERE x->num_cell >= 5 AND x->num_cell <= 12
mine classification as MorphologicalElements
for class(_)=system_of_farms, class(_)=fluvial_landscape
analyze
contain/2, type_of/1, subtype_of/1,
area/1, density/1, extension/1,
line_shape/1, geographic_direction/1,
line_to_line/2, distance/2, line_to_region/2,
region_to_region/2, point_to_region/2
DMQL – Prof. D. Malerba
39
Defining background knowledge
 In SDMOQL the BK is defined as a set of definite clauses.
 Example:
define knowledge
close_to(X,Y)=true :- region_to_region(X,Y)=meet.
close_to(X,Y)=true :- close_to(Y,X)=true.
DMQL – Prof. D. Malerba
40
Defining schema hierarchies
 Define a total or partial order among attributes in the database
schema.
Activity
 Example:
business_activity
low_business_activity
other_activity
high_business_activity
define hierarchy Activity as
level1:{business_activity, other_activity} < level0: Activity;
level2:{low_business_activity,high_business_activity} < level1:
business_activity;
DMQL – Prof. D. Malerba
41
Defining set-grouping
hierarchies
 Organize values for given attributes or dimensions into groups of
constants or range of values
Distance
 Example:
far
2 Km .. + Km
near
0 m … 1,999 m
define hierarchy Distance for distance/2 as
level1:{far, near} < level0: Distance;
level2:{0, 1999} < level1: near;
level2:{2000, +inf} < level1: far;
DMQL – Prof. D. Malerba
42
Interestingness measure
specification
 threshold values: e.g. the user can set thresholds such
as confidence and support as follows:
ThresholdParameter threshold Value
 search biases in the hypotheses space: The user can
specify a number of preference criteria, such as
maximization of the number of covered examples or
minimization of the number of variables in the body of a
learned clauses, according to the following syntax:
preference criteria (minimize | maximize ) Criterion
with tolerance Value.
 generic input parameter of a data mining algorithm:
ParameterName = Value
DMQL – Prof. D. Malerba
43
An example
Problem: Localize a “sistema poderale” (system
of farms) in Apulian maps.
The user browses the maps with INGENS and
finds some examples of system of farms …
DMQL – Prof. D. Malerba
44
An example: the data
… and some
counterexample
DMQL – Prof. D. Malerba
45
An example: the DM query
 Formulate a data mining task through SDMOQL:
SELECT x FROM x in Cell
WHERE(x->num_cell>=1 AND x->num_cell<=6) OR x->num_cell=11
OR x->num_cell=34 OR (x->num_cell>=15 and x->num_cell <= 17)
mine classification as MorphologicalElements
for class(X)=system_of_farms
analyze contain/2, type_of/1, subtype_of/1, color/1, altitude/1,
area/1, density/1, extension/1, line_shape/1, geographic_direction/1,
line_to_line/2, distance/2, line_to_region/2,
region_to_region/2, point_to_region/2
with preference
criteria
minimize negative_example_covered with tolerance 0.6,
maximize positive_example_covered with tolerance 0.4,
minimize cost with tolerance 0.4
number_of_rules threshold 15, consistent threshold 500
DMQL – Prof. D. Malerba
46
An example: the process
VISUALIZATION
QUERY OF
SPATIAL
DATA
MINING
DATA MINING
ALGORITHMS
MAP
DESCRIPTOR
OBJECT ORIENTED
DBMS
DISCOVERED
KNOWLEDGE
SYMBOLIC
DESCRIPTIONS
DEDUCTIVE
DATABASE
OBJECT
ORIENTED
DATABASE
DMQL – Prof. D. Malerba
47
An example: results
class(S1)=system_of_farms 
contain(S1,S2)=true, region_to_region(S2,S3)=meet,
area(S2)[68437.5 .. 187525],
region_to_region(S2,S4)=disjoint,
region_to_region(S4,S3)=meet, type_of(S1)=cell,
type_of(S2)=parcel, type_of(S4)=parcel,
type_of(S3)=parcel
there are two pairs of adjacent parcels (S2, S3) and (S4,
S3), one of which is relatively large (the area is between
68437.5 and 187525 m2)
DMQL – Prof. D. Malerba
48
An example:results
class(S1)=system_of_farms 
contain(S1,S2)=true, region_to_region(S2,S3)=disjoint,
density(S3)=high, region_to_region(S2,S4)=meet,
region_to_region(S4,S5)=meet, region_to_region(S2,S5)=meet,
type_of(S1)=cell, area(S2)[12381.2 .. 25981.2], type_of(S2)=parcel
there are three adjacent regions (S2, S4, S5), one of which is certainly
a medium-sized parcel (the area is between 12381.2 and 25981.2
m2), and there is a fourth region (S3) with a high density
(presumably vegetation), disjoint from the parcel S2
DMQL – Prof. D. Malerba
49
An example: use of results
 The user asks INGENS to find all cells in the Canosa map
that are classified as system of farms and contain a main
road.
SELECT C
FROM M in Map, C in Cell, R in Road
WHERE M->name = “Canosa” AND C->map = M AND R->log_incell = C AND
R->type_road=“main_road” AND class(C) = system_of_farms
 To
check
the
condition defined by the predicate
class(C)=system_of_farms,
the
Query
Interpreter
generates the symbolic description of each cell in the map
and asks the Query Engine of the Deductive Database to
prove the goal class(C)=system_of_farms given the logic
program previously learned.
DMQL – Prof. D. Malerba
50
Conclusions and future work
 A query language for spatial data mining based on OQL
 A solution to the problem of integrating different
technologies (OODBMS, Deductive database, DM, …)
 Differences with respect to traditional DMQL
 Implementation of the interpreter in INGENS.
Future Work
 Extension of the set of descriptors automatically
extracted from a vectorized map
 Extension to other spatial data mining tasks supporting
quantitative interpretation of maps
DMQL – Prof. D. Malerba
51
Download