Towards New Models and Languages for Data Mining and Integration Peter Brezany

advertisement
Towards New Models and
Languages for Data Mining
and Integration
Peter Brezany
Institute of Scientific Computing
University of Vienna, Austria
Presentation at the NeSC, Edinburgh
August 13, 2008
Outline
 Introduction
  CRISP-DM Model and Methodology
 What is CRISP-DM
 Why update it
 From CRISP-DM to CRISP-DMI
 Impact of CRISP-DMI on the DMI Workflow
Language
 State of the Art in Language Design
 Discussion of the 1st Language Design Ideas
 Conclusions and Future Work
Edinburgh, 13 Aug, 2008
2
What is CRISP-DM?
Phases of the CRoss Industry Standard Process for Data Mining
Edinburgh, 13 Aug, 2008
3
CRISP-DM Phases






Business Understanding: the process of
understanding the project objectives from a business
perspective
Data Understanding: the process of collecting and
becoming familiar with data
Data Preparation: the process of selecting and
cleansing the data that will be fed into the modeling
tools
Modeling: the process of applying modeling to
manipulate the data so that conclusions can be
drawn
Evaluation: the process of evaluating the model and
its conclusions
Deployment: the process of applying the
conclusions to a business
Edinburgh, 13 Aug, 2008
4
Why to Update CRISP-DM?
 Support for large-scale data mining
 a lot of distributed, heterogeneous and large
datasets (primary data, derived data,
background data, catalogs): from data to “space
of data”
 data integration is of great importance
 new actors (domain expert, data analyst, data
publisher, system administrator)
 support by new components (e.g. provenance)
 etc.
 Our approach: from CRISP-DM to CRISPDMI (Cross Research & Industry Standard
Process for Data Mining and Integration )
Edinburgh, 13 Aug, 2008
5
CRISP-DMI Model
Edinburgh, 13 Aug, 2008
6
Space of Data and Services
Edinburgh, 13 Aug, 2008
Author: Ibrahim Elsayed
7
TCM Workflow
Edinburgh, 13 Aug, 2008
8
Subworkflow Targeted by Provenance
Edinburgh, 13 Aug, 2008
9
Visualization of Provenance Data
Edinburgh, 13 Aug, 2008
Authors: Y. Han & F.A. Khan
10
Use case
The fields in the data are:
(from P. Caron, C. Shearer, Interactive Visual
Workflow: The Key to Streamlining the
Data Mining Process)
Age:
Sex: M or F
BP: Blood Pressure-High, Normal, or Low
Cholesterol: Blood Cholesterol Level-Normal or High
Na: Blood sodium concentration
K: Blood potassium concentration
Drug: The drug to which this patient responded
The business question: Can we find which drug is appropriate for any
future patient?
Edinburgh, 13 Aug, 2008
11
DmiFlow: DMI Workflow Language
 The emerging DMI applications lead
to the demand of a powerful DMI
workflow language
 On top of it interactive GUIs can be
developed
 It should enable optimized
implementation of language
processors
Edinburgh, 13 Aug, 2008
12
DMI Process to be Composed by DmiFlow
Composition
Space of Source
and Destination
Data and
Services
DMI
Process
Edinburgh, 13 Aug, 2008
13
A Possible Position of DmiFlow in the
Workflow Management Systems
User-relevant information flow
Feedback
Visualisation
User
Edinburgh, 13 Aug, 2008
e24
System-relevant information flow
14
System support
e23
System support
e22
Sub-worklfow for e2
e3
oth
e21
System support
e2
High-level workflow composition
e1
BPE
er la L or
ngu
age
UM
L
Textual representation
UM
L
Principles for DMI Language Design
 Programmer Responsibilities
 Identification of Parallelism
 Specifying communication mode between
workflow components
 Providing hints (sometimes based on domain
knowledge) enabling advanced optimization
 Language Desiderata
 High abstraction level, not too complex (high
productivity)
 Advanced compositional features
 Execution of data mining queries (support for
the inductive database model)
 Extendibility
 Efficient implementation (high performance)
Edinburgh, 13 Aug, 2008
15
Related Work
 Low-level workflow notations:
 XML-based: BPEL4WS, DSCL, WSFL, etc.
 Other: Sculf (Taverna), MoML (Kepler), etc.
 High-level languages (only for workflows
integrating business processes):
 Workflow Prolog
 Valmont: It includes, process model, information
model, and organization model (It registers
organizational structure and resources.)
 C & Co: a C based language
 F#: functional workflow specification at a script
level (MicroSoft development)
 Martlet: functional workflow specification
 Compositional languages (Strand, PCN, etc.)
Edinburgh, 13 Aug, 2008
16
Workplan for the Language Design
 Phase 1 (ongoing): proposing
semantic structure and outlining
compositional structure of programs
while leaving open some aspects of
their concrete representations as
strings of symbols.
 Phase 2: finalizing the 1st language
definition version.
Edinburgh, 13 Aug, 2008
17
Basic Features of DmiFlow
 Code modules – managing complexity
 Activities: their types, parameters, locations
 Virtual communication channels between
activities, which can be represented by
 Persistent explicit datasets
 Internal datasets (implementation dependent)
 Ports used for streaming data
 Control structures: parallel & sequential
statements, loop statements, conditional
statements)
 Embedded data mining query execution
Edinburgh, 13 Aug, 2008
18
Declaration of Activities and Datasets
activity activity_name: ActivityType at (activity_location);
ActivityType – predefined (type of parameters and semantics)
activity_location
∊ {url, discover, default}
this is optional
dataset dataset_name represents (source = source_spec, hints_list);
source_spec ∊ {url, internal, port}
hint ∊ {org = dataset_organization, size = estimated_size, …}
dataset_organization ∊ {set, sequence, bag, …}
Edinburgh, 13 Aug, 2008
19
Basic Control Structures
Concurrent execution:
cobegin {
activity1(…);
…
activityn(…);
}
Sequential execution:
block {
activity1(…);
…
activityn(…);
}
Data mining query execution:
exec dmq (arguments) byactivity (activity_name){
dmq_query_specification
}
Edinburgh, 13 Aug, 2008
20
Workflow Example – Graphical Form
Edinburgh, 13 Aug, 2008
21
DmiFlow Example (1)
module WorkflowExample {
const replaceMethod = "average",
splitingMethod = "gini", //hint
url1 = "/serverA/dmi/services/integrationService1",
url2 = "/serverB/dmi/services/decisionTreeService1",
url3 = "/serverB/dmi/services/neuralNetworkService3";
activity integrDS: dataIntegrationActType at (url1),
missVals: MissingValuesActType at (discover),
normalise: NormalisForNNActType at (default),
dt: decisionTreeActType at (url2),
nn:NeuralNetworkActType at (url3);
dataset ….
Edinburgh, 13 Aug, 2008
22
DmiFlow Example (2)
dataset
ds1 represents (source = "http://www.myproject/d1.dat",
org = set, size = [1.5, 2.0]),
ds2 represents (source = "http://www.myproject/d2.dat",
type = set),
intConf represents (source = "/server/dmi/config/integr.conf);
outIntegr represents (source = internal, org = set),
cleaned represents (source = internal, org = set);
normalised represents (source = internal, org = set);
nnConf represents (source = "/server/dmi/configs/nn.conf);
nnMod represents (source = "/server/dmi/models/nn.pmml);
dtMod represents (source = "/server/dmi/models/dt.pmml);
defworkflow {
...
}
Edinburgh, 13 Aug, 2008
23
DmiFlow Example (3)
defworkflow main () {
integrDSets (in ds1, ds2, intConf; out outItegr);
missValues (in outIntegr, replaceMethod; out cleaned);
cobegin {
block {
normalise (in cleaned; out normalised);
nn (in normalised, nnConf; out nnMod);
}
dt (in cleaned, splittingMethod; out dtMod);
}
}
Edinburgh, 13 Aug, 2008
24
Future Work
 Extend language functionality
 Investigate DmiFlow execution model
for the ADMIRE architecture
 Define functional specification of the
DmiFlow language processor
 Specify concrete language syntax
Edinburgh, 13 Aug, 2008
25
Download