Data services in the DataMiningGrid Martin Swain, University of Ulster

advertisement
Data services in the
DataMiningGrid
Martin Swain,
University of Ulster
1
Talk agenda
1.
2.
3.
4.
5.
A bit about the DataMiningGrid
Data services in the DataMiningGrid
The development work we are doing
Ongoing issues and conclusions
Highlights from the demo
2
Data-mining on the grid
What is data–mining?
• Extracting or “mining” knowledge from large
amounts of data
Key challenges for the grid:
• Data preparation is the biggest challenge
9 Knowledge discovery depends on data quality
• Perfomance depends on efficient interfacing
between data and data mining algorithms
9 Data formatting
• Algorithms need to move to the data
3
DataMiningGrid partners
FHG
DC
TECH
UU
LJU
4
The DataMiningGrid
Text-mining: 4 use-cases
• Pre-processing, Fast distributed text
classification, Ontology learning, Finding
related and similar documents
Bioinformatics: 2 use-cases
• Gene network induction, analysis of molecular
dynamics simulations
Grid monitoring
Mining distributed medical databases
Ecological modelling
5
Data management overview
Data-mining-aware data management
services
• We are developing new functionality to extend
OGSA-DAI e.g.
9 Data assays i.e. statistical summaries
9 Data pre-processing
9 Cross-validation
9 Data conversions and formatting
Implemented a Triana interface to OGSA-DAI
• Triana is a visual, easy-to-use, work-flow editor
• Developed by the University of Cardiff
6
Data preparation
The most challenging aspect of data
mining
Many different methods and algorithms
can be used
May be entirely domain specific
• Two demonstrators focus just on data preprocessing:
9 Text mining
9 A data warehouse for molecular dynamics simulations
7
Data assays 1/2
A data assay is a description or summary
of the data set:
• The name of each field
• Its type (numeric or character)
• Simple statistics
9 Minimum and maximum values
9 Average
9 Variance
• Lists
9 Distinct values
9 Empty or missing values
8
Data assays 2/2
Data assays used for data preparation:
•
•
•
•
•
Data binning
Scaling and normalisation
Mapping one set of values on to another
Cleaning data, noise reduction
…
We want to store the data assay with the
data
• Enables more efficient and flexible data
processing
• Requires a suitable data format
9
Cross-validation
Integrated
into a single
table
TRAINING
Split for
Cross-validation
Distributed
data sources
TEST SET
Training
data-set
divided into columns and
distributed for processing
10
Data conversions and
formatting
Different data formats are required by
different data mining systems
• E.g. the open source WEKA data mining
software uses ARFF format
• PMML is an industry standard for discovered
data models
Many data conversions take place in gridenabled data mining
• There is a requirement for a standard data
format
9 Including a data summary would be very helpful
11
OGSA-DAI and Triana 1/3
Triana is a workflow editor from Cardiff
University
DataMiningGrid tools and services will be
integrated within Triana
• Data services: Triana is used to
9 Write perform documents
9 Execute OGSA-DAI clients
12
Ogsa-Dai and Triana 2/3
13
OGSA-Dai and Triana 3/3
14
Ongoing issues
Integration of DataMiningGrid components
• Error propagation and handling
• Clean up
Suitable data exchange format
Provenance
• Intermediary results stored with provenance
• Developing an XML schema for text mining
Security
• Confidentiality important for medical data sets
15
Conclusions
Data manipulation is central to the
DataMiningGrid
• Not just data access and transfer
• Interfacing between data and algorithms
Processing is tightly coupled to the data
• Flexibility:
9 Transfer algorithms to data
• Standards for data exchange required
16
Demo:Grid-enabled and
Workflow-based Data Mining
with Weka
17
Motivation
Demonstrate generic workflow for data
mining in grids
Demonstrate ease of use
• Algorithm information dynamically retrieved
• Algorithm transfer to remote machines
• Integrated into the workflow editor
Demonstrate the distribution of each
single step on multiple machines in our
testbed
18
Setup
Weka
Executable
Model
ARFF
GT4
ARFF
Weka
Executable
FHG
GT4
grid2
grid1
OGSA-DAI
Perform
Document
Job
matrix
Description
Perform
UU
Document
Model
Algorithm
Result
SQL
Description
Algorithm
Description
TECH
GT4
dmg-tech
Data flow
Meta-data flow
mySQL
Executable &
job description
LJU
kanin
19
Operations and
Components/Services
Operation:
Component/service (Location)
Dynamic retrieval of information
about the selected algorithm
Information service (TECH)
Information processing
Client-side API (Laptop)
•OGSA-DAI Perform document
•GT4 job description
Data query, transportation and
transformation
Data service (OGSA-DAI) (UU)
•Activities:
9SQL
9DB2ARFF
9Write to file
Job execution
GT4 GRAM & DMGrid execution system (FHG)
•File transfer (data, Weka executable)
•Execution
•Resource Brokering
Data mining
Weka (grid1 (FHG))
Workflow processing
Triana (Laptop)
20
Now for the movie…
21
Download