Data services in the DataMiningGrid Martin Swain, University of Ulster 1 Talk agenda 1. 2. 3. 4. 5. A bit about the DataMiningGrid Data services in the DataMiningGrid The development work we are doing Ongoing issues and conclusions Highlights from the demo 2 Data-mining on the grid What is data–mining? • Extracting or “mining” knowledge from large amounts of data Key challenges for the grid: • Data preparation is the biggest challenge 9 Knowledge discovery depends on data quality • Perfomance depends on efficient interfacing between data and data mining algorithms 9 Data formatting • Algorithms need to move to the data 3 DataMiningGrid partners FHG DC TECH UU LJU 4 The DataMiningGrid Text-mining: 4 use-cases • Pre-processing, Fast distributed text classification, Ontology learning, Finding related and similar documents Bioinformatics: 2 use-cases • Gene network induction, analysis of molecular dynamics simulations Grid monitoring Mining distributed medical databases Ecological modelling 5 Data management overview Data-mining-aware data management services • We are developing new functionality to extend OGSA-DAI e.g. 9 Data assays i.e. statistical summaries 9 Data pre-processing 9 Cross-validation 9 Data conversions and formatting Implemented a Triana interface to OGSA-DAI • Triana is a visual, easy-to-use, work-flow editor • Developed by the University of Cardiff 6 Data preparation The most challenging aspect of data mining Many different methods and algorithms can be used May be entirely domain specific • Two demonstrators focus just on data preprocessing: 9 Text mining 9 A data warehouse for molecular dynamics simulations 7 Data assays 1/2 A data assay is a description or summary of the data set: • The name of each field • Its type (numeric or character) • Simple statistics 9 Minimum and maximum values 9 Average 9 Variance • Lists 9 Distinct values 9 Empty or missing values 8 Data assays 2/2 Data assays used for data preparation: • • • • • Data binning Scaling and normalisation Mapping one set of values on to another Cleaning data, noise reduction … We want to store the data assay with the data • Enables more efficient and flexible data processing • Requires a suitable data format 9 Cross-validation Integrated into a single table TRAINING Split for Cross-validation Distributed data sources TEST SET Training data-set divided into columns and distributed for processing 10 Data conversions and formatting Different data formats are required by different data mining systems • E.g. the open source WEKA data mining software uses ARFF format • PMML is an industry standard for discovered data models Many data conversions take place in gridenabled data mining • There is a requirement for a standard data format 9 Including a data summary would be very helpful 11 OGSA-DAI and Triana 1/3 Triana is a workflow editor from Cardiff University DataMiningGrid tools and services will be integrated within Triana • Data services: Triana is used to 9 Write perform documents 9 Execute OGSA-DAI clients 12 Ogsa-Dai and Triana 2/3 13 OGSA-Dai and Triana 3/3 14 Ongoing issues Integration of DataMiningGrid components • Error propagation and handling • Clean up Suitable data exchange format Provenance • Intermediary results stored with provenance • Developing an XML schema for text mining Security • Confidentiality important for medical data sets 15 Conclusions Data manipulation is central to the DataMiningGrid • Not just data access and transfer • Interfacing between data and algorithms Processing is tightly coupled to the data • Flexibility: 9 Transfer algorithms to data • Standards for data exchange required 16 Demo:Grid-enabled and Workflow-based Data Mining with Weka 17 Motivation Demonstrate generic workflow for data mining in grids Demonstrate ease of use • Algorithm information dynamically retrieved • Algorithm transfer to remote machines • Integrated into the workflow editor Demonstrate the distribution of each single step on multiple machines in our testbed 18 Setup Weka Executable Model ARFF GT4 ARFF Weka Executable FHG GT4 grid2 grid1 OGSA-DAI Perform Document Job matrix Description Perform UU Document Model Algorithm Result SQL Description Algorithm Description TECH GT4 dmg-tech Data flow Meta-data flow mySQL Executable & job description LJU kanin 19 Operations and Components/Services Operation: Component/service (Location) Dynamic retrieval of information about the selected algorithm Information service (TECH) Information processing Client-side API (Laptop) •OGSA-DAI Perform document •GT4 job description Data query, transportation and transformation Data service (OGSA-DAI) (UU) •Activities: 9SQL 9DB2ARFF 9Write to file Job execution GT4 GRAM & DMGrid execution system (FHG) •File transfer (data, Weka executable) •Execution •Resource Brokering Data mining Weka (grid1 (FHG)) Workflow processing Triana (Laptop) 20 Now for the movie… 21