Text Mining Within the GridMiner Framework Ivan Janciak1), Peter Brezany1) and Martin Sarnovsky2) 1)Vienna University of Technology Institute of Scientific Computing 2)Technical University of Kosice Department of Cybernetics and AI www.gridminer.org … Intelligent Grid Solutions Outline Motivation Text Mining Workflow Implementation OGSA-DAI & Text Mining in GridMiner Future work Summary www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 2 GridMiner Framework University of Vienna Target: provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources Application area: medical – treatment of Traumatic Brain Injury (Predicting the outcome of seriously ill patients) Virtual Organization Business understanding Data understanding Data provider Data Data Preparation Deployment Modeling Data Exploration Services Pre-processing Services Data Mining Services GridMiner •Data Mining Services •Clustering •Classification •Association rules •Sequences •OLAP •Pre-Processing •Data Cleaning •Data Integration •Visualization •Job Control •GUI •Text Mining Services User Evaluation CRISP-DM, SPSS www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 3 Text Mining - Use Case Grid/ Internet Information Extraction System XML Files Collection Managers XMLDB Indexers query result Text Mining Tasks Query Processor Grid Information Retrieval System GridIR - Official working group of the Globus Alliance: http://www.gir-wg.org www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 4 Motivation Support for IR systems Goals: Find possible groups of documents based on their similarities Find appropriate categories for selected documents Text documents in various formats (plain text, HTML, XML, PDF) and languages Integration of text documents Text extraction Transformation Pre-process (potentially) large collections of documents www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 5 GridMiner and Text Mining JBowl – (Java Bag-of-words library) Modular framework for pre-processing and indexing of large text collections System developed in Java to support: Information retrieval Text mining tasks Creates and evaluates supervised and unsupervised text-mining models Produces output in Predictive model markup language (PMML) GridMiner Workflow execution and controlling Visualization www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 6 Text Mining Workflow Pre-processing Document analysis Pre-processing: suitable way to transform data into numbers •Text tokenization (lexical units) •Filters (Stop words , Stemming) Indexing •computing and storing some statistics (terms, documents frequencies, etc..) Building Text Model Data Mining Evaluation www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 7 Text Mining Workflow Vector-Space model Most frequently used model Bag of words representation Represents document collection by a document/term matrix Pre-processing columns->terms rows->documents TF-IDF weighting (term freq.-inverse doc. freq.) Interprets local and global aspects of the terms Building Text Model Data Mining Evaluation Wij =tf * idf(di,tj) www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 8 Representation of Text Model PMML (Predictive Markup Model Language) allows to define Text Model in XML representation divided into six major parts: Model attributes Dictionary of terms Corpus of text documents Document-term matrix Text model normalization Text model similarity Complete model is a huge XML file (all terms, document term matrix,...) www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 9 Text Mining Workflow Supported tasks: Pre-processing •Classification (Supervised learning) Building Text Model •Classifies documents into a number of predefined categories - (multi label classification) •Target is ‘document category’ Data Mining •Algorithms: C4.5,RIPPER,kNN,SVM •Output: Classification Model (set of decision trees) •Clustering (Unsupervised learning) •Groups similar documents together •Algorithms: kMeans,SOM •Output: Clustering Model (clusters with similar documents) Evaluation www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 10 Representation of Mined Models Represented in PMML Classification Mined Schema Model Stats TreeModel (for each category - long binary tree) Clustering Mined Schema Model Stats ClusteringModel Clustering field Cluster 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh www.gridminer.org 11 Text Mining Workflow Pre-processing Evaluation •Estimation of a model accuracy Building Text Model •To predict (with a high degree of accuracy ) the correct class (or cluster) into which the new document belongs Data Mining •Is done on a set of previously unseen documents (testing set) •Output: stats (precision, recall ,F-measure) Evaluation www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 12 Text Mining Service – First implementation Training set XML File GridMiner-TM Service Build Text Model Terms Reduction Testing Set XML File Categorization & Clustering Model evaluation Text Model Reduced Text Model Classification/ Clustering Model Statistics (PMML) www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 13 www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 14 Text Mining Service & OGSA-DAI Training/ Testing Sets XML Files OGSA-DAI GridMiner-TM Service Grid Data Mediation (Documents Integration) Terms Reduction Text Mining Activity (Build Text Model) XMLDB Delivery (Deliver Model) Text Model Text Categorization & Clustering Model evaluation www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 15 Future Work Distributed version of Text Mining service Distribution of Classification Task decision trees categories categories Slave Service Classification Model TM-Service decision trees …. Slave Service Document Term Matrix OGSA-DAI Delivery Activity www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 16 Summary GridIR architecture extended by Text Mining capabilities Stand alone Text Mining Service Classification Clustering Implementation of Building Text Model activity in OGSA-DAI Distributed version of the Text Mining service Classification www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 17 TextMining Video www.gridminer.org 2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh 18