Advanced Data Mining and Integration Research for Europe Using Advanced Data Mining and Integration in Environmental Risk Management Ladislav Hluchy Ondrej Habala, Martin Šeleng, Peter Krammer, Viet Tran Institute of Informatics Slovak Academy of Sciences SAMI 2011, January 2011, Smolenice, Slovakia ADMIRE – Framework 7 ICT 215024 Contents • • • • EU FP7 project ADMIRE – overview Architecture of DMI solution in ADMIRE New DMI process language – DISPEL Pilot application scenarios – ORAVA, RADAR • goals, architecture, experimental results • Tools in ADMIRE SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE - Advanced Data Mining and Integration Research for Europe • 7th Framework Program • ICT, Call 1.2.A • Commenced in February 2008 over 36 months. • €4.3 million in costs, and €3 million in EC funding SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Collaborators • University of Edinburgh, UK (Coordinator) – NeSc - National e-Science Centre – EPCC - Edinburgh Parallel Computing Centre • Fujitsu Labs of Europe, UK • University of Vienna, Austria – Institute of Scientific Computing • Universidad Politécnica de Madrid, Spain – Facultad de Informatica • Slovak Academy of Sciences, Slovakia – Institute of Informatics • ComArch S.A., Poland SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Goals • Accelerate access to and increase the benefits from data exploitation; • Deliver consistent and easy to use technology for extracting information and knowledge; • Cope with complexity, distribution, change and heterogeneity of services, data, and processes, through abstract view of data mining and integration; and • Provide power to users and developers of data mining and integration processes. SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Structure – WP1: High-Level Model and Language Research • Incremental development of models and languages with a goal of describing Data Mining and Integration (DMI) processes abstractly – WP2: Architecture Research • Incremental development of a flexible, scalable and open DMI architecture – WP3: Platform Support & Delivery • Deliver robust service platforms, support users and encapsulate knowledge in a book – WP4: Service Infrastructure Development and Enhancement • Develop technology and services to enhance the DMI service infrastructure based on Fujitsu’s USMT SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Structure – WP5: Data Mining and Integration Tools Development • Develop and integrate tools that make the technology easier to use and reduce the frequency of failures – WP6: Integrated Applications • Demonstration of validation and performance of architecture, language, platform and tools as an integrated environment for Data Mining and Integration – WP7: Project Management • Management and coordination of the project SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Architecture: Separation of Concerns SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Architecture SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier DISPEL – Data Intensive Systems Process-Engineering Language • Data-intensive distributed systems • Connection point of complex application requests and complex enactment systems –Benefit: method development, engineering and evolution of supported practices can take place independently in each world • Describes enactment requests for streaming-data workflows processes • “Process-engineering time” – transform and optimize process in preparation for enactment period SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier DISPEL: Simple Example Creating streams of literals String sql1 = "SELECT * FROM some_table"; String sql2 = “SELECT * FROM table2”; String resource = "128.18.128.255"; SQLQuery query = new SQLQuery; |- sql1, sql2 -| => query.expression; |- resource -| => query.resource; Tee tee = new Tee; query.result => tee.connectInput; Creating connections SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier DISPEL – real use SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE’s High-Level Architecture SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Gateways USMT SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Security • Framework built on top of formal Grid Infrastructure, available security mechanisms include: –Transport level security: SSL, HTTPs, (currently available) –Message level security: Web Services Security: SOAP Message Security –X509 certificate authentification –Multiple stakeholder authorization –Explicit Trust Delegation (ETD) SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Pilot Applications • Admire has 2 pilot applications – CRM – FloodApp • FloodApp – Orava – Radar – SVP SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ACRM Application • Large-scale, distributed Churn scenario – 4 database parts, distributed among ADMIRE partners – Graphical UI for business analysts – Using ADMIRE workbench, DISPEL and framework to create predictions of customer churn • Mining over distributed data SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Flood Application Data sets used in hydrological scenarios Dataset HUSAV Domain Hydrology Description Volume Data from two probes, 10s of MB containing water saturation of soil MARS Meteorology Historical meteorological 100s of MB data (temperature, rainfall, etc) for Slovakia SVP Hydrology Data from waterworks in 100s of MB western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall DAISY Pedology Various pedological 10s of MB parameters for one probe in southern Slovakia WOFOST Pedology Crop data (with attached soil 10s of MB and meteorological data) for Slovakia, year 2006 SHMU_CURR Meteorology On-line database of 10s of GB + meteorological data – copied SAMI 2011, Smolenice, Slovakia, January 2011 from SHMI web; including FSKD 2010 Yantai, China, August 10-12 radar imagery ADMIRE – Framework 7 ICT 215024 Temporal coverage 1998-2007 Spatial coverage Two distinct points 1975-2007 Slovakia (grid 50x50 km) 1998-2007 15 distinct waterworks 1961-2000 One point 2006 Slovakia (grid) 2008- Slovakia (about 100 distinct probes) easier ...making data-mining 19 Scenarios deployment in testbed • Two scenarios (ORAVA, RADAR) completely deployed in testbed • Other scenario’s data are partially deployed • 5 nodes (1 real + 4 virtual nodes) • Databases (MySQL + PostgreSQL), GRIB files in file storage • USMT (Unified System Management Technology - Jetty container), OGSA-DAI (Apache Tomcat) SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Orava scenario • Legend – Green area – Orava (part of north Slovakia) – Blue – Orava reservoir and local rivers – Red dots – hydrological measurement stations • Notes – We are interested only on hydrological stations below the Orava reservoir – In our tests we will use the hydrological station 5830 (Tvrdosin) SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ORAVA – data mining concept • Predictors – rainfall amount (reservoir and station), air temperature (reservoir and station), reservoir discharge, reservoir temperature • Targets – water level and temperature at a station below the reservoir Time Water temp Rainf all Air temp Air temp Orava Orava Orava Station RainFall Outflow Station Orava Water level Water temp Station Station T-4 E-4 R-4 A-4 B-4 S-4 D-4 X-4 Y-4 T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3 T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2 T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1 T E R A B S D X Y T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1 Targets of data mining T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2 Given in a schedule T+3 R+3 A+3 B+3 S+3 D+3 X+3 Y+3 T+4 R+4 A+4 B+4 S+4 D+4 X+4 Y+4 T+5 R+5 B+5 SAMI 2011, Smolenice, Slovakia,A+5 January 2011 S+5 D+5 X+5 Y+5 T+6 S+6 D+6 X+6 Y+6 Predicted by a meteo model ADMIRE – Framework 7 ICT 215024 R+6 A+6 B+6 ...making data-mining easier ORAVA – data integration • Integration of data from – GRIB files – Reservoirs • Inputs – Time period of experiment – Reservoir ID – List of hydro stations – Geo coordinates SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ORAVA – data sets Dataset SVP Domain Hydrology SHMU_CURR Meteorology SHMU_HIST Meteorology SHMU_GRIB Meteorology SHMU_HYDR Hydrology O Description Temporal coverage Data from waterworks in 100s of MB 1998-2007 western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall On-line database of 10s of GB + 2008meteorological data – copied from SHMI web; including radar imagery Historical meteorological 100s of MB 1998-2007 data from SHMI probes Historical temperatures and 100s of GB rainfall amounts in a gridded binary format Historical data from 10s of MB hydrological measurement stations SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 Volume 1998-2007 1998-2007 Spatial coverage 15 distinct waterworks Slovakia (about 100 distinct probes) Slovakia (more than 100 distinct probes) Slovakia (grid, various sizes) Orava and upper Vah river ...making data-mining easier ORAVA – integrated and preprocessed data Water_temp Air_temp Rainfall Outflow Rainfall Air_temp Flow/Height Water_temp Orava Orava Orava Orava Station Station Station Station -4 -4 -5 -5 -5 -3 -3 1 30 30 30 30 30 50 50 -5.55E-20 -5.55E-20 -4.24E-20 -8.47E-20 -8.47E-20 -8.47E-20 -8.47E-20 269.0278 269.0476 269.5059 270.2394 270.8507 271.2792 271.9238 28 28.62 28.62 28.62 28 28 28 0.7 0.7 0.7 0.7 0.7 0.7 0.8 Time Integrated raw data LinearTrend Filter ReplaceMissingValues Filter ZeroEpsilon Kelvin2Celsius Filter Filter Water_temp Air_temp Rainfall Outflow Rainfall Air_temp Flow/Height Water_temp Orava Orava Orava Station Station Station Station 1.0 1.0 0.995833 0.991667 0.9875 0.983333 0.979167 -4.0 -4.0 -5.0 -5.0 -5.0 -3.0 -3.0 Orava 0.0 0.0 0.0 0.0 0.0 0.0 0.0 30.0 30.0 30.0 30.0 30.0 50.0 50.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -3.12223 -3.1024 -2.64408 -1.91062 -1.29926 -0.87076 -0.22617 SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 28.0 28.62 28.62 28.62 28.0 28.0 28.0 0.7 0.7 0.7 0.7 0.7 0.7 0.8 Time Integrated preprocessed data ...making data-mining easier ORAVA – data mining • • Input - Integrated data Data Mining Phases: – • • • Training on historical data (8760 records) Linear Regression model Neural networks - multilayer perceptron without hidden layers Model Evaluation • • • • Missing values substitution (ReplaceMissingValues filter) Noise reduction (ZeroEpsilon filter) Switching from one scale to another (Kelvin2Celsius filter) Data modifying (LinearTrend filter) Model training • • • – Data visualization Data quality exploration Data preparation • – Data Visualization Data understanding • • – Integrated Data Testing of the trained model N-fold cross validation Using training sets Output - Prediction model Data Preparation Data Cleaning Clean Data Model Training Model Visualization Model Evaluation Prediction Model SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Orava – data mining results prediction of temperature • Linear Regression model equation: Water _ tempstation 0.6473Water _ tempOrava 0.0239 Air _ tempOrava 0.0359 RainfallOrava 0.0055 OutflowOrava 0.0418 Rainfallstation 0.0117 Air _ tempstation 0.0503 Flowstation 2.4324 SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Orava – temperature prediction model comparison Model\Proper ties Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances Linear Multilayer regression perceptron 0.9639 Validation data 0.9821 1.1791 0.7748 1.4607 1.0386 23.8739 % 15.6884 % 26.609 % 18.9195 % 8760 8760 11.6 15.2 6.4 0.7 11.7 14.3 15.6 15.7 0.8 15.8 15.4 14.9 15.4 Linear regression model Predicted Error data 13.071 1.471 14.335 -0.865 7.614 1.214 2.284 1.584 10.948 -0.752 16.526 2.226 12.891 -2.709 12.838 -2.862 1.752 0.952 15.188 -0.612 16.553 1.153 12.795 -2.105 15.660 0.260 SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 Multilayer perceptron model Predicted Error data 12.446 0.846 14.494 -0.706 5.766 -0.634 0.926 0.226 10.266 -1.434 13.671 -0.629 14.502 -1.098 13.353 -2.347 0.826 0.026 14.005 -1.795 13.129 -2.271 14.599 -0.301 13.696 -1.704 ...making data-mining easier Orava – prediction of water level • Neural network model – multilayer perceptron • Input parameters (6) – Rainfall ([S+1]), Water-Level ([X]) – Outflows ([D], [D+1] – [D], ln([D]), sqrt([D])) • Output – Difference of water level ([X+1] – [X]) Time Water temp Rain fall Air temp Air temp Rain Fall Orava Orava Orava Station Station Station Water temp Station T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3 T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2 T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1 T E R A B S D X Y T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1 T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2 SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 Outflow Water Orava Level ...making data-mining easier Orava – water level prediction • Data count : 8735 records • Activation function of the feed-forward neural network: sigmoid • Correlation coefficient: 0.9816 • Mean absolute error : 0.4105 • Root mean squared err.: 0.9673 • Relative absolute error : 30.5869 % (from difference) • Root relative squared error 19.2384 % (from difference) SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier RADAR • Very short-term rainfall prediction from weather radar data – Movement of areas with higher air moisture content, and thus also higher precipitation potential • Mining of matrices of data Time Potential precipitation (RADAR) Measured precipitation (STATION) Temperature (MODEL) Wind (MODEL) T-3 R-3 S-3 H-3 W-3 T-2 R-2 S-2 H-2 W-2 T-1 R-1 S-1 H-1 W-1 T R S H W T+1 R+1 S+1 T+2 R+2 SAMI S+2 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 Targets of data mining ...making data-mining easier 31 Meteorologic data • Network of synoptic stations in Slovakia – 27 stations in Slovakia – Used data from year 2007, 2008 – Rainfall, humidity, atmospheric pressure and temperature values for each hour SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier RADAR isotonic model • Actual model for rainfall prediction – – – – – – – Isotonic reggresion model structure Training on historical data Correlation coefficient 0.4593 Mean absolute error 0.1105 Root mean squared error 0.5490 Total Number of Instances 89700 Validation 10 Cross Fold SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Table of isotonic model Prediction (rainfall) cut point (reflective) Prediction (rainfall) cut point (reflective) Prediction (rainfall) cut point (reflective) 1 0.01 1.78 15 0.23 96.91 29 1.35 355.91 2 0.03 1.84 16 0.28 97.47 30 1.40 377.19 3 0.03 8.28 17 0.30 129.63 31 1.52 381.78 4 0.03 16.97 18 0.33 129.72 32 2.13 395.31 5 0.03 24.28 19 0.42 147.94 33 2.23 399.16 6 0.03 36.91 20 0.44 168.59 34 2.28 447.06 7 0.05 37.53 21 0.50 187.13 35 2.60 447.69 8 0.05 38.72 22 0.51 187.47 36 2.60 467.66 9 0.06 44.53 23 0.62 211.56 37 2.98 515.19 10 0.07 59.03 24 0.72 268.38 38 3.75 625.56 11 0.08 61.16 25 0.93 281.28 39 4.93 665.41 12 0.10 61.78 26 1.00 297.72 40 5.24 901.25 13 0.14 81.59 27 1.14 314.47 41 5.40 934.41 14 0.19 89.22 28 1.26 344.59 42 6.30 971.5 index index SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 index ...making data-mining easier Hydrometeorological performance Probability of detection with threshold 0,3 and 0,6 mm rainfall per hour: • POD0,3 = 63,87 % • POD0,6 = 56,22 % Miss rate with threshold 0,3 and 0,6 mm rainfall per hour: • MR0,3 = 1,85 % • MR0,6 = 1,58 % SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier RADAR model • Other tested models – Neural networks, SMOreg, linear regression, ... – Reached correlation coeficient between 0,35 and 0,42 – Validation - 10 Cross Fold Problems in model creation : – – – – process is significantly stochastic Some input variables are backwards dependent on output Meteorological process is very sensitive Reflection matrix represents quantity of water in atmosphere, not exact rainfall rate in specified area, as opposed to data from synoptic stations SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier ADMIRE Tools • • • • Registry client GUI Process designer SKSA Gateway Process Manager • DMI Model Visualizer SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Registry client GUI • Read-only access to ADMIRE Registry – list PEs and view their properties – search, sort PEs • Write access to Registry is done via DISPEL documents SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Process Designer Manage your DMI project (files, directories – project structure) Select elements from the Registry View the canonical (DISPEL) representation View the properties of of your DMI process in real time your chosen elements Edit your DMI process graphically SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Semantic Knowledge Sharing Assistant Provides access to existing user’s knowledge, sorting and selecting it automatically according to the user’s current working context • Context the user works in – Several reservoirs, one settlement • Knowledge that may be useful in this context – previously entered by other users SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Gateway Process Manager • Keep track of running processes – stop/pause/cancel the process – view the process’ source DISPEL • access process’ results (if available) in several ways – raw or visualized SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier DMI Model Visualizer • Visualization of data mining models – Read Weka classifier object – produce PMML (Predictive Model Markup Language) description of the model – Show the PMML as a graphical tree SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier Admire Project Thank you for attention. SAMI 2011, Smolenice, Slovakia, January 2011 ADMIRE – Framework 7 ICT 215024 ...making data-mining easier