making data-mining easier

advertisement
Advanced Data Mining and Integration Research for Europe
Using Advanced Data Mining and
Integration in Environmental Risk
Management
Ladislav Hluchy
Ondrej Habala, Martin Šeleng, Peter Krammer, Viet Tran
Institute of Informatics
Slovak Academy of Sciences
SAMI 2011, January 2011, Smolenice, Slovakia
ADMIRE – Framework 7 ICT 215024
Contents
•
•
•
•
EU FP7 project ADMIRE – overview
Architecture of DMI solution in ADMIRE
New DMI process language – DISPEL
Pilot application scenarios – ORAVA,
RADAR
• goals, architecture, experimental results
• Tools in ADMIRE
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE - Advanced Data Mining and
Integration Research for Europe
• 7th Framework Program
• ICT, Call 1.2.A
• Commenced in February 2008
over 36 months.
• €4.3 million in costs, and €3
million in EC funding
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Collaborators
• University of Edinburgh, UK (Coordinator)
– NeSc - National e-Science Centre
– EPCC - Edinburgh Parallel Computing Centre
• Fujitsu Labs of Europe, UK
• University of Vienna, Austria
– Institute of Scientific Computing
• Universidad Politécnica de Madrid, Spain
– Facultad de Informatica
• Slovak Academy of Sciences, Slovakia
– Institute of Informatics
• ComArch S.A., Poland
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Goals
• Accelerate access to and increase the benefits from
data exploitation;
• Deliver consistent and easy to use technology for
extracting information and knowledge;
• Cope with complexity, distribution, change and
heterogeneity of services, data, and processes,
through abstract view of data mining and
integration; and
• Provide power to users and developers of data
mining and integration processes.
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Structure
– WP1: High-Level Model and Language Research
• Incremental development of models and languages with a goal of
describing Data Mining and Integration (DMI) processes abstractly
– WP2: Architecture Research
• Incremental development of a flexible, scalable and open DMI
architecture
– WP3: Platform Support & Delivery
• Deliver robust service platforms, support users and encapsulate
knowledge in a book
– WP4: Service Infrastructure Development and Enhancement
• Develop technology and services to enhance the DMI service
infrastructure based on Fujitsu’s USMT
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Structure
– WP5: Data Mining and Integration Tools
Development
• Develop and integrate tools that make the technology
easier to use and reduce the frequency of failures
– WP6: Integrated Applications
• Demonstration of validation and performance of
architecture, language, platform and tools as an
integrated environment for Data Mining and Integration
– WP7: Project Management
• Management and coordination of the project
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Architecture:
Separation of Concerns
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Architecture
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
DISPEL – Data Intensive Systems
Process-Engineering Language
• Data-intensive distributed systems
• Connection point of complex application requests
and complex enactment systems
–Benefit: method development, engineering and evolution
of supported practices can take place independently in
each world
• Describes enactment requests for streaming-data
workflows processes
• “Process-engineering time” – transform and optimize
process in preparation for enactment period
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
DISPEL: Simple Example
Creating streams of literals
String sql1 = "SELECT * FROM some_table";
String sql2 = “SELECT * FROM table2”;
String resource = "128.18.128.255";
SQLQuery query = new SQLQuery;
|- sql1, sql2 -| => query.expression;
|- resource -| => query.resource;
Tee tee = new Tee;
query.result => tee.connectInput;
Creating connections
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
DISPEL – real use
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE’s High-Level Architecture
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Gateways
USMT
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Security
• Framework built on top of formal Grid
Infrastructure, available security mechanisms
include:
–Transport level security: SSL, HTTPs, (currently
available)
–Message level security: Web Services Security: SOAP
Message Security
–X509 certificate authentification
–Multiple stakeholder authorization
–Explicit Trust Delegation (ETD)
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Pilot Applications
• Admire has 2 pilot applications
– CRM
– FloodApp
• FloodApp
– Orava
– Radar
– SVP
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ACRM Application
• Large-scale, distributed Churn scenario
– 4 database parts, distributed among ADMIRE partners
– Graphical UI for business
analysts
– Using ADMIRE workbench,
DISPEL and framework
to create predictions
of customer churn
• Mining over distributed data
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Flood Application
Data sets used in hydrological scenarios
Dataset
HUSAV
Domain
Hydrology
Description
Volume
Data from two probes,
10s of MB
containing water saturation of
soil
MARS
Meteorology Historical meteorological
100s of MB
data (temperature, rainfall,
etc) for Slovakia
SVP
Hydrology
Data from waterworks in
100s of MB
western Slovakia (mainly
river Váh) – outflows, water
levels, temperature, rainfall
DAISY
Pedology
Various pedological
10s of MB
parameters for one probe in
southern Slovakia
WOFOST
Pedology
Crop data (with attached soil 10s of MB
and meteorological data) for
Slovakia, year 2006
SHMU_CURR Meteorology On-line database of
10s of GB +
meteorological data – copied
SAMI 2011, Smolenice, Slovakia, January 2011
from SHMI
web; including
FSKD 2010
Yantai, China, August 10-12
radar imagery
ADMIRE – Framework 7 ICT 215024
Temporal
coverage
1998-2007
Spatial
coverage
Two distinct
points
1975-2007
Slovakia (grid
50x50 km)
1998-2007
15 distinct
waterworks
1961-2000
One point
2006
Slovakia (grid)
2008-
Slovakia (about
100 distinct
probes) easier
...making data-mining
19
Scenarios deployment in testbed
• Two scenarios (ORAVA, RADAR) completely
deployed in testbed
• Other scenario’s data are partially deployed
• 5 nodes (1 real + 4 virtual nodes)
• Databases (MySQL + PostgreSQL), GRIB files in
file storage
• USMT (Unified System Management Technology
- Jetty container), OGSA-DAI (Apache Tomcat)
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Orava scenario
• Legend
– Green area – Orava
(part of north Slovakia)
– Blue – Orava reservoir
and local rivers
– Red dots – hydrological
measurement stations
• Notes
– We are interested only
on hydrological
stations below the
Orava reservoir
– In our tests we will use
the hydrological
station 5830 (Tvrdosin)
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ORAVA – data mining concept
• Predictors – rainfall amount (reservoir and station), air
temperature (reservoir and station), reservoir
discharge, reservoir temperature
• Targets –
water level
and
temperature
at a station
below the
reservoir
Time
Water
temp
Rainf
all
Air
temp
Air
temp
Orava
Orava
Orava
Station
RainFall
Outflow
Station
Orava
Water level
Water
temp
Station
Station
T-4
E-4
R-4
A-4
B-4
S-4
D-4
X-4
Y-4
T-3
E-3
R-3
A-3
B-3
S-3
D-3
X-3
Y-3
T-2
E-2
R-2
A-2
B-2
S-2
D-2
X-2
Y-2
T-1
E-1
R-1
A-1
B-1
S-1
D-1
X-1
Y-1
T
E
R
A
B
S
D
X
Y
T+1
R+1
A+1
B+1
S+1
D+1
X+1
Y+1
Targets of data mining
T+2
R+2
A+2
B+2
S+2
D+2
X+2
Y+2
Given in a schedule
T+3
R+3
A+3
B+3
S+3
D+3
X+3
Y+3
T+4
R+4
A+4
B+4
S+4
D+4
X+4
Y+4
T+5
R+5
B+5
SAMI 2011, Smolenice,
Slovakia,A+5
January 2011
S+5
D+5
X+5
Y+5
T+6
S+6
D+6
X+6
Y+6
Predicted by a meteo model
ADMIRE – Framework 7 ICT 215024
R+6
A+6
B+6
...making data-mining easier
ORAVA – data integration
• Integration of
data from
– GRIB files
– Reservoirs
• Inputs
– Time period of
experiment
– Reservoir ID
– List of hydro
stations
– Geo coordinates
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ORAVA – data sets
Dataset
SVP
Domain
Hydrology
SHMU_CURR Meteorology
SHMU_HIST
Meteorology
SHMU_GRIB
Meteorology
SHMU_HYDR Hydrology
O
Description
Temporal
coverage
Data from waterworks in
100s of MB 1998-2007
western Slovakia (mainly
river Váh) – outflows, water
levels, temperature, rainfall
On-line database of
10s of GB + 2008meteorological data – copied
from SHMI web; including
radar imagery
Historical meteorological
100s of MB 1998-2007
data from SHMI probes
Historical temperatures and 100s of GB
rainfall amounts in a gridded
binary format
Historical data from
10s of MB
hydrological measurement
stations
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
Volume
1998-2007
1998-2007
Spatial
coverage
15 distinct
waterworks
Slovakia (about
100 distinct
probes)
Slovakia (more
than 100 distinct
probes)
Slovakia (grid,
various sizes)
Orava and upper
Vah river
...making data-mining easier
ORAVA – integrated and preprocessed data
Water_temp
Air_temp
Rainfall
Outflow
Rainfall
Air_temp
Flow/Height
Water_temp
Orava
Orava
Orava
Orava
Station
Station
Station
Station
-4
-4
-5
-5
-5
-3
-3
1
30
30
30
30
30
50
50
-5.55E-20
-5.55E-20
-4.24E-20
-8.47E-20
-8.47E-20
-8.47E-20
-8.47E-20
269.0278
269.0476
269.5059
270.2394
270.8507
271.2792
271.9238
28
28.62
28.62
28.62
28
28
28
0.7
0.7
0.7
0.7
0.7
0.7
0.8
Time
Integrated raw data
LinearTrend Filter ReplaceMissingValues Filter ZeroEpsilon Kelvin2Celsius
Filter
Filter
Water_temp
Air_temp
Rainfall Outflow
Rainfall
Air_temp
Flow/Height
Water_temp
Orava
Orava
Orava
Station
Station
Station
Station
1.0
1.0
0.995833
0.991667
0.9875
0.983333
0.979167
-4.0
-4.0
-5.0
-5.0
-5.0
-3.0
-3.0
Orava
0.0
0.0
0.0
0.0
0.0
0.0
0.0
30.0
30.0
30.0
30.0
30.0
50.0
50.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-3.12223
-3.1024
-2.64408
-1.91062
-1.29926
-0.87076
-0.22617
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
28.0
28.62
28.62
28.62
28.0
28.0
28.0
0.7
0.7
0.7
0.7
0.7
0.7
0.8
Time
Integrated preprocessed data
...making data-mining easier
ORAVA – data mining
•
•
Input - Integrated data
Data Mining Phases:
–
•
•
•
Training on historical data (8760 records)
Linear Regression model
Neural networks - multilayer perceptron
without hidden layers
Model Evaluation
•
•
•
•
Missing values substitution
(ReplaceMissingValues filter)
Noise reduction (ZeroEpsilon filter)
Switching from one scale to another
(Kelvin2Celsius filter)
Data modifying (LinearTrend filter)
Model training
•
•
•
–
Data visualization
Data quality exploration
Data preparation
•
–
Data
Visualization
Data understanding
•
•
–
Integrated
Data
Testing of the trained model
N-fold cross validation
Using training sets
Output - Prediction model
Data
Preparation
Data
Cleaning
Clean Data
Model
Training
Model
Visualization
Model
Evaluation
Prediction
Model
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Orava – data mining results
prediction of temperature
• Linear Regression model equation:
Water _ tempstation  0.6473Water _ tempOrava 
 0.0239  Air _ tempOrava  0.0359 RainfallOrava 
0.0055 OutflowOrava  0.0418 Rainfallstation 
0.0117  Air _ tempstation  0.0503 Flowstation  2.4324
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Orava – temperature prediction
model comparison
Model\Proper
ties
Correlation
coefficient
Mean
absolute error
Root mean
squared error
Relative
absolute error
Root relative
squared error
Total Number
of Instances
Linear
Multilayer
regression perceptron
0.9639
Validation
data
0.9821
1.1791
0.7748
1.4607
1.0386
23.8739 %
15.6884 %
26.609 %
18.9195 %
8760
8760
11.6
15.2
6.4
0.7
11.7
14.3
15.6
15.7
0.8
15.8
15.4
14.9
15.4
Linear regression
model
Predicted Error
data
13.071
1.471
14.335
-0.865
7.614
1.214
2.284
1.584
10.948
-0.752
16.526
2.226
12.891
-2.709
12.838
-2.862
1.752
0.952
15.188
-0.612
16.553
1.153
12.795
-2.105
15.660
0.260
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
Multilayer
perceptron model
Predicted Error
data
12.446
0.846
14.494 -0.706
5.766 -0.634
0.926
0.226
10.266 -1.434
13.671 -0.629
14.502 -1.098
13.353 -2.347
0.826
0.026
14.005 -1.795
13.129 -2.271
14.599 -0.301
13.696 -1.704
...making data-mining easier
Orava – prediction of water level
• Neural network model – multilayer perceptron
• Input parameters (6)
– Rainfall ([S+1]), Water-Level ([X])
– Outflows ([D], [D+1] – [D], ln([D]), sqrt([D]))
• Output
– Difference
of water
level
([X+1] – [X])
Time
Water
temp
Rain
fall
Air
temp
Air
temp
Rain
Fall
Orava
Orava
Orava
Station
Station
Station
Water
temp
Station
T-3
E-3
R-3
A-3
B-3
S-3
D-3
X-3
Y-3
T-2
E-2
R-2
A-2
B-2
S-2
D-2
X-2
Y-2
T-1
E-1
R-1
A-1
B-1
S-1
D-1
X-1
Y-1
T
E
R
A
B
S
D
X
Y
T+1
R+1
A+1
B+1
S+1
D+1
X+1
Y+1
T+2
R+2
A+2
B+2
S+2
D+2
X+2
Y+2
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
Outflow Water Orava
Level
...making data-mining easier
Orava – water level prediction
• Data count :
8735 records
• Activation function of the feed-forward
neural network:
sigmoid
• Correlation coefficient:
0.9816
• Mean absolute error :
0.4105
• Root mean squared err.: 0.9673
• Relative absolute error :
30.5869 % (from difference)
• Root relative squared error
19.2384 % (from difference)
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
RADAR
• Very short-term rainfall prediction from
weather radar data
– Movement of areas with higher air moisture content,
and thus also higher precipitation potential
• Mining of matrices of data
Time
Potential
precipitation
(RADAR)
Measured
precipitation
(STATION)
Temperature
(MODEL)
Wind
(MODEL)
T-3
R-3
S-3
H-3
W-3
T-2
R-2
S-2
H-2
W-2
T-1
R-1
S-1
H-1
W-1
T
R
S
H
W
T+1
R+1
S+1
T+2
R+2
SAMI
S+2 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
Targets of data mining
...making data-mining easier
31
Meteorologic data
• Network of synoptic stations in Slovakia
– 27 stations in Slovakia
– Used data from year 2007, 2008
– Rainfall, humidity, atmospheric
pressure and temperature
values for
each hour
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
RADAR isotonic model
• Actual model for rainfall prediction
–
–
–
–
–
–
–
Isotonic reggresion model structure
Training on historical data
Correlation coefficient
0.4593
Mean absolute error
0.1105
Root mean squared error
0.5490
Total Number of Instances
89700
Validation
10 Cross Fold
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Table of isotonic model
Prediction
(rainfall)
cut point
(reflective)
Prediction
(rainfall)
cut point
(reflective)
Prediction
(rainfall)
cut point
(reflective)
1
0.01
1.78
15
0.23
96.91
29
1.35
355.91
2
0.03
1.84
16
0.28
97.47
30
1.40
377.19
3
0.03
8.28
17
0.30
129.63
31
1.52
381.78
4
0.03
16.97
18
0.33
129.72
32
2.13
395.31
5
0.03
24.28
19
0.42
147.94
33
2.23
399.16
6
0.03
36.91
20
0.44
168.59
34
2.28
447.06
7
0.05
37.53
21
0.50
187.13
35
2.60
447.69
8
0.05
38.72
22
0.51
187.47
36
2.60
467.66
9
0.06
44.53
23
0.62
211.56
37
2.98
515.19
10
0.07
59.03
24
0.72
268.38
38
3.75
625.56
11
0.08
61.16
25
0.93
281.28
39
4.93
665.41
12
0.10
61.78
26
1.00
297.72
40
5.24
901.25
13
0.14
81.59
27
1.14
314.47
41
5.40
934.41
14
0.19
89.22
28
1.26
344.59
42
6.30
971.5
index
index
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
index
...making data-mining easier
Hydrometeorological performance
Probability of detection with threshold 0,3 and
0,6 mm rainfall per hour:
•
POD0,3 = 63,87 %
•
POD0,6 = 56,22 %
Miss rate with threshold 0,3 and 0,6 mm rainfall
per hour:
•
MR0,3 = 1,85 %
•
MR0,6 = 1,58 %
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
RADAR model
• Other tested models
– Neural networks, SMOreg, linear regression, ...
– Reached correlation coeficient between 0,35 and 0,42
– Validation - 10 Cross Fold
Problems in model creation :
–
–
–
–
process is significantly stochastic
Some input variables are backwards dependent on output
Meteorological process is very sensitive
Reflection matrix represents quantity of water in atmosphere,
not exact rainfall rate in specified area, as opposed to data from
synoptic stations
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
ADMIRE Tools
•
•
•
•
Registry client GUI
Process designer
SKSA
Gateway Process
Manager
• DMI Model Visualizer
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Registry client GUI
• Read-only access to ADMIRE Registry
– list PEs and view their properties
– search, sort PEs
• Write access to Registry is done via DISPEL
documents
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Process Designer
Manage your DMI project
(files, directories –
project structure)
Select elements
from the Registry
View the canonical (DISPEL) representation
View the properties of
of your DMI process in real time
your chosen elements
Edit your DMI process
graphically
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Semantic Knowledge Sharing Assistant
Provides access to existing user’s knowledge, sorting and selecting
it automatically according to the user’s current working context
• Context the user works in
– Several reservoirs, one
settlement
• Knowledge that may be
useful in this context
– previously entered by
other users
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Gateway Process Manager
• Keep track of running
processes
– stop/pause/cancel the
process
– view the process’ source
DISPEL
• access process’ results
(if available) in several
ways – raw or visualized
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
DMI Model Visualizer
• Visualization of data
mining models
– Read Weka classifier
object
– produce PMML
(Predictive Model Markup Language)
description of the
model
– Show the PMML as a
graphical tree
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Admire Project
Thank you for attention.
SAMI 2011, Smolenice, Slovakia, January 2011
ADMIRE – Framework 7 ICT 215024
...making data-mining easier
Download