Re-tooling Libraries for the 21st Century

advertisement
DTC Archive: data repositories in the
fight against diffuse pollution
Mark Hedges, Richard Gartner: King’s College London
Mike Haft, Hardy Schwamm: Freshwater Biological Association
Open Repositories 2012, Edinburgh, Scotland/UK, 10th July 2012
A message from our sponsors
• Collaboration between the Freshwater
Biological Association and King’s College
London (Centre for e-Research)
• Funded by DEFRA (Department for the
Environment, Food and Rural Affairs)
– A UK government ministry
• Runs from Jan. 2011 – Dec. 2014
Background: water quality and
the DTC project
Diffuse Pollution – what is it?
• Pollution processes that:
– Individually, have minimal effect
– Cumulatively, have significant impact
• Some examples:
– Run-off of water/rain (e.g. from road,
commercial properties)
– Farm fertilisers and waste
– Seepage from developed landscapes
Catchments – what are they?
Water Framework Directive
• What is an EU Directive?
– An EU Directive is a European Union legal instruction or
secondary European legislation which is binding on all
Member States but which must be implemented through
national legislation within a prescribed time-scale.
• Water Framework Directive concerns
water quality
• Freshwater (rivers, lakes, groundwater,)
adversely affected by diffuse pollution
• Failure to comply means problems!
DTC Project
• DTC = Demonstration Test Catchment
• Investigate measures for reducing impact
of diffuse water pollution on ecosystems
• Evaluate the extent to which on-farm
mitigation measures can reduce impact of
water pollution on river ecology
– cost-effectively
– maintaining food production capacity
Defra Demonstration Test Catchments (DTCs)
3 catchment areas in England selected for tests
How does the DTC project work?
• The procedure is (roughly speaking):
– Monitor various environmental markers
– Try out mitigation measures
– Analyse changes in baseline trends of markers in
response to these measures
• All this produces a great variety of data
• The DTCs create data, the DTC Archive
project has to make it usable and useful!
Equipment for data capture
Bank-side water-quality
monitoring station
Drilling a borehole for
monitoring groundwater
Images thanks to Wensum DTC
Mains
power
LHS
view
RHS view
Nitrate probe
Ammonium
analyser
ISCO automatic water sampler
Pump
Flow cell
YSI multi-parameter sonde
Meteor telemetry
unit
Total P and Total reactive P analyser
Bank-side water-quality monitoring station [Image from Wensum DTC]
DTC Archive
Purpose of the archive
• Curating data generated and captured by
DTC projects
• DTCs create data, we have to make it useful!
• Data archive, but also querying, browsing,
visualising, analysing, other interactions
• Integrated views across diverse data
• Need to meet needs of different users –
researchers, also land managers, civil
servants, planners, ...
The Data
• Mostly numerical in some form:
spreadsheets, databases, CSV files
– Sensor data (automated, telemetry)
– Manual samples/analyses
• Species/ecological data
• Geo-data
• Also less highly structured information:
– Time series images, video
– Stakeholder surveys
– Unstructured documents
Example: water quality data
Date/time
pH
Electrical Conductivity
Ca
Mg
Na
K
SO4
Cl
Total Alkalinity
HCO3
CO3
Si
B
NO3
NO2
NH3
Total N
Total Particulate N
Total Dissolved N
dd/mm/yyyy HH:MM
-
uS/cm
mg/l
mg/l
mg/l
mg/l
mg/l
mg/l
mg CaCO3/l
mg/l
mg/l
ug/l
ug/l
mg N/l
ug N/l
ug N/l
mg/l
mg/l
mg/l
Dissolved Organic N
mg/l
11/10/2010 12:00
8.18
700
129.3
3.5
12.72
1.6
32.39
42.64
293
358
0
3336
48
5.73
42.6
20
6.3
0
6.3
0.5
18/10/2010 14:42
7.9
701
134.6
3.98
14.79
2
29.95
39.07
289
353
0
3690
26
4.07
30.3
21
5.3
0
5.3
1.2
21/10/2010 00:36
7.87
727
137.8
3.31
13.57
1.3
27.04
41.03
293
357
0
2954
26
9.01
19.7
31
10.1
0
10.1
1
26/10/2010 13:43
7.93
585
162.8
3.84
16.11
1.5
27.1
40.06
294
358
0
3015
26
8.79
20.8
16
10.1
0
10.1
1.3
29/10/2010 09:45
8.24
688
148.7
3.54
14.7
1.2
26.49
39.91
273
325
0.16
2857
15
8.54
26.7
26
9.7 NaN
9.8
1.2
02/11/2010 12:00
8.22
585
137.8
3.53
14.15
1.3
28.3
40.75
275
328
0.14
2887
33
6.71
41.2
24
7.8
7.8
1.1
05/11/2010 09:50
8.23
763
141.4
3.66
14.23
1.3
30.16
42.41
257
307
0.14
3761
30
6.78
42.1
21
7.1 NaN
7.3
0.5
09/11/2010 11:13
8.32
696
135.3
3.36
12.69
1.7
21.64
33.6
271
320
0.2
2590
21
11.05
16.7
21
12.5
0.1
12.4
1.3
12/11/2010 09:58
7.92
681
138.9
3.27
12.94
1.2
24.23
37.66
279
340
0
2712
7
11.16
13.6
11
12.5
0
12.5
1.3
16/11/2010 10:19
7.88
699
136.7
3.42
13.47
1.1
26.26
37.64
293
357
0
3190
25
8.22
24.1
23
10
0
10
1.7
19/11/2010 10:00
7.9
768
137.3
3.53
13.7
1.1
27
38
296
361
0
3328
14
7.5
30.8
24
9
0
9
1.4
23/11/2010 10:43
7.97
713
132.3
3.55
14.51
1.4
26.42
38.74
292
356
0
3597
7
6.32
32.9
29
7.9
0
7.9
1.5
26/11/2010 10:15
7.79
632
130.4
3.19
16.77
1.3
20.79
39.59
274
334
0
2583
63
9.34
11.9
13
11.3
0
11.3
2
30/11/2010 10:24
8.01
679
135.7
3.34
17.16
1.2
25.64
43.11
290
353
0
2825
35
9.14
17.1
8
10.7
0
10.7
1.6
02/12/2010 14:05
8.05
717
133.1
3.27
15.75
1.1
25.92
41.74
288
351
0
2880
21
9.11
23.7
1
11.1
0.2
10.9
1.8
07/12/2010 09:54
7.98
680
137.5
3.37
13.89
1.1
26.24
36.78
292
356
0
2843
39
9.09
13.9
24
10.9
0
11
1.8
10/12/2010 10:08
7.96
753
136
3.51
21.28
1.3
27.88
49.67
297
362
0
3157
28
7.83
24.5
46
9.8 NaN
9.8
1.9
14/12/2010 10:28
8.04
709
144.6
3.59
15.37
1.1
26.23
38.42
298
363
0
2803
22
8.47
15.1
20
10.4 NaN
10.5
2
16/12/2010 09:40
7.95
718
133.2
3.31
15.92
1.1
25.03
40.34
290
354
0
2972
12
8.21
16.8
47
10.4
0
10.4
2.1
21/12/2010 11:48
7.98
718
131.6
3.33
13.74
1.1
27.17
37.57
302
368
0
3016
21
8.54
14.4
24
10.2
0
10.2
1.6
30/12/2010 09:20
7.97
688
131.1
3.17
13.78
1.1
24.34
35.91
288
352
0
2564
21
9.18
11.2
23
11
0
11
1.8
05/01/2011 11:07
8.1
706
126.9
3.16
12.88
1
27.72
38.5
311
379
0
2833
23
8.52
17
22
10
0.1
9.9
1.3
07/01/2011 10:00
7.98
700
130.9
3.38
14.77
1.1
34.8
40.93
300
366
0
3023
21
7.68
21.2
31
9.6 NaN
9.7
1.9
11/01/2011 10:02
8.04
688
120.7
2.98
14.41
1.2
28.32
38.02
279
340
0
2587
13
7.92
12.8
29
10.5
0.2
10.3
2.3
14/01/2011 09:47
7.88
588
105.9
2.65
11.32
1.3
22.91
27.69
261
319
0
2044
23
8.14
8.2
21
10.3
0.2
10.1
2
0
61,752 data
points per year
for all stations
Example: weather station data
DATE
TIME
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
07/02/2012
MAX-WIND- MIN-WINDMEAN-WIND- WINDRELATIVEAIRNETSPEED
SPEED
SPEED
DIRECTION
BATTERY
HUMIDITY
TEMPERATURE RADIATION
RAINFALL
14:30:35
8.96
1.991
3.52
110.6
13.77
55.86
-1.267
81.7
0
15:15:35
5.474
1.493
3.371
111
13.82
56.54
-1.959
74.45
0
14:15:35
6.967
1.493
3.353
110.9
13.77
57.11
-1.137
90.3
0
14:00:35
4.977
1.493
3.067
115.2
13.75
57.66
-1.034
97.4
0
15:30:35
4.977
0.995
3.034
111.8
13.83
58.02
-2.152
56.96
0
14:45:35
7.963
1.493
3.653
113.1
13.79
58.85
-1.467
78.52
0
15:00:35
4.977
1.493
3.203
110.3
13.8
58.98
-1.634
78.6
0.2
15:45:35
6.967
1.493
3.225
110.9
13.84
60.64
-2.374
-17.87
0
13:45:35
5.474
0.995
3.363
110.2
13.75
61.55
-0.828
103.9
0
16:15:35
5.474
0.995
2.722
110.6
13.87
61.94
-2.823
-45.21
0
16:00:35
5.972
1.493
3.144
108
13.86
62.22
-2.616
-64.56
0
13:30:35
5.972
1.991
3.591
105.6
13.7
62.68
-0.71
109.7
0
Example: Field Use Data
Challenges of data
• Not primarily an issue of scale
• Datasets diverse in terms of structure
• Different degrees of structuring:
–
Highly structured (e.g. sensor outputs)
–
Highly unstructured (e.g. surveys, interviews)
• Different types of structure (tables of data, geospatial)
• Some small, hand-crafted data sets.
–
Idiosyncratic metadata, description, vocabularies
–
Varying provenance and reliability
INSPIRE
• Another EU directive 
• An Infrastructure for Spatial Information in the
European Community
• Create a European Spatial Data Infrastructure
for improved sharing of spatial information
• Includes standards for describing, representing,
disseminating geo-spatial data, e.g.
– Gemini2 for catalogue metadata
– GML (Geography Markup Language)
• Builds on ISO standards (ISO 19100 series)
Generic Data Model
ISO 19156:Observation & Measurements
Multiple Data Representations
Generic data model implemented in several
ways for different purposes:
• Archival representation
– based on library/archive standards
• Data representation for data integration
– “Atomic” representation as triples
• Various derived representations
– Generated for input to specific tools/analysis
Archival Data Representation
Model for Integration
Subject
predicate
Literal value
predicate
Species
Object
memberOf
Identified by URIs
Genus
hasCommonName
Water flea
• RDF triples
• Atomic statements forming
network of node/relations
23
• Discrete datasets mapped into
common format
Example dataset
Tarn
Name
CollectionMethod
Dataset
Site
Actor
Name
ObservationSet
ObservationSet
About:Rainfall
Type:Raw
Unit:Inch
About:Rainfall
Type:Raw
Unit:Inch
Location
GridReference
Easting
Northing
Latitude
Longitude
ObservationSet
ObservationSet
About:Rainfall
Type:Derived
Unit:mm
DependsOn: OS1, OS2
Duration: 1Day
About:Rainfall
Type:Derived
Unit:mm
DependsOn: OS1, OS2
Duration: 1Day
Observation
Observation
Observation
Observation
StartDate:
EndDate
Value:
StartDate:
EndDate
Value:
StartDate:
EndDate
Value:
StartDate:
EndDate
Value:
English Lake District rainfall dataset – from FISH.Link project
24
Dataset capture and mapping
• Columns, concepts, entities mapped to
formal vocabularies
• Mappings defined in archive objects
• Automated
–
e.g. sensor output files
• Computer-assisted
–
e.g. some spreadsheets
• Manual
–
–
by domain experts
e.g. mark up values in texts
Spreadsheet transformation workflow – from FISH.Link project
25
Architectural Overview
Browsing
Visualisation
Search
Analysis
Mappings
RDF triples
Mappings
Archive
Objects
Source datasets
26
Current Status and Next Steps
• Archive project started Jan. 2011, runs till end
2014.
• Datasets are already being generated in large
quantities.
• Prototype functionality
• Modelling and Ingestion of data (incremental)
• Next steps:
– Extend types of dataset covered.
– User interactions (queries, visualisation etc.)
Thank you
mark.hedges@kcl.ac.uk
MHaft@fba.org.uk
http://dtcarchive.org/
Download