Sm Har mart rmon tOp nisa enD ite ation Data erati n of mo on data

advertisement
 This project
p
has rece
eived funding froom the Europea
an Union’s
Seventh
h Programme foor research, tech
hnological
deve
elopment and demonstration unnder grant agreement No
603824.
Harrmon
nisaation
n of dataa to
o Sm
marttOpenD
Data mo
odel.. Fin
nal iteeration
Deliiverable D3.5 :: Publlic Keyw
words: daata harm
monisatioon, ORM, RDF, RD
DFS, Linkeed Data Linked
d Opeen Daata fo
or envviron
nment prottectio
on in Smart Re
egionss
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) TableofContents
1 Introduction ............................................................................................................................ 8 2 Data Harmonisation .............................................................................................................. 11 2.1 CSV‐to‐RDF ..................................................................................................................... 11 2.1.1 Italian pilot .............................................................................................................. 12 2.1.2 Portuguese‐Spanish Pilot ........................................................................................ 20 2.1.3 Irish pilot ................................................................................................................. 27 2.1.4 Transforming Data with Grafterizer and the Jarfter Service .................................. 33 2.2 XML (GML) ‐TO‐RDF transformations ............................................................................ 41 2.2.1 Slovak pilot .............................................................................................................. 41 2.3 Relational DB‐to‐RDF transformations .......................................................................... 55 2.3.1 Czech pilot ............................................................................................................... 55 3 Harmonising Observations and Measurements ................................................................... 61 3.1 RDF Data Cube: Example ................................................................................................ 61 3.1.1 Data Cube Components .......................................................................................... 61 3.1.2 Data Cube Datasets ................................................................................................. 64 3.1.3 Data Cube Structures .............................................................................................. 64 4 Conclusion ............................................................................................................................. 66 5 References ............................................................................................................................ 69 Annex A: Generating RDF with OpenRefine: Challenges and Solutions .................................. 70 Language Tag Customisation ........................................................................................... 70 RDF out of a List of Values ............................................................................................... 72 More than one Root Nodes ............................................................................................. 72 Annex B: Portuguese‐Spanish pilot: ORM and RDF Models .................................................... 74 Chemical Characteristics ...................................................................................................... 74 Climatology .......................................................................................................................... 75 Forestry Tile ......................................................................................................................... 76 Geometry ............................................................................................................................. 76 Work Unit Ecosystem ........................................................................................................... 77 Work Unit Location .............................................................................................................. 78 Version 1.0 Page 2 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ListofFigures
Figure 1: Workflow of OpenRefine‐based data harmonisation .............................................. 11 Figure 2: RDF model of Protected Sites ................................................................................... 13 Figure 3: RDF model of Monitoring Stations ........................................................................... 14 Figure 4: RDF model of Hazardous Substances ....................................................................... 14 Figure 5: Portuguese‐Spanish Pilot, data harmonisation methodology .................................. 21 Figure 6: Original Sample ARPA Data ....................................................................................... 33 Figure 7: RDF mapping for ARPA data ..................................................................................... 34 Figure 8: Generated RDF graph for ARPA data ........................................................................ 35 Figure 9: User interface for Jarfter .......................................................................................... 35 Figure 10: Jarfter compiler services ......................................................................................... 36 Figure 11: Jarfter transformation web service ........................................................................ 37 Figure 12: Dynamic deployment of data transformations ...................................................... 38 Figure 13: CloudML deployment template .............................................................................. 38 Figure 14: List of updated GeoKnow XSLT stylesheets ............................................................ 47 Figure 15: Landing page for Unified Views ............................................................................. 48 Figure 16: List of created pipelines .......................................................................................... 48 Figure 17: Section with DPU templates ................................................................................... 49 Figure 18: Pipelines execution monitor ................................................................................... 49 Figure 19: Scheduler with the possibility to define the schedules for pipelines execution .... 50 Figure 20: Section with additional settings ............................................................................. 50 Figure 21: Example of pipeline details ..................................................................................... 50 Figure 22: Example of further DPU settings ............................................................................ 51 Figure 23: Example of the interlinking pipeline ....................................................................... 52 Figure 24: CKAN interface with the list of metadata for the open linked data from Slovak pilot .......................................................................................................................................... 53 Figure 25: Parliament web application interface .................................................................... 53 Figure 26: Czech pilot Data model ........................................................................................... 56 Figure 27: RDF plugin of OpenRefine, language tag ................................................................ 70 Figure 28: Excerpt from the aux_040400_municipality.csv .................................................... 71 Figure 29: RDF plugin of OpenRefine, literal node customisation .......................................... 71 Figure 30: Excerpt from ObservationTiles.csv file ................................................................... 72 Figure 31: Excerpt from ObservationTiles.csv file ................................................................... 73 Figure 32: Chemical Characteristics: ORM Model ................................................................... 74 Figure 33: Chemical Characteristics: RDF model ..................................................................... 74 Figure 34: Climatology: ORM Model ........................................................................................ 75 Figure 35: Climatology: RDF Model ......................................................................................... 75 Figure 36: Forestry Tile: ORM Model ....................................................................................... 76 Figure 37: Forestry Tile RDF: Model ........................................................................................ 76 Figure 38: Geometry: ORM Model .......................................................................................... 76 Figure 39: Geometry: RDF Model ............................................................................................ 77 Figure 40: Work Unit Ecosystem: ORM Model ........................................................................ 77 Figure 41: Work Unit Ecosystem: RDF Model .......................................................................... 77 Figure 42: Work Unit Location: ORM Model ........................................................................... 78 Figure 43: Work Unit Location RDF model .............................................................................. 78 Version 1.0 Page 3 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ListofTables
Table 1: Data transformation approaches ................................................................................. 9 Table 2: Italian Pilot: summary of classes ............................................................................... 15 Table 3: Italian Pilot: summary of data harmonisation .......................................................... 20 Table 4: Portuguese‐Spanish Pilot: ORM constructs mapped to classes ............................... 23 Table 5: Portuguese‐Spanish Pilot: ORM constructs mapped to properties .......................... 24 Table 6: An overview of the datasets and vocabularies used in SK Pilot ............................... 44 Table 7: List of phases and tasks extracted and deployed from the COMSODE methodology for Open Data publishing ......................................................................................................... 45 Table 8: Vocabulary usage by pilot ......................................................................................... 67 Version 1.0 Page 4 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Document Metadata Contractual Date of Delivery to the EC: August 2015 Actual Date of Delivery to the EC: October 7th 2015 Editor(s): Tatiana Tarasova, SpazioDati Contributor(s): Martin Tuchyňa (SAŽP), Jindřich Mynarz (SAŽP), Peter Mozolík (SAŽP), Dumitru Roman (SINTEF), Nikolay Nikolov (SINTEF), Antoine Pultier (SINTEF), Dina Sukhobok (SINTEF), Håvard H. Holm (SINTEF), Jan Bojko (UHUL FMI), John O’Flaherty (MAC), Gregorio Urquía (TRAGSA), Jesús Estrada (TRAGSA) Document History Version Version date Responsible
Description 0.0 20/07/2015 SpazioDati
Outline and contributions 0.1 30/07/2015 UHUL FMI, HSRS
Czech pilot contributions
0.2 15/08/2015 SAŽP
Slovak pilot contributions
0.3 21/08/2015 SpazioDati, TRAGSA
contribution to data harmonisation of the Italian and Portuguese‐Spanish pilots 0.4 21/08/2015 SpazioDati
restructuring the report
0.5 24/08/2015 SINTEF
contribution on Grafterizer and comparison of Grafterizer with OpenRefine
0.6 27/08/2015 UHIL FMI, SAŽP, SINTEF
Sections 2.2, 2.3 on the Slovak and Czech pilots finalised Section 2.1.4 about Grafterizer completed
0.7 28/08/2015 SpazioDati
Final version of the report with missing contribution from the Irish Pilot; submitted to the project coordinator. 1.0 7/10/2015 TRAGSA
Editorial review call Version 1.0 Page 5 of 78 © SmartOpenData Consortium 2015 for D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The information and views set out in this publication are those of the author(s) and do not necessarily reflect the official opinion of the European Communities. Neither the European Union institutions and bodies nor any person acting on their behalf may be held responsible for the use which may be made of the information contained therein. Copyright © 2015, SmartOpenData Consortium. Version 1.0 Page 6 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Executive Summary Task 3.5 is dedicated to harmonising pilots data to the Final SmOD model delivered in D3.4 [SMODD34]. The model is based on several INSPIRE topics and provides a basis for geospatial and environmental data interoperability. Being such, the model does not cover domain‐specific concepts of the pilots. Hence, initial activities of the data harmonisation task included evaluation of the SmOD model in the context of the pilots. Whenever the model was not sufficient to represent the domain of interest, a search for the existing commonly accepted or standard vocabularies was performed, and if no suitable vocabulary was found, custom terms were developed. These custom terms constituted one of the main outcomes of the current task, the custom SmOD vocabulary. The vocabulary is published at http://www.w3.org/2015/03/inspire/smod#. Operational aspects of the data harmonisation task concern data transformations from input data structures to RDF. 3 different approaches were identified based on the pilots’ requirements: ● CSV‐to‐RDF (Spanish‐Portuguese, Italian and Irish pilots) ● XML‐to‐RDF (Slovak) ● RDBMS‐to‐RDF (Czech pilot) This document explains the approaches, discusses tools and technologies being used to realize them and summarizes the results of the data harmonisation task per pilot. Version 1.0 Page 7 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 1 Introduction Final SmartOpenData model has been delivered with D3.4 [SMODD34]. It is based on the INSPIRE themes being selected specifically to cover domains of the pilots: ● The Generic Concept Model ● Protected Sites ● Land Use ● Administrative Units ● Bio‐Geographical Units ● Species Distribution ● Corine Land Cover ● Environmental Monitoring Facilities ● Cadastral Parcels1 The model serves as a basis for harmonising data in the SmOD pilots. It defines basic concepts that are shared among the pilots, such as Protected Site or Cadastral Parcel. However, every pilot in addition to these basic concepts contains those specific to the domain of the pilot, which are not covered by the model. As a result, every pilot had to extend the model with domain specific terms. These terms were searched in the existing resources, such as the Linked Open Vocabularies repository2, schema.org and DBpedia OWL ontology. Whenever existing resources were not sufficient for the pilot’s needs, custom terms were introduced. We accumulated these custom terms in the SmOD Custom Vocabulary published at http://www.w3.org/2015/03/inspire/smod#. The rest of the document is structured as follows. We split Section 2 into three blocks each of which corresponds to a different data transformation approach: Section Approach Tools, Technologies Section 2.1 CSV‐to‐RDF OpenRefine3, RDF plugin for OpenRefine4, Fusepool BatchRefine API5 Section 2.2 XML‐to‐RDF XSLT (based on customised GeoKnow stylesheets6) 1
Cadastral Parcel theme has been added to the SmOD model after D3.4 had been finalised. http://lov.okfn.org/dataset/lov/ 3
http://openrefine.org/ 4
http://refine.deri.ie/ 5
https://github.com/fusepoolP3/p3‐batchrefine 6
https://web.imis.athena‐innovation.gr/redmine/projects/geoknow_public/wiki/Inspire2RDF Version 1.0 Page 8 of 78 © SmartOpenData Consortium 2015 2
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) and SmOD INSPIRE Vocabularies7 using OpenDataNode8 Section 2.3 Relational DB‐to‐RDF D2RQ, r2rml parser Table 1: Data transformation approaches CSV‐to‐RDF approach has been discussed in detail in D3.3 [SMODD33]. We have included a tutorial on using the RDF plugin of OpenRefine for mapping CSV files into RDF and discussed preliminary results of transforming Italian and Portuguese‐Spanish data into RDF. In this document we present the results of data harmonisation of the Italian pilot in Section 2.1.1, and the results of the Portuguese‐Spanish data harmonisation in Section 2.1.2. We discuss input datasets, models of the pilots and the vocabularies used to encode data in RDF. The latter include the SmOD model, vocabularies developed by third parties and the custom SmOD vocabulary. We conclude discussion of the CSV‐to‐RDF approach by presenting Grafterizer, the tool that performs transformations on tabular data. Section 2.1.4 contains a demonstration of how to use the tool on the example of the data from the Italian pilot and a comparison of the Grafterizer features with the RDF plugin of OpenRefine. XML‐to‐RDF has also been introduced in the previous deliverable D3.3. In the current report in Section 2.2.1 we discuss customisation of the GeoKnow XSL transformations to use the SmOD model as the target schema with the support of the Open Data Node platform and elements of the COMSODE methodology framework9 in the settings of the Slovak pilot. In Section 2.3 we explain the Relational‐to‐RDF approach followed in the Czech pilot. Section 3 discusses application of the RDF Data Cube vocabulary to harmonise environmental observations and measurements. Section 4 concludes the report. 7
http://www.w3.org/2015/03/inspire/ http://opendatanode.org/ 9
http://www.comsode.eu/wp‐content/uploads/D5.1‐
Methodology_for_publishing_datasets_as_open_data.pdf Version 1.0 Page 9 of 78 © SmartOpenData Consortium 2015 8
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Namespaces used in the report: Schema Prefix
Namespace
SmOD Protected Sites ps
http://www.w3.org/2015/03/inspire/ps# SmOD Administrative Units au
http://www.w3.org/2015/03/inspire/au# SmOD Environmental Monitoring Facility ef
http://www.w3.org/2015/03/inspire/ef# SmOD Custom Vocabulary
smod
http://www.w3.org/2015/03/inspire/smod# SmOD Cadastral Parcels Vocabulary cp
http://www.w3.org/2015/03/inspire/cp# SKOS skos
http://www.w3.org/2004/02/skos/core# Friend of a Friend foaf
http://xmlns.com/foaf/0.1/
DC Terms dcterms
http://purl.org/dc/terms/
GeoSPARQL gsp
http://www.opengis.net/ont/geosparql# DBpedia Ontology dbpedia‐
owl http://www.w3.org/2002/07/owl# RDF Data Cube Vocabulary qb
http://purl.org/linked‐data/cube# Time Ontology time
http://www.w3.org/2006/time# QUDT Units qudt‐unit
http://qudt.org/1.1/vocab/unit# QUDT Schema qudt
http://qudt.org/schema/qudt# RDF Schema rdfs
http://www.w3.org/2000/01/rdf‐schema# Corine Land Cover Nomenclature in SKOS clc
http://www.w3.org/2015/03/corine# Asset Description Metadata Schema (ADMS) adms
http://www.w3.org/ns/adms# RAMON schema ramon
http://rdfdata.eionet.europa.eu/ramon/ontology/
Version 1.0 Page 10 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) 2 Daata Harmonisation
n 2.1 CSSV‐to‐RD
DF OpenReefine togeth
her with the RDF plugiin were selected to map and connvert data o
of Italian and Portuguese‐Sp
panish pilots into RDFF. Main mo
otivation fo
or this choiice was the
e pilots’ input cconditions: input datasets ‐ CSV files or XLS spreadsheets ‐ weere extracte
ed from several independeent data sou
urces and addded to the
e pilot in the course off work. Porttuguese‐
h pilot aggrregates inpu
ut data fro m multiple sources off the Spanissh and Porrtuguese Spanish
public b
bodies. Inpu
ut datasets of the Italiian pilot at the moment include ddata from d
different public d
databases, aand it is planned to incclude more data from o
other data ssources. onisation inn D3.3 [SMO
ODD33]. We preesented thee workflow of OpenReffine‐based data harmo
Figure 11 illustratess the processses and toools involved in the workflow. Figure 1: W
Workflow of O
OpenRefine‐ba
ased data harrmonisation Data Prre‐processin
ng In both pilots there was a nee
ed to prepaare input daatasets befo
ore mappingg and transsforming them to
o RDF. In caase of the Ittalian pilot,, functionalities of Ope
enRefine weere sufficient to do this, un
nlike in som
me cases of the Portug uese‐Spanissh pilot, wh
here ad‐hocc bash scrip
pts were applied to input daatasets befo
ore loading them to Op
penRefine. Mappin
ngs Creation
n SmartOpenDaata Consortium
m 2015 Version 11.0 Page 11 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) RDF mappings were created using the GUI of the RDF plugin for OpenRefine. All the RDF mappings are available in the corresponding projects of the OpenRefine instance, which was deployed for the project at https://smod‐refine.spaziodati.eu/. To access the instance, use the following username/password as credentials: smod/EnterSmartOpenData Using OpenRefine for transforming pilots data posed several challenges. We discuss them and present our solutions in Annex A. Data Transformation OpenRefine was designed primarily as a personal desktop application, and is meant to be used in an interactive mode. Within the scope of another EU FP7 project, Fusepool10, a batch version of OpenRefine was developed. APIs of the BatchRefine11 transformer enable programmatic access to the OpenRefine engine, which makes it possible to incorporate BatchRefine into an automatic Extract‐Transform‐Load procedure. In D3.3, Section 5.1 “BatchRefine Example using cURL” we demonstrated the usage of BatchRefine API. At the current stage of the pilots we performed all the transformations using the export “RDF as RDF/XML” functionality of OpenRefine. In the rest of this section we discuss in detail the data harmonisation processes held in the Italian pilot (Section 2.1.1) and Portuguese‐Spanish pilot (Section 2.1.2). We introduce input datasets, discuss RDF modeling and vocabularies used in order to generate RDF representation of the pilots’ data. In Annex A we report on our experience from using the RDF plugin of OpenRefine. We describe several cases, in which RDF generation task was not trivial, and present our solutions. 2.1.1 Italian pilot The Italian pilot is led by ARPA, the Environmental Protection Agency of the Sicilian Region. Following the pilot’s objectives, ARPA identified several user queries that underlie the baseline use case scenario of the pilot12. These queries guided the process of selecting input datasets, as well as the process of creating RDF models of them. In the current document we present one of these queries which, at the moment of writing this report, was fully implemented: ● Which rivers and lakes (upstream, within or crossing, and downstream) are linked to the environment of a Protected Site? 10
http://fusepoolp3.github.io/ 11
https://github.com/fusepoolP3/p3‐batchrefine Refer to [SMODD52] for more information about the pilot’s objectives 12
Version 1.0 Page 12 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Input Datasets
Natura22000 datab
base Natura22000 datab
base13 is maintained bby the Euro
opean Environmental Agency (EEA)14. It accumu
ulates inforrmation about proteccted sites from f
all EU
U memberss. The dataabase is publiclyy available ffor downloaading in thee form of a MS Access database ddump or an
n archive with CSSV files. Forr the pilot we used thhe fifth rele
ease of the database, published on June 2014, w
which refleccts the situ
uation of thhe protecte
ed sites in the Europeean Union in 2013 inclusivve. The database con
nsists of multiple m
tabbles. For th
he pilot’s needs n
we w
were intere
ested in NATURA
A2000SITESS table whicch lists and ddescribes p
protected arreas. Waterb
base ‐ Lakess, Waterbasse ‐ Rivers
EEA Waterbase ‐ Lakes15 an
nd Rivers16 databases contain in
nformation about mo
onitoring a rivers and a measurrements of water quality. ARPA eextracted from the stationss of lakes and stations in
databasses data reelevant for the pilot, i ncluding daata about monitoring m
n Sicilian lakes and rivers and a measured by theem concenttrations of hazardous substancess in the water. A
ARPA added
d geographical coordinnates to som
me stations that were m
missing them. RDF Mo
odelling
Figures 2‐4 illustraate RDF models of Prrotected Sittes, Monito
oring Statioons and Haazardous Substan
nces correspondingly. T
Table 2 bel ow summaarises the classes of thee models and gives examples of their instances. Measurem
ments of hazardous substances w
we encoded
d in RDF using th
he RDF Dataa Cube fram
mework (in SSection 3 we discuss it in detail). Figure 2: RDFF model of Prrotected Sitess 13
http:///www.eea.eu
uropa.eu/data
a‐and‐maps/ddata/natura‐5
http:///www.eea.europa.eu/ 15
http:///www.eea.europa.eu/data‐and‐maps/daata/waterbase‐lakes‐10 16
http:///www.eea.europa.eu/data‐and‐maps/daata/waterbase‐rivers‐10 SmartOpenDaata Consortium
m 2015 Version 11.0 Page 13 oof 78 © S
14
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Figure 3: RDF m
model of Mon
nitoring Statio
ons nces Figure 4: RDF moodel of Hazarrdous Substan
cclass description URI constructio
on
U
URI example classes of Protected Sites, baseURI = <ht
ttp://data.s
smartopendata.eu/Natura2
2000/> ps:Prot
tectedSit
e protected sites instances baseURI
I/so/Protect
tedSite/<
SITECOD
DE> <http://data
<
a.smartopend
data.eu
/Natura2000/
/
/so/Protecte
edSite/
IT3110002>
I
SmartOpenDaata Consortium
m 2015 Version 11.0 Page 14 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) foaf:Document instances of legal foundation documents baseURI/Document/<SITECODE
> <http://data.smartopendata.eu
/Natura2000/Document/IT311000
2> gsp:Geometry geometries of protected sites baseURI/Geometry/<SITECODE
> <http://data.smartopendata.eu
/Natura2000/Geometry/IT311000
2>
au:Administrati
veUnit administrative units of protected sites baseURI/so/AdministrativeU
nit/IT <http://data.smartopendata.eu
/Natura2000/so/Administrative
Unit/IT> classes of Monitoring Stations,
baseURI of the Waterbase Lakes dataset = <http://data.smartopendata.eu/WaterbaseLakes/> baseURI of the Waterbase Rivers dataset = <http://data.smartopendata.eu/WaterbaseRivers/> ef:Environmenta
lMonitoringFaci
lity baseURI/so/Station/<Nation
instances of lakes alStationID> and rivers monitoring stations <http://data.smartopendata.eu
/WaterbaseLakes/so/Station/IT
19LW09318> gsp:Geometry geometries of stations baseURI/Geometry/<National
StationID> <http://data.smartopendata.eu
/WaterbaseLakes/Geometry/IT19
LW09318>
au:Administrati
veUnit administrative units of the stations baseURI/so/AdministrativeU
nit/<CountryCode> <http://data.smartopendata.eu
/WaterbaseLakes/so/Administra
tiveUnit/IT> classes of Hazardous Substances,
baseURI of the Waterbase Lakes dataset = <http://data.smartopendata.eu/WaterbaseLakes/> baseURI of the Waterbase Rivers dataset = <http://data.smartopendata.eu/WaterbaseRivers/> qb:Observation instances of measurements of hazardous substances baseURI/HazardousSubstance
s/Observation/<rowIndex> <http://data.smartopendata.eu
/WaterbaseRivers/HazardousSub
stances/Observation/0> qb:DataSet instances of the input datasets with hazardous substances - <http://data.smartopendata.eu
/WaterbaseRivers/HazardousSub
stances/Dataset/> smod:Determinan
d chemical compounds (determinands) defined in Water Framework Directive http://data.smartopendata.
eu/WFD/Determinand/<CASNum
ber> <http://data.smartopendata.eu
/WFD/Determinand/71-55-6> time:Interval time period, year, for which the values of the measurements were aggregated http://reference.data.gov.
uk/id/gregorianinterval/<Year>+”-0101T00:00:00/P1Y” <http://reference.data.gov.uk
/id/gregorian-interval/201301-01T00:00:00/P1Y> qudt:Unit units of measurements http://data.smartopendata.
eu/WFD/UnitOfMeasure/<unit
_id>
<http://data.smartopendata.eu
/WFD/UnitOfMeasure/9> Table 2: Italian Pilot: summary of classes Version 1.0 Page 15 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) External Vocabularies
In this section we summarise external vocabularies used in the pilot. We focus on the vocabularies which are not included in the final SmOD model. DC Terms and DBPedia OWL for Administrative Units of Protected Sites Protected sites (see Fig. 2) are linked to administrative units they belong to via the property dcterms:coverage. Administrative Units are described in NATURA2000 through the Nomenclature of Territorial Units for Statistics (NUTS) country code, a two‐letter code referencing the country, e.g., “IT” for Italy17. The SmOD model suggests using au:country and take values from the Metadata Registry (MDR). We constructed the MDR URIs for the Sicilian sites. For example, below is an excerpt describing one of the sites: <http://data.smartopendata.eu/Natura2000/so/ProtectedSite/ITA070005> a
ps:ProtectedSite .
<http://data.smartopendata.eu/Natura2000/so/ProtectedSite/ITA070005> dcterms:coverage
<http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> . <http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> a
au:AdministrativeUnit ;
au:nationalLevel
<http://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/1stOrder/> ;
au:country <http://publications.europa.eu/resource/authority/country/ITA> . In addition to this definition, we kept textual representation of the country codes, using the DBPedia ontology property dbpedia-owl:nutsCode, as shown in the listing below: <http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> dbpediaowl:nutsCode "IT" . This was done mainly for the fact that the MDR URIs are currently not resolvable, hence, technically we could not obtain description of the countries by these URIs. Moreover, inspection of the SKOS description of the URI of Italy18 revealed that there is no mapping from the MDR country codes to the NUTS codes, which would be useful to have in the pilot’s case. 17
NUTS codes are identical to the ISO 3166‐1 alpha‐2 code, while MDR makes use of the ISO 3166‐3 codes http://www.iso.org/iso/home/standards/country_codes.htm 18
SKOS document describing all countries can be downloaded from http://publications.europa.eu/mdr/resource/authority/country/skos/countries‐skos.rdf Version 1.0 Page 16 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Custom terms
In addition to the external vocabularies developed by third parties, we introduced new terms that were implemented in the custom SmOD vocabulary http://www.w3.org/2015/03/inspire/smod#. Term rdfs:comment smod:areaHa This property specifies the area of the Protected Site in Ha.
smod:lengthKm This property specifies the length of the Protected Site in km.
smod:ecologicalQuality
This property provides description of the Protected Site in terms of ecological quality.
smod:catchmentName This property specifies the name of major catchment or basin.
smod:featureName This property specifies the name of the feature of interest being monitored by the Environmental Facility. smod:Determinand This class represents the class of nutrients, organic matter, hazardous substances and other chemical determinands reported in the Waterbase data of the European Environmental Agency.
Data Pre-Processing
The RDF models presented in the section above illustrate also how values of certain properties were populated with physical data. For example, in the RDF model of Protected Sites, the value of rdfs:label is populated with the value of the column <SITENAME>. In several cases, population of the properties’ values was not straightforward, and additional pre‐processing steps were required. In this section we discuss some typical examples of them. Implementing domain logics It is a typical situation, when a property values is populated from more than one columns of the input dataset, following some domain logic. For example, in case with protected sites, the value of ps:legalFoundationDate was populated from three columns <DATE_SAC>, <DATE_CONF_SCI> and <DATE_SPA>. <DATE_CONF_SCI> and <DATE_SPA> are the dates when a site was designated as Site of Community Importance (SCI) and Special Protection Areas (SPA) correspondingly. Site designation is found in the column <SITETYPE> and may contain of the three values: ● “A”: the site was designated as SPA ● “B”: the site was designated as SCI ● “C”: the site was designated as both SCI, and SPA In addition, European Commission can assign the status of Special Area of Conservation (SAC) to each site. If this happens, the column <DATE_SAC> is populated. Following consideration from the domain experts of ARPA, a rule was implemented in OpenRefine, in order to take value for ps:legalFoundationDate from <DATE_SAC> Version 1.0 Page 17 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) whenever it is present, and the latest date from the <DATE_CONF_SCI> or <DATE_SPA>, otherwise. Cleaning/Formatting Values Very typical examples of data preparation are data formatting and data cleaning. For example, the date values of ps:legalFoundationDate were converted from the date‐ time format to date, using simple OpenRefine rule for that. Sampling Input Dataset It is often the case that we want to generate RDF out of a subset of the input dataset. With OpenRefine it can be done in numerous ways. For example, NATURA2000 table contains protected sites from all the countries of European Union; however, for the pilot we were interested only in the Sicilian sites. To reduce the input datasets, a text facet was created in OpenRefine on the column <SITECODE>, that outputs “1” in case <SITECODE> contains value of one of the Sicilian sites (these values were provided by ARPA), and “0” otherwise. Joining Datasets Another interesting example of exploiting OpenRefine functionalities for data preparation refers to joining one dataset with another, in order to retrieve more data. For example, the dataset with hazardous substances contains units of measurements (UoM) in the column <Unit_HazSubs>. The values of the column are names of hazardous substances, such as “μg/l”, and as the target RDF model of hazardous substances suggests that the values of sdmx-attribute:unitMeasure must be URIs. The URI of “μg/l” is <http://data.smartopendata.eu/WFD/UnitOfMeasure/9>, in which “9” is an index row of “μg/l” in a dataset of UoMs19 that resides in another OpenRefine project20. Hence, in order to generate the same UoMs URIs in the project with hazardous substances, we need to join this dataset with the dataset of UoM21 and retrieve row indexes of the latter. And this kind of joins is also supported by OpenRefine22. RDF Generation
The size of the complete RDF dataset (including data structure definitions and concept scheme) of the Italian pilot is 2.1M; 14.098 triples in total: ● 223 instances of ps:ProtectedSite 19
http://dd.eionet.europa.eu/dataelements/48239 The project called “ARPA‐haz‐substances‐UoM” is available at https://smod‐refine.spaziodati.eu/ 21
The join is done by the UoM name that is found in the column <Unit_HazSubs> of the source dataset and <Value> in the target 22
See here the documentation of the join rule https://github.com/OpenRefine/OpenRefine/wiki/GREL‐Other‐
Functions#crosscell‐c‐string‐projectname‐string‐columnname Version 1.0 Page 18 of 78 © SmartOpenData Consortium 2015 20
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ● 38 instances of ef:EnvironmentalMonitoringFacility, 14 of which are lakes monitoring stations, and 24 are rivers stations ● 906 instances of qb:Observation, 205 of which are measurements of hazardous substances in rivers, and 701 of which are in lakes Table 3 summarises the results of data harmonisation in the Italian pilot. RDF mappings are available at the OpenRefine projects of each input dataset. Resulting RDF can be downloaded from the given links, alternatively, the data can be queried via the SPARQL endpoint http://smodlumii.sungis.lv/sparql Input Datasets OpenRefine project’s name on https://smod‐
refine.spaziodati.eu
RDF,
SPARQL endpoint: http://smodlumii.sungis.lv/sparql Natura2000 database ‐ http://www.eea.europa.eu/
data‐and‐
maps/data/natura‐5, table NATURA2000SITES
“ARPA‐NATURA2000SITES‐
PLUS” ●
●
RDF dump23
graph: <http://data.smartopendata.eu/natura20
00/sicily> EEA Waterbase ‐ Lakes ‐ http://www.eea.europa.eu/
data‐and‐
maps/data/waterbase‐
lakes‐10, ARPA extraction (enriched with coordinates)24, sheets “StationsLakes” and “HazSubstLakes_Agg” “ARPA‐
Lakes_dati2013_caricati20
14” ●
●
RDF dump25
graph: <http://data.smartopendata.eu/wat
erbase‐lakes/stations/sicily> “ARPA‐
Lakes_dati2013_caricati20
14‐HazSubs” ●
●
RDF dump26
graph: <http://data.smartopendata.eu/wat
erbase‐lakes/haz‐substances/sicily> EEA Waterbase ‐ Rivers ‐
http://www.eea.europa.eu/
data‐and‐
maps/data/waterbase‐
rivers‐10, ARPA extraction27, sheets “StationsRivers” and “HazSubstRivers_Agg” “ARPA‐
Rivers_dati2013_caricati20
14” ●
●
RDF dump28
graph: <http://data.smartopendata.eu/waterbas
e‐rivers/stations/sicily> “ARPA‐
Rivers_dati2013_caricati20
14‐HazSubs” ●
●
RDF dump29
graph: <http://data.smartopendata.eu/waterbas
e‐rivers/haz‐substances/sicily> 23
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐NATURA2000SITES‐PLUS.rdf.zip https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐
release2/Lakes_dati2013_caricati2014+Rev1.xlsx.zip 25
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014.rdf.zip 26
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014‐
HazSubs.rdf.zip 27
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Rivers_2013_19_12_2014_Rev1.xlsx.zip 28
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014.rdf.zip 29
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Rivers_dati2013_caricati2014‐
HazSubs.rdf.zip Version 1.0 Page 19 of 78 © SmartOpenData Consortium 2015 24
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) EEA Unit of measurement of “ARPA‐haz‐substances‐
Hazardous Substances ‐ UoM” http://dd.eionet.europa.eu/
dataelements/48239 ●
●
RDF dump30
graph: <http://data.smartopendata.eu/WFD/haz
‐substances/uom> EEA code list of determinands ‐ http://dd.eionet.europa.eu/
datasets/latest/Groundwate
r/tables/HazSubstGW_Disag
g/elements/DeterminandCo
de ●
●
RDF dump31
graph: <http://data.smartopendata.eu/WFD/haz
‐substances/determinands> “ARPA‐WFD‐determinand”
Table 3: Italian Pilot: summary of data harmonisation Future Outlook
The Italian pilot is being actively developed, and more user queries are to be addressed in the future work, for example: ● Which protected site or areas of a protected site are more or less subject to pollution? ● Which human activities in the protected site can lead to pollution of water and/or lakes (within and/or downstream)? This will require adding more input data sources, such as those defining “pollution” in terms of the concentration of hazardous substances. For example, what is the acceptable value of the benzene concentration? When it is considered to be water pollution? As for the second user query, description of “human activities” needs to be added. New models will be developed to include new data sources. This in turn will affect the SmOD model (and vocabularies) which at the moment do not include either pollution or human activities definitions. 2.1.2 Portuguese‐Spanish Pilot Portuguese‐Spanish pilot is led by Empresa de Transformacion Agraria SA (TRAGSA). Besides TRAGSA, Portuguese partner ‐ Direção Geral do Território ‐ participates in the pilot as domain expert and data provider. A set of user queries of the pilot guided the process of data harmonisation: from choosing input datasets to conceptual modelling of the domain, to designing RDF models and extending SmOD vocabularies with domain‐specific terms. Below we present a few user queries for demonstration purposes32: ● What’s the land use and land cover (LULC) of my field units in Zêzere Watershed in the year x? 30
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐haz‐substances‐UoM.rdf.zip https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐lakes‐haz‐substances‐
determinand.rdf.zip 32
Refer to [SMODD52] for more details on the pilot Version 1.0 Page 20 of 78 © SmartOpenData Consortium 2015 31
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) ● Which land
d use/cover changes occcurred in m
my field unit? ● What envirronmental ffactors can be relevantt in this field unit consiidering the ocurred LULC? Input Datasets
Input d
datasets accumulated
a
d by TRA
AGSA from different data souurces include one denorm
malized tab
ble with relationshipps between
n various concepts of the do
omain ‐ pd_06004_workunion.wkt ‐ and multipl e auxiliary tables tha
at provide definitionss of the conceptts of the do
omain. Data is available frrom the FTP
P server of TTRAGSA. RDF Mo
odelling/Cusstom Termss
In order to develop
p RDF mode
els of the piilot we follo
owed the m
methodologyy presented
d in D3.3 [SMODD33]. Figuree 5 schematically illusttrates the m
methodologyy. Figure 5: Portuguese‐Spanissh Pilot, data harmonisatio
on methodoloogy 3.3, input datasets of tthe pilot lack proper d
documentattion of the schema As explained in D3
design, and domain analysis aand modelli ng were needed prior to harmoniising pilot’s data. Togetheer with TR
RAGSA and
d SINTEF we perforrmed domain analys is and de
eveloped concepttual models using th
he Object‐ Role Mode
elling (ORM
M) techniquues [HM08
8]. D3.3 presentts ORM mo
odels of the
e first releaase of the pilot. Our initial intenntion was to
o follow iterativee approach
h to the pilo
ot’s developpment and produce ba
ackward com
mpatible m
models in each su
ubsequent release r
of the t pilot. TThat meant that in eve
ery new iteeration we were to augmen
nt the existting modelss with moree concepts and relationships, butt not to mo
odify the existingg ones. In practice, p
that approacch worked for the seccond releasse of the pilot, but failed w
with the third t
releaase, in whhich the models m
of the previoous release
es were reconsidered and m
modified. We pub
blished ORM
M models off all the releeases http:///smod‐fp7.github.io/ together w
with their documeentations. In I the currrent docum
ment we incclude ORM models off the latest (third) release of the pilott in Annex B
B, among w hich are: SmartOpenDaata Consortium
m 2015 Version 11.0 Page 21 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ●
●
●
●
●
●
Chemical Characteristics33 ‐ the model of chemical characteristics of soil Climatology34 ‐ the model of climatology measurements Forestry Tile35 ‐ the model of forestry maps and plant species Geometry36 ‐ the model of geometries Work Unit Ecosystem37 ‐ the model of animal species supported by observatory tiles Work Unit Location38 ‐ the model of topological relations between spatial objects The ORM models served as input to the task of RDF data modelling. Following the set of conversion rules presented in D3.3, we transferred ORM models to RDF Schema. In Annex B we include all the resulting RDF models. Next in this section we go through the conversion rules and summarise the result of their application to the ORM models of the third release of the pilot. Mapping Object Types and Value Types to Classes In Table 4 we present ORM constructs ‐ object types and value types ‐ that were mapped to classes. ORM construct Class URI construction
baseURI = <http://data.smartopendata.eu/
sp-pt-pilot/> URI example Work Unit smod:WorkUnit baseURI/so/WorkUnit/<idWorkUni
t> <http://data.smartopendata
.eu/sp-ptpilot/so/WorkUnit/ES111010
0070100100001001> Soil smod:Soil baseURI/so/Soil/<idLitholo>
<http://data.smartopendata
.eu/sp-pt-pilot/Soil/57> Forestry Tile smod:ForestryT
ile baseURI/so/ForestryTile/<idFor
estry> <http://data.smartopendata
.eu/sp-ptpilot/so/ForestryTile/1000
01-MFE25> Plant Species smod:PlantSpec
ies baseURI/PlantSpecies/<codeSP1>
<http://data.smartopendata
.eu/sp-ptpilot/PlantSpecies/Pinsyl> Local number adms:Identifie
r baseURI/Identifier/<idWorkUnit
> <http://data.smartopendata
.eu/Identifier/ES111010007
0100100001001> Protected Site ps:ProtectedSi
baseURI/ProtectedSite/
33
http://smod‐fp7.github.io/tragsa3/diagrams/ChemicalCharacteristics.png http://smod‐fp7.github.io/tragsa3/diagrams/Climatology.png 35
http://smod‐fp7.github.io/tragsa3/diagrams/ForestryTile.png 36
http://smod‐fp7.github.io/tragsa3/diagrams/Geometry.png 37
http://smod‐fp7.github.io/tragsa3/diagrams/WorkUnitEcosystem.png 38
http://smod‐fp7.github.io/tragsa3/diagrams/WorkUnitLocation.png Version 1.0 Page 22 of 78 © SmartOpenData Consortium 2015 34
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) te Parcel smod:Parcel baseURI/Parcel/<idParcel>
<http://data.smartopendata
.eu/sp-ptpilot/so/Parcel/ES11101000
70100100001> Neighbourhood Municipality District NUTS3 NUTS2 au:Administrat
iveUnit baseURI/<AdministrativeUnit>/<
idAdministrativeUnit> <http://data.smartopendata
.eu/sp-ptpilot/so/Neighbourhood/ES1
1101000701> Observatory Tile smod:Observato
ryTile baseURI/so/ObservatoryTile/<id
LandSp> <http://data.smartopendata
.eu/sp-ptpilot/so/ObservatoryTile/2
9TNH15> Animal Species smod:AnimalSpe
cies baseURI/AnimalSpecies/<code>
<http://data.smartopendata
.eu/sp-ptpilot/AnimalSpecies/Alaarv
> Geometry gsp:Geometry baseURI/Geometry/<idWorkUnit>
<http://data.smartopendata
.eu/sp-ptpilot/Geometry/ES111010007
0100100001001> Corine Land Cover skos:Concept http://www.w3.org/2015/03/cori
ne# + <code> <http://www.w3.org/2015/03
/corine#242> ‐ qb:Observation baseURI/<ClimatologyMeasuremen
t/Observation/idClimatologyMea
surement> <http://data.smartopendata
.eu/sp-ptpilot/AnnualHumidityLevel/
Observation/65> ‐ qb:DataSet - <http://data.smartopendata
.eu/sp-pt-pilot/WorkUnitClimatology/Dataset/> Table 4: Portuguese‐Spanish Pilot: ORM constructs mapped to classes Mapping Associations and Value Types to Properties In Table 5 we present ORM constructs ‐ associations, value types and object types ‐ that were mapped to rdf:Property. ORM Construct rdf:Property
rdfs:domain rdfs:range
Chemical Characteristics
(Work Unit) has (Soil) smod:hasSoil
smod:WorkUnit smod:Soil
(Soil) has (Acidity) smod:soilAcidity
smod:Soil
rdfs:Literal
(Soil) has (Permeability) + (Permeability) has Permeability Rate smod:soilPermeabilityRate
smod:Soil
rdfs:Literal
Version 1.0 Page 23 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Geometry
(Work Unit) has (Geometry) (Parcel) has (Geometry) gsp:hasGeometry
gsp:SpatialObject gsp:SpatialObje
ct (Polygon) has (Surface) smod:areaHa gsp:SpatialObject rdfs:Literal
(Polygon) has (Perimeter) smod:lengthKm
gsp:SpatialObject rdfs:Literal
Work Unit Ecosystem
(Observatory Tile) supports (Animal Species) smod:supports
smod:ObservatoryTile smod:AnimalSpec
ies (Animal Species) has Conservation Status smod:iucnConservationStatusCode
smod:AnimalSpecies rdfs:Literals
Work Unit Location
(Work Unit) intersects (Protected Site) gsp:sfIntersects
gsp:SpatialObject gsp:SpatialObje
ct (Work Unit) is located in (Forestry Tile) (Work Unit) is located in (Observatory Tile) (Work Unit) is located in (Neighbourhood) (Work Unit) is located in (Parcel) (Neighbourhood) is located in (Municipality) (Municipality) is located in (District) (District) is located in (NUTS3) (NUTS3) is located in (NUTS2) gsp:sfWithin
gsp:SpatialObject gsp:SpatialObje
ct (Neighbourhood) has Name (Municipality) has Name (District) has Name (NUTS3) has Name (NUTS2) has Name ramon:name ramon:Region rdfs:Literal
Table 5: Portuguese‐Spanish Pilot: ORM constructs mapped to properties Mapping Objectified Associations In the first release of the pilot we had one objectified association39 ‐ “ForestryTileHasPlantSpecies” ‐ association between Forestry Tile and Plant Species that for every plant species of a forestry tile allows to specify representativity level of the plant species (primary, secondary or tertiary) and its density. 39
Objectified associations in ORM allow to express additional qualifying information on the relationship between two entities. Version 1.0 Page 24 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) In D3.3 (Section 3.3.2) we discussed different approaches to express objectified associations in RDF: from RDF reification to introducing custom properties. The latter approach was chosen to represent “ForestryTileHasPlantSpecies” in RDF. We took into account the fact, that no more representative levels were to be added to the data, and opted for a less verbose and simpler way of encoding and querying of the data as opposed to RDF reification. As a result, we introduced 6 properties: Objectified Association rdf:Property
rdfs:domain
rdfs:range
(ForestryTileHasPlantSpecies) has (Representative Level) smod:hasPrimaryPlantSpecies
smod:hasSecondaryPlantSpecies smod:hasTertiaryPlantSpecies smod:Forestry
Tile smod:PlantSpecies
(ForestryTileHasPlantSpecies) has (Density) smod:primaryPlantSpeciesDensity
smod:secondaryPlantSpeciesDensity smod:tertiaryPlantSpeciesDensity smod:Forestry
Tile smod:PlantSpecies
In the third release of the pilot one more objectified association was added that link Work Unit and Corine Land Cover ‐ “WorkUnitHasCorineLandcover”. This association for every work unit allows to specify the code of Corine Land Cover in three years: 1990, 2000 and 2006. When choosing an RDF model for “WorkUnitHasCorineLandcover”, we followed similar logic as for “ForestryTileHasPlantSpecies”, and introduced the following three properties: Objectified Association (WorkUnitHasCorineLandCov
er) in (Year) rdf:Property
smod:corineLandCover1990
smod:corineLandCover2000 smod:corineLandCover2006 rdfs:domain
gsp:SpatialOb
ject rdfs:range
skos:Concept We chose this design solution, as this temporal aspect of Corine Land Cover codes has informative purpose rather than the purpose of combining these values with some other data sources. External Vocabularies
NUTS‐RDF and the RAMON Ontology for Administrative Regions To locate an administrative unit in the pilot, topological relations between work units and administrative units is used. To encode instances of the NUTS region, we re‐used the NUTS classification vocabulary published as Linked Data at this location http://nuts.geovocab.org/ For example, the id of the "Baixo Mondego", sub‐region of Portugal, is http://nuts.geovocab.org/id/PT162.html. Below is the definition of the sub‐region from the NUTS Linked Data set: @prefix nuts: <http://nuts.geovocab.org/id/> .
Version 1.0 Page 25 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) @prefix
@prefix
@prefix
@prefix
ramon: <http://rdfdata.eionet.europa.eu/ramon/ontology/> .
ngeo: <http://geovocab.org/geometry#> .
spatial: <http://geovocab.org/spatial#> .
owl: <http://www.w3.org/2002/07/owl#> . nuts:PT162
nuts:PT162
nuts:PT162
nuts:PT162
nuts:PT162
nuts:PT162
rdf:type ramon:NUTSRegion, spatial:Feature .
rdfs:label "PT162 - Baixo Mondego" .
ramon:name "Baixo Mondego" .
ramon:level "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
ramon:code "PT162" .
ngeo:geometry nuts:PT162_geometry .
nuts:PT162 spatial:PP nuts:PT16 .
nuts:PT162
nuts:PT162
nuts:PT162
nuts:PT162
owl:sameAs
owl:sameAs
owl:sameAs
owl:sameAs
<http://rdfdata.eionet.europa.eu/ramon/nuts2008/PT162> .
<http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/PT162> .
<http://estatwrap.ontologycentral.com/dic/geo#PT162> .
<http://nuts.psi.enakting.org/id/PT162> . Having URIs from this NUTS Linked Data set allows us to re‐use definitions of the NUTS regions (i.e., their names, levels, codes) and the topological relations between them. In the definition above the triple in bold tells us that the "Baixo Mondego" sub‐region is contained in the “Centro” region (http://nuts.geovocab.org/id/PT16.html) Neighbourhoods, Municipalities and Districts are units in the local administrative divisions of Spain and Portugal. To represent them in RDF, we re‐used the Administrative Units vocabulary40 and the RAMON Ontology http://rdfdata.eionet.europa.eu/ramon/ontology/. For example below is the definition of the “Coimbra” district in Portugal, which is contained in the "Baixo Mondego" sub‐region (http://nuts.geovocab.org/id/PT162.html): @prefix ramon: <http://rdfdata.eionet.europa.eu/ramon/ontology/> .
@prefix au: <http://www.w3.org/2015/03/inspire/au#> . @prefix gsp: <http://www.opengis.net/ont/geosparql#> . <http://data.smartopendata.eu/sp-pt-pilot/so/District/PT16211> gsp:sfWithin
<http://nuts.geovocab.org/id/PT162> . <http://data.smartopendata.eu/sp-pt-pilot/so/District/PT16211> a au:AdministrativeUnit
, ramon:LAURegion ;
ramon:name "Coimbra" ;
ramon:level "2"^^xsd:int ;
au:nationalLevel
<http://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/4thOrder/> ;
au:country <http://publications.europa.eu/resource/authority/country/PRT> ;
au:nationalCode "PT16211" . Data Pre-Processing
Input to many target RDF models was the same file ‐ pd_0604_workunion.wkt ‐ that contains relationships between most of the concepts of the pilot, such as: ● all links between Work Unit and Climatology measurements ● all topological relationships between Work Unit and other spatial objects of the domain: Forestry Tile, Observatory Tile, and others. 40
http://www.w3.org/2015/03/inspire/au# Version 1.0 Page 26 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) For example, the link gsp:sfWithin between Work Unit and Forestry Tile is generated using values from the two columns of the input dataset: idWorkUnit and idForestry. However, the file contains duplicate records of the same pairs of idWorkUnit‐idForestry. We ran the following bash commands on the input file to sort its records and remove duplicates based on the two given columns: header=”$col1;$col2” echo $header >“$outputfile” sed 1d ./pd_0604_workunion.wkt | cut -d’;’ -f “$coln1”,”$coln2” | tr
-d ’”’ | awk -F’;’ ’NF==2’ | sort -t’;’ -u >“$outputfile” where $coln1 is the sequential number of the first column in the dataset and $coln2 is the sequential number of the second column. For example, the following command outputs the input file to generate gsp:sfWithin relationship: header=idWorkUnit;idForestryTile echo $header >sorted_idWorkUnit_located_idForestry.wkt sed 1d ./pd_0604_workunion.wkt | cut -d’;’ -f 3,19 | tr -d ’”’ | awk
-F’;’ ’NF==2’ | sort -t’;’ -u
>sorted_idWorkUnit_located_idForestry.wkt Input datasets after pre‐processing are available. RDF Generation
The RDF dump of the pilot is available for downloading41. All RDF mappings can be found in OpenRefine projects on https://smod‐refine.spaziodati.eu, the names of the projects start with “TRAGSA3” and continues with the name of the input file. Future Outlook
As a future work, RDF representation of Geometries needs to be generated. 2.1.3 Irish pilot The Irish pilot, which is led by MAC, is focused on European protected areas and its National Parks, starting with the Burren National Park in Ireland. The pilot aims to demonstrate the value of SmartOpenData in helping Researchers and Decision Makers to better manage, preserve, sustain and use this unique ecosystem. The pilot’s primary objective is to create the following sustainable services that will continue beyond the life of the project42. 1. SmartOpenData enabled European Tourism Indicator System (ETIS) Webservice for the Burren and European GeoParks Network. 2. SmartOpenData enabled App to Ground‐Truth potential Protected Monument sites 41
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip 42
See [SMOD52] for a more in‐depth discussion of the pilot and its objectives. Version 1.0 Page 27 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ETIS is a survey‐based generation service used to provide real‐time statistical information on the GeoPark performance in relation to the performance criteria defined by the Geopark’s Management. However the ETIS model does not yet present useful links to the SmartOpenData data model43, until the service is operational in many GeoParks, and use of a common data model will enable potential eco‐tourists to benchmark, compare and contrast the progress of various sustainable destinations in achieving their objectives, before deciding which to visit44. So it was decided to focus on the second service for now, and use the OpenRefine approach, described above and proven in the Italian pilot, to transform the Irish national heritage services as defined at http://webgis.archaeology.ie/nationalmonuments/flexviewer/ to capture data for the protected sites within the Burren Geopark region. The query generated was to reconcile the data retrieved with the official places names stored by the Logainm dataset. Input Data Sets
The primary input data sets are the Irish Record of Monuments and Places (RMP), and Logainm, the official Irish Placenames. In Ireland archaeological monuments are protected under the Irish National Monuments Acts 1930 ‐ 2004. The National Monuments Service of the Irish Government’s Department of Arts, Heritage and the Gaeltacht maintains a record of all known monuments and this forms the Record of Monuments and Places (RMP)45. The aim of the ground truthing service is to provide a new crowd‐sourcing way to report on and help protect such monument sites, focusing on the Burren initially. The monuments are recorded in the Irish RMP, which is available as a series of PDF documents46 and as CSV files47, i.e. One Star and Three Star. The aim was transform it to 5 Star open data48. Logainm provides the definitive standard authorised forms of all Irish place names in both English and Irish49. It has recently been made available in linked open data format as Linked Logainm, in various formats including RDF, XML and JSON50. RDF Graph and Table
The following summarises the Protected Monuments Sites data and its linking with the Linked Logainm: 43
As discussed in [SMOD33] and [SMOD34] As discussed in D5.1 “Rationale of the Pilots”. 45
www.archaeology.ie 46
Available at http://www.archaeology.ie/publications‐forms‐legislation/record‐of‐monuments‐and‐places 47
https://data.gov.ie/data/search?q=monuments&theme‐primary=Arts 48
as described at http://5stardata.info/en/ 49
www.logainm.ie/en 50
www.logainm.ie/en/inf/proj‐machines Version 1.0 Page 28 of 78 © SmartOpenData Consortium 2015 44
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Class Desccription dc:iden
ntifier Unique Identifie
er for site
ps:siteeDesignation
n The cclassification of the Sitee foaf:naame The TTownland n
name of thee site owl:sameAs The rreconciliatio
on link to LoogAinm geo:location ITM Reference ((E,N)
geo:latt_long Irish Grid Refere
ence (E,N) Data Prre-processin
ng
Multiple records within the dataset w
were record
ded as redundant. Thhese record
ds were identifieed using strraightforward OpenReffine rules.
Joining DataSets
In order to extend
d the heritaage dataset OpenRefin
ne’s RDF recconciliationn tool was u
used. To add thee Logainm R
RDF reconciliation servvice in Open
nRefine, use
ers need to navigate to
o ‘RDF’ > ‘Add reeconciliation service’ > ‘Based oon SPARQLL endpoint...’, and filll in the fo
ollowing informaation: SmartOpenDaata Consortium
m 2015 Version 11.0 Page 29 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Name: Logainm Endpoint URL: http://data.logainm.ie/sparql Type: Virtuoso Label Properties: Also check ‘foaf:name’ The reconciliation was run against the “Townland” name. Once reconciliation was complete manual manipulation was required to resolve the correct townland. Once this process was complete the sameAs link to the Logainm URI’s needed to be added to the RDF. This was accomplished by editing the RDF skeleton and associating the sameAs property to the URI column. A sample of the RDF output is shown below. #<?xml version="1.0" encoding="UTF‐8"?>
<rdf:RDF xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos" xmlns:rdf="http://www.w3.org/1999/02/22‐rdf‐syntax‐ns#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf‐schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://localhost:3333/0"> <dc:description>Anomalous stone group</dc:description> <foaf:name>CARHEENYBAUN</foaf:name> <owl:sameAs>http://data.logainm.ie/place/19220</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">143036, 192476</location> <dc:identifier>GA133‐003‐‐‐‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/1"> <dc:description>Anomalous stone group</dc:description> <foaf:name>CARHEENYBAUN</foaf:name> <owl:sameAs>http://data.logainm.ie/place/19220</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">142686, Version 1.0 Page 30 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 192200</location> <dc:identifier>GA133‐004‐‐‐‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/2"> <dc:description>Architectural fragment</dc:description> <foaf:name>BALLYMAHONY</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5830</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">119634, 198730</location> <dc:identifier>CL009‐014003‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/3"> <dc:description>Architectural fragment</dc:description> <foaf:name>FANTA GLEBE</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6718</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116138, 195021</location> <dc:identifier>CL009‐085003‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/4"> <dc:description>Architectural fragment</dc:description> <foaf:name>BALLYCONNOE NORTH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5796</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116818, 200467</location> <dc:identifier>CL009‐004006‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/5"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILMOON WEST</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6627</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">114875, 200000</location> <dc:identifier>CL008‐049006‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/6"> <dc:description>Architectural fragment</dc:description> <foaf:name>LISHEENEAGH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5808</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116508, 203537</location> Version 1.0 Page 31 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) <dc:identifier>CL005‐063004‐</dc:identifier>
</rdf:Description> <rdf:Description rdf:about="http://localhost:3333/7"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILMOON WEST</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6627</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">115017, 200061</location> <dc:identifier>CL008‐049007‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/8"> <dc:description>Architectural fragment</dc:description> <foaf:name>CLOONEY SOUTH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6653</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">119259, 188007</location> <dc:identifier>CL016‐105005‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/9"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILFENORA</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6720</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">118338, 193926</location> <dc:identifier>CL016‐171‐‐‐‐</dc:identifier> </rdf:Description> Conclusion
OpenRefine and use of the standard data.smartopendata.eu vocabularies (SmartOpenData Protected Sites, FOAF and Dublin Core)51 enabled the transformation to be completed. Transforming the Monuments dataset to RDF was completed using the OpenRefine Tools. This allowed the data to be mashed together with the Linked Logainm source to produce the National Monument locations linked with the definitive Irish placenames of those locations. The exercise has ensured that both the Logainm and National Monuments teams will collaborate more closely in the future, and help to ensure the wider use of both. The first approach to doing this was to build on the Slovakian pilot approach and used the National Monuments datasets as transformed to the INSPIRE Protected Sites theme52, 51
See Table 2 Version 1.0 Page 32 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) howeveer the tran
nsformation
n to RDF foound that there were
e weaknessses in the INSPIRE version of the national mo
onuments dataset, which w
is wh
hy the ori ginal datasset was transformed. Thesse weakne
esses are nnow being addressed, so the eexercise led
d to an improveement of th
he quality of the dataseet involved.. 2.1.4 TTransform
ming Data with Graffterizer an
nd the Jarffter Servicce Grafterizer is an in
nteractive tool for creaating data ttransformattions. It alloows the use
er to set up so ccalled pipeliines and RD
DF mappinggs. Pipelines are essen
ntially scriptts that consist of a numberr of conseccutive action
ns that are applied to datasets. R
RDF mappinngs can be used to publish datasets as linked data. The user inte
erface provvides a livee preview of the transformations ap
pplied to a subset of thhe chosen d
dataset that allows useers to speciify them uploaded annd stored. P
Pipelines incremeentally. Witth Grafterizer, data tra nsformations can be u
and RDF mappingss are displayed both inn tabular fo
orm in a grid
d, and in thheir script fo
orm ‐ as Clojure code. t
tion step is defined as a pipe – a functiion that In Graffterizer, each single transformat
perform
ms simple d
data converssion on its input. Nextt, these functions are ccombined ttogether in such way that output o
of one o pipe accts as an in
nput for ano
other. This way of com
mposing operations gives a great flexib
bility and alllows to perform ratherr complex ddata conversions. Here is given a sh
hort demon
nstration off how Grafterizer tooll performs transformaation on tabular data. Exam
mple is taken
n from the A
ARPA data o
of the Italian Pilot. Figure 6: Orriginal Sample
e ARPA Data
In ordeer to see th
he instant preview off created transformation on thee data, one
e should upload it in a raw
w tabular format. f
Neext, the transformation itself is created byy adding eline. Each time a pip
peline is mo
odified, thee transform
mation is required functionss to a pipe
52
Availab
ble at https://w
www.geoportaal.ie/geoporta
al/catalog/seaarch/resource
e/details.page?uuid=%7bF66DE3EBB‐FC5C
C‐4D79‐
A00A‐BC45AB9F55F6%
%7d SmartOpenDaata Consortium
m 2015 Version 11.0 Page 33 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) applied to the prevviewed dataset immeddiately, so o
one can see
e the effect of each performed step. Att any stage of the transsformation the modifie
ed tabular d
data can be exported. Figure 7: RD
DF mapping fo
or ARPA data
When ttabular data is in desiirable form
mat, one can
n start crea
ating RDF m
mappings. The T RDF skeleton being created is clearly visualizeed, showingg nodes and
d correspon ding relatio
ons. Both pipelines and
d RDF mapp
pings are stoored togeth
her as comp
plete data trransformatiions and used. After transformaation is con
nstructed annd saved, o
one may may bee easily shared and reu
apply itt to the targget dataset and downlload resultss locally in d
desired RDFF format. C
Currently supportted formatss include RD
DF/XML(.rddf), n‐triple(.nt), turtle(.ttl), n3(.n33), nquads(..nq) and RDF/JSO
ON(.rj). SmartOpenDaata Consortium
m 2015 Version 11.0 Page 34 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 8: Generated RDF graph for ARPA data Jarfter In order to apply the transformations on the complete datasets, we have developed Jarfter for the purpose of SmartOpenData. Jarfter is a set of web services that allow for server side compilation of data transformations (serialized as Clojure code) as well as execution of transformations on uploaded datasets. They can be accessed through the user interface shown in Figure 9, giving the users two options for how to transform their data. Figure 9: User interface for Jarfter The Execute transformation operation performs the complete transformation of the entire dataset based on the generated Clojure code corresponding to the transformation. The code and data are uploaded to the server, and when the transformation is complete the browser downloads a file containing the transformed data. The second option is to use the Download transformation executable operation, which only does half the job compared to “Execute transformation”. The server receives only the generated Clojure source code and not the dataset. The Clojure code is compiled to an Version 1.0 Page 35 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) executaable JAR file, which is thenn automattically dow
wnloaded by the user u
as "transfo
ormation.jaar". This file
e can then be used to
o transform datasets loocally instead of in the clou
ud. The JAR
R file can be b ran by u sing the co
ommand lin
ne interfacee from the location where tthe file is lo
ocated as follows: $java -jar transformati
t
ion.jar <inp
put-file.csv
v> <output-file.(nt|rdf|
|n3|ttl)> Jarfter A
Architecturre Jarfter iis a set of R
RESTful web
b services w
with a back‐end databa
ase which alllow for serrver side compilaation of Clo
ojure code aand executi on of transsformations on datasetts. The servvices are implem
mented a waay that allo
ows Jarfter to be used
d both with and withoout interacting with the database. Figure 10: JJarfter compiiler services A schem
matic overvview of the compiler sservice, acccessed by th
he "Downlooad transfo
ormation executaable" capab
bility, is sho
own in thee figure above. The Clojure C
sourrce code, which w
is generatted from th
he user‐speccified transfformations in Grafterizzer, is sent to the server back‐
end wh
here it is co
ompiled to an executa ble JAR file
e. The JAR file f can theen sent back to the user (w
where it can be execute
ed locally), oor if the dattabase interractive servvices are use
ed, both the Clojjure source code and the executa ble JAR are stored in th
he back‐endd database as well. Jarfter aalso supporrts executio
on of the traansformatio
ons on the sserver side,, as exposed
d by the "Executte transform
mation" cap
pability. Figuure 11 provvides an ove
erview of thhe workflow
w for the servicess: SmartOpenDaata Consortium
m 2015 Version 11.0 Page 36 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Figgure 11: Jarfteer transforma
ation web serv
vice wn on the d
diagram, the
e client mu st either prrovide referrences to thhe database
e entries As show
for a trransformation and the
e dataset thhat will be transforme
t
d, or the CClojure sourrce code and thee dataset itself. Any provided source cod
de is dynamically com
mpiled into
o a JAR executaable. In case the user provides a reference, the JAR is extracted ffrom the daatabase. The transformed d
dataset is aalso downlooaded from the databa
ase, if an e ntry instead of the datasett itself is givven as inputt. The back‐‐end then exxecutes the
e JAR and trransforms th
he given datasett before it seends the traansformed data back tto the user. Warfter: Dynamic Deployment of Data TTransforma
ations (Jarftter extensioon) The ap
pproach im
mplemented
d with the Jarfter se
ervice with
h regard too generatin
ng data transformations allows a
for the realizaation of a very high level of aautomation of the s
r
transformation prrocess. In particular, this is due to the statelessne
ss of all resulting transformation exxecutable. This T
properrty allows for the cre
eation of ttransformattions on d, which can then be u
used to dynnamically form cloud deployment topologies.. A high‐
demand
level ovverview of tthe intende
ed process of forming a simple to
opology is i llustrated in Figure 12: SmartOpenDaata Consortium
m 2015 Version 11.0 Page 37 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Figure 12
2: Dynamic deeployment off data transformations First, ussers need to
o specify th
he transform
mation thatt needs to b
be deployedd on the client‐side using th
he Grafterizer tool. When W
the trransformatiion is readyy, the transsformation code is sent to the Jarfter back‐end w
where the CClojure compiler service
e forms an eexecutable WAR or JAR filee, and forw
wards it to a a "deployerr" compone
ent, capable of dynam
mically provvisioning cloud reesources an
nd deployin
ng applicatiions (in ourr case, we plan p
to usee the CloudML run‐
time en
nvironment. Finally, th
he transform
mation thatt has been deployed inn the cloud
d can be accesseed by the transformatio
on owner oor other use
ers to apply the transfoormation to
o various datasetts. As men
ntioned, in o
order to implement thhe dynamic deploymen
nt of transfoormations, we plan to use C
CloudML. CloudML com
mprises a seet of tools, and a domain‐specificc language ((DSL) for modelliing and en
nacting the
e provisionning and deployment
d
t of cloud applications. The modelliing languagge allows for f the speecification of cloud topologies aand the ne
ecessary softwarre and hardware resou
urces as shoown in Figurre 13: Figure 13: Clou
F
udML deployment templatte SmartOpenDaata Consortium
m 2015 Version 11.0 Page 38 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The CloudML template in the figure represents a simple deployment topology which comprises of two software components – the aforementioned executable transformation (labelled "Transformation"), a generic servlet container (labelled "SC"). Additionally, the figure illustrates also a specification of a hardware component – a virtual machine (labelled "VM"). In CloudML virtual machines are specified in a provider‐agnostic way through a set of hardware requirements, which can then be matched by the available flavors of virtual machines, available from each particular provider. The CloudML template can be programmatically edited to inject details on how to deploy a particular transformation that has been generated by the Clojure compiler. This resulting model can be sent to the CloudML engine, which can then enact the provisioning and deployment of the necessary resources, through a process of matching the hardware and software requirements with the available capabilities. Grafterizer vs. OpenRefine – An Overview This section gives an analysis of the data transformation process for ARPA and DGT‐TRAGSA pilot use cases. Transformations have been performed with help of two data cleaning and transformation tools – OpenRefine and Grafterizer. Below there is given a comparison of transformation construction process for these tools. The first difference that significantly affects the data transformation process is possibility to create utility functions in Grafterizer. This allows to separate computational logic from data it operates on. Thus, the formula for computing geographical coordinates for ARPA pilot Lakes/Rivers Monitoring Stations in OpenRefine project is defined twice: for computing latitude and for computing longitude operations. Grafterizer allows it to be encapsulated in separate function which can be called as many times as needed. Another difference lies in possibility to keep original cell value if an error occurs during transformation in OpenRefine – the feature that is not currently available in Grafterizer. Some transformations in tested use cases require cross‐dataset operations. This feature has been introduced in OpenRefine, but Grafterizer currently doesn't allow to read several datasets at the same time in one pipeline. One rather useful feature of Grafterizer data transformation is the possibility to edit parameters of each transformation step and change step order at any moment of creating the transformation, that is impossible to do with help of OpenRefine. At the same time OpenRefine provides transformation history with Undo/Redo options. The functionality for the RDF mapping construction is similar for both tools with some small differences. One of them is that at its current stage Grafterizer doesn't provide functionality for creating language‐tagged nodes. Version 1.0 Page 39 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The short summary of differences in functionality of mentioned tools is given in the table below. Feature Grafterizer OpenRefine
Basic functionality
Encapsulating and reusing utility functions in one transformation
‐
+ Ignore errors(leave original data on error) +
‐ Cross‐dataset operations(join datasets) +
‐ Transformation operations management
Edit transformation operation ‐
+ Change operation order ‐
+ Transformation history with undo/redo options +
‐ RDF mapping
Language‐tagged nodes +
‐ Version 1.0 Page 40 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 2.2 XML (GML) ‐TO‐RDF transformations 2.2.1 Slovak pilot Main motivation for the selected approach
Based on the further analysis of the available datasets, technologies, knowledge capacities in the field of the data harmonisation and previous experience documented in D3.3 (Chapter 6.4) [SMODD33] additional datasets have been harmonised in order to implement SmartOpenData modelling framework to support Slovakian pilot. Main motivation of this approach was to investigate the possibilities to expose both INSPIRE compliant as well as other datasets into the Web of (Linked) Data and enrich these resources with external knowledge. Data model and storage
In addition to the initial SK INSPIRE Protected Sites dataset53 transformed following GeoKnow XSL stylesheets further list of dataset has been identified and prepared for the transformation with the support of the COMSODE project54. Activity covered several of the datasets that SAZP publishes to comply with the European Union's INSPIRE directive55, including data on protected sites, species distribution, bio‐
geographical regions, and land cover; and an additional dataset on contaminated sites registered as environmental burdens. The INSPIRE datasets were described with the INSPIRE XML schemas, while the latter dataset used a custom XML schema. The source data is available in the Geography Markup Language (GML) via an API provided by the Web Feature Service (WFS). Note for “Input dataset hyperlinks” in following table: Instructions in this column are related to the bash script56, that downloads individually datasets from WFS (script requires curl http client). In the output of the script you see dataset title and the relevant request WFS. Request URL is closed between the characters '<' and '>'. It necessary to copy it as a whole (not recommended open queries in browser). Bash script downloads dataset into the file system. Most of the requests contains cql_filter where, selecting the data only for the Slovakia (Database contains also data for Czech Republic). Request 'Corine landcover' may take a few minutes as it contains about 22000 features a transformation from relational DB into GML is happening "on the fly". All data are in EPSG: 4258 (ETRS89) geographic coordinates. 53
http://ckan.sazp.sk/dataset/inspire‐protected‐sites‐linked‐data/resource/fba4d3b8‐195c‐4224‐a7b9‐
ab734c6e933d 54
http://www.comsode.eu/ 55
http://inspire.ec.europa.eu/index.cfm/pageid/3 56
http://redmine.sazp.sk/attachments/download/136/retrieve‐smod‐datasets.sh Version 1.0 Page 41 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) No. Input dataset Target vocabulary Output dataset 1 National parks and protected landscape areas57 SmOD Protected Sites 58 SK LD INSPIRE Protected Sites59 2 Small scale protected areas60 SmOD Protected Sites SK LD INSPIRE Protected Sites 3 Protected natural monuments61 SmOD Protected Sites SK LD INSPIRE Protected Sites 4 Special protection areas ‐ Bird directive62 SmOD Protected Sites SK LD INSPIRE Protected Sites 5 Sites of community importance ‐ Habitat Directive63 SmOD Protected Sites SK LD INSPIRE Protected Sites 57
WFS, GML> Dowloading '01. National parks and protected landscape areas' ... URL:<http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&t
ypeName=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('nationalPark','ProtectedLandscapeOrSeascape') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `01_NP_LPA.gml'. 58
http://www.w3.org/2015/03/inspire/ps# 59
http://ckan.sazp.sk/dataset/inspire‐protected‐sites‐linked‐data/resource/1d6e0fdf‐df3d‐4a69‐bd5e‐
d49aa16d6596 60
WFS, GML> Dowloading '02. Small scale protected areas' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('managedResourceProtectedArea','strictNatureReserve','wildernessArea') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'>GML output saved in `02_SSPA.gml' 61
WFS, GML>Dowloading '03. Protected natural monuments' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('naturalMonument') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `03_PNM.gml' 62
WFS, GML>Dowloading '04. Special protection areas ‐ Bird directive' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('specialProtectionArea') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `04_SPA.gml 63
WFS,GML>Dowloading '05. Sites of community importance ‐ Habitat Directive' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('siteOfCommunityImportance') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `05_SCI.gml' Version 1.0 Page 42 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 6 Biosphere reserves64 SmOD Protected Sites SK LD INSPIRE Protected Sites 7 Ramsar65 SmOD Protected Sites SK LD INSPIRE Protected Sites 8 UNESCO world nature heritage sites66 SmOD Protected Sites SK LD INSPIRE Protected Sites 9 Protected landscape elements67 SmOD Protected Sites SK LD INSPIRE Protected Sites 10 Corine Land Cover68 SmOD Land Cover69 SK LD Land Cover 11 Contaminated sites / Environmental burdens SK Contaminated SK LD Contaminated sites / Environmental sites / Environmental 64
WFS, GML>Dowloading '06. Biosphere reserves' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('biosphereReserve') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `06_BR.gml' 65
WFS, GML>Dowloading '07. Ramsar' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designationScheme" in ('ramsar') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `07_RAMSAR.gml 66
WFS, GML>Dowloading '08. UNESCO world nature heritage sites' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designationScheme" in ('UNESCOWorldHeritage') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `08_UNESCO.gml 67
WFS, GML>Dowloading '09. Protected landscape elements' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('naturalMonument') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `09_PLE.gml 68
WFS, GML>Dowloading '10. Corine landcover' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen
ame=lcv:LandCoverUnit&cql_filter="lcv:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:LC'> GML output saved in `10_CLC.gml 69
http://www.w3.org/2015/03/inspire/lc# Version 1.0 Page 43 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) burdens vocabulary70 burdens71 12 Biogeographical regions72 SmOD Biogeographical regions73 SK LD Biogeographical regions74 13 Species distribution (Selected taxons)75 SmOD Species distribution76 SK LD Species distribution77 Table 6: An overview of the datasets and vocabularies used in SK Pilot Process, tools and technologies
The whole process of data harmonisation was driven by the development of the related components of the SmartOpenData infrastructure as well as by the selected elements of the COMSODE methodology for Open Data publishing78. 70
http://data.sazp.sk/vocab/contaminated‐sites http://ckan.sazp.sk/dataset/sk‐environmental‐burdens‐contaminated‐sites/resource/a33b9933‐937a‐4cca‐
89d7‐223703bb1187 72
WFS, GML>Dowloading '12. Biogeographical regions' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typeN
ame=br:Bio‐geographicalRegion&cql_filter="br:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:BR'> GML output saved in `12_BIO_REGIONS.gml 73
http://www.w3.org/2015/03/inspire/br# 74
http://ckan.sazp.sk/dataset/sk‐inspire‐bio‐geographical‐regions‐linked‐data 75
WFS, GML>Dowloading '13. Species distribution' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typeN
ame=sd:SpeciesDistributionUnit&cql_filter="sd:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:SD'> GML output saved in `13_SD.gml' 76
http://www.w3.org/2015/03/inspire/sd# 77
http://ckan.sazp.sk/dataset/sk‐inspire‐species‐distribution‐linked‐data 78
http://opendatanode.org/product/methodology‐for‐od‐publishing/ Version 1.0 Page 44 of 78 © SmartOpenData Consortium 2015 71
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Table 7: List of phases and tasks extracted and deployed from the COMSODE methodology for Open Data publishing In brief the whole process was initiated with the task related to the harvesting the data from the WFS and converting it to RDF. During the process of conversion alignment of the data with selected RDF vocabularies and code lists took place. Some of these newly created linked data were interlinked with the third‐party data in order to enrich it. Creating linked data
Whole process of the data transformation have been undertaken with the support of the Unified Views Extract‐Transform‐Load (ETL) framework79 creating the core component of Open Data Node (ODN) – publication platform for Open data where it ensures extraction, transformation, and publishing of (Linked) Open Data. This environment allows to define, execute, monitor, debug, schedule, and share RDF data processing tasks. 79
http://opendatanode.org/product/unifiedviews/ Version 1.0 Page 45 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) A data processing task (or simply task) consists of one or more data processing units. This tasks may use custom plugins ‐ data processing units (DPU) created by users. A data processing unit (DPU) encapsulates certain business logic needed when processing data (e.g., one DPU may extract data from an RDF database or apply a SPARQL query). Every DPU has its inputs, outputs, business logic and configuration. UnifiedViews differs from other ETL frameworks by natively supporting RDF data and ontologies. UnifiedViews has a graphical user interface for the administration, debugging, and monitoring of the ETL process. Since GML is an XML format harvested data were converted to RDF/XML via XSL transformations. In order to do this XSL transformations developed by the GeoKnow project80 were reused. To reflect recent development extensive set of GeoKnow XSLT style sheets have been updated81: These updates contained aside some bug fixes also changes related to mapping against the SmOD vocabularies82 as well as specific modifications related to the UnifiedViews. 80
http://geoknow.eu 81
https://github.com/jindrichmynarz/TripleGeo/tree/sazp/xslt http://www.w3.org/2015/03/inspire 82
Version 1.0 Page 46 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 14: List of updated GeoKnow XSLT stylesheets For each dataset a data processing pipeline has been built in the UnifiedViews component of the ODN. The pipelines harvested the data from the WFS and converted it to RDF via XSL transformations. Version 1.0 Page 47 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 15: Landing page for Unified Views Figure 16: List of created pipelines Transformation of the SK datasets have been designed and executed with the following list of DPUs: e‐distributionMetadata t‐geonamesOrgToRdfFile e‐filesDownload t‐gunzipper e‐sparqlEndpoint t‐rdfToFiles l‐filesToCkan t‐sparqlConstruct l‐filesToParliament t‐sparqlUpdate l‐filesToVirtuoso t‐unzipper l‐filesUpload t‐xslt
l‐rdfToCkan t‐zipper t‐filesToRdf Version 1.0 Page 48 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 17: Section with DPU templates Each DPU comes with specific functionality, eg. l‐filesToParliament is a DPU that loads RDF serialized in files to the Parliament RDF store via its HTTP API for bulk upload83, whist t‐
geonamesOrgToRdfFile is a DPU that transforms dump of Geonames.org data into RDF. The dump is not valid RDF, since it consists of line‐separated pairs of URIs and corresponding descriptions of the URIs serialized in RDF/XML. This DPU parses the dump format and outputs valid RDF file84. Figure 18: Pipelines execution monitor 83
https://github.com/UnifiedViews/Plugins/blob/master/l‐filesToParliament/doc/About.md https://github.com/comsode‐uv‐plugins/t‐geonamesOrgToRdfFile/blob/develop/README.md 84
Version 1.0 Page 49 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 19: Scheduler with the possibility to define the schedules for pipelines execution Figure 20: Section with additional settings Figure 21: Example of pipeline details Version 1.0 Page 50 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 22: Example of further DPU settings To fulfill the requirements of the project, ODN has been enhanced with a loader to the Parliament RDF store (http://parliament.semwebcentral.org) and an extractor for Geonames.org. Parliament is used by SAZP to store RDF data because it supports geospatial features. Extractor for Geonames.org was needed in order to be able to link to this dataset. Interlinking
In order to provide the linkages to the external resources following enrichment of the generated linked data have been identified and in addition to the data transformation pipelines, there has been created pipelines for enriching the datasets with links to external datasets including Geonames.org and 3 datasets from the European Environmental Agency (Biogeographical regions 2011, Natura 2000 and EUNIS). : ● SK Protected Sites <> GeoNames85 ● SK Protected Sites <> EEA Natura 200086 ● SK Contaminated Sites <> GeoNames 85
http://www.geonames.org/ 86
http://natura2000.eea.europa.eu/rdf/ Version 1.0 Page 51 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 23: Example of the interlinking pipeline Publishing
Main outcomes of this harmonisation activities available via: ● Human readable SAZP Open Data portal interface based on CKAN ‐ providing the possibility to search metadata and visualise all harmonised SK linked data resources87 ● Machine readable GeoSparql API88 ● Web application interface supporting GeoSparql queries89 Visualizations will be supported with the extensions of LDVMi90. http://data.sazp.sk/ 87
88
http://data.sazp.sk/parliament/sparql http://data.sazp.sk/parliament/query.jsp 90
http://ldvm.net Version 1.0 Page 52 of 78 © SmartOpenData Consortium 2015 89
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 24: CKAN interface with the list of metadata for the open linked data from Slovak pilot Figure 25: Parliament web application interface Version 1.0 Page 53 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Observed benefits and limitations
A key benefit of the RDF version of the SAZP datasets is that it is straightforward to combine it with third‐party datasets. In this way, large and rich datasets, such as Geonames.org, can be linked and additional features may be drawn from them to frame the original data into a broader context. During the activities on creation of the visualizations use of the Coordinate reference systems (CRS) has been identified as stumbling block. Even though the CRS was changed to a more common one, most visualization tools cannot directly reuse data projected in this CRS because the inverse order of coordinates is expected. This is also the case for OpenLayers 3 (http://openlayers.org/), the visualization library which has been used and required re‐
projection of the coordinates on the client‐side. Ultimately, visualization of the data by projecting it on the map allowed for visual inspection that revealed errors in its coordinates, which were fixed subsequently. In this way, this exercise helped to improve the quality of the primary data. It turned out that transforming data and viewing it from different perspectives can detect errors and thus contribute to better data quality. Recommendations & Future outlook
When publishing the data adhering to common standards, such as the INSPIRE schemas, make it more reusable. In the case of SAZP datasets, standardization allowed to reuse parts of the GeoKnow XSL transformations that were made for INSPIRE‐compliant data without creating our own from scratch. This helped us learnt a similar lesson for the CRS. In order to improve reusability of geospatial data on the Web, it should be available at least in the WGS 84/Pseudo‐Mercator ‐ Spherical Mercator CRS, which is supported natively in most tools. When it comes to the formats for geographic geometries, it was identified that encoding them as Well‐Known Text (WKT) RDF literals offer a good trade‐off between granularity and data volume. Based on this experience further investigation will take place to identify, which datasets shall be extended in their coverage, which new ones will be the best candidates for further harmonisation as well as possible linking and enrichment with external third ‐ party linked data resources. Version 1.0 Page 54 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 2.3 Relational DB‐to‐RDF transformations 2.3.1 Czech pilot The goal of the czech pilot is the transformation of the NFI (National Forest Inventory) data from the relational database into an RDF/XML or a TURTLE and publish these LOD on the web. During the beginning of the SmartOpenData project the UHUL FMI was supposed to establish SPARQL endpoint and use on‐the‐fly transformation. However any suitable lightweight tool for the UHUL FMI production environment hadn't been found, one of the issues was the technological dependency on the java platform with most of available tools. The UHUL FMI has decided to use static transformation into the file in order to publish the data statically at least. The approach is not completely wrong, because the NFI data are created at one moment and are stable for approximately a year period. Data model and storage
The UHUL FMI uses PostgreSQL/PostGIS as a key component for data storage and also data analyses, using it on the side of the NFI source database and also the public data store allow us to replicate/copy the necessary data from the private database server to the public database server. So for the infrastructure two separate PostgreSQL databases are used, for the transformation itself the public database is used. This pilot description is focused on transformation of a data from the public database. The data model below represents the NFI type of information that is being published. In the middle is the main table t_nfi_estimate, which represents an estimate. Every estimate has its point estimate (a value), lower and upper limit (a confidence interval). The estimate is far more defined in lookup tables (a type, a unit of measure, an attribute filter, a geographic domain etc.), it could be for example forest cover in hectares in the Czech Republic divided by a forest owner etc. The relation to the geographic domain is important, because the UHUL FMI uses mostly the NUTS regions which are commonly used among partners across EU and moreover it appears as appropriate entity for linkage with other data sources. Another possible linkage are the NFI outcomes or attributes themselves, because in EU there are a lot of other countries providing the NFI outcomes same as the Czech republic and also initiatives, which try to define common attributes among them NFI's e.g. ENFIN91. 91
http://www.enfin.info/ Version 1.0 Page 55 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 26: Czech pilot Data model Relations, Links & Mappings
In order to create proper links and define the NFI data in a broader space (internet) links (URLs) with the same meanings and definitions had to be found. Some of these URLs were sufficient for the NFI data, but it was also necessary to create specific vocabulary for the NFI “forest” attributes, which is not available on the internet. The UHUL FMI had created first draft of the NFI vocabulary in the RDF for this purpose, which possess short description of the estimates. However, it will be desirable to find responsible body, which will be taking care of this vocabulary. During SmOD we are expecting, that it will be the UHUL FMI. Example of the vocabulary, which will be available from http://nil.uhul.cz/lod/ns/nfi/ follows: @prefix nfi: <http://nil.uhul.cz/nfi.ttl> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . ... nfi:ObAvGrowingStockPerHa rdfs:subClassOf qb:Observation ; Version 1.0 Page 56 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) rdfs:label "Observed average growing stock per hectare"@en; rdfs:comment "The observed average growing stock in cubic metres per
hectare."@en . nfi:ObForestOwner rdfs:subClassOf qb:Observation ; rdfs:label "Observed forest owner" ; rdfs:comment "The observed type of forest owner. Each observation is
linked to the relevant owner type defined in http://nil.uhul.cz/lod/ns/fot"
. nfi:ObGrowingStock rdfs:subClassOf qb:Observation ; rdfs:label "Observed growing stock"@en; rdfs:comment "The observed growing stock (in cubic metres) within the
specified area."@en . ... … For visualisation the UHUL FMI also needs a geometric representation of a geographic domain and therefore on the webpage (http://nil.uhul.cz) there are also published NUTS regions in the WKT form. Of course there are some sources for the NUTS regions already available on the web, however the NFI uses own generalisation of the geometry for the map client. It is faster for a web map window to just use the geometry than try to generalize it dynamically on a client side for every request for the geometry representation. If there will be proper NUTS 3 geometry representation available on the web, then the vocabulary can be avoided. The vocabulary has following format and will be available on this URL: http://nil.uhul.cz/lod/ns/nuts/ . @prefix unit: <http://qudt.org/1.1/vocab/unit#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix nfi: <http://nil.uhul.cz/lod/nfi#> . @prefix geo: <http://www.opengis.net/ont/geosparql#> . <http://nil.uhul.cz/lod/geo/nuts#CZ032> a <http://www.w3.org/2015/03/inspire/au#AdministrativeUnit> ; rdfs:comment "CZ032 - Plzeňský" ; rdfs:label "CZ032" ; owl:sameAs <http://nuts.geovocab.org/id/CZ032> ,
<http://estatwrap.ontologycentral.com/dic/geo#CZ032> ; geo:asWKT "POLYGON((13.7657560325118 49.5140373364391,13.7478475772677
49.4868312489771, … Data published by the NFI are mostly statistical, therefore the UHUL FMI could use available mathematical and physical vocabularies for the data definition, e.g.: ● http://purl.org/NET/scovo# ● http://qudt.org/1.1/vocab/unit#
And also vocabularies for the geographical relations and entities, some created and recommended during SmOD project: ● http://www.opengis.net/ont/geosparql#
● http://www.w3.org/2015/03/inspire/au#
Version 1.0 Page 57 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Transformation
SmartOpenData technical meetings, documents and hackathons helped us with testing the tools suitable for presenting our outcomes. We tested the D2RQ and r2rml‐parser92 for publishing from our database of the NFI results and the Virtuoso for further data processing and visualisation. Nevertheless, it depends on several conditions if we will use the D2RQ for our data transformation in production environment, for example the D2RQ long‐term support, security, java technology support by the ministry of the agriculture etc. Data published now at http://nil.uhul.cz was created with r2rml‐parser. When the links had been set up (described in the previous chapter) the transformation could be done. For the transformation the mapping has been defined in R2RML syntax93. The data has not been translated from the native Czech language in the rdb database, therefore the language attribute had to be used. Below is example of the mapping file for an estimate of the forest cover: # # forest_cover # @prefix map: <#>. @prefix rr: <http://www.w3.org/ns/r2rml#>. @prefix au: <http://www.w3.org/2015/03/inspire/au>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix owl: <http://www.w3.org/2002/07/owl#>. @prefix dc: <http://purl.org/dc/elements/1.1/>. @prefix geo: <http://www.opengis.net/ont/geosparql#>. @prefix nfi: <http://nil.uhul.cz/nfi#>. @prefix scovo: <http://purl.org/NET/scovo#>. @prefix unit: <http://qudt.org/1.1/vocab/unit#> . @prefix nfi: <http://nil.uhul.cz/lod/nfi#> . ### NIL database mappings map:spatial rr:logicalTable <#forest>; rr:subjectMap [ rr:template 'http://nil.uhul.cz/lod/nfi/forest_cover#{"id_result"}'; rr:class nfi:forest_cover; ]; rr:predicateObjectMap [ rr:predicate rdf:value; rr:objectMap [ rr:column "point_estimate";] ; ]; rr:predicateObjectMap [ rr:predicate scovo:max; rr:objectMap [ rr:column "upper_limit"] ; ]; rr:predicateObjectMap [ rr:predicate scovo:min; rr:objectMap [ rr:column "lower_limit"] ; ]; rr:predicateObjectMap [ 92
93
https://github.com/nkons/r2rml‐parser http://www.w3.org/TR/r2rml/ Version 1.0 Page 58 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) rr:predicate unit:units; rr:objectMap [ rr:constant unit:Percent;] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomainId; rr:objectMap [ rr:column "adomain"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomain; rr:objectMap [ rr:column "adomain_label" ; rr:language "cs"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomain_description; rr:objectMap [ rr:column "adomain_description" ; rr:language "cs"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:nfi_cycle; rr:objectMap [ rr:constant "2001 - 2004"; rr:termType rr:Literal;] ; ]; rr:predicateObjectMap [ rr:predicate geo:hasGeometry; rr:objectMap [rr:template
'http://nil.uhul.cz/lod/geo/nuts#{"gdomain_label"}'; rr:termType rr:IRI;] ; ]; . ... All transformation have been done using r2rml‐parser, which can also create a triple store with a dynamic connection to the database, however at this moment only one‐time transformation has been done. Final RDF/XML or TURTLE files are available from the http://nil.uhul.cz/lod/ns/* , where * represents name of a vocabulary , e.g. http://nil.uhul.cz/lod/nfi/forest_cover/ or http://nil.uhul.cz/lod/nfi/forest_cover.ttl . The estimates are presented in temporal cycles; therefore the outcomes can be compared between time periods. However user should always get default values, which are latest, but if someone needs elder data a link should look like e.g. http://nil.uhul.cz/lod/nfi/forest_cover/AGS2001‐2004.rdf, where AGS2001‐2004 stands for “Average Growing Stock during 2001‐2004 period”. In order to have also “a raw HTML” or human readable version of the estimates, an XSLT transformation has been used with the RDF/XML output. The forest cover can be also accessed from this link: http://nil.uhul.cz/lod/nfi/forest_cover.html . The NFI data are suitable for adoption the RDF Data Cube vocabulary described in Section 3. All estimates could be defined as observations and specified by dimensions; below there is an example of an estimate of the forest area divided according to the forest species: @prefix sa: <http://nil.uhul.cz/lod/nfi/species-area/> . @prefix qb: <http://purl.org/linked-data/cube#> . @prefix smod: <http://www.w3.org/2015/03/inspire/smod#> . @prefix scovo: <http://purl.org/NET/scovo#> . @prefix nuts: <http://nil.uhul.cz/lod/ns/nuts#> . @prefix ts:
<http://nil.uhul.cz/lod/ns/species-area#> . @prefix sdmx-attribute: <http://purl.org/linked-data/sdmx/2009/attribute#>
. @prefix unit: <http://qudt.org/1.1/vocab/unit#> . ... sa:ob4882 a qb:Observation, nfi:ObSpeciesArea ; Version 1.0 Page 59 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) qb:dataSet sa:SA2001-2004 ; nfi:refArea nuts:CZ0 ; nfi:cycle <http://reference.data.gov.uk/id/gregorian-interval/2001-0101T00:00:00/P3Y> ; nfi:treeSpecies ts:ts2500 ; sdmx-attribute:unitMeasure unit:Hectare ; smod:areaHa "5586"^^xsd:double ; scovo:max "6833"^^xsd:double ; scovo:min "4339"^^xsd:double . … In order to model above data the NFI had to define terms in a vocabulary for forest species used, similar concept has been used for other estimates. Example below: @prefix ts:
<http://nil.uhul.cz/lod/ns/species-area#> . @prefix skos:
<http://www.w3.org/2004/02/skos/core#>. @prefix nfi:
<http://nil.uhul.cz/lod/ns/nfi#> . ... ts:ts2500 a skos:Concept ; skos:prefLabel "DBC"@cs ; skos:prefLabel "Red oak"@en ; skos:definition "Dub červený"@cs ; skos:definition "Red oak"@en ; skos:notation "2500"^^nfi:UHULID ; skos:inScheme <http://nil.uhul.cz/lod/ns/species-area> ; skos:broader ts:ts6400 . ... Version 1.0 Page 60 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 3 Harmonising Observations and Measurements The final SMOD model suggests to adopt the RDF Data Cube vocabulary94 to encode in RDF environmental measurements and observations. In this section we illustrate application of the Data Cube framework on the example of a water quality measurement taken from the Italian pilot. We discuss third‐parties vocabularies as well as custom terms used to encode environmental measurements. 3.1 RDF Data Cube: Example Environmental observations are essentially numeric values accompanied with numerous attributes that allow to interpret and describe these values, e.g.: “Average concentration of benzene in the water of the Sciaguana lake in 2013 was 0.1 µg/kg” In the example above “0.1” is the observation value, which is by itself does not give us much information. However, if we consider its attributes, we can interpret the value: ● “benzene” ‐ what was measured? ● “concentration” ‐ what quality of benzene” was measured? ● “2013” ‐ when was it measured? ● “the Sciaguana lake” ‐ where was it measured? ● “µg/kg” ‐ in what units was it measured? 3.1.1 Data Cube Components Snippet below demonstrates how to encode the example observation in RDF using the RDF Data Cube approach: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Observation/0> a qb:Observation ;
dbpedia-owl:average "0.1"^^xsd:float ;
arpa-components:hasObservedDeterminand
<http://data.smartopendata.eu/WFD/Determinand/71-43-2> ;
sdmx-attribute:unitMeasure <http://data.smartopendata.eu/WFD/UnitOfMeasure/9> ;
sdmx-dimension:refPeriod
<http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> ;
arpa-components:station
<http://data.smartopendata.eu/WaterbaseLakes/so/Station/IT19LW09453> . In terms of Data Cube properties highlighted in bold are called components. In order to represent pilots’ observations, we re‐used components defined by the Statistical Data and 94
http://www.w3.org/TR/vocab‐data‐cube/ Version 1.0 Page 61 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Metadata eXchange (SDMX) code lists95. Table below summarises SDMX properties and properties from other vocabularies used in pilots’ data: Prefix Namespace
Terms used sdmx-dimension http://purl.org/linked‐data/sdmx/2009/dimension#
sdmx-dimension:refPeriod
sdmx-attribute http://purl.org/linked‐data/sdmx/2009/attribute#
sdmx-attribute:unitMeasure
dbpedia-owl http://dbpedia.org/ontology/
dbpedia-owl:min dbpedia-owl:max dbpedia-owl:average dbpedia-owl:mean dcterms http://purl.org/dc/terms/
dcterms:subject dcterms:source schema http://schema.org/ schema:minValue schema:maxValue In addition to the components presented in the table above, custom components have been defined for the Italian and Portuguese‐Spanish pilots: Prefix Namespace
Terms used arpacomponents http://smod‐fp7.github.io/components/arpa‐components.ttl
arpa-components:basePhenomenon
arpacomponents:hasObservedDetermina
nd arpa-components:station arpa-components:numberOfSamples
tragsacomponents http://smod‐fp7.github.io/components/tragsa‐components.ttl
tragsa-components:workUnit Components Values
Whenever possible, values of components have been encoded via existing SKOS concepts schemes or other resources. Values of sdmx-dimension:refPeriod Temporal aspect of measurements was represented using the reference time URI set developed by data.gov.uk. For example, in the Portuguese‐Spanish pilot climatology measurements are aggregated over several years, 1981‐2010. We encoded this time period using the following pattern: <http://reference.data.gov.uk/id/gregorian-interval>/<start-datetime>/P<n-of-years>Y Hence, the URI of the period of time that corresponds to 21 years starting from 1981 looks as follows: <http://reference.data.gov.uk/id/gregorian-interval/1981-01-01T00:00:00/P21Y> 95
SDMX guidelines contain standard code lists that are intended to be generic and reusable across various datasets. Version 1.0 Page 62 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Values of sdmx-attribute:unitMeasure Table below summarises values of Units of Measurement (UoM) used in the Portuguese‐
Spanish pilot: Measurement URI Unit of Measurement
Average Annual Rainfall Level Runoff mm^3
qudt-unit:CubicMillimeter
Slope Annual Humidity Level percent
qudt-unit:Percent Average/Minimum/Maximum Annual Temperature degrees Celsius
qudt-unit:DegreeCelsius Annual Evapotranspiration Level
mm
qudt-unit:Millimeter Annual Radiation Level Kcal/cm^2
qudtunit:KilocaloriePerSquareCent
imeter
Annual Insolation Level sun hours per year
qudt-unit:NumberPerYear In the Italian pilot, unit of measurements are defined by EEA code list96. We have transformed this list into an RDF vocabulary defining each unit of measurement as an instance of the qudt-unit:Unit class. For example, below is the definition of μg/l: <http://data.smartopendata.eu/WFD/UnitOfMeasure/9> a qudt:Unit ;
rdfs:label "μg/l" ;
rdfs:comment "microgrammes per liter" . The complete code list is published together with the Portuguese‐Spanish data97. Values of arpa-components:hasObservedDeterminand Values of the observed determinand are also defined in the EEA code list98. We have transformed it into an RDF vocabulary, defining every compound as an instance of the custom class smod:Determinand, e.g.: <http://data.smartopendata.eu/WFD/Determinand/71-43-2> a smod:Determinand ;
rdfs:label "Benzene" . The complete code list is published together with the Portuguese‐Spanish data99. 96
http://dd.eionet.europa.eu/dataelements/48239 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip 98
http://dd.eionet.europa.eu/datasets/latest/Groundwater/tables/HazSubstGW_Disagg/elements/Determinand
Code 99
https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip Version 1.0 Page 63 of 78 © SmartOpenData Consortium 2015 97
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 3.1.2 Data Cube Datasets Data Cube finalises definition of observations by specifying which dataset each observation belongs to. It is done through the property qb:dataset, as shown below for the running example: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Observation/0> qb:dataSet <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/>
. <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> a qb:DataSet . Like observations, a dataset may contain components as well: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> rdfs:comment "Aggregated data on hazardous substances reported in the Waterbase
- Lakes dataset of the European Environmental Agency."@en ; dcterms:source <http://www.eea.europa.eu/data-and-maps/data/waterbase-lakes-10>
; dcterms:subject <http://www.eionet.europa.eu/gemet/concept/9214> ; arpa-components:basePhenomenon "concentration". The values of the dataset’s components hold for all the observations of the dataset. Thus, for example, we know the example observation is from the dataset that is available at http://www.eea.europa.eu/data‐and‐maps/data/waterbase‐lakes‐10. 3.1.3 Data Cube Structures When define, the components are grouped into structures, e.g.: <http://data.smartopendata.eu/WaterbaseRivers/HazardousSubstances/DSD/> a qb:DataStructureDefinition ; rdfs:comment "Data structure definition for hazardous substances reported in
the Waterbase - Rivers dataset of the European Environment Agency and used in the
Italian pilot of the SmartOpenData project, http://www.smartopendata.eu/"@en ; qb:component [qb:attribute dcterms:subject ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute dcterms:source ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute arpa-components:basePhenomenon ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute sdmx-attribute:unitMeasure] ; qb:component [qb:attribute arpa-components:hasObservedDeterminand ] ; qb:component [qb:attribute arpa-components:station ] ; qb:component [qb:measure dbpedia-owl:average ] ; qb:component [qb:dimension sdmx-dimension:refPeriod ] . Structures have two main objectives. Firstly, they allow to change the default (qb:Observation) level of attachment of a component. In other words, one can specify whether the value of a component is specific to each observation or it can be generalised over a dataset. Secondly, such structures can be re‐used across similar datasets. For example, we used the structure from the snippet above for the Waterbase ‐ Lakes dataset: Version 1.0 Page 64 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> qb:structure
<http://data.smartopendata.eu/WtaerbaseLakes/HazardousSubstances/DSD/> . … and for the Waterbase ‐ Rivers dataset: <http://data.smartopendata.eu/WaterbaseRivers/HazardousSubstances/Dataset/> qb:structure
<http://data.smartopendata.eu/WtaerbaseRivers/HazardousSubstances/DSD/> . Definitions of the datasets and structures of the Italian and Portuguese‐Spanish pilots are available at http://smod‐fp7.github.io/dsd/arpa‐dsd‐dataset.ttl and http://smod‐
fp7.github.io/dsd/tragsa‐dsd‐dataset.ttl correspondingly. Version 1.0 Page 65 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) 4 Co
onclusiion In this d
deliverable we reporte
ed on final iiteration of the task off data harm
monisation to SmOD model. The modeel is based on severaal INSPIRE themes t
tha
at were chhosen to re
epresent ns of the pilots cond
ducted wit hin the prroject. As Table 8 shhows, therre is an domain
intersecction betweeen the dom
mains of thee pilots, forr example, in the topicc of protecte
ed sites, which is relevant tto most of the pilot. H
However, th
here are INSSPIRE topicss used by o
one pilot only, su
uch as Environmental M
Monitoring Facility and
d Cadastral P
Parcel. Voccabulary Italian Pilot SmOD Protected Site SmOD Land Use SmOD Bio‐
Geograaphical Region
ns SmOD Species bution Distrib
SmOD Corine Land C
Cover SmOD Environmental Monito
oring Facilityy Czech Pilot
C
Slovak PPilot Irissh Pilot SmOD Custom Vocabulary SmOD nistrative Admin
Units SmOD Cadastral Parcelss Portu
uguese‐
Spaniish Pilot SmartOpenDaata Consortium
m 2015 Version 11.0 Page 66 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Vocabularies of third p
parties Own vocabu
ularies Table 8: V
Vocabulary usa
age by pilot el has beenn extended with custom
m terms. Thhese custom
m terms But in aall the pilotts the mode
into a were aggregated SmOD custom
m voccabulary http://w
www.w3.orrg/2015/03//inspire/sm
mod#, which curren
ntly contaains classe
es and propertties that weere defined in the scoppe of the Ittalian and P
Portuguese‐‐Spanish pillots. We discusseed them in detail in sections dediccated to each of the pilot. Other p
pilots publisshed their custom term
ms in their namespaces: ● Czech Pilo
ot ‐ the NFI N vocabuulary of fo
orest attributes (wil l be available at http://nil.u
uhul.cz/lod/nfi/) ● Slovak Pilo
ot ‐ SK SmO
OD Environm
mental burrdens / Con
ntaminated sites vocabulary ‐ https://datta.sazp.sk/vocab/contaaminated‐sites/ Overall,, our observation is that a com
mmon dataa model ba
ased on th e existing INSPIRE standarrds facilitatted the process of ddata harmo
onisation to
o a greate r or lesserr extent depend
ding on thee settings and requirem
ments of each e
pilot. Pilots with INSPIRE‐co
ompliant datasetts, such as Slovak, not only used the model as a target schema for f
RDF transformations, b
but also took advantagees of the traansformatio
on tools thaat exist for IINSPIRE‐
nt of the compliaant datasets. Other pilots, such a s Portuguesse‐Spanish, used a sm all fragmen
model ccomparing tto the required domainn extension
n. We beliieve that in some case
es custom teerms could be found in
n other INS PIRE theme
es. Good examples of such
h cases are smod:c
catchment
tName, that specifiees the nam
me of a mod:featu
ureName, that specifiies the nam
me of the catchment area in a water basin, and sm
nitored by tthe Environmental Facility. Both pproperties o
originate feature of interestt being mon
he EEA Watterbase databases. Theese cases should be co
onsidered ffor the futu
ure work from th
related to development of the
e SmOD moodel. Techniccal contribu
utions of the data harm
monisation task includ
de three diffferent app
proaches devised
d and implemented in different piilots: CSV‐to
o‐RDF, XMLL‐to‐RDF an d Relationaal DB‐to‐
RDF. We preesented ressults of ussing the RD
DF plugin for f OpenRe
efine to peerform CSV
V‐to‐RDF transformations and compared it to Graafterizer, a tool that is being acttively developed at the mo
oment. On the one haand, rich fuunctionalityy of OpenR
Refine allow
wed us to perform various data pre‐p
processing ssteps and pprepare data for RDF m
mappings. O
On the othe
er hand, the GUII of the RDFF plugin enaabled intuitiive and inte
eractive con
nstruction oof RDF skele
etons for our datta. We repo
orted on sevveral challeenging casess in Annex A, but overrall we man
naged to SmartOpenDaata Consortium
m 2015 Version 11.0 Page 67 oof 78 © S
D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) perform all the required transformations. Perhaps, the weakest side of this approach is its scalability. Although, we didn’t hit this limitation, we are aware the existing issues with memory usage by OpenRefine100. Grafterizer is presented as an alternative solution which is being actively developed and already provides several features not available in OpenRefine, such as reusing utility functions in one transformation, changing operation order and editing transformation operation. In the scope of the Slovak pilot the existing XML‐to‐RDF transformations produced by the GeoKnow project were customised to use the SmOD vocabularies and modified for usage within the Unified Views ETL framework101. Finally, in the settings of the Czech pilot, transformations of data from Relational Database to RDF were covered. The D2RQ and r2rml‐parsers are being evaluated for a definitive solution. 100
101
https://github.com/OpenRefine/OpenRefine/wiki/FAQ:‐Allocate‐More‐Memory https://github.com/jindrichmynarz/TripleGeo/tree/sazp/xslt Version 1.0 Page 68 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 5 References [SMODD33] SmartOpenData EU/FP7 project, Report on the Initial Data Harmonisation D3.3 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData_D3.3_Initial_Data_Harm
onisation.pdf [SMODD32] SmartOpenData EU/FP7 project, Report on the Initial SmartOpenData Model D3.2 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData_D3.2_Initial%20Data%2
0Model.pdf [SMODD34] SmartOpenData EU/FP7 project, Report on the Final SmartOpenData Model D3.4 to be published at the website of the project http://www.smartopendata.eu/public‐
deliverables [SMODD52] SmartOpenData EU/FP7 project, Report on the First Iteration of pilots, D5.2 to be pubslihed at the website of the project http://www.smartopendata.eu/public‐
deliverables [HM08] Halpin, Terry; Morgan, Tony (March 2008), Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design (2nd ed.), Morgan Kaufmann, ISBN 978‐0‐12‐373568‐3 [SMODD31] SmartOpenData EU/FP7 project, Review of geographic resources metadata and related metadata standards D3.1 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData%20D3.%201%20Review
%20of%20geographic%20resources%20metadata%20and%20related%20metadata%20stan
dards.pdf Version 1.0 Page 69 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) nex A: G
Generating R
RDF w
with Op
penReffine: Ann
Challenges and Solutio
ons Languaage Tag Customisattion In both Italian and Portuguese pilots we faced the n
need to gen
nerate languuage tagged
d literals hat containss data in m ultiple langguages. RDFF plugin for OpenRefine allows out of aa column th
us to sp
pecify only o
one languagge tag per oone Literal n
node, as sho
own in Figurre 1. Figure
e 27: RDF plu gin of OpenR
Refine, language tag We havve identified
d two possib
ble options to generate
e RDF with multiple lannguage tagss: ● split input dataset ho
orizontally bby language
e and run RDF R transfoormations for f each data slice e ● customise lliteral node
First op
ption was applied to
o produce RDF of th
he Sicilian Protected Sites out of the NATURA
A2000SITESS table of the NATUR
RA2000 dattabase. Thiis table co ntains info
ormation about P
Protected Sites in all EU memberss. Names, d
descriptionss and alike are all give
en in the membeer states lan
nguages. Th
he goal is too produce tthe followin
ng RDF triplle for everyy Sicilian site: <protec
ctedsiteURI>
> rdfs:label
l <SITENAME>^^@it . We creeated RDF mappings m
with w “it” lannguage tag and run th
hem on a ssubset of th
he input table w
with Sicilian sites only. W
We used O penRefine tto create su
uch a subseet. Even tho
ough the subset could be geenerated byy pre‐proce ssing input data before uploadingg it in Refine, doing SmartOpenDaata Consortium
m 2015 Version 11.0 Page 70 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) it in Refine allows one to sto
ore all the ppre‐processing steps in
n one placee together w
with the RDF maappings. Second option we used in the
e Portugues e‐Spanish p
pilot to gene
erate RDF oout of datassets with adminisstrative uniits from bo
oth Portugaal and Spain. For exam
mple, Figurre 28 illustrrates an excerptt from the municipalities dataseet with Spaanish “A Ba
aña” and PPortuguese “Soure” municip
palities. Figure 28
8: Excerpt from
m the aux_04
40400_municiipality.csv The goaal is to geneerate a triple
e of the folllowing form
m for every municipalityy of the dattaset: <munici
ipalityURI> r
ramon:name <name>@<lang
<
g-tag> . If we sp
pecify “sp” as the language tag (a s shown in Figure 3), w
we will geneerate the fo
ollowing two trip
ples: <http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Municipality/ES11101000
07> ramon:na
ame "A
Baña"@sp .
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Municipality/PT16211061
15> ramon:na
ame
<http://data
"Soure"@sp . Second triple is obviously not correct. In order to tell Reefine to gen
nerate “sp” tagged lite
erals out of the name column only when the municipality is Spanish, we can custoomise the vaalue of the Literal nodee as followss: Figure 29: RDF plugin of O
OpenRefine, lliteral node cu
ustomisation One can see in th
he preview that the ccondition faails on the Portuguesee municipaality and mapping, onnly one trip
ple is produced out of the two records of nothingg is output. With this m
the running examp
ple: <http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Municipality/ES11101000
07> ramon:na
ame "A
Baña"@sp . SmartOpenDaata Consortium
m 2015 Version 11.0 Page 71 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) In order to outputt similar trip
ple for the Portuguese
e municipaliity, we nee d to add on
ne more mappin
ng of the “n
name” colum
mn to the lliteral tagge
ed with the language ““pt” and cu
ustomise the valu
ue of the “n
name” colum
mn as follow
ws: if(cell
ls[“idState”
”].value == “PT”, value, null) As a ressult, we willl generate ssecond tripl e for the Po
ortuguese m
municipalityy: <http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Municipality/PT16211061
15> ramon:na
ame
"Soure"@pt . RDF ou
ut of a List of Value
es In the Portuguesee‐Spanish pilot p
we haad to generate the fo
ollowing RD
DF triple fo
or every ObservaatoryTile: <observatory
ytileURI> sm
mod:supports
s <animalspe
eciesURI> .
Each ob
bservatory tile can sup
pport multiiple animal species, ass shown in the excerp
pt below from th
he input file with obserrvatory tiless: Figure 30: Excerptt from ObservvationTiles.cssv file n “speciesC
Code” contaains a list of all the species su
upported bby the tile with id Column
“ES1110010007010010000100
01”. For eacch value in tthe list we w
want to connstruct a UR
RI of the animal. In order to
o do so, we
e applied thhe following custom e
expression tto the value of the “speciesCode” colu
umn when mapping it to the anim
mal species n
node: forEach(v
value.split("
"~~~"), v, "AnimalSpeci
"
ies/" + v) More tthan one Root Nod
des With RD
DF plugin fo
or OpenRefine it is posssible to gen
nerate more than one root nodess for the same in
nput dataseet. We trie
ed this funcctionality of o OpenRefine to trannsform dataa of the Portugu
uese‐Spanissh pilot. Forr example, “Work Unit Location”” model (seee Annex B)) among other instances defines instancess of tw
wo classess: smod:
:WorkUnit
t and SmartOpenDaata Consortium
m 2015 Version 11.0 Page 72 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) cp:CadastralP
Parcel. Input dattaset for this mo
odel was generated
d from 102
pd_06004_workunion.wkt , aand containns two colum
mns: “idWorkUnit” andd “idParcel””. We cou
uld generatte work un
nit and parrcel instancces togethe
er with thee gsp:sfW
Within relation
nship, as illu
ustrated in FFigure 31. Figure 31: Excerptt from ObservvationTiles.cssv file The pro
oblem with this solutio
on is that tthe input dataset d
conttains dupliccates of “id
dParcel”, since th
he same parcel can contain moree than one w
work units. As a resultt, the RDF ccontains duplicattes of parccel instance
es, as manyy as there are a duplicates of “idPParcel” in th
he input datasett. For example, there aare two reccords with ““idParcel” =
= “ES11333 000101014
400001”. The RDF of these rrecords will look as foll ows: <http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/WorkUnit/ES113330001010
01400001001>
> a
smod:WorkUni
it ;
gsp:
:sfWithin <h
http://data.s
smartopendat
ta.eu/sp-ptpilot/so/Par
rcel/ES11333
300010101400
0001> .
<http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Parcel/ES11333000101014
400001> a
cp:Cadastral
lParcel .
<http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/WorkUnit/ES113330001010
01400001002>
> a
smod:WorkUni
it ;
gsp:
:sfWithin <h
http://data.s
smartopendat
ta.eu/sp-ptpilot/so/Par
rcel/ES11333
300010101400
0001> .
<http://data
a.smartopend
data.eu/sp-p
pt-pilot/so/
/Parcel/ES11333000101014
400001> a
cp:Cadastral
lParcel . Currenttly, there iss no way in the RDF plugin to generate such instancces just on
nce. Our solution
n was to maap instancess of parcelss in a separaate project.
102
Referr to Section 2.1.2, “Data Pre
e‐Processing” for more info
ormation abou
ut data pre‐prrocessing SmartOpenDaata Consortium
m 2015 Version 11.0 Page 73 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) nex B: P
Portugguese‐Spanissh pilo
ot: ORM
M and Ann
RDFF Mode
els ustrate ORM
M models developed for the 3d
d release oof the Porttuguese‐
Figures below illu
h pilot and ttheir translaation to RDFF(S). Spanish
Documeentation of the ORM moodels is available online at http://smod‐
fp7.gith
hub.io/tragssa3/orm/Ob
bjectTypeLisst.html, an
nd the con
nstraint vaalidation re
eport is availablle at http:///smod‐fp7.ggithub.io/trragsa3/orm//ConstrainttValidationRReport.html Chem
mical Characteristtics Figure 32: Chemiccal Characteriistics: ORM M
Model ure 33: Chemiical Characterristics: RDF model Figu
SmartOpenDaata Consortium
m 2015 Version 11.0 Page 74 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Climaatology Figure 34: CClimatology: O
ORM Model
Figure 35: Climatology: RDF Model SmartOpenDaata Consortium
m 2015 Version 11.0 Page 75 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Foresstry Tile Figure 36: FForestry Tile: ORM Model
Figure 37: FForestry Tile RDF: Model
Geom
metry Figure 38: Geometry: O
ORM Model SmartOpenDaata Consortium
m 2015 Version 11.0 Page 76 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Figure 39:: Geometry: R
RDF Model Workk Unit Eco
osystem Figgure 40: Workk Unit Ecosysttem: ORM Mo
odel Figgure 41: Workk Unit Ecosysttem: RDF Model SmartOpenDaata Consortium
m 2015 Version 11.0 Page 77 oof 78 © S
D3.5 Finaal Data Harmo
onisation SmartOp
penData projeect (Grant no.: 603824) Workk Unit Loccation Figure 42: Worrk Unit Locatio
on: ORM Mod
del Figure 43: Wo
F
ork Unit Location RDF model SmartOpenDaata Consortium
m 2015 Version 11.0 Page 78 oof 78 © S
Download