This project p has rece eived funding froom the Europea an Union’s Seventh h Programme foor research, tech hnological deve elopment and demonstration unnder grant agreement No 603824. Harrmon nisaation n of dataa to o Sm marttOpenD Data mo odel.. Fin nal iteeration Deliiverable D3.5 :: Publlic Keyw words: daata harm monisatioon, ORM, RDF, RD DFS, Linkeed Data Linked d Opeen Daata fo or envviron nment prottectio on in Smart Re egionss D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) TableofContents 1 Introduction ............................................................................................................................ 8 2 Data Harmonisation .............................................................................................................. 11 2.1 CSV‐to‐RDF ..................................................................................................................... 11 2.1.1 Italian pilot .............................................................................................................. 12 2.1.2 Portuguese‐Spanish Pilot ........................................................................................ 20 2.1.3 Irish pilot ................................................................................................................. 27 2.1.4 Transforming Data with Grafterizer and the Jarfter Service .................................. 33 2.2 XML (GML) ‐TO‐RDF transformations ............................................................................ 41 2.2.1 Slovak pilot .............................................................................................................. 41 2.3 Relational DB‐to‐RDF transformations .......................................................................... 55 2.3.1 Czech pilot ............................................................................................................... 55 3 Harmonising Observations and Measurements ................................................................... 61 3.1 RDF Data Cube: Example ................................................................................................ 61 3.1.1 Data Cube Components .......................................................................................... 61 3.1.2 Data Cube Datasets ................................................................................................. 64 3.1.3 Data Cube Structures .............................................................................................. 64 4 Conclusion ............................................................................................................................. 66 5 References ............................................................................................................................ 69 Annex A: Generating RDF with OpenRefine: Challenges and Solutions .................................. 70 Language Tag Customisation ........................................................................................... 70 RDF out of a List of Values ............................................................................................... 72 More than one Root Nodes ............................................................................................. 72 Annex B: Portuguese‐Spanish pilot: ORM and RDF Models .................................................... 74 Chemical Characteristics ...................................................................................................... 74 Climatology .......................................................................................................................... 75 Forestry Tile ......................................................................................................................... 76 Geometry ............................................................................................................................. 76 Work Unit Ecosystem ........................................................................................................... 77 Work Unit Location .............................................................................................................. 78 Version 1.0 Page 2 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ListofFigures Figure 1: Workflow of OpenRefine‐based data harmonisation .............................................. 11 Figure 2: RDF model of Protected Sites ................................................................................... 13 Figure 3: RDF model of Monitoring Stations ........................................................................... 14 Figure 4: RDF model of Hazardous Substances ....................................................................... 14 Figure 5: Portuguese‐Spanish Pilot, data harmonisation methodology .................................. 21 Figure 6: Original Sample ARPA Data ....................................................................................... 33 Figure 7: RDF mapping for ARPA data ..................................................................................... 34 Figure 8: Generated RDF graph for ARPA data ........................................................................ 35 Figure 9: User interface for Jarfter .......................................................................................... 35 Figure 10: Jarfter compiler services ......................................................................................... 36 Figure 11: Jarfter transformation web service ........................................................................ 37 Figure 12: Dynamic deployment of data transformations ...................................................... 38 Figure 13: CloudML deployment template .............................................................................. 38 Figure 14: List of updated GeoKnow XSLT stylesheets ............................................................ 47 Figure 15: Landing page for Unified Views ............................................................................. 48 Figure 16: List of created pipelines .......................................................................................... 48 Figure 17: Section with DPU templates ................................................................................... 49 Figure 18: Pipelines execution monitor ................................................................................... 49 Figure 19: Scheduler with the possibility to define the schedules for pipelines execution .... 50 Figure 20: Section with additional settings ............................................................................. 50 Figure 21: Example of pipeline details ..................................................................................... 50 Figure 22: Example of further DPU settings ............................................................................ 51 Figure 23: Example of the interlinking pipeline ....................................................................... 52 Figure 24: CKAN interface with the list of metadata for the open linked data from Slovak pilot .......................................................................................................................................... 53 Figure 25: Parliament web application interface .................................................................... 53 Figure 26: Czech pilot Data model ........................................................................................... 56 Figure 27: RDF plugin of OpenRefine, language tag ................................................................ 70 Figure 28: Excerpt from the aux_040400_municipality.csv .................................................... 71 Figure 29: RDF plugin of OpenRefine, literal node customisation .......................................... 71 Figure 30: Excerpt from ObservationTiles.csv file ................................................................... 72 Figure 31: Excerpt from ObservationTiles.csv file ................................................................... 73 Figure 32: Chemical Characteristics: ORM Model ................................................................... 74 Figure 33: Chemical Characteristics: RDF model ..................................................................... 74 Figure 34: Climatology: ORM Model ........................................................................................ 75 Figure 35: Climatology: RDF Model ......................................................................................... 75 Figure 36: Forestry Tile: ORM Model ....................................................................................... 76 Figure 37: Forestry Tile RDF: Model ........................................................................................ 76 Figure 38: Geometry: ORM Model .......................................................................................... 76 Figure 39: Geometry: RDF Model ............................................................................................ 77 Figure 40: Work Unit Ecosystem: ORM Model ........................................................................ 77 Figure 41: Work Unit Ecosystem: RDF Model .......................................................................... 77 Figure 42: Work Unit Location: ORM Model ........................................................................... 78 Figure 43: Work Unit Location RDF model .............................................................................. 78 Version 1.0 Page 3 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ListofTables Table 1: Data transformation approaches ................................................................................. 9 Table 2: Italian Pilot: summary of classes ............................................................................... 15 Table 3: Italian Pilot: summary of data harmonisation .......................................................... 20 Table 4: Portuguese‐Spanish Pilot: ORM constructs mapped to classes ............................... 23 Table 5: Portuguese‐Spanish Pilot: ORM constructs mapped to properties .......................... 24 Table 6: An overview of the datasets and vocabularies used in SK Pilot ............................... 44 Table 7: List of phases and tasks extracted and deployed from the COMSODE methodology for Open Data publishing ......................................................................................................... 45 Table 8: Vocabulary usage by pilot ......................................................................................... 67 Version 1.0 Page 4 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Document Metadata Contractual Date of Delivery to the EC: August 2015 Actual Date of Delivery to the EC: October 7th 2015 Editor(s): Tatiana Tarasova, SpazioDati Contributor(s): Martin Tuchyňa (SAŽP), Jindřich Mynarz (SAŽP), Peter Mozolík (SAŽP), Dumitru Roman (SINTEF), Nikolay Nikolov (SINTEF), Antoine Pultier (SINTEF), Dina Sukhobok (SINTEF), Håvard H. Holm (SINTEF), Jan Bojko (UHUL FMI), John O’Flaherty (MAC), Gregorio Urquía (TRAGSA), Jesús Estrada (TRAGSA) Document History Version Version date Responsible Description 0.0 20/07/2015 SpazioDati Outline and contributions 0.1 30/07/2015 UHUL FMI, HSRS Czech pilot contributions 0.2 15/08/2015 SAŽP Slovak pilot contributions 0.3 21/08/2015 SpazioDati, TRAGSA contribution to data harmonisation of the Italian and Portuguese‐Spanish pilots 0.4 21/08/2015 SpazioDati restructuring the report 0.5 24/08/2015 SINTEF contribution on Grafterizer and comparison of Grafterizer with OpenRefine 0.6 27/08/2015 UHIL FMI, SAŽP, SINTEF Sections 2.2, 2.3 on the Slovak and Czech pilots finalised Section 2.1.4 about Grafterizer completed 0.7 28/08/2015 SpazioDati Final version of the report with missing contribution from the Irish Pilot; submitted to the project coordinator. 1.0 7/10/2015 TRAGSA Editorial review call Version 1.0 Page 5 of 78 © SmartOpenData Consortium 2015 for D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The information and views set out in this publication are those of the author(s) and do not necessarily reflect the official opinion of the European Communities. Neither the European Union institutions and bodies nor any person acting on their behalf may be held responsible for the use which may be made of the information contained therein. Copyright © 2015, SmartOpenData Consortium. Version 1.0 Page 6 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Executive Summary Task 3.5 is dedicated to harmonising pilots data to the Final SmOD model delivered in D3.4 [SMODD34]. The model is based on several INSPIRE topics and provides a basis for geospatial and environmental data interoperability. Being such, the model does not cover domain‐specific concepts of the pilots. Hence, initial activities of the data harmonisation task included evaluation of the SmOD model in the context of the pilots. Whenever the model was not sufficient to represent the domain of interest, a search for the existing commonly accepted or standard vocabularies was performed, and if no suitable vocabulary was found, custom terms were developed. These custom terms constituted one of the main outcomes of the current task, the custom SmOD vocabulary. The vocabulary is published at http://www.w3.org/2015/03/inspire/smod#. Operational aspects of the data harmonisation task concern data transformations from input data structures to RDF. 3 different approaches were identified based on the pilots’ requirements: ● CSV‐to‐RDF (Spanish‐Portuguese, Italian and Irish pilots) ● XML‐to‐RDF (Slovak) ● RDBMS‐to‐RDF (Czech pilot) This document explains the approaches, discusses tools and technologies being used to realize them and summarizes the results of the data harmonisation task per pilot. Version 1.0 Page 7 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 1 Introduction Final SmartOpenData model has been delivered with D3.4 [SMODD34]. It is based on the INSPIRE themes being selected specifically to cover domains of the pilots: ● The Generic Concept Model ● Protected Sites ● Land Use ● Administrative Units ● Bio‐Geographical Units ● Species Distribution ● Corine Land Cover ● Environmental Monitoring Facilities ● Cadastral Parcels1 The model serves as a basis for harmonising data in the SmOD pilots. It defines basic concepts that are shared among the pilots, such as Protected Site or Cadastral Parcel. However, every pilot in addition to these basic concepts contains those specific to the domain of the pilot, which are not covered by the model. As a result, every pilot had to extend the model with domain specific terms. These terms were searched in the existing resources, such as the Linked Open Vocabularies repository2, schema.org and DBpedia OWL ontology. Whenever existing resources were not sufficient for the pilot’s needs, custom terms were introduced. We accumulated these custom terms in the SmOD Custom Vocabulary published at http://www.w3.org/2015/03/inspire/smod#. The rest of the document is structured as follows. We split Section 2 into three blocks each of which corresponds to a different data transformation approach: Section Approach Tools, Technologies Section 2.1 CSV‐to‐RDF OpenRefine3, RDF plugin for OpenRefine4, Fusepool BatchRefine API5 Section 2.2 XML‐to‐RDF XSLT (based on customised GeoKnow stylesheets6) 1 Cadastral Parcel theme has been added to the SmOD model after D3.4 had been finalised. http://lov.okfn.org/dataset/lov/ 3 http://openrefine.org/ 4 http://refine.deri.ie/ 5 https://github.com/fusepoolP3/p3‐batchrefine 6 https://web.imis.athena‐innovation.gr/redmine/projects/geoknow_public/wiki/Inspire2RDF Version 1.0 Page 8 of 78 © SmartOpenData Consortium 2015 2 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) and SmOD INSPIRE Vocabularies7 using OpenDataNode8 Section 2.3 Relational DB‐to‐RDF D2RQ, r2rml parser Table 1: Data transformation approaches CSV‐to‐RDF approach has been discussed in detail in D3.3 [SMODD33]. We have included a tutorial on using the RDF plugin of OpenRefine for mapping CSV files into RDF and discussed preliminary results of transforming Italian and Portuguese‐Spanish data into RDF. In this document we present the results of data harmonisation of the Italian pilot in Section 2.1.1, and the results of the Portuguese‐Spanish data harmonisation in Section 2.1.2. We discuss input datasets, models of the pilots and the vocabularies used to encode data in RDF. The latter include the SmOD model, vocabularies developed by third parties and the custom SmOD vocabulary. We conclude discussion of the CSV‐to‐RDF approach by presenting Grafterizer, the tool that performs transformations on tabular data. Section 2.1.4 contains a demonstration of how to use the tool on the example of the data from the Italian pilot and a comparison of the Grafterizer features with the RDF plugin of OpenRefine. XML‐to‐RDF has also been introduced in the previous deliverable D3.3. In the current report in Section 2.2.1 we discuss customisation of the GeoKnow XSL transformations to use the SmOD model as the target schema with the support of the Open Data Node platform and elements of the COMSODE methodology framework9 in the settings of the Slovak pilot. In Section 2.3 we explain the Relational‐to‐RDF approach followed in the Czech pilot. Section 3 discusses application of the RDF Data Cube vocabulary to harmonise environmental observations and measurements. Section 4 concludes the report. 7 http://www.w3.org/2015/03/inspire/ http://opendatanode.org/ 9 http://www.comsode.eu/wp‐content/uploads/D5.1‐ Methodology_for_publishing_datasets_as_open_data.pdf Version 1.0 Page 9 of 78 © SmartOpenData Consortium 2015 8 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Namespaces used in the report: Schema Prefix Namespace SmOD Protected Sites ps http://www.w3.org/2015/03/inspire/ps# SmOD Administrative Units au http://www.w3.org/2015/03/inspire/au# SmOD Environmental Monitoring Facility ef http://www.w3.org/2015/03/inspire/ef# SmOD Custom Vocabulary smod http://www.w3.org/2015/03/inspire/smod# SmOD Cadastral Parcels Vocabulary cp http://www.w3.org/2015/03/inspire/cp# SKOS skos http://www.w3.org/2004/02/skos/core# Friend of a Friend foaf http://xmlns.com/foaf/0.1/ DC Terms dcterms http://purl.org/dc/terms/ GeoSPARQL gsp http://www.opengis.net/ont/geosparql# DBpedia Ontology dbpedia‐ owl http://www.w3.org/2002/07/owl# RDF Data Cube Vocabulary qb http://purl.org/linked‐data/cube# Time Ontology time http://www.w3.org/2006/time# QUDT Units qudt‐unit http://qudt.org/1.1/vocab/unit# QUDT Schema qudt http://qudt.org/schema/qudt# RDF Schema rdfs http://www.w3.org/2000/01/rdf‐schema# Corine Land Cover Nomenclature in SKOS clc http://www.w3.org/2015/03/corine# Asset Description Metadata Schema (ADMS) adms http://www.w3.org/ns/adms# RAMON schema ramon http://rdfdata.eionet.europa.eu/ramon/ontology/ Version 1.0 Page 10 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) 2 Daata Harmonisation n 2.1 CSSV‐to‐RD DF OpenReefine togeth her with the RDF plugiin were selected to map and connvert data o of Italian and Portuguese‐Sp panish pilots into RDFF. Main mo otivation fo or this choiice was the e pilots’ input cconditions: input datasets ‐ CSV files or XLS spreadsheets ‐ weere extracte ed from several independeent data sou urces and addded to the e pilot in the course off work. Porttuguese‐ h pilot aggrregates inpu ut data fro m multiple sources off the Spanissh and Porrtuguese Spanish public b bodies. Inpu ut datasets of the Italiian pilot at the moment include ddata from d different public d databases, aand it is planned to incclude more data from o other data ssources. onisation inn D3.3 [SMO ODD33]. We preesented thee workflow of OpenReffine‐based data harmo Figure 11 illustratess the processses and toools involved in the workflow. Figure 1: W Workflow of O OpenRefine‐ba ased data harrmonisation Data Prre‐processin ng In both pilots there was a nee ed to prepaare input daatasets befo ore mappingg and transsforming them to o RDF. In caase of the Ittalian pilot,, functionalities of Ope enRefine weere sufficient to do this, un nlike in som me cases of the Portug uese‐Spanissh pilot, wh here ad‐hocc bash scrip pts were applied to input daatasets befo ore loading them to Op penRefine. Mappin ngs Creation n SmartOpenDaata Consortium m 2015 Version 11.0 Page 11 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) RDF mappings were created using the GUI of the RDF plugin for OpenRefine. All the RDF mappings are available in the corresponding projects of the OpenRefine instance, which was deployed for the project at https://smod‐refine.spaziodati.eu/. To access the instance, use the following username/password as credentials: smod/EnterSmartOpenData Using OpenRefine for transforming pilots data posed several challenges. We discuss them and present our solutions in Annex A. Data Transformation OpenRefine was designed primarily as a personal desktop application, and is meant to be used in an interactive mode. Within the scope of another EU FP7 project, Fusepool10, a batch version of OpenRefine was developed. APIs of the BatchRefine11 transformer enable programmatic access to the OpenRefine engine, which makes it possible to incorporate BatchRefine into an automatic Extract‐Transform‐Load procedure. In D3.3, Section 5.1 “BatchRefine Example using cURL” we demonstrated the usage of BatchRefine API. At the current stage of the pilots we performed all the transformations using the export “RDF as RDF/XML” functionality of OpenRefine. In the rest of this section we discuss in detail the data harmonisation processes held in the Italian pilot (Section 2.1.1) and Portuguese‐Spanish pilot (Section 2.1.2). We introduce input datasets, discuss RDF modeling and vocabularies used in order to generate RDF representation of the pilots’ data. In Annex A we report on our experience from using the RDF plugin of OpenRefine. We describe several cases, in which RDF generation task was not trivial, and present our solutions. 2.1.1 Italian pilot The Italian pilot is led by ARPA, the Environmental Protection Agency of the Sicilian Region. Following the pilot’s objectives, ARPA identified several user queries that underlie the baseline use case scenario of the pilot12. These queries guided the process of selecting input datasets, as well as the process of creating RDF models of them. In the current document we present one of these queries which, at the moment of writing this report, was fully implemented: ● Which rivers and lakes (upstream, within or crossing, and downstream) are linked to the environment of a Protected Site? 10 http://fusepoolp3.github.io/ 11 https://github.com/fusepoolP3/p3‐batchrefine Refer to [SMODD52] for more information about the pilot’s objectives 12 Version 1.0 Page 12 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Input Datasets Natura22000 datab base Natura22000 datab base13 is maintained bby the Euro opean Environmental Agency (EEA)14. It accumu ulates inforrmation about proteccted sites from f all EU U memberss. The dataabase is publiclyy available ffor downloaading in thee form of a MS Access database ddump or an n archive with CSSV files. Forr the pilot we used thhe fifth rele ease of the database, published on June 2014, w which refleccts the situ uation of thhe protecte ed sites in the Europeean Union in 2013 inclusivve. The database con nsists of multiple m tabbles. For th he pilot’s needs n we w were intere ested in NATURA A2000SITESS table whicch lists and ddescribes p protected arreas. Waterb base ‐ Lakess, Waterbasse ‐ Rivers EEA Waterbase ‐ Lakes15 an nd Rivers16 databases contain in nformation about mo onitoring a rivers and a measurrements of water quality. ARPA eextracted from the stationss of lakes and stations in databasses data reelevant for the pilot, i ncluding daata about monitoring m n Sicilian lakes and rivers and a measured by theem concenttrations of hazardous substancess in the water. A ARPA added d geographical coordinnates to som me stations that were m missing them. RDF Mo odelling Figures 2‐4 illustraate RDF models of Prrotected Sittes, Monito oring Statioons and Haazardous Substan nces correspondingly. T Table 2 bel ow summaarises the classes of thee models and gives examples of their instances. Measurem ments of hazardous substances w we encoded d in RDF using th he RDF Dataa Cube fram mework (in SSection 3 we discuss it in detail). Figure 2: RDFF model of Prrotected Sitess 13 http:///www.eea.eu uropa.eu/data a‐and‐maps/ddata/natura‐5 http:///www.eea.europa.eu/ 15 http:///www.eea.europa.eu/data‐and‐maps/daata/waterbase‐lakes‐10 16 http:///www.eea.europa.eu/data‐and‐maps/daata/waterbase‐rivers‐10 SmartOpenDaata Consortium m 2015 Version 11.0 Page 13 oof 78 © S 14 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Figure 3: RDF m model of Mon nitoring Statio ons nces Figure 4: RDF moodel of Hazarrdous Substan cclass description URI constructio on U URI example classes of Protected Sites, baseURI = <ht ttp://data.s smartopendata.eu/Natura2 2000/> ps:Prot tectedSit e protected sites instances baseURI I/so/Protect tedSite/< SITECOD DE> <http://data < a.smartopend data.eu /Natura2000/ / /so/Protecte edSite/ IT3110002> I SmartOpenDaata Consortium m 2015 Version 11.0 Page 14 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) foaf:Document instances of legal foundation documents baseURI/Document/<SITECODE > <http://data.smartopendata.eu /Natura2000/Document/IT311000 2> gsp:Geometry geometries of protected sites baseURI/Geometry/<SITECODE > <http://data.smartopendata.eu /Natura2000/Geometry/IT311000 2> au:Administrati veUnit administrative units of protected sites baseURI/so/AdministrativeU nit/IT <http://data.smartopendata.eu /Natura2000/so/Administrative Unit/IT> classes of Monitoring Stations, baseURI of the Waterbase Lakes dataset = <http://data.smartopendata.eu/WaterbaseLakes/> baseURI of the Waterbase Rivers dataset = <http://data.smartopendata.eu/WaterbaseRivers/> ef:Environmenta lMonitoringFaci lity baseURI/so/Station/<Nation instances of lakes alStationID> and rivers monitoring stations <http://data.smartopendata.eu /WaterbaseLakes/so/Station/IT 19LW09318> gsp:Geometry geometries of stations baseURI/Geometry/<National StationID> <http://data.smartopendata.eu /WaterbaseLakes/Geometry/IT19 LW09318> au:Administrati veUnit administrative units of the stations baseURI/so/AdministrativeU nit/<CountryCode> <http://data.smartopendata.eu /WaterbaseLakes/so/Administra tiveUnit/IT> classes of Hazardous Substances, baseURI of the Waterbase Lakes dataset = <http://data.smartopendata.eu/WaterbaseLakes/> baseURI of the Waterbase Rivers dataset = <http://data.smartopendata.eu/WaterbaseRivers/> qb:Observation instances of measurements of hazardous substances baseURI/HazardousSubstance s/Observation/<rowIndex> <http://data.smartopendata.eu /WaterbaseRivers/HazardousSub stances/Observation/0> qb:DataSet instances of the input datasets with hazardous substances - <http://data.smartopendata.eu /WaterbaseRivers/HazardousSub stances/Dataset/> smod:Determinan d chemical compounds (determinands) defined in Water Framework Directive http://data.smartopendata. eu/WFD/Determinand/<CASNum ber> <http://data.smartopendata.eu /WFD/Determinand/71-55-6> time:Interval time period, year, for which the values of the measurements were aggregated http://reference.data.gov. uk/id/gregorianinterval/<Year>+”-0101T00:00:00/P1Y” <http://reference.data.gov.uk /id/gregorian-interval/201301-01T00:00:00/P1Y> qudt:Unit units of measurements http://data.smartopendata. eu/WFD/UnitOfMeasure/<unit _id> <http://data.smartopendata.eu /WFD/UnitOfMeasure/9> Table 2: Italian Pilot: summary of classes Version 1.0 Page 15 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) External Vocabularies In this section we summarise external vocabularies used in the pilot. We focus on the vocabularies which are not included in the final SmOD model. DC Terms and DBPedia OWL for Administrative Units of Protected Sites Protected sites (see Fig. 2) are linked to administrative units they belong to via the property dcterms:coverage. Administrative Units are described in NATURA2000 through the Nomenclature of Territorial Units for Statistics (NUTS) country code, a two‐letter code referencing the country, e.g., “IT” for Italy17. The SmOD model suggests using au:country and take values from the Metadata Registry (MDR). We constructed the MDR URIs for the Sicilian sites. For example, below is an excerpt describing one of the sites: <http://data.smartopendata.eu/Natura2000/so/ProtectedSite/ITA070005> a ps:ProtectedSite . <http://data.smartopendata.eu/Natura2000/so/ProtectedSite/ITA070005> dcterms:coverage <http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> . <http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> a au:AdministrativeUnit ; au:nationalLevel <http://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/1stOrder/> ; au:country <http://publications.europa.eu/resource/authority/country/ITA> . In addition to this definition, we kept textual representation of the country codes, using the DBPedia ontology property dbpedia-owl:nutsCode, as shown in the listing below: <http://data.smartopendata.eu/Natura2000/so/AdministrativeUnit/IT> dbpediaowl:nutsCode "IT" . This was done mainly for the fact that the MDR URIs are currently not resolvable, hence, technically we could not obtain description of the countries by these URIs. Moreover, inspection of the SKOS description of the URI of Italy18 revealed that there is no mapping from the MDR country codes to the NUTS codes, which would be useful to have in the pilot’s case. 17 NUTS codes are identical to the ISO 3166‐1 alpha‐2 code, while MDR makes use of the ISO 3166‐3 codes http://www.iso.org/iso/home/standards/country_codes.htm 18 SKOS document describing all countries can be downloaded from http://publications.europa.eu/mdr/resource/authority/country/skos/countries‐skos.rdf Version 1.0 Page 16 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Custom terms In addition to the external vocabularies developed by third parties, we introduced new terms that were implemented in the custom SmOD vocabulary http://www.w3.org/2015/03/inspire/smod#. Term rdfs:comment smod:areaHa This property specifies the area of the Protected Site in Ha. smod:lengthKm This property specifies the length of the Protected Site in km. smod:ecologicalQuality This property provides description of the Protected Site in terms of ecological quality. smod:catchmentName This property specifies the name of major catchment or basin. smod:featureName This property specifies the name of the feature of interest being monitored by the Environmental Facility. smod:Determinand This class represents the class of nutrients, organic matter, hazardous substances and other chemical determinands reported in the Waterbase data of the European Environmental Agency. Data Pre-Processing The RDF models presented in the section above illustrate also how values of certain properties were populated with physical data. For example, in the RDF model of Protected Sites, the value of rdfs:label is populated with the value of the column <SITENAME>. In several cases, population of the properties’ values was not straightforward, and additional pre‐processing steps were required. In this section we discuss some typical examples of them. Implementing domain logics It is a typical situation, when a property values is populated from more than one columns of the input dataset, following some domain logic. For example, in case with protected sites, the value of ps:legalFoundationDate was populated from three columns <DATE_SAC>, <DATE_CONF_SCI> and <DATE_SPA>. <DATE_CONF_SCI> and <DATE_SPA> are the dates when a site was designated as Site of Community Importance (SCI) and Special Protection Areas (SPA) correspondingly. Site designation is found in the column <SITETYPE> and may contain of the three values: ● “A”: the site was designated as SPA ● “B”: the site was designated as SCI ● “C”: the site was designated as both SCI, and SPA In addition, European Commission can assign the status of Special Area of Conservation (SAC) to each site. If this happens, the column <DATE_SAC> is populated. Following consideration from the domain experts of ARPA, a rule was implemented in OpenRefine, in order to take value for ps:legalFoundationDate from <DATE_SAC> Version 1.0 Page 17 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) whenever it is present, and the latest date from the <DATE_CONF_SCI> or <DATE_SPA>, otherwise. Cleaning/Formatting Values Very typical examples of data preparation are data formatting and data cleaning. For example, the date values of ps:legalFoundationDate were converted from the date‐ time format to date, using simple OpenRefine rule for that. Sampling Input Dataset It is often the case that we want to generate RDF out of a subset of the input dataset. With OpenRefine it can be done in numerous ways. For example, NATURA2000 table contains protected sites from all the countries of European Union; however, for the pilot we were interested only in the Sicilian sites. To reduce the input datasets, a text facet was created in OpenRefine on the column <SITECODE>, that outputs “1” in case <SITECODE> contains value of one of the Sicilian sites (these values were provided by ARPA), and “0” otherwise. Joining Datasets Another interesting example of exploiting OpenRefine functionalities for data preparation refers to joining one dataset with another, in order to retrieve more data. For example, the dataset with hazardous substances contains units of measurements (UoM) in the column <Unit_HazSubs>. The values of the column are names of hazardous substances, such as “μg/l”, and as the target RDF model of hazardous substances suggests that the values of sdmx-attribute:unitMeasure must be URIs. The URI of “μg/l” is <http://data.smartopendata.eu/WFD/UnitOfMeasure/9>, in which “9” is an index row of “μg/l” in a dataset of UoMs19 that resides in another OpenRefine project20. Hence, in order to generate the same UoMs URIs in the project with hazardous substances, we need to join this dataset with the dataset of UoM21 and retrieve row indexes of the latter. And this kind of joins is also supported by OpenRefine22. RDF Generation The size of the complete RDF dataset (including data structure definitions and concept scheme) of the Italian pilot is 2.1M; 14.098 triples in total: ● 223 instances of ps:ProtectedSite 19 http://dd.eionet.europa.eu/dataelements/48239 The project called “ARPA‐haz‐substances‐UoM” is available at https://smod‐refine.spaziodati.eu/ 21 The join is done by the UoM name that is found in the column <Unit_HazSubs> of the source dataset and <Value> in the target 22 See here the documentation of the join rule https://github.com/OpenRefine/OpenRefine/wiki/GREL‐Other‐ Functions#crosscell‐c‐string‐projectname‐string‐columnname Version 1.0 Page 18 of 78 © SmartOpenData Consortium 2015 20 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ● 38 instances of ef:EnvironmentalMonitoringFacility, 14 of which are lakes monitoring stations, and 24 are rivers stations ● 906 instances of qb:Observation, 205 of which are measurements of hazardous substances in rivers, and 701 of which are in lakes Table 3 summarises the results of data harmonisation in the Italian pilot. RDF mappings are available at the OpenRefine projects of each input dataset. Resulting RDF can be downloaded from the given links, alternatively, the data can be queried via the SPARQL endpoint http://smodlumii.sungis.lv/sparql Input Datasets OpenRefine project’s name on https://smod‐ refine.spaziodati.eu RDF, SPARQL endpoint: http://smodlumii.sungis.lv/sparql Natura2000 database ‐ http://www.eea.europa.eu/ data‐and‐ maps/data/natura‐5, table NATURA2000SITES “ARPA‐NATURA2000SITES‐ PLUS” ● ● RDF dump23 graph: <http://data.smartopendata.eu/natura20 00/sicily> EEA Waterbase ‐ Lakes ‐ http://www.eea.europa.eu/ data‐and‐ maps/data/waterbase‐ lakes‐10, ARPA extraction (enriched with coordinates)24, sheets “StationsLakes” and “HazSubstLakes_Agg” “ARPA‐ Lakes_dati2013_caricati20 14” ● ● RDF dump25 graph: <http://data.smartopendata.eu/wat erbase‐lakes/stations/sicily> “ARPA‐ Lakes_dati2013_caricati20 14‐HazSubs” ● ● RDF dump26 graph: <http://data.smartopendata.eu/wat erbase‐lakes/haz‐substances/sicily> EEA Waterbase ‐ Rivers ‐ http://www.eea.europa.eu/ data‐and‐ maps/data/waterbase‐ rivers‐10, ARPA extraction27, sheets “StationsRivers” and “HazSubstRivers_Agg” “ARPA‐ Rivers_dati2013_caricati20 14” ● ● RDF dump28 graph: <http://data.smartopendata.eu/waterbas e‐rivers/stations/sicily> “ARPA‐ Rivers_dati2013_caricati20 14‐HazSubs” ● ● RDF dump29 graph: <http://data.smartopendata.eu/waterbas e‐rivers/haz‐substances/sicily> 23 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐NATURA2000SITES‐PLUS.rdf.zip https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐ release2/Lakes_dati2013_caricati2014+Rev1.xlsx.zip 25 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014.rdf.zip 26 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014‐ HazSubs.rdf.zip 27 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Rivers_2013_19_12_2014_Rev1.xlsx.zip 28 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Lakes_dati2013_caricati2014.rdf.zip 29 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/Rivers_dati2013_caricati2014‐ HazSubs.rdf.zip Version 1.0 Page 19 of 78 © SmartOpenData Consortium 2015 24 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) EEA Unit of measurement of “ARPA‐haz‐substances‐ Hazardous Substances ‐ UoM” http://dd.eionet.europa.eu/ dataelements/48239 ● ● RDF dump30 graph: <http://data.smartopendata.eu/WFD/haz ‐substances/uom> EEA code list of determinands ‐ http://dd.eionet.europa.eu/ datasets/latest/Groundwate r/tables/HazSubstGW_Disag g/elements/DeterminandCo de ● ● RDF dump31 graph: <http://data.smartopendata.eu/WFD/haz ‐substances/determinands> “ARPA‐WFD‐determinand” Table 3: Italian Pilot: summary of data harmonisation Future Outlook The Italian pilot is being actively developed, and more user queries are to be addressed in the future work, for example: ● Which protected site or areas of a protected site are more or less subject to pollution? ● Which human activities in the protected site can lead to pollution of water and/or lakes (within and/or downstream)? This will require adding more input data sources, such as those defining “pollution” in terms of the concentration of hazardous substances. For example, what is the acceptable value of the benzene concentration? When it is considered to be water pollution? As for the second user query, description of “human activities” needs to be added. New models will be developed to include new data sources. This in turn will affect the SmOD model (and vocabularies) which at the moment do not include either pollution or human activities definitions. 2.1.2 Portuguese‐Spanish Pilot Portuguese‐Spanish pilot is led by Empresa de Transformacion Agraria SA (TRAGSA). Besides TRAGSA, Portuguese partner ‐ Direção Geral do Território ‐ participates in the pilot as domain expert and data provider. A set of user queries of the pilot guided the process of data harmonisation: from choosing input datasets to conceptual modelling of the domain, to designing RDF models and extending SmOD vocabularies with domain‐specific terms. Below we present a few user queries for demonstration purposes32: ● What’s the land use and land cover (LULC) of my field units in Zêzere Watershed in the year x? 30 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐haz‐substances‐UoM.rdf.zip https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/arpa‐release2/ARPA‐lakes‐haz‐substances‐ determinand.rdf.zip 32 Refer to [SMODD52] for more details on the pilot Version 1.0 Page 20 of 78 © SmartOpenData Consortium 2015 31 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) ● Which land d use/cover changes occcurred in m my field unit? ● What envirronmental ffactors can be relevantt in this field unit consiidering the ocurred LULC? Input Datasets Input d datasets accumulated a d by TRA AGSA from different data souurces include one denorm malized tab ble with relationshipps between n various concepts of the do omain ‐ pd_06004_workunion.wkt ‐ and multipl e auxiliary tables tha at provide definitionss of the conceptts of the do omain. Data is available frrom the FTP P server of TTRAGSA. RDF Mo odelling/Cusstom Termss In order to develop p RDF mode els of the piilot we follo owed the m methodologyy presented d in D3.3 [SMODD33]. Figuree 5 schematically illusttrates the m methodologyy. Figure 5: Portuguese‐Spanissh Pilot, data harmonisatio on methodoloogy 3.3, input datasets of tthe pilot lack proper d documentattion of the schema As explained in D3 design, and domain analysis aand modelli ng were needed prior to harmoniising pilot’s data. Togetheer with TR RAGSA and d SINTEF we perforrmed domain analys is and de eveloped concepttual models using th he Object‐ Role Mode elling (ORM M) techniquues [HM08 8]. D3.3 presentts ORM mo odels of the e first releaase of the pilot. Our initial intenntion was to o follow iterativee approach h to the pilo ot’s developpment and produce ba ackward com mpatible m models in each su ubsequent release r of the t pilot. TThat meant that in eve ery new iteeration we were to augmen nt the existting modelss with moree concepts and relationships, butt not to mo odify the existingg ones. In practice, p that approacch worked for the seccond releasse of the pilot, but failed w with the third t releaase, in whhich the models m of the previoous release es were reconsidered and m modified. We pub blished ORM M models off all the releeases http:///smod‐fp7.github.io/ together w with their documeentations. In I the currrent docum ment we incclude ORM models off the latest (third) release of the pilott in Annex B B, among w hich are: SmartOpenDaata Consortium m 2015 Version 11.0 Page 21 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ● ● ● ● ● ● Chemical Characteristics33 ‐ the model of chemical characteristics of soil Climatology34 ‐ the model of climatology measurements Forestry Tile35 ‐ the model of forestry maps and plant species Geometry36 ‐ the model of geometries Work Unit Ecosystem37 ‐ the model of animal species supported by observatory tiles Work Unit Location38 ‐ the model of topological relations between spatial objects The ORM models served as input to the task of RDF data modelling. Following the set of conversion rules presented in D3.3, we transferred ORM models to RDF Schema. In Annex B we include all the resulting RDF models. Next in this section we go through the conversion rules and summarise the result of their application to the ORM models of the third release of the pilot. Mapping Object Types and Value Types to Classes In Table 4 we present ORM constructs ‐ object types and value types ‐ that were mapped to classes. ORM construct Class URI construction baseURI = <http://data.smartopendata.eu/ sp-pt-pilot/> URI example Work Unit smod:WorkUnit baseURI/so/WorkUnit/<idWorkUni t> <http://data.smartopendata .eu/sp-ptpilot/so/WorkUnit/ES111010 0070100100001001> Soil smod:Soil baseURI/so/Soil/<idLitholo> <http://data.smartopendata .eu/sp-pt-pilot/Soil/57> Forestry Tile smod:ForestryT ile baseURI/so/ForestryTile/<idFor estry> <http://data.smartopendata .eu/sp-ptpilot/so/ForestryTile/1000 01-MFE25> Plant Species smod:PlantSpec ies baseURI/PlantSpecies/<codeSP1> <http://data.smartopendata .eu/sp-ptpilot/PlantSpecies/Pinsyl> Local number adms:Identifie r baseURI/Identifier/<idWorkUnit > <http://data.smartopendata .eu/Identifier/ES111010007 0100100001001> Protected Site ps:ProtectedSi baseURI/ProtectedSite/ 33 http://smod‐fp7.github.io/tragsa3/diagrams/ChemicalCharacteristics.png http://smod‐fp7.github.io/tragsa3/diagrams/Climatology.png 35 http://smod‐fp7.github.io/tragsa3/diagrams/ForestryTile.png 36 http://smod‐fp7.github.io/tragsa3/diagrams/Geometry.png 37 http://smod‐fp7.github.io/tragsa3/diagrams/WorkUnitEcosystem.png 38 http://smod‐fp7.github.io/tragsa3/diagrams/WorkUnitLocation.png Version 1.0 Page 22 of 78 © SmartOpenData Consortium 2015 34 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) te Parcel smod:Parcel baseURI/Parcel/<idParcel> <http://data.smartopendata .eu/sp-ptpilot/so/Parcel/ES11101000 70100100001> Neighbourhood Municipality District NUTS3 NUTS2 au:Administrat iveUnit baseURI/<AdministrativeUnit>/< idAdministrativeUnit> <http://data.smartopendata .eu/sp-ptpilot/so/Neighbourhood/ES1 1101000701> Observatory Tile smod:Observato ryTile baseURI/so/ObservatoryTile/<id LandSp> <http://data.smartopendata .eu/sp-ptpilot/so/ObservatoryTile/2 9TNH15> Animal Species smod:AnimalSpe cies baseURI/AnimalSpecies/<code> <http://data.smartopendata .eu/sp-ptpilot/AnimalSpecies/Alaarv > Geometry gsp:Geometry baseURI/Geometry/<idWorkUnit> <http://data.smartopendata .eu/sp-ptpilot/Geometry/ES111010007 0100100001001> Corine Land Cover skos:Concept http://www.w3.org/2015/03/cori ne# + <code> <http://www.w3.org/2015/03 /corine#242> ‐ qb:Observation baseURI/<ClimatologyMeasuremen t/Observation/idClimatologyMea surement> <http://data.smartopendata .eu/sp-ptpilot/AnnualHumidityLevel/ Observation/65> ‐ qb:DataSet - <http://data.smartopendata .eu/sp-pt-pilot/WorkUnitClimatology/Dataset/> Table 4: Portuguese‐Spanish Pilot: ORM constructs mapped to classes Mapping Associations and Value Types to Properties In Table 5 we present ORM constructs ‐ associations, value types and object types ‐ that were mapped to rdf:Property. ORM Construct rdf:Property rdfs:domain rdfs:range Chemical Characteristics (Work Unit) has (Soil) smod:hasSoil smod:WorkUnit smod:Soil (Soil) has (Acidity) smod:soilAcidity smod:Soil rdfs:Literal (Soil) has (Permeability) + (Permeability) has Permeability Rate smod:soilPermeabilityRate smod:Soil rdfs:Literal Version 1.0 Page 23 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Geometry (Work Unit) has (Geometry) (Parcel) has (Geometry) gsp:hasGeometry gsp:SpatialObject gsp:SpatialObje ct (Polygon) has (Surface) smod:areaHa gsp:SpatialObject rdfs:Literal (Polygon) has (Perimeter) smod:lengthKm gsp:SpatialObject rdfs:Literal Work Unit Ecosystem (Observatory Tile) supports (Animal Species) smod:supports smod:ObservatoryTile smod:AnimalSpec ies (Animal Species) has Conservation Status smod:iucnConservationStatusCode smod:AnimalSpecies rdfs:Literals Work Unit Location (Work Unit) intersects (Protected Site) gsp:sfIntersects gsp:SpatialObject gsp:SpatialObje ct (Work Unit) is located in (Forestry Tile) (Work Unit) is located in (Observatory Tile) (Work Unit) is located in (Neighbourhood) (Work Unit) is located in (Parcel) (Neighbourhood) is located in (Municipality) (Municipality) is located in (District) (District) is located in (NUTS3) (NUTS3) is located in (NUTS2) gsp:sfWithin gsp:SpatialObject gsp:SpatialObje ct (Neighbourhood) has Name (Municipality) has Name (District) has Name (NUTS3) has Name (NUTS2) has Name ramon:name ramon:Region rdfs:Literal Table 5: Portuguese‐Spanish Pilot: ORM constructs mapped to properties Mapping Objectified Associations In the first release of the pilot we had one objectified association39 ‐ “ForestryTileHasPlantSpecies” ‐ association between Forestry Tile and Plant Species that for every plant species of a forestry tile allows to specify representativity level of the plant species (primary, secondary or tertiary) and its density. 39 Objectified associations in ORM allow to express additional qualifying information on the relationship between two entities. Version 1.0 Page 24 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) In D3.3 (Section 3.3.2) we discussed different approaches to express objectified associations in RDF: from RDF reification to introducing custom properties. The latter approach was chosen to represent “ForestryTileHasPlantSpecies” in RDF. We took into account the fact, that no more representative levels were to be added to the data, and opted for a less verbose and simpler way of encoding and querying of the data as opposed to RDF reification. As a result, we introduced 6 properties: Objectified Association rdf:Property rdfs:domain rdfs:range (ForestryTileHasPlantSpecies) has (Representative Level) smod:hasPrimaryPlantSpecies smod:hasSecondaryPlantSpecies smod:hasTertiaryPlantSpecies smod:Forestry Tile smod:PlantSpecies (ForestryTileHasPlantSpecies) has (Density) smod:primaryPlantSpeciesDensity smod:secondaryPlantSpeciesDensity smod:tertiaryPlantSpeciesDensity smod:Forestry Tile smod:PlantSpecies In the third release of the pilot one more objectified association was added that link Work Unit and Corine Land Cover ‐ “WorkUnitHasCorineLandcover”. This association for every work unit allows to specify the code of Corine Land Cover in three years: 1990, 2000 and 2006. When choosing an RDF model for “WorkUnitHasCorineLandcover”, we followed similar logic as for “ForestryTileHasPlantSpecies”, and introduced the following three properties: Objectified Association (WorkUnitHasCorineLandCov er) in (Year) rdf:Property smod:corineLandCover1990 smod:corineLandCover2000 smod:corineLandCover2006 rdfs:domain gsp:SpatialOb ject rdfs:range skos:Concept We chose this design solution, as this temporal aspect of Corine Land Cover codes has informative purpose rather than the purpose of combining these values with some other data sources. External Vocabularies NUTS‐RDF and the RAMON Ontology for Administrative Regions To locate an administrative unit in the pilot, topological relations between work units and administrative units is used. To encode instances of the NUTS region, we re‐used the NUTS classification vocabulary published as Linked Data at this location http://nuts.geovocab.org/ For example, the id of the "Baixo Mondego", sub‐region of Portugal, is http://nuts.geovocab.org/id/PT162.html. Below is the definition of the sub‐region from the NUTS Linked Data set: @prefix nuts: <http://nuts.geovocab.org/id/> . Version 1.0 Page 25 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) @prefix @prefix @prefix @prefix ramon: <http://rdfdata.eionet.europa.eu/ramon/ontology/> . ngeo: <http://geovocab.org/geometry#> . spatial: <http://geovocab.org/spatial#> . owl: <http://www.w3.org/2002/07/owl#> . nuts:PT162 nuts:PT162 nuts:PT162 nuts:PT162 nuts:PT162 nuts:PT162 rdf:type ramon:NUTSRegion, spatial:Feature . rdfs:label "PT162 - Baixo Mondego" . ramon:name "Baixo Mondego" . ramon:level "3"^^<http://www.w3.org/2001/XMLSchema#integer> . ramon:code "PT162" . ngeo:geometry nuts:PT162_geometry . nuts:PT162 spatial:PP nuts:PT16 . nuts:PT162 nuts:PT162 nuts:PT162 nuts:PT162 owl:sameAs owl:sameAs owl:sameAs owl:sameAs <http://rdfdata.eionet.europa.eu/ramon/nuts2008/PT162> . <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/PT162> . <http://estatwrap.ontologycentral.com/dic/geo#PT162> . <http://nuts.psi.enakting.org/id/PT162> . Having URIs from this NUTS Linked Data set allows us to re‐use definitions of the NUTS regions (i.e., their names, levels, codes) and the topological relations between them. In the definition above the triple in bold tells us that the "Baixo Mondego" sub‐region is contained in the “Centro” region (http://nuts.geovocab.org/id/PT16.html) Neighbourhoods, Municipalities and Districts are units in the local administrative divisions of Spain and Portugal. To represent them in RDF, we re‐used the Administrative Units vocabulary40 and the RAMON Ontology http://rdfdata.eionet.europa.eu/ramon/ontology/. For example below is the definition of the “Coimbra” district in Portugal, which is contained in the "Baixo Mondego" sub‐region (http://nuts.geovocab.org/id/PT162.html): @prefix ramon: <http://rdfdata.eionet.europa.eu/ramon/ontology/> . @prefix au: <http://www.w3.org/2015/03/inspire/au#> . @prefix gsp: <http://www.opengis.net/ont/geosparql#> . <http://data.smartopendata.eu/sp-pt-pilot/so/District/PT16211> gsp:sfWithin <http://nuts.geovocab.org/id/PT162> . <http://data.smartopendata.eu/sp-pt-pilot/so/District/PT16211> a au:AdministrativeUnit , ramon:LAURegion ; ramon:name "Coimbra" ; ramon:level "2"^^xsd:int ; au:nationalLevel <http://inspire.ec.europa.eu/codelist/AdministrativeHierarchyLevel/4thOrder/> ; au:country <http://publications.europa.eu/resource/authority/country/PRT> ; au:nationalCode "PT16211" . Data Pre-Processing Input to many target RDF models was the same file ‐ pd_0604_workunion.wkt ‐ that contains relationships between most of the concepts of the pilot, such as: ● all links between Work Unit and Climatology measurements ● all topological relationships between Work Unit and other spatial objects of the domain: Forestry Tile, Observatory Tile, and others. 40 http://www.w3.org/2015/03/inspire/au# Version 1.0 Page 26 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) For example, the link gsp:sfWithin between Work Unit and Forestry Tile is generated using values from the two columns of the input dataset: idWorkUnit and idForestry. However, the file contains duplicate records of the same pairs of idWorkUnit‐idForestry. We ran the following bash commands on the input file to sort its records and remove duplicates based on the two given columns: header=”$col1;$col2” echo $header >“$outputfile” sed 1d ./pd_0604_workunion.wkt | cut -d’;’ -f “$coln1”,”$coln2” | tr -d ’”’ | awk -F’;’ ’NF==2’ | sort -t’;’ -u >“$outputfile” where $coln1 is the sequential number of the first column in the dataset and $coln2 is the sequential number of the second column. For example, the following command outputs the input file to generate gsp:sfWithin relationship: header=idWorkUnit;idForestryTile echo $header >sorted_idWorkUnit_located_idForestry.wkt sed 1d ./pd_0604_workunion.wkt | cut -d’;’ -f 3,19 | tr -d ’”’ | awk -F’;’ ’NF==2’ | sort -t’;’ -u >sorted_idWorkUnit_located_idForestry.wkt Input datasets after pre‐processing are available. RDF Generation The RDF dump of the pilot is available for downloading41. All RDF mappings can be found in OpenRefine projects on https://smod‐refine.spaziodati.eu, the names of the projects start with “TRAGSA3” and continues with the name of the input file. Future Outlook As a future work, RDF representation of Geometries needs to be generated. 2.1.3 Irish pilot The Irish pilot, which is led by MAC, is focused on European protected areas and its National Parks, starting with the Burren National Park in Ireland. The pilot aims to demonstrate the value of SmartOpenData in helping Researchers and Decision Makers to better manage, preserve, sustain and use this unique ecosystem. The pilot’s primary objective is to create the following sustainable services that will continue beyond the life of the project42. 1. SmartOpenData enabled European Tourism Indicator System (ETIS) Webservice for the Burren and European GeoParks Network. 2. SmartOpenData enabled App to Ground‐Truth potential Protected Monument sites 41 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip 42 See [SMOD52] for a more in‐depth discussion of the pilot and its objectives. Version 1.0 Page 27 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) ETIS is a survey‐based generation service used to provide real‐time statistical information on the GeoPark performance in relation to the performance criteria defined by the Geopark’s Management. However the ETIS model does not yet present useful links to the SmartOpenData data model43, until the service is operational in many GeoParks, and use of a common data model will enable potential eco‐tourists to benchmark, compare and contrast the progress of various sustainable destinations in achieving their objectives, before deciding which to visit44. So it was decided to focus on the second service for now, and use the OpenRefine approach, described above and proven in the Italian pilot, to transform the Irish national heritage services as defined at http://webgis.archaeology.ie/nationalmonuments/flexviewer/ to capture data for the protected sites within the Burren Geopark region. The query generated was to reconcile the data retrieved with the official places names stored by the Logainm dataset. Input Data Sets The primary input data sets are the Irish Record of Monuments and Places (RMP), and Logainm, the official Irish Placenames. In Ireland archaeological monuments are protected under the Irish National Monuments Acts 1930 ‐ 2004. The National Monuments Service of the Irish Government’s Department of Arts, Heritage and the Gaeltacht maintains a record of all known monuments and this forms the Record of Monuments and Places (RMP)45. The aim of the ground truthing service is to provide a new crowd‐sourcing way to report on and help protect such monument sites, focusing on the Burren initially. The monuments are recorded in the Irish RMP, which is available as a series of PDF documents46 and as CSV files47, i.e. One Star and Three Star. The aim was transform it to 5 Star open data48. Logainm provides the definitive standard authorised forms of all Irish place names in both English and Irish49. It has recently been made available in linked open data format as Linked Logainm, in various formats including RDF, XML and JSON50. RDF Graph and Table The following summarises the Protected Monuments Sites data and its linking with the Linked Logainm: 43 As discussed in [SMOD33] and [SMOD34] As discussed in D5.1 “Rationale of the Pilots”. 45 www.archaeology.ie 46 Available at http://www.archaeology.ie/publications‐forms‐legislation/record‐of‐monuments‐and‐places 47 https://data.gov.ie/data/search?q=monuments&theme‐primary=Arts 48 as described at http://5stardata.info/en/ 49 www.logainm.ie/en 50 www.logainm.ie/en/inf/proj‐machines Version 1.0 Page 28 of 78 © SmartOpenData Consortium 2015 44 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Class Desccription dc:iden ntifier Unique Identifie er for site ps:siteeDesignation n The cclassification of the Sitee foaf:naame The TTownland n name of thee site owl:sameAs The rreconciliatio on link to LoogAinm geo:location ITM Reference ((E,N) geo:latt_long Irish Grid Refere ence (E,N) Data Prre-processin ng Multiple records within the dataset w were record ded as redundant. Thhese record ds were identifieed using strraightforward OpenReffine rules. Joining DataSets In order to extend d the heritaage dataset OpenRefin ne’s RDF recconciliationn tool was u used. To add thee Logainm R RDF reconciliation servvice in Open nRefine, use ers need to navigate to o ‘RDF’ > ‘Add reeconciliation service’ > ‘Based oon SPARQLL endpoint...’, and filll in the fo ollowing informaation: SmartOpenDaata Consortium m 2015 Version 11.0 Page 29 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Name: Logainm Endpoint URL: http://data.logainm.ie/sparql Type: Virtuoso Label Properties: Also check ‘foaf:name’ The reconciliation was run against the “Townland” name. Once reconciliation was complete manual manipulation was required to resolve the correct townland. Once this process was complete the sameAs link to the Logainm URI’s needed to be added to the RDF. This was accomplished by editing the RDF skeleton and associating the sameAs property to the URI column. A sample of the RDF output is shown below. #<?xml version="1.0" encoding="UTF‐8"?> <rdf:RDF xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos" xmlns:rdf="http://www.w3.org/1999/02/22‐rdf‐syntax‐ns#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf‐schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://localhost:3333/0"> <dc:description>Anomalous stone group</dc:description> <foaf:name>CARHEENYBAUN</foaf:name> <owl:sameAs>http://data.logainm.ie/place/19220</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">143036, 192476</location> <dc:identifier>GA133‐003‐‐‐‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/1"> <dc:description>Anomalous stone group</dc:description> <foaf:name>CARHEENYBAUN</foaf:name> <owl:sameAs>http://data.logainm.ie/place/19220</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">142686, Version 1.0 Page 30 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 192200</location> <dc:identifier>GA133‐004‐‐‐‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/2"> <dc:description>Architectural fragment</dc:description> <foaf:name>BALLYMAHONY</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5830</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">119634, 198730</location> <dc:identifier>CL009‐014003‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/3"> <dc:description>Architectural fragment</dc:description> <foaf:name>FANTA GLEBE</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6718</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116138, 195021</location> <dc:identifier>CL009‐085003‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/4"> <dc:description>Architectural fragment</dc:description> <foaf:name>BALLYCONNOE NORTH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5796</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116818, 200467</location> <dc:identifier>CL009‐004006‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/5"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILMOON WEST</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6627</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">114875, 200000</location> <dc:identifier>CL008‐049006‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/6"> <dc:description>Architectural fragment</dc:description> <foaf:name>LISHEENEAGH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/5808</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">116508, 203537</location> Version 1.0 Page 31 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) <dc:identifier>CL005‐063004‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/7"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILMOON WEST</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6627</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">115017, 200061</location> <dc:identifier>CL008‐049007‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/8"> <dc:description>Architectural fragment</dc:description> <foaf:name>CLOONEY SOUTH</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6653</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">119259, 188007</location> <dc:identifier>CL016‐105005‐</dc:identifier> </rdf:Description> <rdf:Description rdf:about="http://localhost:3333/9"> <dc:description>Architectural fragment</dc:description> <foaf:name>KILFENORA</foaf:name> <owl:sameAs>http://data.logainm.ie/place/6720</owl:sameAs> <location xmlns="http://www.w3.org/2003/01/geo/wgs84_pos#">118338, 193926</location> <dc:identifier>CL016‐171‐‐‐‐</dc:identifier> </rdf:Description> Conclusion OpenRefine and use of the standard data.smartopendata.eu vocabularies (SmartOpenData Protected Sites, FOAF and Dublin Core)51 enabled the transformation to be completed. Transforming the Monuments dataset to RDF was completed using the OpenRefine Tools. This allowed the data to be mashed together with the Linked Logainm source to produce the National Monument locations linked with the definitive Irish placenames of those locations. The exercise has ensured that both the Logainm and National Monuments teams will collaborate more closely in the future, and help to ensure the wider use of both. The first approach to doing this was to build on the Slovakian pilot approach and used the National Monuments datasets as transformed to the INSPIRE Protected Sites theme52, 51 See Table 2 Version 1.0 Page 32 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) howeveer the tran nsformation n to RDF foound that there were e weaknessses in the INSPIRE version of the national mo onuments dataset, which w is wh hy the ori ginal datasset was transformed. Thesse weakne esses are nnow being addressed, so the eexercise led d to an improveement of th he quality of the dataseet involved.. 2.1.4 TTransform ming Data with Graffterizer an nd the Jarffter Servicce Grafterizer is an in nteractive tool for creaating data ttransformattions. It alloows the use er to set up so ccalled pipeliines and RD DF mappinggs. Pipelines are essen ntially scriptts that consist of a numberr of conseccutive action ns that are applied to datasets. R RDF mappinngs can be used to publish datasets as linked data. The user inte erface provvides a livee preview of the transformations ap pplied to a subset of thhe chosen d dataset that allows useers to speciify them uploaded annd stored. P Pipelines incremeentally. Witth Grafterizer, data tra nsformations can be u and RDF mappingss are displayed both inn tabular fo orm in a grid d, and in thheir script fo orm ‐ as Clojure code. t tion step is defined as a pipe – a functiion that In Graffterizer, each single transformat perform ms simple d data converssion on its input. Nextt, these functions are ccombined ttogether in such way that output o of one o pipe accts as an in nput for ano other. This way of com mposing operations gives a great flexib bility and alllows to perform ratherr complex ddata conversions. Here is given a sh hort demon nstration off how Grafterizer tooll performs transformaation on tabular data. Exam mple is taken n from the A ARPA data o of the Italian Pilot. Figure 6: Orriginal Sample e ARPA Data In ordeer to see th he instant preview off created transformation on thee data, one e should upload it in a raw w tabular format. f Neext, the transformation itself is created byy adding eline. Each time a pip peline is mo odified, thee transform mation is required functionss to a pipe 52 Availab ble at https://w www.geoportaal.ie/geoporta al/catalog/seaarch/resource e/details.page?uuid=%7bF66DE3EBB‐FC5C C‐4D79‐ A00A‐BC45AB9F55F6% %7d SmartOpenDaata Consortium m 2015 Version 11.0 Page 33 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) applied to the prevviewed dataset immeddiately, so o one can see e the effect of each performed step. Att any stage of the transsformation the modifie ed tabular d data can be exported. Figure 7: RD DF mapping fo or ARPA data When ttabular data is in desiirable form mat, one can n start crea ating RDF m mappings. The T RDF skeleton being created is clearly visualizeed, showingg nodes and d correspon ding relatio ons. Both pipelines and d RDF mapp pings are stoored togeth her as comp plete data trransformatiions and used. After transformaation is con nstructed annd saved, o one may may bee easily shared and reu apply itt to the targget dataset and downlload resultss locally in d desired RDFF format. C Currently supportted formatss include RD DF/XML(.rddf), n‐triple(.nt), turtle(.ttl), n3(.n33), nquads(..nq) and RDF/JSO ON(.rj). SmartOpenDaata Consortium m 2015 Version 11.0 Page 34 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 8: Generated RDF graph for ARPA data Jarfter In order to apply the transformations on the complete datasets, we have developed Jarfter for the purpose of SmartOpenData. Jarfter is a set of web services that allow for server side compilation of data transformations (serialized as Clojure code) as well as execution of transformations on uploaded datasets. They can be accessed through the user interface shown in Figure 9, giving the users two options for how to transform their data. Figure 9: User interface for Jarfter The Execute transformation operation performs the complete transformation of the entire dataset based on the generated Clojure code corresponding to the transformation. The code and data are uploaded to the server, and when the transformation is complete the browser downloads a file containing the transformed data. The second option is to use the Download transformation executable operation, which only does half the job compared to “Execute transformation”. The server receives only the generated Clojure source code and not the dataset. The Clojure code is compiled to an Version 1.0 Page 35 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) executaable JAR file, which is thenn automattically dow wnloaded by the user u as "transfo ormation.jaar". This file e can then be used to o transform datasets loocally instead of in the clou ud. The JAR R file can be b ran by u sing the co ommand lin ne interfacee from the location where tthe file is lo ocated as follows: $java -jar transformati t ion.jar <inp put-file.csv v> <output-file.(nt|rdf| |n3|ttl)> Jarfter A Architecturre Jarfter iis a set of R RESTful web b services w with a back‐end databa ase which alllow for serrver side compilaation of Clo ojure code aand executi on of transsformations on datasetts. The servvices are implem mented a waay that allo ows Jarfter to be used d both with and withoout interacting with the database. Figure 10: JJarfter compiiler services A schem matic overvview of the compiler sservice, acccessed by th he "Downlooad transfo ormation executaable" capab bility, is sho own in thee figure above. The Clojure C sourrce code, which w is generatted from th he user‐speccified transfformations in Grafterizzer, is sent to the server back‐ end wh here it is co ompiled to an executa ble JAR file e. The JAR file f can theen sent back to the user (w where it can be execute ed locally), oor if the dattabase interractive servvices are use ed, both the Clojjure source code and the executa ble JAR are stored in th he back‐endd database as well. Jarfter aalso supporrts executio on of the traansformatio ons on the sserver side,, as exposed d by the "Executte transform mation" cap pability. Figuure 11 provvides an ove erview of thhe workflow w for the servicess: SmartOpenDaata Consortium m 2015 Version 11.0 Page 36 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Figgure 11: Jarfteer transforma ation web serv vice wn on the d diagram, the e client mu st either prrovide referrences to thhe database e entries As show for a trransformation and the e dataset thhat will be transforme t d, or the CClojure sourrce code and thee dataset itself. Any provided source cod de is dynamically com mpiled into o a JAR executaable. In case the user provides a reference, the JAR is extracted ffrom the daatabase. The transformed d dataset is aalso downlooaded from the databa ase, if an e ntry instead of the datasett itself is givven as inputt. The back‐‐end then exxecutes the e JAR and trransforms th he given datasett before it seends the traansformed data back tto the user. Warfter: Dynamic Deployment of Data TTransforma ations (Jarftter extensioon) The ap pproach im mplemented d with the Jarfter se ervice with h regard too generatin ng data transformations allows a for the realizaation of a very high level of aautomation of the s r transformation prrocess. In particular, this is due to the statelessne ss of all resulting transformation exxecutable. This T properrty allows for the cre eation of ttransformattions on d, which can then be u used to dynnamically form cloud deployment topologies.. A high‐ demand level ovverview of tthe intende ed process of forming a simple to opology is i llustrated in Figure 12: SmartOpenDaata Consortium m 2015 Version 11.0 Page 37 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Figure 12 2: Dynamic deeployment off data transformations First, ussers need to o specify th he transform mation thatt needs to b be deployedd on the client‐side using th he Grafterizer tool. When W the trransformatiion is readyy, the transsformation code is sent to the Jarfter back‐end w where the CClojure compiler service e forms an eexecutable WAR or JAR filee, and forw wards it to a a "deployerr" compone ent, capable of dynam mically provvisioning cloud reesources an nd deployin ng applicatiions (in ourr case, we plan p to usee the CloudML run‐ time en nvironment. Finally, th he transform mation thatt has been deployed inn the cloud d can be accesseed by the transformatio on owner oor other use ers to apply the transfoormation to o various datasetts. As men ntioned, in o order to implement thhe dynamic deploymen nt of transfoormations, we plan to use C CloudML. CloudML com mprises a seet of tools, and a domain‐specificc language ((DSL) for modelliing and en nacting the e provisionning and deployment d t of cloud applications. The modelliing languagge allows for f the speecification of cloud topologies aand the ne ecessary softwarre and hardware resou urces as shoown in Figurre 13: Figure 13: Clou F udML deployment templatte SmartOpenDaata Consortium m 2015 Version 11.0 Page 38 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The CloudML template in the figure represents a simple deployment topology which comprises of two software components – the aforementioned executable transformation (labelled "Transformation"), a generic servlet container (labelled "SC"). Additionally, the figure illustrates also a specification of a hardware component – a virtual machine (labelled "VM"). In CloudML virtual machines are specified in a provider‐agnostic way through a set of hardware requirements, which can then be matched by the available flavors of virtual machines, available from each particular provider. The CloudML template can be programmatically edited to inject details on how to deploy a particular transformation that has been generated by the Clojure compiler. This resulting model can be sent to the CloudML engine, which can then enact the provisioning and deployment of the necessary resources, through a process of matching the hardware and software requirements with the available capabilities. Grafterizer vs. OpenRefine – An Overview This section gives an analysis of the data transformation process for ARPA and DGT‐TRAGSA pilot use cases. Transformations have been performed with help of two data cleaning and transformation tools – OpenRefine and Grafterizer. Below there is given a comparison of transformation construction process for these tools. The first difference that significantly affects the data transformation process is possibility to create utility functions in Grafterizer. This allows to separate computational logic from data it operates on. Thus, the formula for computing geographical coordinates for ARPA pilot Lakes/Rivers Monitoring Stations in OpenRefine project is defined twice: for computing latitude and for computing longitude operations. Grafterizer allows it to be encapsulated in separate function which can be called as many times as needed. Another difference lies in possibility to keep original cell value if an error occurs during transformation in OpenRefine – the feature that is not currently available in Grafterizer. Some transformations in tested use cases require cross‐dataset operations. This feature has been introduced in OpenRefine, but Grafterizer currently doesn't allow to read several datasets at the same time in one pipeline. One rather useful feature of Grafterizer data transformation is the possibility to edit parameters of each transformation step and change step order at any moment of creating the transformation, that is impossible to do with help of OpenRefine. At the same time OpenRefine provides transformation history with Undo/Redo options. The functionality for the RDF mapping construction is similar for both tools with some small differences. One of them is that at its current stage Grafterizer doesn't provide functionality for creating language‐tagged nodes. Version 1.0 Page 39 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) The short summary of differences in functionality of mentioned tools is given in the table below. Feature Grafterizer OpenRefine Basic functionality Encapsulating and reusing utility functions in one transformation ‐ + Ignore errors(leave original data on error) + ‐ Cross‐dataset operations(join datasets) + ‐ Transformation operations management Edit transformation operation ‐ + Change operation order ‐ + Transformation history with undo/redo options + ‐ RDF mapping Language‐tagged nodes + ‐ Version 1.0 Page 40 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 2.2 XML (GML) ‐TO‐RDF transformations 2.2.1 Slovak pilot Main motivation for the selected approach Based on the further analysis of the available datasets, technologies, knowledge capacities in the field of the data harmonisation and previous experience documented in D3.3 (Chapter 6.4) [SMODD33] additional datasets have been harmonised in order to implement SmartOpenData modelling framework to support Slovakian pilot. Main motivation of this approach was to investigate the possibilities to expose both INSPIRE compliant as well as other datasets into the Web of (Linked) Data and enrich these resources with external knowledge. Data model and storage In addition to the initial SK INSPIRE Protected Sites dataset53 transformed following GeoKnow XSL stylesheets further list of dataset has been identified and prepared for the transformation with the support of the COMSODE project54. Activity covered several of the datasets that SAZP publishes to comply with the European Union's INSPIRE directive55, including data on protected sites, species distribution, bio‐ geographical regions, and land cover; and an additional dataset on contaminated sites registered as environmental burdens. The INSPIRE datasets were described with the INSPIRE XML schemas, while the latter dataset used a custom XML schema. The source data is available in the Geography Markup Language (GML) via an API provided by the Web Feature Service (WFS). Note for “Input dataset hyperlinks” in following table: Instructions in this column are related to the bash script56, that downloads individually datasets from WFS (script requires curl http client). In the output of the script you see dataset title and the relevant request WFS. Request URL is closed between the characters '<' and '>'. It necessary to copy it as a whole (not recommended open queries in browser). Bash script downloads dataset into the file system. Most of the requests contains cql_filter where, selecting the data only for the Slovakia (Database contains also data for Czech Republic). Request 'Corine landcover' may take a few minutes as it contains about 22000 features a transformation from relational DB into GML is happening "on the fly". All data are in EPSG: 4258 (ETRS89) geographic coordinates. 53 http://ckan.sazp.sk/dataset/inspire‐protected‐sites‐linked‐data/resource/fba4d3b8‐195c‐4224‐a7b9‐ ab734c6e933d 54 http://www.comsode.eu/ 55 http://inspire.ec.europa.eu/index.cfm/pageid/3 56 http://redmine.sazp.sk/attachments/download/136/retrieve‐smod‐datasets.sh Version 1.0 Page 41 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) No. Input dataset Target vocabulary Output dataset 1 National parks and protected landscape areas57 SmOD Protected Sites 58 SK LD INSPIRE Protected Sites59 2 Small scale protected areas60 SmOD Protected Sites SK LD INSPIRE Protected Sites 3 Protected natural monuments61 SmOD Protected Sites SK LD INSPIRE Protected Sites 4 Special protection areas ‐ Bird directive62 SmOD Protected Sites SK LD INSPIRE Protected Sites 5 Sites of community importance ‐ Habitat Directive63 SmOD Protected Sites SK LD INSPIRE Protected Sites 57 WFS, GML> Dowloading '01. National parks and protected landscape areas' ... URL:<http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&t ypeName=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('nationalPark','ProtectedLandscapeOrSeascape') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `01_NP_LPA.gml'. 58 http://www.w3.org/2015/03/inspire/ps# 59 http://ckan.sazp.sk/dataset/inspire‐protected‐sites‐linked‐data/resource/1d6e0fdf‐df3d‐4a69‐bd5e‐ d49aa16d6596 60 WFS, GML> Dowloading '02. Small scale protected areas' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('managedResourceProtectedArea','strictNatureReserve','wildernessArea') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'>GML output saved in `02_SSPA.gml' 61 WFS, GML>Dowloading '03. Protected natural monuments' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('naturalMonument') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `03_PNM.gml' 62 WFS, GML>Dowloading '04. Special protection areas ‐ Bird directive' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('specialProtectionArea') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `04_SPA.gml 63 WFS,GML>Dowloading '05. Sites of community importance ‐ Habitat Directive' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('siteOfCommunityImportance') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `05_SCI.gml' Version 1.0 Page 42 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 6 Biosphere reserves64 SmOD Protected Sites SK LD INSPIRE Protected Sites 7 Ramsar65 SmOD Protected Sites SK LD INSPIRE Protected Sites 8 UNESCO world nature heritage sites66 SmOD Protected Sites SK LD INSPIRE Protected Sites 9 Protected landscape elements67 SmOD Protected Sites SK LD INSPIRE Protected Sites 10 Corine Land Cover68 SmOD Land Cover69 SK LD Land Cover 11 Contaminated sites / Environmental burdens SK Contaminated SK LD Contaminated sites / Environmental sites / Environmental 64 WFS, GML>Dowloading '06. Biosphere reserves' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('biosphereReserve') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `06_BR.gml' 65 WFS, GML>Dowloading '07. Ramsar' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designationScheme" in ('ramsar') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `07_RAMSAR.gml 66 WFS, GML>Dowloading '08. UNESCO world nature heritage sites' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designationScheme" in ('UNESCOWorldHeritage') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `08_UNESCO.gml 67 WFS, GML>Dowloading '09. Protected landscape elements' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=ps:ProtectedSite&cql_filter="ps:siteDesignation/ps:DesignationType/ps:designation" in ('naturalMonument') and "ps:inspireID/base:Identifier/base:namespace" = 'SK:GOV:MOE:SEA:PS'> GML output saved in `09_PLE.gml 68 WFS, GML>Dowloading '10. Corine landcover' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typen ame=lcv:LandCoverUnit&cql_filter="lcv:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:LC'> GML output saved in `10_CLC.gml 69 http://www.w3.org/2015/03/inspire/lc# Version 1.0 Page 43 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) burdens vocabulary70 burdens71 12 Biogeographical regions72 SmOD Biogeographical regions73 SK LD Biogeographical regions74 13 Species distribution (Selected taxons)75 SmOD Species distribution76 SK LD Species distribution77 Table 6: An overview of the datasets and vocabularies used in SK Pilot Process, tools and technologies The whole process of data harmonisation was driven by the development of the related components of the SmartOpenData infrastructure as well as by the selected elements of the COMSODE methodology for Open Data publishing78. 70 http://data.sazp.sk/vocab/contaminated‐sites http://ckan.sazp.sk/dataset/sk‐environmental‐burdens‐contaminated‐sites/resource/a33b9933‐937a‐4cca‐ 89d7‐223703bb1187 72 WFS, GML>Dowloading '12. Biogeographical regions' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typeN ame=br:Bio‐geographicalRegion&cql_filter="br:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:BR'> GML output saved in `12_BIO_REGIONS.gml 73 http://www.w3.org/2015/03/inspire/br# 74 http://ckan.sazp.sk/dataset/sk‐inspire‐bio‐geographical‐regions‐linked‐data 75 WFS, GML>Dowloading '13. Species distribution' ... URL: <http://inspire.geop.sazp.sk/geoserver/wfs?request=GetFeature&version=1.1.0&outputFormat=gml32&typeN ame=sd:SpeciesDistributionUnit&cql_filter="sd:inspireId/base33:Identifier/base33:namespace" = 'SK:GOV:MOE:SEA:SD'> GML output saved in `13_SD.gml' 76 http://www.w3.org/2015/03/inspire/sd# 77 http://ckan.sazp.sk/dataset/sk‐inspire‐species‐distribution‐linked‐data 78 http://opendatanode.org/product/methodology‐for‐od‐publishing/ Version 1.0 Page 44 of 78 © SmartOpenData Consortium 2015 71 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Table 7: List of phases and tasks extracted and deployed from the COMSODE methodology for Open Data publishing In brief the whole process was initiated with the task related to the harvesting the data from the WFS and converting it to RDF. During the process of conversion alignment of the data with selected RDF vocabularies and code lists took place. Some of these newly created linked data were interlinked with the third‐party data in order to enrich it. Creating linked data Whole process of the data transformation have been undertaken with the support of the Unified Views Extract‐Transform‐Load (ETL) framework79 creating the core component of Open Data Node (ODN) – publication platform for Open data where it ensures extraction, transformation, and publishing of (Linked) Open Data. This environment allows to define, execute, monitor, debug, schedule, and share RDF data processing tasks. 79 http://opendatanode.org/product/unifiedviews/ Version 1.0 Page 45 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) A data processing task (or simply task) consists of one or more data processing units. This tasks may use custom plugins ‐ data processing units (DPU) created by users. A data processing unit (DPU) encapsulates certain business logic needed when processing data (e.g., one DPU may extract data from an RDF database or apply a SPARQL query). Every DPU has its inputs, outputs, business logic and configuration. UnifiedViews differs from other ETL frameworks by natively supporting RDF data and ontologies. UnifiedViews has a graphical user interface for the administration, debugging, and monitoring of the ETL process. Since GML is an XML format harvested data were converted to RDF/XML via XSL transformations. In order to do this XSL transformations developed by the GeoKnow project80 were reused. To reflect recent development extensive set of GeoKnow XSLT style sheets have been updated81: These updates contained aside some bug fixes also changes related to mapping against the SmOD vocabularies82 as well as specific modifications related to the UnifiedViews. 80 http://geoknow.eu 81 https://github.com/jindrichmynarz/TripleGeo/tree/sazp/xslt http://www.w3.org/2015/03/inspire 82 Version 1.0 Page 46 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 14: List of updated GeoKnow XSLT stylesheets For each dataset a data processing pipeline has been built in the UnifiedViews component of the ODN. The pipelines harvested the data from the WFS and converted it to RDF via XSL transformations. Version 1.0 Page 47 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 15: Landing page for Unified Views Figure 16: List of created pipelines Transformation of the SK datasets have been designed and executed with the following list of DPUs: e‐distributionMetadata t‐geonamesOrgToRdfFile e‐filesDownload t‐gunzipper e‐sparqlEndpoint t‐rdfToFiles l‐filesToCkan t‐sparqlConstruct l‐filesToParliament t‐sparqlUpdate l‐filesToVirtuoso t‐unzipper l‐filesUpload t‐xslt l‐rdfToCkan t‐zipper t‐filesToRdf Version 1.0 Page 48 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 17: Section with DPU templates Each DPU comes with specific functionality, eg. l‐filesToParliament is a DPU that loads RDF serialized in files to the Parliament RDF store via its HTTP API for bulk upload83, whist t‐ geonamesOrgToRdfFile is a DPU that transforms dump of Geonames.org data into RDF. The dump is not valid RDF, since it consists of line‐separated pairs of URIs and corresponding descriptions of the URIs serialized in RDF/XML. This DPU parses the dump format and outputs valid RDF file84. Figure 18: Pipelines execution monitor 83 https://github.com/UnifiedViews/Plugins/blob/master/l‐filesToParliament/doc/About.md https://github.com/comsode‐uv‐plugins/t‐geonamesOrgToRdfFile/blob/develop/README.md 84 Version 1.0 Page 49 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 19: Scheduler with the possibility to define the schedules for pipelines execution Figure 20: Section with additional settings Figure 21: Example of pipeline details Version 1.0 Page 50 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 22: Example of further DPU settings To fulfill the requirements of the project, ODN has been enhanced with a loader to the Parliament RDF store (http://parliament.semwebcentral.org) and an extractor for Geonames.org. Parliament is used by SAZP to store RDF data because it supports geospatial features. Extractor for Geonames.org was needed in order to be able to link to this dataset. Interlinking In order to provide the linkages to the external resources following enrichment of the generated linked data have been identified and in addition to the data transformation pipelines, there has been created pipelines for enriching the datasets with links to external datasets including Geonames.org and 3 datasets from the European Environmental Agency (Biogeographical regions 2011, Natura 2000 and EUNIS). : ● SK Protected Sites <> GeoNames85 ● SK Protected Sites <> EEA Natura 200086 ● SK Contaminated Sites <> GeoNames 85 http://www.geonames.org/ 86 http://natura2000.eea.europa.eu/rdf/ Version 1.0 Page 51 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 23: Example of the interlinking pipeline Publishing Main outcomes of this harmonisation activities available via: ● Human readable SAZP Open Data portal interface based on CKAN ‐ providing the possibility to search metadata and visualise all harmonised SK linked data resources87 ● Machine readable GeoSparql API88 ● Web application interface supporting GeoSparql queries89 Visualizations will be supported with the extensions of LDVMi90. http://data.sazp.sk/ 87 88 http://data.sazp.sk/parliament/sparql http://data.sazp.sk/parliament/query.jsp 90 http://ldvm.net Version 1.0 Page 52 of 78 © SmartOpenData Consortium 2015 89 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 24: CKAN interface with the list of metadata for the open linked data from Slovak pilot Figure 25: Parliament web application interface Version 1.0 Page 53 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Observed benefits and limitations A key benefit of the RDF version of the SAZP datasets is that it is straightforward to combine it with third‐party datasets. In this way, large and rich datasets, such as Geonames.org, can be linked and additional features may be drawn from them to frame the original data into a broader context. During the activities on creation of the visualizations use of the Coordinate reference systems (CRS) has been identified as stumbling block. Even though the CRS was changed to a more common one, most visualization tools cannot directly reuse data projected in this CRS because the inverse order of coordinates is expected. This is also the case for OpenLayers 3 (http://openlayers.org/), the visualization library which has been used and required re‐ projection of the coordinates on the client‐side. Ultimately, visualization of the data by projecting it on the map allowed for visual inspection that revealed errors in its coordinates, which were fixed subsequently. In this way, this exercise helped to improve the quality of the primary data. It turned out that transforming data and viewing it from different perspectives can detect errors and thus contribute to better data quality. Recommendations & Future outlook When publishing the data adhering to common standards, such as the INSPIRE schemas, make it more reusable. In the case of SAZP datasets, standardization allowed to reuse parts of the GeoKnow XSL transformations that were made for INSPIRE‐compliant data without creating our own from scratch. This helped us learnt a similar lesson for the CRS. In order to improve reusability of geospatial data on the Web, it should be available at least in the WGS 84/Pseudo‐Mercator ‐ Spherical Mercator CRS, which is supported natively in most tools. When it comes to the formats for geographic geometries, it was identified that encoding them as Well‐Known Text (WKT) RDF literals offer a good trade‐off between granularity and data volume. Based on this experience further investigation will take place to identify, which datasets shall be extended in their coverage, which new ones will be the best candidates for further harmonisation as well as possible linking and enrichment with external third ‐ party linked data resources. Version 1.0 Page 54 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 2.3 Relational DB‐to‐RDF transformations 2.3.1 Czech pilot The goal of the czech pilot is the transformation of the NFI (National Forest Inventory) data from the relational database into an RDF/XML or a TURTLE and publish these LOD on the web. During the beginning of the SmartOpenData project the UHUL FMI was supposed to establish SPARQL endpoint and use on‐the‐fly transformation. However any suitable lightweight tool for the UHUL FMI production environment hadn't been found, one of the issues was the technological dependency on the java platform with most of available tools. The UHUL FMI has decided to use static transformation into the file in order to publish the data statically at least. The approach is not completely wrong, because the NFI data are created at one moment and are stable for approximately a year period. Data model and storage The UHUL FMI uses PostgreSQL/PostGIS as a key component for data storage and also data analyses, using it on the side of the NFI source database and also the public data store allow us to replicate/copy the necessary data from the private database server to the public database server. So for the infrastructure two separate PostgreSQL databases are used, for the transformation itself the public database is used. This pilot description is focused on transformation of a data from the public database. The data model below represents the NFI type of information that is being published. In the middle is the main table t_nfi_estimate, which represents an estimate. Every estimate has its point estimate (a value), lower and upper limit (a confidence interval). The estimate is far more defined in lookup tables (a type, a unit of measure, an attribute filter, a geographic domain etc.), it could be for example forest cover in hectares in the Czech Republic divided by a forest owner etc. The relation to the geographic domain is important, because the UHUL FMI uses mostly the NUTS regions which are commonly used among partners across EU and moreover it appears as appropriate entity for linkage with other data sources. Another possible linkage are the NFI outcomes or attributes themselves, because in EU there are a lot of other countries providing the NFI outcomes same as the Czech republic and also initiatives, which try to define common attributes among them NFI's e.g. ENFIN91. 91 http://www.enfin.info/ Version 1.0 Page 55 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Figure 26: Czech pilot Data model Relations, Links & Mappings In order to create proper links and define the NFI data in a broader space (internet) links (URLs) with the same meanings and definitions had to be found. Some of these URLs were sufficient for the NFI data, but it was also necessary to create specific vocabulary for the NFI “forest” attributes, which is not available on the internet. The UHUL FMI had created first draft of the NFI vocabulary in the RDF for this purpose, which possess short description of the estimates. However, it will be desirable to find responsible body, which will be taking care of this vocabulary. During SmOD we are expecting, that it will be the UHUL FMI. Example of the vocabulary, which will be available from http://nil.uhul.cz/lod/ns/nfi/ follows: @prefix nfi: <http://nil.uhul.cz/nfi.ttl> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . ... nfi:ObAvGrowingStockPerHa rdfs:subClassOf qb:Observation ; Version 1.0 Page 56 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) rdfs:label "Observed average growing stock per hectare"@en; rdfs:comment "The observed average growing stock in cubic metres per hectare."@en . nfi:ObForestOwner rdfs:subClassOf qb:Observation ; rdfs:label "Observed forest owner" ; rdfs:comment "The observed type of forest owner. Each observation is linked to the relevant owner type defined in http://nil.uhul.cz/lod/ns/fot" . nfi:ObGrowingStock rdfs:subClassOf qb:Observation ; rdfs:label "Observed growing stock"@en; rdfs:comment "The observed growing stock (in cubic metres) within the specified area."@en . ... … For visualisation the UHUL FMI also needs a geometric representation of a geographic domain and therefore on the webpage (http://nil.uhul.cz) there are also published NUTS regions in the WKT form. Of course there are some sources for the NUTS regions already available on the web, however the NFI uses own generalisation of the geometry for the map client. It is faster for a web map window to just use the geometry than try to generalize it dynamically on a client side for every request for the geometry representation. If there will be proper NUTS 3 geometry representation available on the web, then the vocabulary can be avoided. The vocabulary has following format and will be available on this URL: http://nil.uhul.cz/lod/ns/nuts/ . @prefix unit: <http://qudt.org/1.1/vocab/unit#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix nfi: <http://nil.uhul.cz/lod/nfi#> . @prefix geo: <http://www.opengis.net/ont/geosparql#> . <http://nil.uhul.cz/lod/geo/nuts#CZ032> a <http://www.w3.org/2015/03/inspire/au#AdministrativeUnit> ; rdfs:comment "CZ032 - Plzeňský" ; rdfs:label "CZ032" ; owl:sameAs <http://nuts.geovocab.org/id/CZ032> , <http://estatwrap.ontologycentral.com/dic/geo#CZ032> ; geo:asWKT "POLYGON((13.7657560325118 49.5140373364391,13.7478475772677 49.4868312489771, … Data published by the NFI are mostly statistical, therefore the UHUL FMI could use available mathematical and physical vocabularies for the data definition, e.g.: ● http://purl.org/NET/scovo# ● http://qudt.org/1.1/vocab/unit# And also vocabularies for the geographical relations and entities, some created and recommended during SmOD project: ● http://www.opengis.net/ont/geosparql# ● http://www.w3.org/2015/03/inspire/au# Version 1.0 Page 57 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Transformation SmartOpenData technical meetings, documents and hackathons helped us with testing the tools suitable for presenting our outcomes. We tested the D2RQ and r2rml‐parser92 for publishing from our database of the NFI results and the Virtuoso for further data processing and visualisation. Nevertheless, it depends on several conditions if we will use the D2RQ for our data transformation in production environment, for example the D2RQ long‐term support, security, java technology support by the ministry of the agriculture etc. Data published now at http://nil.uhul.cz was created with r2rml‐parser. When the links had been set up (described in the previous chapter) the transformation could be done. For the transformation the mapping has been defined in R2RML syntax93. The data has not been translated from the native Czech language in the rdb database, therefore the language attribute had to be used. Below is example of the mapping file for an estimate of the forest cover: # # forest_cover # @prefix map: <#>. @prefix rr: <http://www.w3.org/ns/r2rml#>. @prefix au: <http://www.w3.org/2015/03/inspire/au>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix owl: <http://www.w3.org/2002/07/owl#>. @prefix dc: <http://purl.org/dc/elements/1.1/>. @prefix geo: <http://www.opengis.net/ont/geosparql#>. @prefix nfi: <http://nil.uhul.cz/nfi#>. @prefix scovo: <http://purl.org/NET/scovo#>. @prefix unit: <http://qudt.org/1.1/vocab/unit#> . @prefix nfi: <http://nil.uhul.cz/lod/nfi#> . ### NIL database mappings map:spatial rr:logicalTable <#forest>; rr:subjectMap [ rr:template 'http://nil.uhul.cz/lod/nfi/forest_cover#{"id_result"}'; rr:class nfi:forest_cover; ]; rr:predicateObjectMap [ rr:predicate rdf:value; rr:objectMap [ rr:column "point_estimate";] ; ]; rr:predicateObjectMap [ rr:predicate scovo:max; rr:objectMap [ rr:column "upper_limit"] ; ]; rr:predicateObjectMap [ rr:predicate scovo:min; rr:objectMap [ rr:column "lower_limit"] ; ]; rr:predicateObjectMap [ 92 93 https://github.com/nkons/r2rml‐parser http://www.w3.org/TR/r2rml/ Version 1.0 Page 58 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) rr:predicate unit:units; rr:objectMap [ rr:constant unit:Percent;] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomainId; rr:objectMap [ rr:column "adomain"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomain; rr:objectMap [ rr:column "adomain_label" ; rr:language "cs"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:adomain_description; rr:objectMap [ rr:column "adomain_description" ; rr:language "cs"] ; ]; rr:predicateObjectMap [ rr:predicate nfi:nfi_cycle; rr:objectMap [ rr:constant "2001 - 2004"; rr:termType rr:Literal;] ; ]; rr:predicateObjectMap [ rr:predicate geo:hasGeometry; rr:objectMap [rr:template 'http://nil.uhul.cz/lod/geo/nuts#{"gdomain_label"}'; rr:termType rr:IRI;] ; ]; . ... All transformation have been done using r2rml‐parser, which can also create a triple store with a dynamic connection to the database, however at this moment only one‐time transformation has been done. Final RDF/XML or TURTLE files are available from the http://nil.uhul.cz/lod/ns/* , where * represents name of a vocabulary , e.g. http://nil.uhul.cz/lod/nfi/forest_cover/ or http://nil.uhul.cz/lod/nfi/forest_cover.ttl . The estimates are presented in temporal cycles; therefore the outcomes can be compared between time periods. However user should always get default values, which are latest, but if someone needs elder data a link should look like e.g. http://nil.uhul.cz/lod/nfi/forest_cover/AGS2001‐2004.rdf, where AGS2001‐2004 stands for “Average Growing Stock during 2001‐2004 period”. In order to have also “a raw HTML” or human readable version of the estimates, an XSLT transformation has been used with the RDF/XML output. The forest cover can be also accessed from this link: http://nil.uhul.cz/lod/nfi/forest_cover.html . The NFI data are suitable for adoption the RDF Data Cube vocabulary described in Section 3. All estimates could be defined as observations and specified by dimensions; below there is an example of an estimate of the forest area divided according to the forest species: @prefix sa: <http://nil.uhul.cz/lod/nfi/species-area/> . @prefix qb: <http://purl.org/linked-data/cube#> . @prefix smod: <http://www.w3.org/2015/03/inspire/smod#> . @prefix scovo: <http://purl.org/NET/scovo#> . @prefix nuts: <http://nil.uhul.cz/lod/ns/nuts#> . @prefix ts: <http://nil.uhul.cz/lod/ns/species-area#> . @prefix sdmx-attribute: <http://purl.org/linked-data/sdmx/2009/attribute#> . @prefix unit: <http://qudt.org/1.1/vocab/unit#> . ... sa:ob4882 a qb:Observation, nfi:ObSpeciesArea ; Version 1.0 Page 59 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) qb:dataSet sa:SA2001-2004 ; nfi:refArea nuts:CZ0 ; nfi:cycle <http://reference.data.gov.uk/id/gregorian-interval/2001-0101T00:00:00/P3Y> ; nfi:treeSpecies ts:ts2500 ; sdmx-attribute:unitMeasure unit:Hectare ; smod:areaHa "5586"^^xsd:double ; scovo:max "6833"^^xsd:double ; scovo:min "4339"^^xsd:double . … In order to model above data the NFI had to define terms in a vocabulary for forest species used, similar concept has been used for other estimates. Example below: @prefix ts: <http://nil.uhul.cz/lod/ns/species-area#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#>. @prefix nfi: <http://nil.uhul.cz/lod/ns/nfi#> . ... ts:ts2500 a skos:Concept ; skos:prefLabel "DBC"@cs ; skos:prefLabel "Red oak"@en ; skos:definition "Dub červený"@cs ; skos:definition "Red oak"@en ; skos:notation "2500"^^nfi:UHULID ; skos:inScheme <http://nil.uhul.cz/lod/ns/species-area> ; skos:broader ts:ts6400 . ... Version 1.0 Page 60 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 3 Harmonising Observations and Measurements The final SMOD model suggests to adopt the RDF Data Cube vocabulary94 to encode in RDF environmental measurements and observations. In this section we illustrate application of the Data Cube framework on the example of a water quality measurement taken from the Italian pilot. We discuss third‐parties vocabularies as well as custom terms used to encode environmental measurements. 3.1 RDF Data Cube: Example Environmental observations are essentially numeric values accompanied with numerous attributes that allow to interpret and describe these values, e.g.: “Average concentration of benzene in the water of the Sciaguana lake in 2013 was 0.1 µg/kg” In the example above “0.1” is the observation value, which is by itself does not give us much information. However, if we consider its attributes, we can interpret the value: ● “benzene” ‐ what was measured? ● “concentration” ‐ what quality of benzene” was measured? ● “2013” ‐ when was it measured? ● “the Sciaguana lake” ‐ where was it measured? ● “µg/kg” ‐ in what units was it measured? 3.1.1 Data Cube Components Snippet below demonstrates how to encode the example observation in RDF using the RDF Data Cube approach: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Observation/0> a qb:Observation ; dbpedia-owl:average "0.1"^^xsd:float ; arpa-components:hasObservedDeterminand <http://data.smartopendata.eu/WFD/Determinand/71-43-2> ; sdmx-attribute:unitMeasure <http://data.smartopendata.eu/WFD/UnitOfMeasure/9> ; sdmx-dimension:refPeriod <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> ; arpa-components:station <http://data.smartopendata.eu/WaterbaseLakes/so/Station/IT19LW09453> . In terms of Data Cube properties highlighted in bold are called components. In order to represent pilots’ observations, we re‐used components defined by the Statistical Data and 94 http://www.w3.org/TR/vocab‐data‐cube/ Version 1.0 Page 61 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Metadata eXchange (SDMX) code lists95. Table below summarises SDMX properties and properties from other vocabularies used in pilots’ data: Prefix Namespace Terms used sdmx-dimension http://purl.org/linked‐data/sdmx/2009/dimension# sdmx-dimension:refPeriod sdmx-attribute http://purl.org/linked‐data/sdmx/2009/attribute# sdmx-attribute:unitMeasure dbpedia-owl http://dbpedia.org/ontology/ dbpedia-owl:min dbpedia-owl:max dbpedia-owl:average dbpedia-owl:mean dcterms http://purl.org/dc/terms/ dcterms:subject dcterms:source schema http://schema.org/ schema:minValue schema:maxValue In addition to the components presented in the table above, custom components have been defined for the Italian and Portuguese‐Spanish pilots: Prefix Namespace Terms used arpacomponents http://smod‐fp7.github.io/components/arpa‐components.ttl arpa-components:basePhenomenon arpacomponents:hasObservedDetermina nd arpa-components:station arpa-components:numberOfSamples tragsacomponents http://smod‐fp7.github.io/components/tragsa‐components.ttl tragsa-components:workUnit Components Values Whenever possible, values of components have been encoded via existing SKOS concepts schemes or other resources. Values of sdmx-dimension:refPeriod Temporal aspect of measurements was represented using the reference time URI set developed by data.gov.uk. For example, in the Portuguese‐Spanish pilot climatology measurements are aggregated over several years, 1981‐2010. We encoded this time period using the following pattern: <http://reference.data.gov.uk/id/gregorian-interval>/<start-datetime>/P<n-of-years>Y Hence, the URI of the period of time that corresponds to 21 years starting from 1981 looks as follows: <http://reference.data.gov.uk/id/gregorian-interval/1981-01-01T00:00:00/P21Y> 95 SDMX guidelines contain standard code lists that are intended to be generic and reusable across various datasets. Version 1.0 Page 62 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) Values of sdmx-attribute:unitMeasure Table below summarises values of Units of Measurement (UoM) used in the Portuguese‐ Spanish pilot: Measurement URI Unit of Measurement Average Annual Rainfall Level Runoff mm^3 qudt-unit:CubicMillimeter Slope Annual Humidity Level percent qudt-unit:Percent Average/Minimum/Maximum Annual Temperature degrees Celsius qudt-unit:DegreeCelsius Annual Evapotranspiration Level mm qudt-unit:Millimeter Annual Radiation Level Kcal/cm^2 qudtunit:KilocaloriePerSquareCent imeter Annual Insolation Level sun hours per year qudt-unit:NumberPerYear In the Italian pilot, unit of measurements are defined by EEA code list96. We have transformed this list into an RDF vocabulary defining each unit of measurement as an instance of the qudt-unit:Unit class. For example, below is the definition of μg/l: <http://data.smartopendata.eu/WFD/UnitOfMeasure/9> a qudt:Unit ; rdfs:label "μg/l" ; rdfs:comment "microgrammes per liter" . The complete code list is published together with the Portuguese‐Spanish data97. Values of arpa-components:hasObservedDeterminand Values of the observed determinand are also defined in the EEA code list98. We have transformed it into an RDF vocabulary, defining every compound as an instance of the custom class smod:Determinand, e.g.: <http://data.smartopendata.eu/WFD/Determinand/71-43-2> a smod:Determinand ; rdfs:label "Benzene" . The complete code list is published together with the Portuguese‐Spanish data99. 96 http://dd.eionet.europa.eu/dataelements/48239 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip 98 http://dd.eionet.europa.eu/datasets/latest/Groundwater/tables/HazSubstGW_Disagg/elements/Determinand Code 99 https://s3‐eu‐west‐1.amazonaws.com/smod‐repo/tragsa‐release3/rdf.zip Version 1.0 Page 63 of 78 © SmartOpenData Consortium 2015 97 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 3.1.2 Data Cube Datasets Data Cube finalises definition of observations by specifying which dataset each observation belongs to. It is done through the property qb:dataset, as shown below for the running example: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Observation/0> qb:dataSet <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> . <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> a qb:DataSet . Like observations, a dataset may contain components as well: <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> rdfs:comment "Aggregated data on hazardous substances reported in the Waterbase - Lakes dataset of the European Environmental Agency."@en ; dcterms:source <http://www.eea.europa.eu/data-and-maps/data/waterbase-lakes-10> ; dcterms:subject <http://www.eionet.europa.eu/gemet/concept/9214> ; arpa-components:basePhenomenon "concentration". The values of the dataset’s components hold for all the observations of the dataset. Thus, for example, we know the example observation is from the dataset that is available at http://www.eea.europa.eu/data‐and‐maps/data/waterbase‐lakes‐10. 3.1.3 Data Cube Structures When define, the components are grouped into structures, e.g.: <http://data.smartopendata.eu/WaterbaseRivers/HazardousSubstances/DSD/> a qb:DataStructureDefinition ; rdfs:comment "Data structure definition for hazardous substances reported in the Waterbase - Rivers dataset of the European Environment Agency and used in the Italian pilot of the SmartOpenData project, http://www.smartopendata.eu/"@en ; qb:component [qb:attribute dcterms:subject ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute dcterms:source ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute arpa-components:basePhenomenon ; qb:componentAttachment qb:DataSet ] ; qb:component [qb:attribute sdmx-attribute:unitMeasure] ; qb:component [qb:attribute arpa-components:hasObservedDeterminand ] ; qb:component [qb:attribute arpa-components:station ] ; qb:component [qb:measure dbpedia-owl:average ] ; qb:component [qb:dimension sdmx-dimension:refPeriod ] . Structures have two main objectives. Firstly, they allow to change the default (qb:Observation) level of attachment of a component. In other words, one can specify whether the value of a component is specific to each observation or it can be generalised over a dataset. Secondly, such structures can be re‐used across similar datasets. For example, we used the structure from the snippet above for the Waterbase ‐ Lakes dataset: Version 1.0 Page 64 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) <http://data.smartopendata.eu/WaterbaseLakes/HazardousSubstances/Dataset/> qb:structure <http://data.smartopendata.eu/WtaerbaseLakes/HazardousSubstances/DSD/> . … and for the Waterbase ‐ Rivers dataset: <http://data.smartopendata.eu/WaterbaseRivers/HazardousSubstances/Dataset/> qb:structure <http://data.smartopendata.eu/WtaerbaseRivers/HazardousSubstances/DSD/> . Definitions of the datasets and structures of the Italian and Portuguese‐Spanish pilots are available at http://smod‐fp7.github.io/dsd/arpa‐dsd‐dataset.ttl and http://smod‐ fp7.github.io/dsd/tragsa‐dsd‐dataset.ttl correspondingly. Version 1.0 Page 65 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) 4 Co onclusiion In this d deliverable we reporte ed on final iiteration of the task off data harm monisation to SmOD model. The modeel is based on severaal INSPIRE themes t tha at were chhosen to re epresent ns of the pilots cond ducted wit hin the prroject. As Table 8 shhows, therre is an domain intersecction betweeen the dom mains of thee pilots, forr example, in the topicc of protecte ed sites, which is relevant tto most of the pilot. H However, th here are INSSPIRE topicss used by o one pilot only, su uch as Environmental M Monitoring Facility and d Cadastral P Parcel. Voccabulary Italian Pilot SmOD Protected Site SmOD Land Use SmOD Bio‐ Geograaphical Region ns SmOD Species bution Distrib SmOD Corine Land C Cover SmOD Environmental Monito oring Facilityy Czech Pilot C Slovak PPilot Irissh Pilot SmOD Custom Vocabulary SmOD nistrative Admin Units SmOD Cadastral Parcelss Portu uguese‐ Spaniish Pilot SmartOpenDaata Consortium m 2015 Version 11.0 Page 66 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Vocabularies of third p parties Own vocabu ularies Table 8: V Vocabulary usa age by pilot el has beenn extended with custom m terms. Thhese custom m terms But in aall the pilotts the mode into a were aggregated SmOD custom m voccabulary http://w www.w3.orrg/2015/03//inspire/sm mod#, which curren ntly contaains classe es and propertties that weere defined in the scoppe of the Ittalian and P Portuguese‐‐Spanish pillots. We discusseed them in detail in sections dediccated to each of the pilot. Other p pilots publisshed their custom term ms in their namespaces: ● Czech Pilo ot ‐ the NFI N vocabuulary of fo orest attributes (wil l be available at http://nil.u uhul.cz/lod/nfi/) ● Slovak Pilo ot ‐ SK SmO OD Environm mental burrdens / Con ntaminated sites vocabulary ‐ https://datta.sazp.sk/vocab/contaaminated‐sites/ Overall,, our observation is that a com mmon dataa model ba ased on th e existing INSPIRE standarrds facilitatted the process of ddata harmo onisation to o a greate r or lesserr extent depend ding on thee settings and requirem ments of each e pilot. Pilots with INSPIRE‐co ompliant datasetts, such as Slovak, not only used the model as a target schema for f RDF transformations, b but also took advantagees of the traansformatio on tools thaat exist for IINSPIRE‐ nt of the compliaant datasets. Other pilots, such a s Portuguesse‐Spanish, used a sm all fragmen model ccomparing tto the required domainn extension n. We beliieve that in some case es custom teerms could be found in n other INS PIRE theme es. Good examples of such h cases are smod:c catchment tName, that specifiees the nam me of a mod:featu ureName, that specifiies the nam me of the catchment area in a water basin, and sm nitored by tthe Environmental Facility. Both pproperties o originate feature of interestt being mon he EEA Watterbase databases. Theese cases should be co onsidered ffor the futu ure work from th related to development of the e SmOD moodel. Techniccal contribu utions of the data harm monisation task includ de three diffferent app proaches devised d and implemented in different piilots: CSV‐to o‐RDF, XMLL‐to‐RDF an d Relationaal DB‐to‐ RDF. We preesented ressults of ussing the RD DF plugin for f OpenRe efine to peerform CSV V‐to‐RDF transformations and compared it to Graafterizer, a tool that is being acttively developed at the mo oment. On the one haand, rich fuunctionalityy of OpenR Refine allow wed us to perform various data pre‐p processing ssteps and pprepare data for RDF m mappings. O On the othe er hand, the GUII of the RDFF plugin enaabled intuitiive and inte eractive con nstruction oof RDF skele etons for our datta. We repo orted on sevveral challeenging casess in Annex A, but overrall we man naged to SmartOpenDaata Consortium m 2015 Version 11.0 Page 67 oof 78 © S D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) perform all the required transformations. Perhaps, the weakest side of this approach is its scalability. Although, we didn’t hit this limitation, we are aware the existing issues with memory usage by OpenRefine100. Grafterizer is presented as an alternative solution which is being actively developed and already provides several features not available in OpenRefine, such as reusing utility functions in one transformation, changing operation order and editing transformation operation. In the scope of the Slovak pilot the existing XML‐to‐RDF transformations produced by the GeoKnow project were customised to use the SmOD vocabularies and modified for usage within the Unified Views ETL framework101. Finally, in the settings of the Czech pilot, transformations of data from Relational Database to RDF were covered. The D2RQ and r2rml‐parsers are being evaluated for a definitive solution. 100 101 https://github.com/OpenRefine/OpenRefine/wiki/FAQ:‐Allocate‐More‐Memory https://github.com/jindrichmynarz/TripleGeo/tree/sazp/xslt Version 1.0 Page 68 of 78 © SmartOpenData Consortium 2015 D3.5 Final Data Harmonisation SmartOpenData project (Grant no.: 603824) 5 References [SMODD33] SmartOpenData EU/FP7 project, Report on the Initial Data Harmonisation D3.3 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData_D3.3_Initial_Data_Harm onisation.pdf [SMODD32] SmartOpenData EU/FP7 project, Report on the Initial SmartOpenData Model D3.2 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData_D3.2_Initial%20Data%2 0Model.pdf [SMODD34] SmartOpenData EU/FP7 project, Report on the Final SmartOpenData Model D3.4 to be published at the website of the project http://www.smartopendata.eu/public‐ deliverables [SMODD52] SmartOpenData EU/FP7 project, Report on the First Iteration of pilots, D5.2 to be pubslihed at the website of the project http://www.smartopendata.eu/public‐ deliverables [HM08] Halpin, Terry; Morgan, Tony (March 2008), Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design (2nd ed.), Morgan Kaufmann, ISBN 978‐0‐12‐373568‐3 [SMODD31] SmartOpenData EU/FP7 project, Review of geographic resources metadata and related metadata standards D3.1 publicly available at http://www.smartopendata.eu/sites/default/files/SmartOpenData%20D3.%201%20Review %20of%20geographic%20resources%20metadata%20and%20related%20metadata%20stan dards.pdf Version 1.0 Page 69 of 78 © SmartOpenData Consortium 2015 D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) nex A: G Generating R RDF w with Op penReffine: Ann Challenges and Solutio ons Languaage Tag Customisattion In both Italian and Portuguese pilots we faced the n need to gen nerate languuage tagged d literals hat containss data in m ultiple langguages. RDFF plugin for OpenRefine allows out of aa column th us to sp pecify only o one languagge tag per oone Literal n node, as sho own in Figurre 1. Figure e 27: RDF plu gin of OpenR Refine, language tag We havve identified d two possib ble options to generate e RDF with multiple lannguage tagss: ● split input dataset ho orizontally bby language e and run RDF R transfoormations for f each data slice e ● customise lliteral node First op ption was applied to o produce RDF of th he Sicilian Protected Sites out of the NATURA A2000SITESS table of the NATUR RA2000 dattabase. Thiis table co ntains info ormation about P Protected Sites in all EU memberss. Names, d descriptionss and alike are all give en in the membeer states lan nguages. Th he goal is too produce tthe followin ng RDF triplle for everyy Sicilian site: <protec ctedsiteURI> > rdfs:label l <SITENAME>^^@it . We creeated RDF mappings m with w “it” lannguage tag and run th hem on a ssubset of th he input table w with Sicilian sites only. W We used O penRefine tto create su uch a subseet. Even tho ough the subset could be geenerated byy pre‐proce ssing input data before uploadingg it in Refine, doing SmartOpenDaata Consortium m 2015 Version 11.0 Page 70 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) it in Refine allows one to sto ore all the ppre‐processing steps in n one placee together w with the RDF maappings. Second option we used in the e Portugues e‐Spanish p pilot to gene erate RDF oout of datassets with adminisstrative uniits from bo oth Portugaal and Spain. For exam mple, Figurre 28 illustrrates an excerptt from the municipalities dataseet with Spaanish “A Ba aña” and PPortuguese “Soure” municip palities. Figure 28 8: Excerpt from m the aux_04 40400_municiipality.csv The goaal is to geneerate a triple e of the folllowing form m for every municipalityy of the dattaset: <munici ipalityURI> r ramon:name <name>@<lang < g-tag> . If we sp pecify “sp” as the language tag (a s shown in Figure 3), w we will geneerate the fo ollowing two trip ples: <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /Municipality/ES11101000 07> ramon:na ame "A Baña"@sp . a.smartopend data.eu/sp-p pt-pilot/so/ /Municipality/PT16211061 15> ramon:na ame <http://data "Soure"@sp . Second triple is obviously not correct. In order to tell Reefine to gen nerate “sp” tagged lite erals out of the name column only when the municipality is Spanish, we can custoomise the vaalue of the Literal nodee as followss: Figure 29: RDF plugin of O OpenRefine, lliteral node cu ustomisation One can see in th he preview that the ccondition faails on the Portuguesee municipaality and mapping, onnly one trip ple is produced out of the two records of nothingg is output. With this m the running examp ple: <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /Municipality/ES11101000 07> ramon:na ame "A Baña"@sp . SmartOpenDaata Consortium m 2015 Version 11.0 Page 71 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) In order to outputt similar trip ple for the Portuguese e municipaliity, we nee d to add on ne more mappin ng of the “n name” colum mn to the lliteral tagge ed with the language ““pt” and cu ustomise the valu ue of the “n name” colum mn as follow ws: if(cell ls[“idState” ”].value == “PT”, value, null) As a ressult, we willl generate ssecond tripl e for the Po ortuguese m municipalityy: <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /Municipality/PT16211061 15> ramon:na ame "Soure"@pt . RDF ou ut of a List of Value es In the Portuguesee‐Spanish pilot p we haad to generate the fo ollowing RD DF triple fo or every ObservaatoryTile: <observatory ytileURI> sm mod:supports s <animalspe eciesURI> . Each ob bservatory tile can sup pport multiiple animal species, ass shown in the excerp pt below from th he input file with obserrvatory tiless: Figure 30: Excerptt from ObservvationTiles.cssv file n “speciesC Code” contaains a list of all the species su upported bby the tile with id Column “ES1110010007010010000100 01”. For eacch value in tthe list we w want to connstruct a UR RI of the animal. In order to o do so, we e applied thhe following custom e expression tto the value of the “speciesCode” colu umn when mapping it to the anim mal species n node: forEach(v value.split(" "~~~"), v, "AnimalSpeci " ies/" + v) More tthan one Root Nod des With RD DF plugin fo or OpenRefine it is posssible to gen nerate more than one root nodess for the same in nput dataseet. We trie ed this funcctionality of o OpenRefine to trannsform dataa of the Portugu uese‐Spanissh pilot. Forr example, “Work Unit Location”” model (seee Annex B)) among other instances defines instancess of tw wo classess: smod: :WorkUnit t and SmartOpenDaata Consortium m 2015 Version 11.0 Page 72 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) cp:CadastralP Parcel. Input dattaset for this mo odel was generated d from 102 pd_06004_workunion.wkt , aand containns two colum mns: “idWorkUnit” andd “idParcel””. We cou uld generatte work un nit and parrcel instancces togethe er with thee gsp:sfW Within relation nship, as illu ustrated in FFigure 31. Figure 31: Excerptt from ObservvationTiles.cssv file The pro oblem with this solutio on is that tthe input dataset d conttains dupliccates of “id dParcel”, since th he same parcel can contain moree than one w work units. As a resultt, the RDF ccontains duplicattes of parccel instance es, as manyy as there are a duplicates of “idPParcel” in th he input datasett. For example, there aare two reccords with ““idParcel” = = “ES11333 000101014 400001”. The RDF of these rrecords will look as foll ows: <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /WorkUnit/ES113330001010 01400001001> > a smod:WorkUni it ; gsp: :sfWithin <h http://data.s smartopendat ta.eu/sp-ptpilot/so/Par rcel/ES11333 300010101400 0001> . <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /Parcel/ES11333000101014 400001> a cp:Cadastral lParcel . <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /WorkUnit/ES113330001010 01400001002> > a smod:WorkUni it ; gsp: :sfWithin <h http://data.s smartopendat ta.eu/sp-ptpilot/so/Par rcel/ES11333 300010101400 0001> . <http://data a.smartopend data.eu/sp-p pt-pilot/so/ /Parcel/ES11333000101014 400001> a cp:Cadastral lParcel . Currenttly, there iss no way in the RDF plugin to generate such instancces just on nce. Our solution n was to maap instancess of parcelss in a separaate project. 102 Referr to Section 2.1.2, “Data Pre e‐Processing” for more info ormation abou ut data pre‐prrocessing SmartOpenDaata Consortium m 2015 Version 11.0 Page 73 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) nex B: P Portugguese‐Spanissh pilo ot: ORM M and Ann RDFF Mode els ustrate ORM M models developed for the 3d d release oof the Porttuguese‐ Figures below illu h pilot and ttheir translaation to RDFF(S). Spanish Documeentation of the ORM moodels is available online at http://smod‐ fp7.gith hub.io/tragssa3/orm/Ob bjectTypeLisst.html, an nd the con nstraint vaalidation re eport is availablle at http:///smod‐fp7.ggithub.io/trragsa3/orm//ConstrainttValidationRReport.html Chem mical Characteristtics Figure 32: Chemiccal Characteriistics: ORM M Model ure 33: Chemiical Characterristics: RDF model Figu SmartOpenDaata Consortium m 2015 Version 11.0 Page 74 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Climaatology Figure 34: CClimatology: O ORM Model Figure 35: Climatology: RDF Model SmartOpenDaata Consortium m 2015 Version 11.0 Page 75 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Foresstry Tile Figure 36: FForestry Tile: ORM Model Figure 37: FForestry Tile RDF: Model Geom metry Figure 38: Geometry: O ORM Model SmartOpenDaata Consortium m 2015 Version 11.0 Page 76 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Figure 39:: Geometry: R RDF Model Workk Unit Eco osystem Figgure 40: Workk Unit Ecosysttem: ORM Mo odel Figgure 41: Workk Unit Ecosysttem: RDF Model SmartOpenDaata Consortium m 2015 Version 11.0 Page 77 oof 78 © S D3.5 Finaal Data Harmo onisation SmartOp penData projeect (Grant no.: 603824) Workk Unit Loccation Figure 42: Worrk Unit Locatio on: ORM Mod del Figure 43: Wo F ork Unit Location RDF model SmartOpenDaata Consortium m 2015 Version 11.0 Page 78 oof 78 © S