Standardisation within the ESS: SDMX present and future

advertisement
Standardisation within the ESS:
SDMX present and future
Marco Pellegrino
Eurostat, Statistical Office of the European Union
marco.pellegrino@ec.europa.eu
Luxembourg, October 2015
Eurostat
Outline
• Evolution of SDMX
• Standards integration
- Examples
• Opportunities and challenges
- All good standards change
2
Eurostat
SDMX provides

A model to describe statistical data
and metadata

A standard for automated
communication from machine to
machine

A technology supporting standardised
IT tools
A common language for statistics
 Statisticians agree to use a common description for data and metadata
 The data exchange process is then driven by this common description
 Data descriptions are made available for everybody who wants to
understand and reuse the data
3
Eurostat
Why do we need a model?
• To define and describe statistical processes in a
coherent way
• To standardize process terminology
• To compare and benchmark processes within and
between organisations
• To identify synergies between processes
• To inform decisions on systems architectures and
organisation of resources
4
Eurostat
The SDMX
Components
Describe statistics in a standard way
Objects and their relationships


Data Structure Definition (DSD), Concepts, Code List

Central management and standard access

SDMX Registry, SDMX Web Services





Cross Domain Concepts
Cross Domain Code Lists
Statistical Domains
Metadata Common Vocabulary

Push

Provider generates and sends file to receiver
Pull



Provider opens web service to data
Receiver downloads regularly
Hub


Special case of pull: receiver downloads on end user request
Eurostat
5
Broadening the scope of SDMX
• The same information is needed for exchange
between different steps in a statistical production
process.
• The use of SDMX throughout the process, in
combination with a metadata registry (central
storage of definitions, classifications, etc.) makes
it more efficient and coherent to implement
changes, e.g. in definitions
• Metadata-driven systems
6
Eurostat
Broadening the scope of SDMX
 Standard metadata
layer for the
description and use of
data and metadata
throughout the
process
7
Eurostat
GSBPM and SDMX: towards a more complete picture
8
Eurostat
SDMX and standards integration
• SDMX promotes an incremental movement towards a
data and metadata sharing model with the production of
comparable and accurate statistics.
• The increasing use of SDMX:
a) improves the quality of the statistical process
b) enables simplified exchange and dissemination
processes, improving timeliness and accessibility
• Statistical integration goes hand-in-hand with technical
integration and standardisation.
9
Eurostat
Building bridges
…not walls
10
Eurostat
Building bridges
11
Eurostat
Building bridges
SDMX and Linked Open Data
https://open-data.europa.eu/en/linked-data
•
Based on RDF - Resource Description Framework - a family of
specifications published by W3C allowing for machine-actionable,
semantically rich linking of things found on the Web.
•
Main RDF vocabulary for statistical data: → Data Cube Vocabulary
Simplified version of the SDMX model covering data structures
12
Eurostat
SDMX Data Set
structured by
SDMX Data Structure Definition
RDF Data Cube Vocabulary
http://www.w3.org/TR/2014/REC-vocab-data-cube-20140116
Latest Version
The RDF Data Cube Vocabulary
W3C Recommendation 16 January 2014
This version:
http://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/
14
5 star-schema of Linked Open
Data
★
Make your stuff available on the Web (whatever
format) under an open license.
★★
Make it available as structured data (e.g., Excel
instead of image scan of a table).
★★★
Use non-proprietary formats (e.g., CSV instead of
Excel).
★★★★
Use URIs to denote things, so that people can point at
your stuff.
★★★★★
Link your data to other data to provide context.
Slide 15
The Data Cube Vocabulary
DataCube is a W3C recommendation, and has gained some
momentum
Data producers using SDMX can also publish in the Data
Cube Vocabulary (DCV)
As with any other RDF publication, the applications
processing the RDF must understand the DCV data model
to make sense of the data
Therefore applications wishing to process any additional
information added to the DCV triples need to understand
the model of the attached data
16
The SDMX Perspective
If you are using SDMX today (GESMES or XML),
what does this mean?
Most DataCube implementation today is being done
by organizations that don’t use SDMX-ML
For statistical organisations there is an increasing
interest in RDF and there is a need to be able to
integrate DataCube as an alternative query and
delivery sourced originally from existing SDMXbased systems
17
SDMX and RDF:
Scenario 
Using SDMX
Component Architecture
Data
Cube Writer
Either
Statistical
Dissemination
System
SDMX
Writer Interface
RDF File
Or
SDMX-ML to RDF
Transformer
SDMX-ML File
18
Scenario : Publish RDF triples as flat files
Publish to a server exposed to the web
Packaged in a meaningful way using named graphs
• Data by data set
• Structures (all in one file or codelists and concepts in one file
and DSDs in another file
Considerations
• Needs to be kept up to date (either republish as a replace or
as an incremental update)
• Simple Approach but not easily queryable (discovery and
linking tools typically work with SPARQL endpoints)
19
SDMX and RDF:
Scenario 
Either
Statistical
Dissemination
System
Data
Cube Writer
Triple Store
(DataCube)
SDMX
Writer Interface
Or
SDMX-ML
File
SDMX-ML File to
RDF Transformer
RDF Service
SPARQL
20
Scenario : Populate a SPARQL endpoint
•Deploy RDF triple in a “triple store”
• Dedicated database system that natively understands SPARQL
queries
• Supported by many RDF tools, some supporting a variety of
flavours of RDF (XML, TURTLE, N-Triples)
• Data could be updated at the level of dataflow
Considerations
• Good support for linking (the reason for LOD)
• Good support for cross dataflow queries
• Data with some common dimensions
21
Considerations
• If RDF is treated as a completely separate syntax, then the
burden of data management is doubled
• If it is treated as a delivery format (just another data
writer) then it is relatively easy to implement
• Up-front cost for tools development
• Low ongoing maintenance
• The benefits of RDF-based technology are realized in a
cost-effective manner
22
Building bridges
Data validation
 “Technical”
- Covered by SDMX today
 “Statistical Domain”
- Not yet covered by SDMX (VTL)
- Format Check (SDMX-ML)
- Codes exist (SDMX DSD)
- Codes used correctly
(Dataflow & Constraint)
-
Eurostat
Value check
Time series
Revisions
Validation expressions
VTL: Validation and Transformation Language
Standard language for defining validation and transformation
rules
• Validation (now)
• Transformation (partially now, to be enriched at a later stage)
Main goals
• Define and preserve validation and transformation rules
• Exchange and share rules
• Apply rules in industrialized processes
• Apply to several standards (e.g. SDMX, DDI, GSIM) thanks to a
generic information model
24
Eurostat
DDI: The Data Documentation Initiative
 DDI is split into 2 branches:
•
•
DDI-Codebook (DDI-C): DDI-C is a light-weight version of the standard,
intended primarily to document simple survey data.
DDI-Lifecycle (DDI-L or DDI 3+): DDI-L is designed to document and manage
data across the entire life cycle, from conceptualization to data publication and
analysis and beyond. DDI-L is currently being evaluated in several statistical
organizations across the world.
 The DDI Lifecycle standard provides a data model for describing
surveys in a very detailed fashion using XML.
•
This can support many parts of the process of survey management particularly in
the case of households surveys. E.g. exchange between question banks and data
collection applications, generation of collection instruments, …
25
Eurostat
 DDI: The DDI data lifecycle model
26
Eurostat
Building bridges
SDMX and DDI
 DDI Lifecycle can provide a very
detailed set of metadata,
covering:
• Surveys and processing of
microdata
• Structure of data files, including
hierarchical files and complex
relationships
• Archiving of data files and their
metadata
• Tabulation and processing of
data into tables
• Link between microdata
variables and resulting
aggregates
•
•
•
•
•
•
•
•
SDMX can provide:
Metadata describing the structure
of dimensional data
Stand-alone metadata sets
(“reference metadata”)
Formats for dimensional data
A model of data reporting and
dissemination
Standard registry interfaces,
providing a catalogue of
resources
Guidelines for deploying standard
web services
A way of describing statistical
processes
27
Eurostat
SDMX and DDI: similarities and differences
• Both standards use a similar model for identifiable,
versionable and maintainable artefacts
• Both standards use “schemes”, as packages for lists of
items, and XML “schemas”
• Both standards are designed to support reuse
• DDI has much more detailed metadata at the level of the
study domain, and provides more complete descriptions of
the processing of data
• SDMX provides more architectural components to support
registration, reporting/collecting and exchange, and has a
solid information model
28
Eurostat
29
Conceptual
model
GSIM
DDI
Implementation
standards
SDMX
Other relevant standards
Geospatial standards
30
Opportunities and challenges
• SDMX is interacting well with other standards (GSIM, DDI,
RDF Linked Open Data, JSON) and this “complementarity”
opens us new perspectives for the innovation of statistical
processes.
• Common data validation and processing procedures are
required (from structural validation to content).
• Better metadata-driven statistical production systems, with
the use of standards throughout the processes in
combination with a metadata registry.
• Better maintenance and developments of SDMX (e.g. support
to use cases, new functions, more formats, etc.) using the
wealth of its Information Model.
31
Eurostat
All good standards change
Version 2.0
SDMX-EDI
SDMX-ML
SDMX Registry
• Too much change may
discourage adoption
But…
• not giving users the
functionalities they want would
also discourage adoption
Version 1.0
GESMES/TS
September 2004
Version
1.0
November 2005
Version
2.0
April 2011
Version
2.1
32
Eurostat
Where do we want SDMX to be, in 2020?
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where–” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“–so long as I get SOMEWHERE,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”
(Alice’s Adventures in Wonderland, Chapter 6)
33
Eurostat
Where are we?
• Dramatic changes in the environment of official
statistics producers (e.g. data deluge)
• Modernization of statistical information system
seen as a question of survival for the sector of
official statistics
• Standardization viewed as a key enabler for
modernization
• Standards-based industrialization of statistical
production
34
Eurostat
SDMX 2020
Main challenges for the years to come:
• Strengthening implementation
• Facilitating data consumption
• Supporting statistical process innovation
• Enhancing communication
• Investing on training and capacity-building
Action Plan
SWG/TWG's work plan
Eurostat
SDMX present and future
« If you are not sure where you are going
you will finish someplace else »
Thanks for your attention!
Marco.Pellegrino@ec.europa.eu
36
Eurostat
Download