INEGI: Introduction to SDMX

advertisement
EDDI: Introduction to SDMX
Arofan Gregory
Open Data Foundation
What is SDMX?
• The problem space:
– Statistical collection, processing, and
exchange is time-consuming and resourceintensive
– Various international and national
organisations have individual approaches for
their constituencies
– Uncertainties about how to proceed with new
technologies (XML, web services …)
National Statistical
Organisations
accounts
statistics
Banks, Corporates
Individual Households
transactions
accounts
www.z.org
www.hub.org
www.y.org
www.x.org
Internet, Search, Navigation
180 + Countries
International Organisations accounts
Regional Organisations statistics
What is SDMX?
The Statistical Data and Metadata
Exchange (SDMX) initiative is taking steps
to address these challenges and
opportunities that have just been
mentioned:
– By focusing on business practices in the field
of statistical information
– By identifying more efficient processes for
exchange and sharing of data and metadata
using modern technology
Historical Note
• SDMX uses an approach based on the 10-yearlong success of an earlier standard –
GESMES/TS
• GESMES/TS was an initiative that is used today
in many countries for collecting, exchanging,
and updating statistical databases
– GESMES/TS is now SDMX-EDI
• Focus is on time-series, and is mostly used by
central banks
Who is SDMX?
• SDMX is an initiative made up of seven
international organizations:
–
–
–
–
–
Bank for International Settlements
European Central Bank
Eurostat
International Monetary Fund
Organisation for Economic Cooperation and
Development
– United Nations
– World Bank
• The initiative was launched in 2002
SDMX Products
• Technical standards for the formatting and
exchange of aggregate statistics:
– SDMX Technical Specifications version 1.0 (now
ISO/TS 17369 SDMX)
– SDMX Technical Specifications version 2.0 (submitted
to ISO)
– SDMX Technical Specifications version 2.1 under
review (will be forwarded to ISO)
• Content-Oriented Guidelines
– Common Metadata Vocabulary
– Cross-Domain Statistical Concepts
– Statistical Subject-Matter Domains
Detailed SDMX Goals
• Reduce national reporting burden to international institutions
• Fostering consistency, accuracy, and timeliness between
data and metadata disseminated by national and
international institutions, relying on what is decentrally
released via national websites
• Enhancing national statistical processing efficiency,
especially through internationally-recognised standard
formats for exchanges between statistical silos within
institutions and with other national statistical agencies
• Providing standards for web-based dissemination formats
that are computer readable and facilitate updating of
databases
• Enhancing comparison of data and metadata analysis
through standard formats and content-oriented guidelines
Official Recommendations
• SDMX has been officially recommended:
– February 2007: SDMX endorsed by the
European Union’s Statistical Programme
Committee
– March 2008: UN Statistical Commission
declares SDMX to be the preferred standard
for data and metadata
Exchange Patterns
• Bilateral: Institutions exchange data
according to bilateral agreements
regarding format, timing, protocols, etc.
• Gateway: Institutions share the data they
collect with their peers, in agreed formats
among counterparty communities
• Data-sharing: standard exchange of data
using standard formats and protocols
Bilateral Exchange
Gateway Exchange
Data-Sharing Exchange
Notes About Data-Sharing
• Data-sharing only works if there are standard
formats
• Data-sharing works only if the data themselves
are decentralized
– One big database doesn’t work!
• Like the Web itself, a data-sharing model relies
on pull exchanges, not push exchanges
– Data consumers discover the data they need, and its
location, and then go and get it
– Data producers don’t have to send data
SDMX View
• SDMX products support all types of
exchange
• One major requirement is to work well with
existing systems, to protect technology
investments
• SDMX promotes an incremental
movement toward the data-sharing model
Exchange with Peer Organizations
• SDMX-EDI and SDMX-ML are both able to
exchange databases between peer
organizations
• Structural metadata is also exchanged and can
be read by counterparty systems
• Incremental updating is possible
• Increases degree of automation for exchange –
lowers degree of bilateral, verbal agreement
• Can use “pull” instead of “push” if registry is
deployed
Integration within an Organization
• SDMX standard formats are also useful within
an organization
– Many organizations have several disparate databases
– Differences in database structure and content can
make it difficult to use other system’s data
– SDMX-ML provides a way to loosely couple such
databases, while facilitating exchange
– An SDMX registry can allow visibility into other
databases, while not affecting control or ownership of
data
Data Collection and Warehousing
• When data is collected from many different
sources, it can be in a wide variety of
formats
– Typically metadata-poor
• SDMX allows for a single, metadata-rich
reporting format for each type of data
• Existing counterparty systems can be
“wrappered” to support SDMX for
exchange only
Adoption of SDMX
• SDMX has been aggressively adopted, as
compared to other international technology
standards
– Many important data sets are available in
SDMX-ML today
– There are many prototypes and planned
projects at the national and international level
– Increasing numbers of tools are available
which support SDMX
Adopters/Interest
•
The following are known adopters (or planning to adopt):
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
US Federal Reserve Board and Bank of New York
European Central Bank
Joint External Debt Hub (WB, IMF, OECD, BIS)
UN/TRADECOM at UN Statistical Division
NAAWE (National Accounts from OECD/Eurostat)
European Statistical System (Eurostat and National Statistical Institutes)
Mexican Federal System
Vietnamese Ministry of Planning and Investment
Qatar Information Exchange
IMF (BOP, SNA, SDDS/GDDS)
Food and Agriculture Organization
Millennium Development Goals (UN System, others)
International Labor Organization
Bank for International Settlements
OECD
World Bank World Development Indicators (WDI)
Marchioness Islands (Spanish/Portuguese Statistical Region)
UNESCO (Education)
Australian Bureau of Statistics
WHO (SDMX-HD)
Statistics Canada
There are many others!
SDMX and Domains
• SDMX is organized as a central standard,
created and supported by the SDMX Initiative
– Each statistical domain creates it’s own domain
standard
– Example: WHO has created SDMX-HD (“Health
Domain”) for monitoring disease
outbreaks/epidemiology
– Example: UNESCO and Eurostat have developed
standard SDMX applications for Education Statistics
• You should look at the work in the different
domains when applying SDMX to different
national-level statistics collection
US Federal Reserve Board
• Several important data sets are available –
and searchable at a granular level – using
SDMX
• SDMX-ML is both a web-delivery format and
an internal exchange format for production of
data
http://www.federalreserve.gov/datadownload/
default.htm
Federal Reserve Bank of New York
• Historical data – once stored in huge CSV
files – is now available as SDMX-ML
• Increased the use of the site
• The “typical user” is now a machine
http://www.newyorkfed.org/xml/index.html
European Central Bank
• ECB uses SDMX-EDI to exchange data with
European Central Banks
• SDMX-ML is used for web dissemination
– Simultaneous release on many CB sites
– Each site can use its own language and look & feel
– Data warehouse now available in SDMX-ML
• Built and maintained using SDMX standards
http://www.ecb.int/stats/exchange/eurofxref/html/index.en.html
http://stats.ecb.europa.eu/stats/sdmx/visualisation/icp/dashboard/rc1/
• ECB’s Statistical Data Warehouse/web service
OECD
• Data structures are specified using SDMX
standards
• Data sets are held in SDMX-ML format and
navigated “on the fly”
– OECD.Stat
• http://stats.oecd.org/WBOS/index.aspx
• Experimenting with graphical presentation of
data
• Serves all OECD data as SDMX through
OECD.stat web service
Eurostat
• Builds on long experience of using GESMES for data
transmission (GESMES is main format for transmission
of data in several important domains e.g. national
accounts, balance of payments, short-term statistics)
• More than 50 Data Structure Definitions for GESMES
developed and maintained (in partnership with ECB)
• Software components developed and made available as
open-source software (see Tools page of SDMX website)
• Now creating a portal for all European Census data,
collected as SDMX
SDMX Specifications and Products
SDMX Information Model: High level
Schematic
Category
Scheme
Data or Metadata
Structure Definition
Data or
Metadata Set
conforms to business
rules of the
data/metadata flow
Metadata Flow
publishes/reports
data/metadata sets
Data Provider
uses specific
data/metadata
structure
can be linked to
categories in
multiple category
schemes
Data or
can provide
data/metadata for
many data/metadata
flows using agreed
data/metadata
structure
can get data/metadata
from multiple
data/metadata providers
Provision
Agreement
registers existence of
data and metadata
is registered for
comprises
subject or
reporting
categories
Category
can have child
categories
Registered
Data or
Metadata Set
SDMX Technical Specs v 1.0
• Information Model (data structure
definitions and data formats)
• SDMX-ML: XML formats for data structure
definitions and data
• SDMX-EDI: EDI formats for data structure
definitions and data
• Web-Services Guidelines
• User Guide
Technical Notes on Version 1.0
• Only numeric observations were supported
• Only coded key values were supported
• Intended to provide an XML version of the
existing GESMES/TS data model
– GESMES/TS became SDMX-EDI
– XML extended the data model to provide for
more types of groups and cross-sectional data
• Hierarchical codelists not supported
SDMX Technical Spec v. 2.0
• Expanded data model includes
– Registry interfaces
– Metadata structures and formats
– Data and metadata provisioning
– Other advanced features (process flow,
reporting taxonomy, structure mapping, etc.)
• Data formats now include uncoded
dimensions, hierarchical codelists, and
non-numeric observations
Technical Notes on Version 2.0
• A very large expansion of scope
– Model covers the process of statistical
exchange, not just the data formats
– Many cases which version 1.0 could not
support were included in version 2.0 as a
result of implementations
• Full support for the “data sharing” pattern
of exchange
– Resulting from the inclusion of the registry
Changes for Version 2.1
• Expanded Web Services Guidelines
–
–
–
–
Standard WSDL Functions
Standard RESTful syntax (URL-based API)
Standard Error Codes
Will allow for interoperable web services for SDMX – so generic
clients can use multiple sources
• Simplified Data Formats
– All data formats will be more consistent
– Cross-sectional and time-series formats are more similar
• SDMX Query has been improved
• Note: SDMX 2.1 is available for public review now!
SDMX Content-Oriented
Guidelines
• Four documents:
– Overview
– Metadata Common Vocabulary
– Cross-Domain Concepts
– Statistical Subject-Matter Domains
• These will not become ISO specifications,
but will evolve as publications of the
SDMX Initiative
Metadata Common Vocabulary
• A set of terms and definitions for the
different parts of the SDMX technical
standards, and many common concepts
used in data and metadata structures
• Does not replace other major vocabularies
in this space (such as the OECD glossary)
but references these other works
Cross-Domain Concepts
• Includes concepts which are common
across many statistical domains
– Names & Definitions
– Representations
• These are concepts which support both
data and metadata structures
Statistical Subject-Matter Domains
• Based on the UN/ECE classification of
statistical activities
• Provides a classification system for use in
exchanging statistics across domain
boundaries
• Provides a breakdown of the various
domains within official statistics
SDMX and Data Formats
Data Set
We have a dataset, what do we
need to know?
• Version 1.0
– What it is and how it is structured
• Version 2.0
– Who reports/disseminates it
– How a specific data set fits into the overall
collection framework and which organisation
is responsible for reporting which parts
– The reporting/publication schedule
– That it has been reported/published
Data Set: Structure
First: Identify the Concepts
• A concept is a unit of knowledge created by a
unique combination of characteristics (SDMX
Information Model)
Data Set Structure:Concepts
Country
Stock/Flow
Unit Multiplier
Unit
Time/Frequency
Computers need structure of data
•Concepts
•Code lists
•Data values
Topic
•How these fit together
Data Set Structure: Code Lists
CONCEPTS
Concepts
Topic
Country
Flow
Code Lists
TOPIC
COUNTRY
STOCK/FLOW
A Brady Bonds
AR Argentina
1 Stock
B Bank Loans
MX Mexico
2 Flow
C Debt Securities
ZA South Africa
Data Makes Sense
Q,ZA,B,1,1999-06-30=16547
16457
Data Set Structure: Defining Multidimensional Structures
• Comprises
– Dimensions
Concepts that identify the observation value
– Attributes
Concepts that add additional metadata about the
observation value
– Measure
Concept that is the observation value
– Any of these may be
•
•
•
•
•
coded
text
date/time
number
etc.
Representation
Data Set Structure: Concept Usage
Country
(Dimension)
Stock/Flow
(Dimension)
Unit Multiplier
(Attribute)
Unit
(Attribute)
Time/Frequency
(Dimension)
Topic
(Dimension)
(Dimension)
Observation
(Measure)
Data Structure
Definition
concepts
that
identify
groups of
keys
concepts that
identify the
observation
Key
Group Key
concepts that
are observed
phenomenon
concepts that
add metadata
Attributes
Measures
takes
semantic
from
takes
semantic
from
Concept
CONCEPTS
Topic
Country
Flow
Dimensions
takes
semantic
from
has
format
has
format
Representation
NonCoded
coded
has
format
has code
list
TOPIC
A Brady BondsCode
B Bank Loans
List
C Debt Securities
Data Makes Sense
Frequency,Country,Topic,Stock/Flow,Time=Observation
Q,ZA,B,1,1999-06-30=16547
Quarterly, South Africa, Bank Loans, Stocks, 2nd quarter 1999
16457
Identifying Concepts
• Identifying Concepts - Sources
– Existing data set tables
• From website
• From applications
– Data Collection Instruments
• Questionnaires
• Excel spreadsheets
– Regulations, Handbooks, User Guides
• Labour Statistics Convention, 1985 (No. 160), Recommendation,
1985 (No. 170)
• Council Regulation No: 311/76/EEC of 09/021976; OJ: L039 of
14/02/1976; Compilation of statistics on foreign workers
– Database Tables
– Existing Data Structure Definitions
• From other organisations
Identify Concepts – from website
Measurement = 1,000 Kg
Source: FAO proof of concept project
Concepts
Measure Type
Frequency
and Time
Commodity
Reference
Region
Measurement = 1,000 Kg
Unit and Unit
Multiplier
Observation
Value
Concept Role: Reminder
• Dimensions
– Are the concepts that identify the observation value
• Attributes
– Are the concepts that add additional metadata about
the observation value
• Measure
– Is the concept that is the observation value
Exercise:Concept Role
Measure Type
Frequency
and Time
(Dimension)
(Dimensions)
Observation
Value (Measure)
Commodity
(Dimension)
Reference
Region
(Dimension)
Measurement = 1,000 Kg
Unit and Unit
Multiplier
(Attributes)
Data Set and Structure
Dimension Concept
FREQ
REF_AREA_REG
COMMODITY
MEASURE_TYPE
TIME
Measure Concept
OBS_VALUE
Attribute Concept
OBS_STATUS
OBS_CONF
UNIT
UNIT_MULTIPLIER
Identify/Define Code Lists
• Purpose of a Code List
– Constrains the value domain of concepts when used
in a structure like a data structure definition
– Defines a shortened language independent
representation of the values
– Gives semantic meaning to the values, possibly in
multiple languages
• Agreeing on harmonised code lists is the most
difficult aspect of defining a data structure
definition
Data Structure Definition - Reminder
Data Structure
Definition
concepts that
identify the
observation
Key
concepts that
add metadata
Attributes
Group Key
concepts that
are observed
phenomenon
Measures
takes
semantic
from
concepts
that
identify
groups of
keys
takes
semantic
from
Concept
Dimensions
has
format
takes
semantic
from
has
format
Representation
NonCoded
coded
has code
has
list
format
Code
List
SDMX and Data Formats
Session: SDMX Syntax
Implementations for Data
SDMX Data Syntax Implementations
• SDMX provides for two main syntaxes:
– UN/EDIFACT (for SDMX-EDI)
– XML (for SDMX-ML)
• Each syntax provides a format for
describing data structure definitions
• Each syntax provides at least one format
for data
– There are 4 different XML syntaxes for data
SDMX-EDI
• EDI – “electronic data interchange” – is an
older, flat-file syntax used primarily to
conduct e-commerce
– There have been a few statistical messages
– GESMES is the “generic statistical message”
• EDI messages are difficult to read unless
you know EDI very well…
Benefits of SDMX-EDI
• As a data format, it is very compact
– Good for very large data sets
• Permits incremental updating of data sets
• Permits attributes and observations to be
sent separately
• Has a very large installed base within the
European community and the central
banks (used by 180 countries)
• It is not very Web-oriented, however
SDMX-ML Document Types (Data)
• Structure Message: Holds the agencies, concepts,
codelists, and data structure definitions (DSDs)
• Generic Format: A single XML schema for all different types
of data, regardless of data structure definition
• Utility Format: Specific to DSD, provides strongest
validation
• Compact Format: Like the EDI message, compact, but not
as much validation as Utility
• Cross-Sectional Format: Similar to Compact, but holds
cross-sectional data
• Data Query Message: Allows for querying of online
databases and similar applications which are SDMX-aware.
Supports web services.
The SDMX-ML Data Formats
• In designing the XML formats for SDMX, several different
needs were identified
– Needed an XML format for describing data structure definitions
– Needed an XML version of the EDIFACT messages for transmitting
large databases
– Needed an XML which would help validate statistical data sets
– Needed an XML which could be used generically for any statistical
data set
– Needed an XML for transmitting cross-sectional data
– Needed a message to query for data
• Because SDMX-ML is based on the SDMX Information
Model, it was decided to create several equivalent XML data
formats, to satisfy each of these cases
– Requirements were mutually exclusive for these cases
Generic Data Message
•
•
•
•
No validation
Carries data for any data structure definition
Verbose – files are very large
Can perform incremental updates and carry
partial data sets
• Useful for applications which need to carry
potentially incorrect data for processing and
cleaning
• Useful for generic applications which handle
data for more than one DSD
• Serves as a “pivot format” between other SDMXML format types
Utility Data Message
• Provides strongest validation – all
business rules in DSD are enforced by a
generic XML parser (schemas are specific
to particular DSDs)
• Less verbose than Generic; more verbose
than Compact & Cross-Sectional
• Incremental updates not supported
• For XML tools, this is the most “normal”
type of XML schema – performs best
Compact Data Message
• Equivalent of SDMX-EDI data format, but
schemas are specific to a particular DSD
• Good for exchanging partial data sets and
incremental updates
• Very compact (for XML) in terms of file sizes
• Very simple, but performs limited validation
– Will validate codelists, but not some other things
Cross-Sectional Data Message
• Similar to Compact format, but allows for
lots of observations for a single point in
time (not time-series oriented like other
formats)
• Very compact
• Supports incremental updates
• Provides limited validation – schemas are
specific to a particular DSD
Selecting the Right SDMX-ML
Format
• Free tools allow transformation between data
formats without any loss – each application can
use one or more formats for specific tasks
• Depending on the application, one format may
be preferable to another
–
–
–
–
How large are the data files?
How much validation needs to be performed?
How many DSDs are supported by the application?
Will all data be correct when received (according to
the DSD)?
SDMX-ML “Model-Driven” XML Approach
DSD
Additional SDMX Features
• Hierarchical Code List
• Structure Set (mappings)
• Reporting Taxonomy
Hierarchical Code Lists – Example Scenario
•
•
•
•
•
•
France is a country
France is part of the continent of Europe
France is a member of NATO
France is a member of the EU
France is a member of the G10
When I analyse statistics I might want to see totals by
–
–
–
–
continent
trading block
military alliance
financial grouping
• France will be grouped with different sets of countries
depending on the “view” required
• How do we express these groupings?
Code
Code List
Hierarchy-1
Hierarchy-2
Hierarchy-3
Hierarchy-4
Code
Composition
Code
Composition
Code
Composition
Code
Composition
Reference Area
6B NATO
B0 EU
B1 NAFTA
BE Belgium
BG Bulgaria
CA Canada
Europe
EU countries
NATO countries
G10 countries
Code
Parent
Code
Parent
Code
Parent
Code
Parent
BE
E1
BE
E0
BE
6B
BE
G0
BG
E1
CZ
E0
BG
6B
CA
G0
CH
E1
DE
E0
CA
6B
CH
G0
CZ
E1
DK
E0
CZ
6B
DE
G0
DE
E1
EE
E0
DE
6B
FR
G0
DK
E1
ES
E0
DK
6B
GB
G0
ES Spain
EE
E1
FI
E0
EE
6B
JP
G0
FI Finland
ES
E1
FR
E0
ES
6B
IT
G0
FR France
FI
E1
GB
E0
FR
6B
NL
G0
GB United Kingdom
FR
E1
etc
GB
6B
SE
G0
GR Greece
GB
E1
US
G0
HU Hungary
etc
CH Switzerland
CZ Czech Republic
DE Germany
DK Denmark
E1 Europe
E8 North America
EE Estonia
JP Japan
I2 Euro 12
IT Italy
NE Netherlands
US United States
North America
NAFTA countries
Code
Parent
CA
B1
Code
Parent
US
B1
CA
B1
MX
B1
US
B1
etc
Code
Association
Schematic of the Hierarchical Code Scheme
comprises hierarchies
Hierarchical
Code
Scheme
comprises code groups
Code List
belongs to
The codes may be
in variety of code
lists.
relates a code to a parent code
code
Code
parent code
Properties of
the association
Property
Hierarchy
value based hierarchy
has code groups
level based hierarchy
has formal levels
Code
Association
groups codes
with the
same parent
Code
Composition
comprises
code
groups
Level
Item Scheme Maps
• Many types of “item scheme” use the
same fundamental structure
– Code list
– Category scheme
– Concept scheme
• Two Item Schemes can be mapped
Schematic of the “Code” Mapping
source item scheme
Code List
Map
Item Scheme
Association
Category
Scheme
Map
Concept
Scheme
Map
Association
Role
Item Scheme
Code List
Category
Scheme
target item scheme
Concept
Scheme
Code List
Item Scheme
Category
Scheme
Concept
Scheme
has item
associations
Item
Code
Category
source item
Concept
Item
Association
target item
Item
Code
Category
Concept
Structure Maps
• Structures can also be mapped
– Data structures
– Metadata structures
Data/Metadata Reporting, Query,
Analysis, Mapping
Structure &
Item Scheme
Maps
Data or Metadata
Structure Definition
Category
Scheme
Data or
Metadata Set
Data or
Metadata Flow
Category
Content
Constraint
Data Provider
Provision
Agreement
Attachment
Constraint
Registered
Data Set or
Metadata Set
Reporting Taxonomy
• An SDMX Reporting Taxonomy is a group
of data flows and/or metadata flows which
form the basis of a single real-world
document or report
• They can be organized into groups and
sub-groups as needed
• They can be named and identified
• Useful for managing various types of
reports over time
Processes
• SDMX 2.0 provides the ability to document
the steps and logic of a process flow
• This is not executable, but serves as
documentation to describe the processes
which produce data and metadata
• It is useful as a target for the attachment of
reference metadata describing processing
SDMX and Metadata Formats
Reference Metadata
• We have seen how data values are limited to where they
belong
– Series key (usually qualified by time)
• Data attribute values are limited in where they belong
–
–
–
–
Observation value
Series key
Group key
Data set
• Metadata is everywhere, but
– it must be metadata about “something”
• what is the “something”
• how is it identified
– it comprises concepts and how are they structured
• The Metadata Structure Definition answers these
questions
• Advance release calendar is only one possible example
Metadata Example: Advance
Release Calendar (ARC)
• What is the release calendar for?
RELEASE CALENDAR
– Informs when data will be
published/made available
• Who publishes the data set?
• What type of data is it (data flow)?
• What metadata is in the release
calendar (i.e. its structure)
• Who publishes the release calendar?
• When is it published?
Labour Force Statistics
Metadata Structure Definition (MSD)
Structure
RELEASE CALENDAR
•Concepts
•Hierarchies
•Representation
(e.g. code list)
Metadata Structure Definition (MSD)
Report Structure
Metadata
Structure
Definition
can comprise the
specification of one
or more report
Metadata
Report
Concept
takes semantic and
context from
MetadataAttributes
Attributes
Metadata
concept defined in
can have
hierarchy
definition of format
and permitted values
can have
hierarchy
Concept
Scheme
Format and
Permitted
Value List
Example ARC Metadata
Day
Ref Area
Indicator
Ref Period
Time
Tolerance
Status
Identifiers
30-042007
INE, Spain
LF-H
Q: 31-032007
09:00
+24 Hr.
Final
30-042007
INE, Spain
LF-E
Q: 31-032007
09:00
+24 Hr.
Final
30-042007
ONS, UK
LF-H
Q: 31-032007
09:00
+48 Hr.
Final
30-042007
ONS, UK
LF-E
Q: 31-032007
09:00
+48 Hr.
Draft
MSD Metadata Concepts: Advance
Release Calendar
Concept Id
REFERENCE_PERIOD
RELEASE_DATE_TIME
1
Concepts
Description
The time period to which a variable refers
The specific point in time that data or metadata are made
available
DATE_TOLERANCE
The possible or permissible variance of a time period
relative to a known point in time.
RELEASE_STATUS
The state of preparedness of a statement on the
availability of data or metadata
ANNOTATION
Additional metadata
MSD: Report Structure for ARC
ARC_METADATA
Metadata
Structure
Definition
REFERENCE_PERIOD
RELEASE_DATE_TIME
DATE_TOLERANCE
RELEASE_STATUS
ANNOTATION
ARC
Metadata
Report
Concept
Concept
Scheme
MY_AGENCY:METADATA_CONCEPTS
REFERENCE_PERIOD
RELEASE_DATE_TIME
DATE_TOLERANCE
RELEASE_STATUS
ANNOTATION
MetadataAttributes
Attributes
Metadata
Format and
Permitted
Value List
MSD: Metadata Report Structure
Metadata Report =
ARC
Target Id =
Metadata Attribute
Concept =
Reference_Period
Representation =
Release_Date_Time
Representation =
Date_Tolerance
Representation =
Date/Time
Metadata Attribute
Concept =
Date/Time
Metadata Attribute
Concept =
Metadata Attribute
Concept =
Time Value
CL_Status
Release_Status
Representation =
F Final
P Provisional
Metadata Attribute
Concept =
Text
Annotation
Representation =
Metadata Set: ARC Report Example
Metadata Set
Metadata Structure = ARC_METADATA
Metadata Report = ARC
Identifiers
Metadata Attributes
Concept =
Reference_Period
Concept =
Release_Date_Time Value = 2007-04-30T09:00
Concept =
Date_Tolerance
Value = +24Hr
Concept =
Release_Status
Value = F
Concept =
Annotation
Value = simultaneous release by
ECB
Value = 2007-31-03
Metadata Example: Advance
Release Calendar (ARC)
• What is the release calendar for?
– Informs when data will be
published/made available
RELEASE CALENDAR

• Who publishes the data set?
• What type of data is it (data flow)
• What metadata is in the release

calendar (i.e. its structure)
• Who publishes the release calendar?
• When is it published?
Metadata Structure Definition (MSD)
To which object is the metadata attached?
Metadata
Structure
Definition
can comprise the
specification of one
or more report
Target
Identifier
Links to
Metadata
Report
Concept
takes semantic and
context from
MetadataAttributes
Attributes
Metadata
concept defined in
can have
hierarchy
definition of format
and permitted values
can have
hierarchy
Concept
Scheme
Format and
Permitted
Value List
Data Flows: Controlling Reporting
and Publishing
Structure
Definition
uses specific
data structure
Data Set
conforms to business
rules of the dataflow
Data Flow
RELEASE CALENDAR
publishes/
reports
data sets
Data
Provider
can provide data for
many data flows using
agreed data structure
can get data
from multiple
data providers
Provision
Agreement
Controlling Data Reporting
Structure
Definition
uses specific
data structure
Data Set
conforms to business
rules of the dataflow
LF-H = labor force hours
Data Flow
RELEASE CALENDAR
publishes/
reports
data sets
1A – INE Spain
Data
Provider
can get data
from multiple
data providers
can provide data for
many data flows using
agreed data structure
Provision
Agreement
Metadata Structure Definition (MSD)
Identify
Structure
RELEASE CALENDAR
Provision
Agreement
•Concepts
•Hierarchies
•Representation
(e.g. code list)
MSD: Identifying the “Target”
Metadata
Structure
Definition
defines “keys” of object types to
which metadata can be “attached”
Full Target
Identifier
Partial Target
Identifier
specifies the identifier components
(“key”) of the target object
Target Object
Type
identifies target
object type of the
component
Identifier
Identifier
Components
Components
identifies the code list or other type
of list (e.g. Category Scheme which
defines the valid values tat can be
used when metadata are reported in
a metadata set
Item Scheme
MSD: Object Identification for ARC
ARC
Metadata
Structure
Definition
ARC_METADATA
Metadata
Report
Data_Flow_Provider
Data Flow
Full Target
Identifier
Partial Target
Identifier
CL_DATA_FLOW
LF-H Labour Force, Hours Worked
LF-E Labour Force, Employment
OS_DATA_PROVIDER
Data Provider
Target Object
Type
Identifier
Identifier
Components
Components
1A
INE, Spain
2A
ONS, UK
Item Scheme
MSD: Identifiers for ARC
Metadata Structure Definition = ARC_METADATA
Target = Data_Flow_Provider
Identifier Component
Target Object Type =
Data Flow
CL_DATA_FLOW
Item Scheme =
LF-H Labour Force, Hours Worked
LF-E Labour Force, Employment
Identifier Component
Target Object Type =
Data Provider
OS_DATA_PROVIDER
Item Scheme =
1A
INE, Spain
2A
ONS, UK
MSD: Metadata Report Structure
Metadata Report =
Target Id =
ARC
Data_Flow_Provider
Metadata Attribute
Concept =
Reference_Period
Representation =
Release_Date_Time
Representation =
Date_Tolerance
Representation =
Date/Time
Metadata Attribute
Concept =
Date/Time
Metadata Attribute
Concept =
Metadata Attribute
Concept =
Time Value
CL_Status
Release_Status
Representation =
F Final
P Provisional
Metadata Attribute
Concept =
Text
Annotation
Representation =
Metadata Set: ARC Report Example
Metadata Set
Metadata Structure = ARC_METADATA
Data Flow
Metadata Report = ARC
Identifiers
Data Provider =
1A
Data Flow =
LF-H
Data
Provider
Provision
Agreement
Metadata Attributes
Concept =
Reference_Period
Concept =
Release_Date_Time Value = 2007-04-30T09:00
Concept =
Date_Tolerance
Value = +24Hr
Concept =
Release_Status
Value = F
Concept =
Annotation
Value = simultaneous release by
ECB
Value = 2007-31-03
Metadata: Advance Release
Calendar (ARC)
• What is the release calendar for?
– Informs when data will be
published/made available
RELEASE CALENDAR

• Who publishes the data?

• What type of data is it (data flow)?
• What metadata is in the release

calendar (i.e. its structure)?
• Who publishes the release calendar?
• When is it published?
Controlling Metadata Reporting
Metadata
Structure
Definition
ARC_METADATA
uses specific
data structure
Metadata
Set
conforms to business
rules of the metadata flow
publishes/
reports
metadata
sets
1A
can provide metadata for
many metadata flows using
(Meta)Data agreed metadata structure
Provider
Metadata
Flow
ARC
can get
metadata from
multiple
metadata
providers
Provision
Agreement
Metadata collectors can
set up control metadata
for the collection
process
Metadata: Advance Release
Calendar (ARC)
• What is the release calendar for?
– Informs when data will be
published/made available
RELEASE CALENDAR

• Who publishes the data?

• What type of data is it (data flow)?
• What metadata is in the release

calendar (i.e. its structure)
• Who publishes the release calendar? 
• When is it published?

Reference Metadata
• Metadata is everywhere, but
– it must be metadata about “something”
• what is the “something”
• how is it identified
– it comprises concepts and how are they structured
• The Metadata Structure Definition answers these
questions
• Advance release calendar is only one possible example
– attached to the Provision Agreement
To which (other) things can
metadata be attached?
MSD: Some Object Types
Structure
Definition
Data Set or
Metadata
Set
Data
Provider
Structure and
Item Scheme
Maps
Data or
Metadata
Flow
Provision
Agreement
Category
Scheme
Category
Content
Attachment
Constraint
Constraint
Registered Data
Set or Metadata
Set
MSD: List of Object Types to Which
Metadata can be Attached
Agency
ConceptScheme
Concept
Codelist
Code
KeyFamily
Component
KeyDescriptor
MeasureDescriptor
AttributeDescriptor
GroupKeyDescriptor
Dimension
Measure
Attribute
CategoryScheme
ReportingTaxonomy
Category
OrganisationScheme
DataProvider
MetadataStructure
FullTargetIdentifier
PartialTargetIdentifier
MetadataAttribute
DataFlow
ProvisionAgreement
MetadataFlow
ContentConstraint
AttachmentConstraint
DataSet
XSDataSet
MetadataSet
HierarchicalCodelist
Hierarchy
StructureSet
StructureMap
ComponentMap
CodelistMap
CodeMap
CategorySchemeMap
CategoryMap
OrganisationSchemeMap
OrganisationRoleMap
ConceptSchemeMap
ConceptMap
Process
ProcessStep
Metadata Structure Definition (MSD)
Report Structure
Metadata
Structure
Definition
can comprise the
specification of one
or more report
Target
Identifier
Links to
Metadata
Report
Concept
takes semantic and
context from
MetadataAttributes
Attributes
Metadata
concept defined in
can have
hierarchy
definition of format
and permitted values
can have
hierarchy
Concept
Scheme
Format and
Permitted
Value List
SDMX and Metadata Formats
Session: SDMX-ML Formats for
Metadata Sets
Metadata Formats Syntax
Implementation
• There are three relevant constructs in SDMX-ML
for handling metadata sets
– Metadata Structure Definitions
– Metadata Reports (specific to an MSD)
– Generic Metadata Sets (for any MSD)
• This is similar to data formats in SDMX-ML,
except that there are fewer different use cases
• There is no corresponding format
implementation in SDMX-EDI for Reference
Metadata
Comparing Formats for Metadata
Sets
• Generic Metadata performs no validation,
but can hold any type of metadata report
• MSD-specific Metadata Reports can
perform more validation, and are less
verbose
– Because there tend to be few codelists or
numeric types in metadata reports, the
validation may not be very useful
Metadata: Quality Frameworks
• The SDMX cross domain concepts for
reference metadata are concerned with
data quality framework (DQAF) metadata
• These DQAFs are used to improve the
quality, comparability, transparency etc. of
published data
Metadata – Reported according to
a Quality Framework
Example Metadata: Content
ACCOUNTING_CONV
QUALITY_METADATA
Metadata
Structure
Definition
BASE_PER
COVERAGE
COVERAGE_SECTOR
REF_AREA
REF_PERIOD
CATEGORY_CONTENT_REPORT
COVERAGE
REF_AREA
Metadata
Report
MY_CONCEPTS
BASE_PER
Concept
Concept
Scheme
COVERAGE_SECTOR
ACCOUNTING_CONV
REF_PERIOD
BASE_PER
MetadataAttributes
Attributes
Metadata
Format and
Permitted
Value List
SDMX Registry Overview
SDMX Registry/Repository
Indexes data and
metadata
Describes data and
metadata sources and
reporting processes
Register
REGISTRY
Data Set/
Metadata Set
Query
Submit
REPOSITORY
Provisioning
Metadata
Query
Submit
Describes data and
metadata structures
REPOSITORY
Structural
Metadata
Query
S
D
M
X
R
e
g
i
s
t
r
y
I
n
t
e
r
f
a
c
e
s
SDMX Registry/Repository
Indexes data and
metadata
Register
REGISTRY
Data Set/
Metadata Set
Subscription/
Notification
Applications can
subscribe to
notification of new
or changed objects
Query
Submit
REPOSITORY
Provisioning
Metadata
Query
Submit
Describes data and
metadata structures
REPOSITORY
Structural
Metadata
Query
S
D
M
X
R
e
g
i
s
t
r
y
I
n
t
e
r
f
a
c
e
s
Information Model: High level Schematic
Structure
Maps
Data or
Metadata Set
structure and
code list maps
conforms to business
rules of the
data/metadata flow
uses specific
data/metadata
structure
can be linked to
categories in
multiple category
schemes
Data or
Metadata Flow
publishes/reports
data/metadata sets
Data Provider
Category
Scheme
Data or Metadata
Structure Definition
can provide
data/metadata for
many data/metadata
flows using agreed
data/metadata
structure
can get data/metadata
from multiple
data/metadata providers
Provision
Agreement
registers existence of
data and metadata
URL,
registration
date etc.
comprises
subject or
reporting
categories
Category
can have child
categories
Data or
Metadata Set
SDMX Registry/Repository
Indexes data and
metadata
Subscription/
Notification
Applications can
subscribe to
notification of new
or changed objects
Register
REGISTRY
Data Set/
Metadata Set
Query
Submit
REPOSITORY
Provisioning
Metadata
Query
Submit
Describes data and
metadata structures
REPOSITORY
Structural
Metadata
Query
S
D
M
X
R
e
g
i
s
t
r
y
I
n
t
e
r
f
a
c
e
s
SDMX Artefacts: Registry Contents
Structure
Maps
structure and
code list maps
Structural Metadata
Provisioning Metadata
Registered Data and
Metadata
Data Provider
Category
Scheme
Structure Definition
can provide
data/metadata for
many data/metadata
flows using agreed
data/metadata
structure
uses specific
data/metadata
structure
can be linked to
categories in
multiple category
schemes
Data Flow
can get data/metadata
from multiple
data/metadata providers
Provision
Agreement
registers existence of
data and metadata sets
URL,
registration
date etc.
comprises
subject or
reporting
categories
Category
can have child
categories
Data Set
The Old JEDH (Joint External
Debt Hub) Site
BIS
WEBSITE
IMF
OECD
World
Bank
(Various
Formats)
(3-month production cycle)
JEDH with SDMX
Retrieves data from sites
BIS
IMF
OECD
World
Bank
SDMX-ML
SDMX
“Agent”
SDMX-ML
SDMX-ML
SDMX
Registry
Discover data
and URLs
Data provided
in real time
to site
SDMX-ML
SDMX-ML
SDMX-ML
Loaded into
JEDH DB
(Debtor database)
JEDH Site
FOOD AND AGRICULTURE ORGANIZATION
OF THE UNITED NATIONS
SDMX in Action: Prototype System
FAO SDMX
Registry
2
National
Publication
Server(s)
1
CountrySTAT
3a
Regional
Publication
Server
3b
Flow of FAO CountrySTATRegionSTAT Implementation
4
RegionSTAT
Slide courtesy of the FAO
FOOD AND AGRICULTURE ORGANIZATION
OF THE UNITED NATIONS
Prototype System: Explanation
1
CountryStat National Publication Server
•The web site is published from the files in CountryStat
2
SDMX Publication
•The new CountryStat files are converted to SDMX-ML data sets
and made web accessible on the CountryStat web site
•These files are registered in the FAO SDMX Registry
RegionStat Regional Publication Server
3a
•Queries the registry for new registrations which responds with
registration details including the URL of the new data sets
3b
•Retrieves the new data sets from the CountryStat web site
•Converts the SDMX-ML files to an internal format and integrates
the new data sets with existing RegionStat data sets
4
•Re-publishes the RegionStat web site
Slide courtesy of the FAO
SDMX Implementation
Developing SDMX Applications
• General Design Approaches
• Publications and Dissemination
• Data Warehousing/Integration of Data
Sources
• Other Topics
SDMX Publication and
Dissemination
• SDMX can be used to drive Web
dissemination and print publication
– It is a useful format for distribution from
websites
– It can be used by websites to improve delivery
of content
– It can be used to provide content to print
applications, for tabular data
• These techniques can result from a single
system
Note: Can be a virtual
data store fed by the
SDMX registry
Data Storage
(SDMX)
SDMX
Registry
Templates,
boilerplate text,
analysis
SDMX
Query Engine
XSL-FO
SDMX- SDMXML
ML
Print Publication
Engine
Canned
Queries
On-the-Fly
Queries
ASP/JSP
CSV
PDF, etc.
Website
HTML
XSLT
Notes on Publication/Dissemination
• Current practice is often to focus on the delivery
of tables
– This is often not what users ideally want
– Tables can be viewed as “canned queries”
• Better web-sites can be created which support
granular user queries supported by rich
metadata
– See the ECB data warehouse, Federal Reserve
Board site as examples
– See “Data on the Web” presentation for more details
Data Warehousing/Integration of
Data Sources
• SDMX is also designed to support the
collection and processing of data
– In most organizations, this is seen as a data
warehousing activity
• SDMX provides tools for integrating data
from a variety of sources
– Can be among a set of organizations or within
an organization
Data Warehouse
Data Loading
Data
Harmonization/
Processing
Data
Dissemination
Website
Data Sources (static files, databases, etc.)
Data Pulled
Notification
Print
Publication
SDMX
Registry
Internal
Applications
Data
Registration
Note: All types of dissemination
applications may use the registry
for various purposes. The registry may even
be made publically available to users who
want SDMX-ML data and metadata.
Notes on Data Warehousing
• Each stage is loosely coupled with associated
applications, using XML interfaces:
– Data sources
– Data processing
– Data dissemination applications
• The SDMX Registry functions throughout as a
metadata repository, to provide structural and
provisioning information as well as location of
data as needed
• Internal database structures are based on SDMX
information model
– They are predictable and regular
– They can be auto-generated
SDMX Tools and Resources
SDMX Tools (Partial List)
• Metadata Technology has a set of free tools for
working with data and metadata, and a free
registry implementation
– Mostly Java and XSLT
• Eurostat has a set of free tools for working with
data and metadata, and has a registry
implementation
• OECD and IMF have a web-services based
package for dissemination: .STAT (available
through MOU)
• ECB visualization tools written in Flex on Google
Code
• Some other tools, including commercial vendors
(STR Supercross 2, etc.)
Other Resources
• www.sdmx.org has a blog and makes many
different presentations and paper available, as
well as distributing copies of the standards
– An SDMX User’s Guide is currently being developed
(beyond the material contained in the SDMX v 2.0
specification)
• The Open Data Foundation promotes SDMX
(among other standards)
– Check www.opendatafoundation.org
– They host the SDMX Users Forum
www.sdmxusers.org
SDMX and Other Standards
Other Important Standards
• Data Documentation Initiative (DDI) – describes
the micro-data inputs to aggregate (SDMX) data
• ISO/IEC 11179 Metadata Registries – describes
terminological/semantic and conceptual models,
and the metadata lifecycle
• eXtensible Business Reporting Language
(XBRL) – describes financial microdata for
economic statistics
SDMX and XBRL
• These standards can be mapped to each
other successfully
• However, the mapping depends on the
specific SDMX Data Structure Definition,
and the specific XBRL “Taxonomy”
– There is no single, standard mapping
DDI and SDMX Combined Data Model
• DDI 3 focuses on:
–
–
–
–
collection and production of microdata
reuse and sharing of common data structures
conversion to statistical tables (matrices)
preservation and multiple storage options
• SDMX focuses on:
– statistical tables
– reuse and sharing of common data structures
– consistent data transfer structure
• Together they form a coherent data
management model for data capture, storage
and interchange with a wide area of overlap
S20
138
Generic Process Example
DDI
Anonymization, cleaning,
recoding, etc.
Raw Data Set
Aggregate Data Set
(Lower level)
Micro-Data Set/
Public Use Files
Aggregation,
harmonization
Aggregate Data Set
(Highest-Level)
Aggregate Data Set
(Higher Level)
SDMX
The Generic Staistical Business
Process Model (GSBPM)
• The METIS group is a part of UN/ECE
which addresses metadata issues for
national statistical agencies (and other
producers of official statistics)
– This community uses both SDMX and DDI
• They have produced a reference model of
the statistical production process
– The DDI 3 Lifecycle Model was a major input
– GSBPM has a much greater level of detail
The Generic Statistical Information
Model (GSIM)
• Early work on an information model to
accompany the GSBPM is starting
– Still informal, very early
– Involves some of the statistical agencies which lead
the work on GSBPM
• GSIM will take as a major input both the DDI and
SDMX information models
– Will also cover other metadata
– Will also draw on other standards (Neuchatel Model
for Classifications, etc.)
• Goal is to publish GSIM through METIS
alongside the GSBPM
Questions?
Download