INEGI: Introduction to SDMX

EDDI: Introduction to SDMX Arofan Gregory Open Data Foundation What is SDMX? • The problem space: – Statistical collection, processing, and exchange is time-consuming and resourceintensive – Various international and national organisations have individual approaches for their constituencies – Uncertainties about how to proceed with new technologies (XML, web services …) National Statistical Organisations accounts statistics Banks, Corporates Individual Households transactions accounts www.z.org www.hub.org www.y.org www.x.org Internet, Search, Navigation 180 + Countries International Organisations accounts Regional Organisations statistics What is SDMX? The Statistical Data and Metadata Exchange (SDMX) initiative is taking steps to address these challenges and opportunities that have just been mentioned: – By focusing on business practices in the field of statistical information – By identifying more efficient processes for exchange and sharing of data and metadata using modern technology Historical Note • SDMX uses an approach based on the 10-yearlong success of an earlier standard – GESMES/TS • GESMES/TS was an initiative that is used today in many countries for collecting, exchanging, and updating statistical databases – GESMES/TS is now SDMX-EDI • Focus is on time-series, and is mostly used by central banks Who is SDMX? • SDMX is an initiative made up of seven international organizations: – – – – – Bank for International Settlements European Central Bank Eurostat International Monetary Fund Organisation for Economic Cooperation and Development – United Nations – World Bank • The initiative was launched in 2002 SDMX Products • Technical standards for the formatting and exchange of aggregate statistics: – SDMX Technical Specifications version 1.0 (now ISO/TS 17369 SDMX) – SDMX Technical Specifications version 2.0 (submitted to ISO) – SDMX Technical Specifications version 2.1 under review (will be forwarded to ISO) • Content-Oriented Guidelines – Common Metadata Vocabulary – Cross-Domain Statistical Concepts – Statistical Subject-Matter Domains Detailed SDMX Goals • Reduce national reporting burden to international institutions • Fostering consistency, accuracy, and timeliness between data and metadata disseminated by national and international institutions, relying on what is decentrally released via national websites • Enhancing national statistical processing efficiency, especially through internationally-recognised standard formats for exchanges between statistical silos within institutions and with other national statistical agencies • Providing standards for web-based dissemination formats that are computer readable and facilitate updating of databases • Enhancing comparison of data and metadata analysis through standard formats and content-oriented guidelines Official Recommendations • SDMX has been officially recommended: – February 2007: SDMX endorsed by the European Union’s Statistical Programme Committee – March 2008: UN Statistical Commission declares SDMX to be the preferred standard for data and metadata Exchange Patterns • Bilateral: Institutions exchange data according to bilateral agreements regarding format, timing, protocols, etc. • Gateway: Institutions share the data they collect with their peers, in agreed formats among counterparty communities • Data-sharing: standard exchange of data using standard formats and protocols Bilateral Exchange Gateway Exchange Data-Sharing Exchange Notes About Data-Sharing • Data-sharing only works if there are standard formats • Data-sharing works only if the data themselves are decentralized – One big database doesn’t work! • Like the Web itself, a data-sharing model relies on pull exchanges, not push exchanges – Data consumers discover the data they need, and its location, and then go and get it – Data producers don’t have to send data SDMX View • SDMX products support all types of exchange • One major requirement is to work well with existing systems, to protect technology investments • SDMX promotes an incremental movement toward the data-sharing model Exchange with Peer Organizations • SDMX-EDI and SDMX-ML are both able to exchange databases between peer organizations • Structural metadata is also exchanged and can be read by counterparty systems • Incremental updating is possible • Increases degree of automation for exchange – lowers degree of bilateral, verbal agreement • Can use “pull” instead of “push” if registry is deployed Integration within an Organization • SDMX standard formats are also useful within an organization – Many organizations have several disparate databases – Differences in database structure and content can make it difficult to use other system’s data – SDMX-ML provides a way to loosely couple such databases, while facilitating exchange – An SDMX registry can allow visibility into other databases, while not affecting control or ownership of data Data Collection and Warehousing • When data is collected from many different sources, it can be in a wide variety of formats – Typically metadata-poor • SDMX allows for a single, metadata-rich reporting format for each type of data • Existing counterparty systems can be “wrappered” to support SDMX for exchange only Adoption of SDMX • SDMX has been aggressively adopted, as compared to other international technology standards – Many important data sets are available in SDMX-ML today – There are many prototypes and planned projects at the national and international level – Increasing numbers of tools are available which support SDMX Adopters/Interest • The following are known adopters (or planning to adopt): – – – – – – – – – – – – – – – – – – – – – – US Federal Reserve Board and Bank of New York European Central Bank Joint External Debt Hub (WB, IMF, OECD, BIS) UN/TRADECOM at UN Statistical Division NAAWE (National Accounts from OECD/Eurostat) European Statistical System (Eurostat and National Statistical Institutes) Mexican Federal System Vietnamese Ministry of Planning and Investment Qatar Information Exchange IMF (BOP, SNA, SDDS/GDDS) Food and Agriculture Organization Millennium Development Goals (UN System, others) International Labor Organization Bank for International Settlements OECD World Bank World Development Indicators (WDI) Marchioness Islands (Spanish/Portuguese Statistical Region) UNESCO (Education) Australian Bureau of Statistics WHO (SDMX-HD) Statistics Canada There are many others! SDMX and Domains • SDMX is organized as a central standard, created and supported by the SDMX Initiative – Each statistical domain creates it’s own domain standard – Example: WHO has created SDMX-HD (“Health Domain”) for monitoring disease outbreaks/epidemiology – Example: UNESCO and Eurostat have developed standard SDMX applications for Education Statistics • You should look at the work in the different domains when applying SDMX to different national-level statistics collection US Federal Reserve Board • Several important data sets are available – and searchable at a granular level – using SDMX • SDMX-ML is both a web-delivery format and an internal exchange format for production of data http://www.federalreserve.gov/datadownload/ default.htm Federal Reserve Bank of New York • Historical data – once stored in huge CSV files – is now available as SDMX-ML • Increased the use of the site • The “typical user” is now a machine http://www.newyorkfed.org/xml/index.html European Central Bank • ECB uses SDMX-EDI to exchange data with European Central Banks • SDMX-ML is used for web dissemination – Simultaneous release on many CB sites – Each site can use its own language and look & feel – Data warehouse now available in SDMX-ML • Built and maintained using SDMX standards http://www.ecb.int/stats/exchange/eurofxref/html/index.en.html http://stats.ecb.europa.eu/stats/sdmx/visualisation/icp/dashboard/rc1/ • ECB’s Statistical Data Warehouse/web service OECD • Data structures are specified using SDMX standards • Data sets are held in SDMX-ML format and navigated “on the fly” – OECD.Stat • http://stats.oecd.org/WBOS/index.aspx • Experimenting with graphical presentation of data • Serves all OECD data as SDMX through OECD.stat web service Eurostat • Builds on long experience of using GESMES for data transmission (GESMES is main format for transmission of data in several important domains e.g. national accounts, balance of payments, short-term statistics) • More than 50 Data Structure Definitions for GESMES developed and maintained (in partnership with ECB) • Software components developed and made available as open-source software (see Tools page of SDMX website) • Now creating a portal for all European Census data, collected as SDMX SDMX Specifications and Products SDMX Information Model: High level Schematic Category Scheme Data or Metadata Structure Definition Data or Metadata Set conforms to business rules of the data/metadata flow Metadata Flow publishes/reports data/metadata sets Data Provider uses specific data/metadata structure can be linked to categories in multiple category schemes Data or can provide data/metadata for many data/metadata flows using agreed data/metadata structure can get data/metadata from multiple data/metadata providers Provision Agreement registers existence of data and metadata is registered for comprises subject or reporting categories Category can have child categories Registered Data or Metadata Set SDMX Technical Specs v 1.0 • Information Model (data structure definitions and data formats) • SDMX-ML: XML formats for data structure definitions and data • SDMX-EDI: EDI formats for data structure definitions and data • Web-Services Guidelines • User Guide Technical Notes on Version 1.0 • Only numeric observations were supported • Only coded key values were supported • Intended to provide an XML version of the existing GESMES/TS data model – GESMES/TS became SDMX-EDI – XML extended the data model to provide for more types of groups and cross-sectional data • Hierarchical codelists not supported SDMX Technical Spec v. 2.0 • Expanded data model includes – Registry interfaces – Metadata structures and formats – Data and metadata provisioning – Other advanced features (process flow, reporting taxonomy, structure mapping, etc.) • Data formats now include uncoded dimensions, hierarchical codelists, and non-numeric observations Technical Notes on Version 2.0 • A very large expansion of scope – Model covers the process of statistical exchange, not just the data formats – Many cases which version 1.0 could not support were included in version 2.0 as a result of implementations • Full support for the “data sharing” pattern of exchange – Resulting from the inclusion of the registry Changes for Version 2.1 • Expanded Web Services Guidelines – – – – Standard WSDL Functions Standard RESTful syntax (URL-based API) Standard Error Codes Will allow for interoperable web services for SDMX – so generic clients can use multiple sources • Simplified Data Formats – All data formats will be more consistent – Cross-sectional and time-series formats are more similar • SDMX Query has been improved • Note: SDMX 2.1 is available for public review now! SDMX Content-Oriented Guidelines • Four documents: – Overview – Metadata Common Vocabulary – Cross-Domain Concepts – Statistical Subject-Matter Domains • These will not become ISO specifications, but will evolve as publications of the SDMX Initiative Metadata Common Vocabulary • A set of terms and definitions for the different parts of the SDMX technical standards, and many common concepts used in data and metadata structures • Does not replace other major vocabularies in this space (such as the OECD glossary) but references these other works Cross-Domain Concepts • Includes concepts which are common across many statistical domains – Names & Definitions – Representations • These are concepts which support both data and metadata structures Statistical Subject-Matter Domains • Based on the UN/ECE classification of statistical activities • Provides a classification system for use in exchanging statistics across domain boundaries • Provides a breakdown of the various domains within official statistics SDMX and Data Formats Data Set We have a dataset, what do we need to know? • Version 1.0 – What it is and how it is structured • Version 2.0 – Who reports/disseminates it – How a specific data set fits into the overall collection framework and which organisation is responsible for reporting which parts – The reporting/publication schedule – That it has been reported/published Data Set: Structure First: Identify the Concepts • A concept is a unit of knowledge created by a unique combination of characteristics (SDMX Information Model) Data Set Structure:Concepts Country Stock/Flow Unit Multiplier Unit Time/Frequency Computers need structure of data •Concepts •Code lists •Data values Topic •How these fit together Data Set Structure: Code Lists CONCEPTS Concepts Topic Country Flow Code Lists TOPIC COUNTRY STOCK/FLOW A Brady Bonds AR Argentina 1 Stock B Bank Loans MX Mexico 2 Flow C Debt Securities ZA South Africa Data Makes Sense Q,ZA,B,1,1999-06-30=16547 16457 Data Set Structure: Defining Multidimensional Structures • Comprises – Dimensions Concepts that identify the observation value – Attributes Concepts that add additional metadata about the observation value – Measure Concept that is the observation value – Any of these may be • • • • • coded text date/time number etc. Representation Data Set Structure: Concept Usage Country (Dimension) Stock/Flow (Dimension) Unit Multiplier (Attribute) Unit (Attribute) Time/Frequency (Dimension) Topic (Dimension) (Dimension) Observation (Measure) Data Structure Definition concepts that identify groups of keys concepts that identify the observation Key Group Key concepts that are observed phenomenon concepts that add metadata Attributes Measures takes semantic from takes semantic from Concept CONCEPTS Topic Country Flow Dimensions takes semantic from has format has format Representation NonCoded coded has format has code list TOPIC A Brady BondsCode B Bank Loans List C Debt Securities Data Makes Sense Frequency,Country,Topic,Stock/Flow,Time=Observation Q,ZA,B,1,1999-06-30=16547 Quarterly, South Africa, Bank Loans, Stocks, 2nd quarter 1999 16457 Identifying Concepts • Identifying Concepts - Sources – Existing data set tables • From website • From applications – Data Collection Instruments • Questionnaires • Excel spreadsheets – Regulations, Handbooks, User Guides • Labour Statistics Convention, 1985 (No. 160), Recommendation, 1985 (No. 170) • Council Regulation No: 311/76/EEC of 09/021976; OJ: L039 of 14/02/1976; Compilation of statistics on foreign workers – Database Tables – Existing Data Structure Definitions • From other organisations Identify Concepts – from website Measurement = 1,000 Kg Source: FAO proof of concept project Concepts Measure Type Frequency and Time Commodity Reference Region Measurement = 1,000 Kg Unit and Unit Multiplier Observation Value Concept Role: Reminder • Dimensions – Are the concepts that identify the observation value • Attributes – Are the concepts that add additional metadata about the observation value • Measure – Is the concept that is the observation value Exercise:Concept Role Measure Type Frequency and Time (Dimension) (Dimensions) Observation Value (Measure) Commodity (Dimension) Reference Region (Dimension) Measurement = 1,000 Kg Unit and Unit Multiplier (Attributes) Data Set and Structure Dimension Concept FREQ REF_AREA_REG COMMODITY MEASURE_TYPE TIME Measure Concept OBS_VALUE Attribute Concept OBS_STATUS OBS_CONF UNIT UNIT_MULTIPLIER Identify/Define Code Lists • Purpose of a Code List – Constrains the value domain of concepts when used in a structure like a data structure definition – Defines a shortened language independent representation of the values – Gives semantic meaning to the values, possibly in multiple languages • Agreeing on harmonised code lists is the most difficult aspect of defining a data structure definition Data Structure Definition - Reminder Data Structure Definition concepts that identify the observation Key concepts that add metadata Attributes Group Key concepts that are observed phenomenon Measures takes semantic from concepts that identify groups of keys takes semantic from Concept Dimensions has format takes semantic from has format Representation NonCoded coded has code has list format Code List SDMX and Data Formats Session: SDMX Syntax Implementations for Data SDMX Data Syntax Implementations • SDMX provides for two main syntaxes: – UN/EDIFACT (for SDMX-EDI) – XML (for SDMX-ML) • Each syntax provides a format for describing data structure definitions • Each syntax provides at least one format for data – There are 4 different XML syntaxes for data SDMX-EDI • EDI – “electronic data interchange” – is an older, flat-file syntax used primarily to conduct e-commerce – There have been a few statistical messages – GESMES is the “generic statistical message” • EDI messages are difficult to read unless you know EDI very well… Benefits of SDMX-EDI • As a data format, it is very compact – Good for very large data sets • Permits incremental updating of data sets • Permits attributes and observations to be sent separately • Has a very large installed base within the European community and the central banks (used by 180 countries) • It is not very Web-oriented, however SDMX-ML Document Types (Data) • Structure Message: Holds the agencies, concepts, codelists, and data structure definitions (DSDs) • Generic Format: A single XML schema for all different types of data, regardless of data structure definition • Utility Format: Specific to DSD, provides strongest validation • Compact Format: Like the EDI message, compact, but not as much validation as Utility • Cross-Sectional Format: Similar to Compact, but holds cross-sectional data • Data Query Message: Allows for querying of online databases and similar applications which are SDMX-aware. Supports web services. The SDMX-ML Data Formats • In designing the XML formats for SDMX, several different needs were identified – Needed an XML format for describing data structure definitions – Needed an XML version of the EDIFACT messages for transmitting large databases – Needed an XML which would help validate statistical data sets – Needed an XML which could be used generically for any statistical data set – Needed an XML for transmitting cross-sectional data – Needed a message to query for data • Because SDMX-ML is based on the SDMX Information Model, it was decided to create several equivalent XML data formats, to satisfy each of these cases – Requirements were mutually exclusive for these cases Generic Data Message • • • • No validation Carries data for any data structure definition Verbose – files are very large Can perform incremental updates and carry partial data sets • Useful for applications which need to carry potentially incorrect data for processing and cleaning • Useful for generic applications which handle data for more than one DSD • Serves as a “pivot format” between other SDMXML format types Utility Data Message • Provides strongest validation – all business rules in DSD are enforced by a generic XML parser (schemas are specific to particular DSDs) • Less verbose than Generic; more verbose than Compact & Cross-Sectional • Incremental updates not supported • For XML tools, this is the most “normal” type of XML schema – performs best Compact Data Message • Equivalent of SDMX-EDI data format, but schemas are specific to a particular DSD • Good for exchanging partial data sets and incremental updates • Very compact (for XML) in terms of file sizes • Very simple, but performs limited validation – Will validate codelists, but not some other things Cross-Sectional Data Message • Similar to Compact format, but allows for lots of observations for a single point in time (not time-series oriented like other formats) • Very compact • Supports incremental updates • Provides limited validation – schemas are specific to a particular DSD Selecting the Right SDMX-ML Format • Free tools allow transformation between data formats without any loss – each application can use one or more formats for specific tasks • Depending on the application, one format may be preferable to another – – – – How large are the data files? How much validation needs to be performed? How many DSDs are supported by the application? Will all data be correct when received (according to the DSD)? SDMX-ML “Model-Driven” XML Approach DSD Additional SDMX Features • Hierarchical Code List • Structure Set (mappings) • Reporting Taxonomy Hierarchical Code Lists – Example Scenario • • • • • • France is a country France is part of the continent of Europe France is a member of NATO France is a member of the EU France is a member of the G10 When I analyse statistics I might want to see totals by – – – – continent trading block military alliance financial grouping • France will be grouped with different sets of countries depending on the “view” required • How do we express these groupings? Code Code List Hierarchy-1 Hierarchy-2 Hierarchy-3 Hierarchy-4 Code Composition Code Composition Code Composition Code Composition Reference Area 6B NATO B0 EU B1 NAFTA BE Belgium BG Bulgaria CA Canada Europe EU countries NATO countries G10 countries Code Parent Code Parent Code Parent Code Parent BE E1 BE E0 BE 6B BE G0 BG E1 CZ E0 BG 6B CA G0 CH E1 DE E0 CA 6B CH G0 CZ E1 DK E0 CZ 6B DE G0 DE E1 EE E0 DE 6B FR G0 DK E1 ES E0 DK 6B GB G0 ES Spain EE E1 FI E0 EE 6B JP G0 FI Finland ES E1 FR E0 ES 6B IT G0 FR France FI E1 GB E0 FR 6B NL G0 GB United Kingdom FR E1 etc GB 6B SE G0 GR Greece GB E1 US G0 HU Hungary etc CH Switzerland CZ Czech Republic DE Germany DK Denmark E1 Europe E8 North America EE Estonia JP Japan I2 Euro 12 IT Italy NE Netherlands US United States North America NAFTA countries Code Parent CA B1 Code Parent US B1 CA B1 MX B1 US B1 etc Code Association Schematic of the Hierarchical Code Scheme comprises hierarchies Hierarchical Code Scheme comprises code groups Code List belongs to The codes may be in variety of code lists. relates a code to a parent code code Code parent code Properties of the association Property Hierarchy value based hierarchy has code groups level based hierarchy has formal levels Code Association groups codes with the same parent Code Composition comprises code groups Level Item Scheme Maps • Many types of “item scheme” use the same fundamental structure – Code list – Category scheme – Concept scheme • Two Item Schemes can be mapped Schematic of the “Code” Mapping source item scheme Code List Map Item Scheme Association Category Scheme Map Concept Scheme Map Association Role Item Scheme Code List Category Scheme target item scheme Concept Scheme Code List Item Scheme Category Scheme Concept Scheme has item associations Item Code Category source item Concept Item Association target item Item Code Category Concept Structure Maps • Structures can also be mapped – Data structures – Metadata structures Data/Metadata Reporting, Query, Analysis, Mapping Structure & Item Scheme Maps Data or Metadata Structure Definition Category Scheme Data or Metadata Set Data or Metadata Flow Category Content Constraint Data Provider Provision Agreement Attachment Constraint Registered Data Set or Metadata Set Reporting Taxonomy • An SDMX Reporting Taxonomy is a group of data flows and/or metadata flows which form the basis of a single real-world document or report • They can be organized into groups and sub-groups as needed • They can be named and identified • Useful for managing various types of reports over time Processes • SDMX 2.0 provides the ability to document the steps and logic of a process flow • This is not executable, but serves as documentation to describe the processes which produce data and metadata • It is useful as a target for the attachment of reference metadata describing processing SDMX and Metadata Formats Reference Metadata • We have seen how data values are limited to where they belong – Series key (usually qualified by time) • Data attribute values are limited in where they belong – – – – Observation value Series key Group key Data set • Metadata is everywhere, but – it must be metadata about “something” • what is the “something” • how is it identified – it comprises concepts and how are they structured • The Metadata Structure Definition answers these questions • Advance release calendar is only one possible example Metadata Example: Advance Release Calendar (ARC) • What is the release calendar for? RELEASE CALENDAR – Informs when data will be published/made available • Who publishes the data set? • What type of data is it (data flow)? • What metadata is in the release calendar (i.e. its structure) • Who publishes the release calendar? • When is it published? Labour Force Statistics Metadata Structure Definition (MSD) Structure RELEASE CALENDAR •Concepts •Hierarchies •Representation (e.g. code list) Metadata Structure Definition (MSD) Report Structure Metadata Structure Definition can comprise the specification of one or more report Metadata Report Concept takes semantic and context from MetadataAttributes Attributes Metadata concept defined in can have hierarchy definition of format and permitted values can have hierarchy Concept Scheme Format and Permitted Value List Example ARC Metadata Day Ref Area Indicator Ref Period Time Tolerance Status Identifiers 30-042007 INE, Spain LF-H Q: 31-032007 09:00 +24 Hr. Final 30-042007 INE, Spain LF-E Q: 31-032007 09:00 +24 Hr. Final 30-042007 ONS, UK LF-H Q: 31-032007 09:00 +48 Hr. Final 30-042007 ONS, UK LF-E Q: 31-032007 09:00 +48 Hr. Draft MSD Metadata Concepts: Advance Release Calendar Concept Id REFERENCE_PERIOD RELEASE_DATE_TIME 1 Concepts Description The time period to which a variable refers The specific point in time that data or metadata are made available DATE_TOLERANCE The possible or permissible variance of a time period relative to a known point in time. RELEASE_STATUS The state of preparedness of a statement on the availability of data or metadata ANNOTATION Additional metadata MSD: Report Structure for ARC ARC_METADATA Metadata Structure Definition REFERENCE_PERIOD RELEASE_DATE_TIME DATE_TOLERANCE RELEASE_STATUS ANNOTATION ARC Metadata Report Concept Concept Scheme MY_AGENCY:METADATA_CONCEPTS REFERENCE_PERIOD RELEASE_DATE_TIME DATE_TOLERANCE RELEASE_STATUS ANNOTATION MetadataAttributes Attributes Metadata Format and Permitted Value List MSD: Metadata Report Structure Metadata Report = ARC Target Id = Metadata Attribute Concept = Reference_Period Representation = Release_Date_Time Representation = Date_Tolerance Representation = Date/Time Metadata Attribute Concept = Date/Time Metadata Attribute Concept = Metadata Attribute Concept = Time Value CL_Status Release_Status Representation = F Final P Provisional Metadata Attribute Concept = Text Annotation Representation = Metadata Set: ARC Report Example Metadata Set Metadata Structure = ARC_METADATA Metadata Report = ARC Identifiers Metadata Attributes Concept = Reference_Period Concept = Release_Date_Time Value = 2007-04-30T09:00 Concept = Date_Tolerance Value = +24Hr Concept = Release_Status Value = F Concept = Annotation Value = simultaneous release by ECB Value = 2007-31-03 Metadata Example: Advance Release Calendar (ARC) • What is the release calendar for? – Informs when data will be published/made available RELEASE CALENDAR  • Who publishes the data set? • What type of data is it (data flow) • What metadata is in the release  calendar (i.e. its structure) • Who publishes the release calendar? • When is it published? Metadata Structure Definition (MSD) To which object is the metadata attached? Metadata Structure Definition can comprise the specification of one or more report Target Identifier Links to Metadata Report Concept takes semantic and context from MetadataAttributes Attributes Metadata concept defined in can have hierarchy definition of format and permitted values can have hierarchy Concept Scheme Format and Permitted Value List Data Flows: Controlling Reporting and Publishing Structure Definition uses specific data structure Data Set conforms to business rules of the dataflow Data Flow RELEASE CALENDAR publishes/ reports data sets Data Provider can provide data for many data flows using agreed data structure can get data from multiple data providers Provision Agreement Controlling Data Reporting Structure Definition uses specific data structure Data Set conforms to business rules of the dataflow LF-H = labor force hours Data Flow RELEASE CALENDAR publishes/ reports data sets 1A – INE Spain Data Provider can get data from multiple data providers can provide data for many data flows using agreed data structure Provision Agreement Metadata Structure Definition (MSD) Identify Structure RELEASE CALENDAR Provision Agreement •Concepts •Hierarchies •Representation (e.g. code list) MSD: Identifying the “Target” Metadata Structure Definition defines “keys” of object types to which metadata can be “attached” Full Target Identifier Partial Target Identifier specifies the identifier components (“key”) of the target object Target Object Type identifies target object type of the component Identifier Identifier Components Components identifies the code list or other type of list (e.g. Category Scheme which defines the valid values tat can be used when metadata are reported in a metadata set Item Scheme MSD: Object Identification for ARC ARC Metadata Structure Definition ARC_METADATA Metadata Report Data_Flow_Provider Data Flow Full Target Identifier Partial Target Identifier CL_DATA_FLOW LF-H Labour Force, Hours Worked LF-E Labour Force, Employment OS_DATA_PROVIDER Data Provider Target Object Type Identifier Identifier Components Components 1A INE, Spain 2A ONS, UK Item Scheme MSD: Identifiers for ARC Metadata Structure Definition = ARC_METADATA Target = Data_Flow_Provider Identifier Component Target Object Type = Data Flow CL_DATA_FLOW Item Scheme = LF-H Labour Force, Hours Worked LF-E Labour Force, Employment Identifier Component Target Object Type = Data Provider OS_DATA_PROVIDER Item Scheme = 1A INE, Spain 2A ONS, UK MSD: Metadata Report Structure Metadata Report = Target Id = ARC Data_Flow_Provider Metadata Attribute Concept = Reference_Period Representation = Release_Date_Time Representation = Date_Tolerance Representation = Date/Time Metadata Attribute Concept = Date/Time Metadata Attribute Concept = Metadata Attribute Concept = Time Value CL_Status Release_Status Representation = F Final P Provisional Metadata Attribute Concept = Text Annotation Representation = Metadata Set: ARC Report Example Metadata Set Metadata Structure = ARC_METADATA Data Flow Metadata Report = ARC Identifiers Data Provider = 1A Data Flow = LF-H Data Provider Provision Agreement Metadata Attributes Concept = Reference_Period Concept = Release_Date_Time Value = 2007-04-30T09:00 Concept = Date_Tolerance Value = +24Hr Concept = Release_Status Value = F Concept = Annotation Value = simultaneous release by ECB Value = 2007-31-03 Metadata: Advance Release Calendar (ARC) • What is the release calendar for? – Informs when data will be published/made available RELEASE CALENDAR  • Who publishes the data?  • What type of data is it (data flow)? • What metadata is in the release  calendar (i.e. its structure)? • Who publishes the release calendar? • When is it published? Controlling Metadata Reporting Metadata Structure Definition ARC_METADATA uses specific data structure Metadata Set conforms to business rules of the metadata flow publishes/ reports metadata sets 1A can provide metadata for many metadata flows using (Meta)Data agreed metadata structure Provider Metadata Flow ARC can get metadata from multiple metadata providers Provision Agreement Metadata collectors can set up control metadata for the collection process Metadata: Advance Release Calendar (ARC) • What is the release calendar for? – Informs when data will be published/made available RELEASE CALENDAR  • Who publishes the data?  • What type of data is it (data flow)? • What metadata is in the release  calendar (i.e. its structure) • Who publishes the release calendar?  • When is it published?  Reference Metadata • Metadata is everywhere, but – it must be metadata about “something” • what is the “something” • how is it identified – it comprises concepts and how are they structured • The Metadata Structure Definition answers these questions • Advance release calendar is only one possible example – attached to the Provision Agreement To which (other) things can metadata be attached? MSD: Some Object Types Structure Definition Data Set or Metadata Set Data Provider Structure and Item Scheme Maps Data or Metadata Flow Provision Agreement Category Scheme Category Content Attachment Constraint Constraint Registered Data Set or Metadata Set MSD: List of Object Types to Which Metadata can be Attached Agency ConceptScheme Concept Codelist Code KeyFamily Component KeyDescriptor MeasureDescriptor AttributeDescriptor GroupKeyDescriptor Dimension Measure Attribute CategoryScheme ReportingTaxonomy Category OrganisationScheme DataProvider MetadataStructure FullTargetIdentifier PartialTargetIdentifier MetadataAttribute DataFlow ProvisionAgreement MetadataFlow ContentConstraint AttachmentConstraint DataSet XSDataSet MetadataSet HierarchicalCodelist Hierarchy StructureSet StructureMap ComponentMap CodelistMap CodeMap CategorySchemeMap CategoryMap OrganisationSchemeMap OrganisationRoleMap ConceptSchemeMap ConceptMap Process ProcessStep Metadata Structure Definition (MSD) Report Structure Metadata Structure Definition can comprise the specification of one or more report Target Identifier Links to Metadata Report Concept takes semantic and context from MetadataAttributes Attributes Metadata concept defined in can have hierarchy definition of format and permitted values can have hierarchy Concept Scheme Format and Permitted Value List SDMX and Metadata Formats Session: SDMX-ML Formats for Metadata Sets Metadata Formats Syntax Implementation • There are three relevant constructs in SDMX-ML for handling metadata sets – Metadata Structure Definitions – Metadata Reports (specific to an MSD) – Generic Metadata Sets (for any MSD) • This is similar to data formats in SDMX-ML, except that there are fewer different use cases • There is no corresponding format implementation in SDMX-EDI for Reference Metadata Comparing Formats for Metadata Sets • Generic Metadata performs no validation, but can hold any type of metadata report • MSD-specific Metadata Reports can perform more validation, and are less verbose – Because there tend to be few codelists or numeric types in metadata reports, the validation may not be very useful Metadata: Quality Frameworks • The SDMX cross domain concepts for reference metadata are concerned with data quality framework (DQAF) metadata • These DQAFs are used to improve the quality, comparability, transparency etc. of published data Metadata – Reported according to a Quality Framework Example Metadata: Content ACCOUNTING_CONV QUALITY_METADATA Metadata Structure Definition BASE_PER COVERAGE COVERAGE_SECTOR REF_AREA REF_PERIOD CATEGORY_CONTENT_REPORT COVERAGE REF_AREA Metadata Report MY_CONCEPTS BASE_PER Concept Concept Scheme COVERAGE_SECTOR ACCOUNTING_CONV REF_PERIOD BASE_PER MetadataAttributes Attributes Metadata Format and Permitted Value List SDMX Registry Overview SDMX Registry/Repository Indexes data and metadata Describes data and metadata sources and reporting processes Register REGISTRY Data Set/ Metadata Set Query Submit REPOSITORY Provisioning Metadata Query Submit Describes data and metadata structures REPOSITORY Structural Metadata Query S D M X R e g i s t r y I n t e r f a c e s SDMX Registry/Repository Indexes data and metadata Register REGISTRY Data Set/ Metadata Set Subscription/ Notification Applications can subscribe to notification of new or changed objects Query Submit REPOSITORY Provisioning Metadata Query Submit Describes data and metadata structures REPOSITORY Structural Metadata Query S D M X R e g i s t r y I n t e r f a c e s Information Model: High level Schematic Structure Maps Data or Metadata Set structure and code list maps conforms to business rules of the data/metadata flow uses specific data/metadata structure can be linked to categories in multiple category schemes Data or Metadata Flow publishes/reports data/metadata sets Data Provider Category Scheme Data or Metadata Structure Definition can provide data/metadata for many data/metadata flows using agreed data/metadata structure can get data/metadata from multiple data/metadata providers Provision Agreement registers existence of data and metadata URL, registration date etc. comprises subject or reporting categories Category can have child categories Data or Metadata Set SDMX Registry/Repository Indexes data and metadata Subscription/ Notification Applications can subscribe to notification of new or changed objects Register REGISTRY Data Set/ Metadata Set Query Submit REPOSITORY Provisioning Metadata Query Submit Describes data and metadata structures REPOSITORY Structural Metadata Query S D M X R e g i s t r y I n t e r f a c e s SDMX Artefacts: Registry Contents Structure Maps structure and code list maps Structural Metadata Provisioning Metadata Registered Data and Metadata Data Provider Category Scheme Structure Definition can provide data/metadata for many data/metadata flows using agreed data/metadata structure uses specific data/metadata structure can be linked to categories in multiple category schemes Data Flow can get data/metadata from multiple data/metadata providers Provision Agreement registers existence of data and metadata sets URL, registration date etc. comprises subject or reporting categories Category can have child categories Data Set The Old JEDH (Joint External Debt Hub) Site BIS WEBSITE IMF OECD World Bank (Various Formats) (3-month production cycle) JEDH with SDMX Retrieves data from sites BIS IMF OECD World Bank SDMX-ML SDMX “Agent” SDMX-ML SDMX-ML SDMX Registry Discover data and URLs Data provided in real time to site SDMX-ML SDMX-ML SDMX-ML Loaded into JEDH DB (Debtor database) JEDH Site FOOD AND AGRICULTURE ORGANIZATION OF THE UNITED NATIONS SDMX in Action: Prototype System FAO SDMX Registry 2 National Publication Server(s) 1 CountrySTAT 3a Regional Publication Server 3b Flow of FAO CountrySTATRegionSTAT Implementation 4 RegionSTAT Slide courtesy of the FAO FOOD AND AGRICULTURE ORGANIZATION OF THE UNITED NATIONS Prototype System: Explanation 1 CountryStat National Publication Server •The web site is published from the files in CountryStat 2 SDMX Publication •The new CountryStat files are converted to SDMX-ML data sets and made web accessible on the CountryStat web site •These files are registered in the FAO SDMX Registry RegionStat Regional Publication Server 3a •Queries the registry for new registrations which responds with registration details including the URL of the new data sets 3b •Retrieves the new data sets from the CountryStat web site •Converts the SDMX-ML files to an internal format and integrates the new data sets with existing RegionStat data sets 4 •Re-publishes the RegionStat web site Slide courtesy of the FAO SDMX Implementation Developing SDMX Applications • General Design Approaches • Publications and Dissemination • Data Warehousing/Integration of Data Sources • Other Topics SDMX Publication and Dissemination • SDMX can be used to drive Web dissemination and print publication – It is a useful format for distribution from websites – It can be used by websites to improve delivery of content – It can be used to provide content to print applications, for tabular data • These techniques can result from a single system Note: Can be a virtual data store fed by the SDMX registry Data Storage (SDMX) SDMX Registry Templates, boilerplate text, analysis SDMX Query Engine XSL-FO SDMX- SDMXML ML Print Publication Engine Canned Queries On-the-Fly Queries ASP/JSP CSV PDF, etc. Website HTML XSLT Notes on Publication/Dissemination • Current practice is often to focus on the delivery of tables – This is often not what users ideally want – Tables can be viewed as “canned queries” • Better web-sites can be created which support granular user queries supported by rich metadata – See the ECB data warehouse, Federal Reserve Board site as examples – See “Data on the Web” presentation for more details Data Warehousing/Integration of Data Sources • SDMX is also designed to support the collection and processing of data – In most organizations, this is seen as a data warehousing activity • SDMX provides tools for integrating data from a variety of sources – Can be among a set of organizations or within an organization Data Warehouse Data Loading Data Harmonization/ Processing Data Dissemination Website Data Sources (static files, databases, etc.) Data Pulled Notification Print Publication SDMX Registry Internal Applications Data Registration Note: All types of dissemination applications may use the registry for various purposes. The registry may even be made publically available to users who want SDMX-ML data and metadata. Notes on Data Warehousing • Each stage is loosely coupled with associated applications, using XML interfaces: – Data sources – Data processing – Data dissemination applications • The SDMX Registry functions throughout as a metadata repository, to provide structural and provisioning information as well as location of data as needed • Internal database structures are based on SDMX information model – They are predictable and regular – They can be auto-generated SDMX Tools and Resources SDMX Tools (Partial List) • Metadata Technology has a set of free tools for working with data and metadata, and a free registry implementation – Mostly Java and XSLT • Eurostat has a set of free tools for working with data and metadata, and has a registry implementation • OECD and IMF have a web-services based package for dissemination: .STAT (available through MOU) • ECB visualization tools written in Flex on Google Code • Some other tools, including commercial vendors (STR Supercross 2, etc.) Other Resources • www.sdmx.org has a blog and makes many different presentations and paper available, as well as distributing copies of the standards – An SDMX User’s Guide is currently being developed (beyond the material contained in the SDMX v 2.0 specification) • The Open Data Foundation promotes SDMX (among other standards) – Check www.opendatafoundation.org – They host the SDMX Users Forum www.sdmxusers.org SDMX and Other Standards Other Important Standards • Data Documentation Initiative (DDI) – describes the micro-data inputs to aggregate (SDMX) data • ISO/IEC 11179 Metadata Registries – describes terminological/semantic and conceptual models, and the metadata lifecycle • eXtensible Business Reporting Language (XBRL) – describes financial microdata for economic statistics SDMX and XBRL • These standards can be mapped to each other successfully • However, the mapping depends on the specific SDMX Data Structure Definition, and the specific XBRL “Taxonomy” – There is no single, standard mapping DDI and SDMX Combined Data Model • DDI 3 focuses on: – – – – collection and production of microdata reuse and sharing of common data structures conversion to statistical tables (matrices) preservation and multiple storage options • SDMX focuses on: – statistical tables – reuse and sharing of common data structures – consistent data transfer structure • Together they form a coherent data management model for data capture, storage and interchange with a wide area of overlap S20 138 Generic Process Example DDI Anonymization, cleaning, recoding, etc. Raw Data Set Aggregate Data Set (Lower level) Micro-Data Set/ Public Use Files Aggregation, harmonization Aggregate Data Set (Highest-Level) Aggregate Data Set (Higher Level) SDMX The Generic Staistical Business Process Model (GSBPM) • The METIS group is a part of UN/ECE which addresses metadata issues for national statistical agencies (and other producers of official statistics) – This community uses both SDMX and DDI • They have produced a reference model of the statistical production process – The DDI 3 Lifecycle Model was a major input – GSBPM has a much greater level of detail The Generic Statistical Information Model (GSIM) • Early work on an information model to accompany the GSBPM is starting – Still informal, very early – Involves some of the statistical agencies which lead the work on GSBPM • GSIM will take as a major input both the DDI and SDMX information models – Will also cover other metadata – Will also draw on other standards (Neuchatel Model for Classifications, etc.) • Goal is to publish GSIM through METIS alongside the GSBPM Questions?

INEGI: Introduction to SDMX

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib