register data

advertisement
SDMX and DDI: A Proposed Approach for
Interoperable Exchange of Register Data
I. Introduction
The purpose of this document is to illustrate how a single metadata model can work effectively between
DDI and SDMX, allowing both types of XML formatting to be used for a specific case. This document is
created as input for the dialogue on-going between the SDMX sponsors and the DDI Alliance.
In order to illustrate this approach, one of the simpler use cases described as part of the ongoing
informal dialogue between SDMX and DDI has been selected – the documentation, reporting, and/or
exchange of register data. This is a use case for which DDI – Lifecycle (formerly DDI 3) – is used today,
and one about which there has been much recent discussion within the SDMX community.
If the use of the standards is approached in the way described, it should be easily possible to take DDI
XML and turn it into SDMX, and vice-versa. The same approach could potentially be taken for other use
cases as well.
II. Background
One common use of DDI is to describe register or administrative data. This is also a function supported
(to a certain extent) within SDMX. Today, we are seeing a growing number of statistical agencies
involved in the collection, reporting, and even dissemination of microdata (including register data).
While the toolkits available for much microdata functionality have generally been based on DDI rather
than SDMX, the latter standard is preferred by many official statistics organizations. Thus, it may be the
case that a supplier of register data would prefer to use DDI, while the collector or end-user might
prefer to have the same data and metadata in SDMX, or vice-versa.
This paper proposes an approach which allows this difference to have minimal impact: if the problem
can be easily reduced to a simple XML transformation, for which the mappings are agreed (and obvious)
then the preference of one standard over the other becomes much less of an issue. If a standard profile
could be identified for the use of the standards, then generically useful tools could even be developed.
Illustrated below is a standard DDI view of how register data can be documented:
1
From this picture – based on a production implementation collecting and documenting labour data – we
can see that DDI provides a set of useful metadata alongside the data file. This includes not only
structural metadata (codes, concepts, etc.) for manipulating the file, but also provides information
about how the data was sourced, and what events were performed in its collection and cleaning. In this
example, data was coming from more than one administrative source and being integrated into a single
data set, but differences in the sources of administrative data required different processing/cleaning.
This is not an uncommon scenario – register data often comes from disparate sources, but is formed
into a single coherent data set when collected.
This proposal takes a typical case like this and shows how a selection of DDI elements can be used to
document/exchange the metadata alongside the data file (presumably in ASCII); and how alternately
SDMX can be used to describe the data and metadata, so that an easily-understood transformation
could be specified. The key to this is based on a sub-set of DDI being identified, and the technique of
applying SDMX being documented.
It is worth noting that the reference metadata mechanism in SDMX was designed to – among other
things – allow for non-SDMX standards to be reflected in an SDMX system. This proposal uses the
existing capabilities of SDMX to describe microdata, and also the reference metadata capabilities of
SDMX to describe a sub-set of DDI elements.
2
This proposal is not made at a fine-grained level of detail: it outlines the DDI elements needed at a fairly
high level. It is a rough first draft – obviously, there would need to be discussion about what the optimal
metadata payload for register data exchange would be. Some of the more complex capabilities of DDI –
such as describing hierarchical data sets with different types of records (persons, households, etc.) or
other complex linked structures are also not used in the draft profile, although these could be included if
desired. The presented use of SDMX and DDI is intended to be the basis of a discussion, and to propose
this approach as one which could be used by the SDMX and DDI communities to better serve their users
needs. It is not presented as a finished business solution.
III. Description of Profiles in DDI
In the DDI standard, there is a feature known as a “DDI Profile”. This is an XML description of what fields
within the overall DDI standard are supported by a specific application or organization, and it allows for
several functions:
(1) It indicates whether a particular part of the standard is used or not
(2) It allows for optional elements to be required
(3) It allows for fixed or default values to be supplied for elements which do not have them normally in
the standard
(4) It provides for application-specific names and documentation to be attached to the used elements
This XML profile can then be published and referenced by other applications, so that they know what to
expect from a particular DDI instance.
We are proposing to use this feature of DDI as a way of helping to document a profile of DDI 3.1 to
support our register use case, when there could be an issue around whether SDMX or DDI is the
preferred standard. Obviously, the DDI profile can only express the profile in terms of DDI, so it would
need to be supplemented by some SDMX artefacts and documentation as well. But the profile could be
published and then referenced by register-data applications.
The contents of the profile are provided in a later section – the description here of their capabilities is
provided for those who are unfamiliar with this feature of DDI.
IV. Description of Reference Metadata in SDMX
There is a feature of SDMX which can be used to describe metadata models of any sort – SDMX
Reference Metadata structures and reports. While this feature of SDMX is typically used for describing
quality frameworks, it is very flexible, and one of its intended uses was to help SDMX systems cope with
metadata based on other standard models. (The quality frameworks, in fact, can be seen as a type of
metadata based on their own standards such as DQAF, etc.)
3
The schematic below shows how metadata reports are structured in SDMX. They use concepts to define
the meanings of metadata attributes, assign representations (codelists, etc.) to these, and then allow for
the attributes to be arranged as a flat list or a hierarchy. Once the structure has been defined, it is
reflected in a standard XML report.
The trickiest part of SDMX Reference Metadata is the definition of a “Target Identifier”. The function of
a target identifier is to allow the report to be attached specifically to an SDMX-defined object of any
type (a data flow, a topical classification expressed as an SDMX CategoryScheme, etc.). It effectively
identifies the subject of the metadata report. For our proposal, we assume that an SDMX data set (the
microdata), possibly a data flow (we can already get this from the data set), and a topical category
would make up the target identifier. The only absolutely required field here would of course be the data
set identifier.
It is easy to imagine how a hierarchical XML structure such as DDI can be expressed using this technique,
as a set of metadata attributes in an SDMX reference metadata report. For our purposes, a standard
MSD would be published which corresponded with many elements of the DDI Profile mentioned above.
To give an example, in our DDI profile described below, we have elements such as StudyUnit, which
contains child elements Abstract and Purpose. One can easily imagine an SDMX Metadata Structure
4
which has concepts titled “StudyUnit”, “Abstract”, and “Purpose”, arranged in a hierarchical report to
reflect their relationships in the DDI. These would also have the same representations (both standards
use XHTML to do formatting of this type of metadata, and both cab describe codelists, numeric types,
strings, etc).
V. Microdata Structures in SDMX
In DDI, it is traditional 9and generally the case) that data is left in it’s native format, or exported into an
ASCII-based format such as CSV. While the standard is capable of carrying data, this is a secondary use of
the standard, and a feature not widely implemented.
For SDMX, however, the standard formatting of the data itself is a primary feature of the standard, and
is key to the successful exchange of data.
As has been described elsewhere [reference to SDMX/DDI Microdata paper], SDMX was designed with
some capacity to describe microdata sets. This model does not have the richness of DDI – it was
originally to be used in the description of financial data at the account level, corresponding to a required
reporting format established by legal obligation. Thus, SDMX supports a relatively small sub-set of the
requirements which DDI was designed to meet as regards microdata.
In describing register data in SDMX, we will use the features of the standard intended for this purpose.
This corresponds to part of the overall metadata payload contained in our DDI Profile, and in our
proposal is supplemented by an attached SDMX Reference Metadata Report.
The problem with register data is that it is not inherently dimensional. In order to express it in SDMX, we
will need to agree on how to apply the dimensional aspects of SDMX to a data file which could be
tabulated in many different ways. This is not difficult to do, however:
Dimension 1: Frequency – Even if not really needed from the perspective of a data reporter, it is
traditional to use this as the first dimension in an SDMX data structure definition (DSD). Also, the data
collector will likely want to create a full data set over time, in which this dimension will be relevant.
Values for this would be the standard values for this dimension – quarterly, monthly, daily, etc. This
might or might not be found in a DDI instance as a variable, but would be known by the reporter and
collector in a regular exchange of data.
Dimension 2: Entity or Identity - These dimension types in SDMX were designed to describe either
specific accounts (with the Identity type) or legal entities (such as reporting organizations or individuals,
with the Entity type). This corresponds to the case identifier variable as described in the DDI file.
Dimension 3: Measure – This dimension gives the measure type of each of the values contained in the
register data file (typically corresponding to the column headers in a CSV file). The values for this
dimension are specific to each type of register data, and are expressed in the SDMX DSD as the values of
a codelist, used to represent this measure dimension. These correspond to all the non-frequency, nontime, and non-case-identifying variables in the DDI-described data set.
5
Dimension 4: Time – This dimension again may seem unnecessary to the data reporter, but will be used
by the data collector to create a view of the data over time. Typically, there is a variable within the
register microdata which gives a reporting period, but this is not necessarily the case. It is stronbgly
recommended for SDMX DSDs, however, so should be included here, as the value will be known to both
counterparties whether contained in the microdata or not.
The primary measure in SDMX would be specified as “OBS_VALUE”, and be given as forgiving a type as
needed to describe all of the reported microdata – the value of the Measure dimension would be used
to validate actual data types (these could all be described as secondary measures in SDMX if desired).
Note that SDMX uses concepts to provide the names for all dimensions and measures. The rule here
would be that the identifiers of the variables in the DDI description of the register microdata would
become the concepts used in the DSD for the measure dimension.
All microdata formatted as SDMX-ML under this proposal would use this four-dimensional structure,
even in cases where there might be variables which may be tabulated differently by the counterparties
under other circumstances. SDMX here is being used as a mechanism for data transport only.
VI. Data Transformation between SDMX and DDI at the Microdata Level
This section walks through an example of how the data described in DDI would be transformed into the
data contained in an SDMX data set. In the later sections we will show how the rest of the metadata
described in DDI could be transported in a reference metadata report.
Consider the following CSV file, as viewed in a spreadsheet tool. We will use this subset to illustrate our
approach, both with SDMX and DDI.
In the DDI description of this file, we will have four variables declared: CASE_ID, PART_FUL, EMPLOY,
and WAGES, corresponding to the headers of each column. CASE_ID holds the employee identifier used
by the administrative source (in this case, some fictional social security numbers), and is identified in the
DDI metadata as the record identifier. PART_FUL holds a coded value indicating whether the employee
6
works full-time or part-time (PT is part-time, FT is full-time). EMPLOY holds a value taken from an
employment classification. Wages is the total compensation reported by the administrative source,
expressed in US dollars.
Each DDI variable provides an ID, one or more names and labels, and specifies the type of the value
(which can be numeric, textual, coded, etc.) the DDI instance also describes the categories and codes
which provide valid values for coded types (similar to codelists in SDMX).
Other parts of the DDI – not the variable descriptions – provide the logical and physical layout of the
record: in this case, our four variables in the order they are shown above. Because this example is a CSV
file, the format in DDI would be described as a delimited format, with the separator being a comma
(fixed-width ASCII formats are also common, and supported, as are various types of database formats).
Let us assume that this is data reported annually, and that this is the report for 2010.
What would this look like in SDMX?
We would use our four-dimensional structure, as described above:
1. Frequency – Value in this case is known, but comes from the standard SDMX codelist: Annual
2. Case identifier (Entitity) – Value is a unique identifier, which is fine because the values here are
assigned to the individuals at birth, and are guaranteed to be unique (effectively a tax ID)
3. Measure – Value is taken from a codelist containing codes for PART_FUL, EMPLOY, and WAGES
4. Reference Period – value is “2010”
Our primary measure, OBS_VALUE, is declared in the DSD as a string, because the identifiers contain
characters which are non-numeric and un-coded (each different type of measure can be declared as a
specific secondary measure).
For those parts of the DDI which correspond to the dimensions, codelists, data types, and record
layouts, there is no need to reproduce them in our SDMX Reference Metadata – they can be determined
by looking at the SDMX DSD.
Here is our example data expressed as version 2.0 CrossSectional format SDMX-ML (namespace
abbreviations have been removed for clarity):
7
The transformation between the two formats – DDI-described CSV and SDMX-ML – should be easy to
imagine. It is similar to any other CSV-SDMX transformation. The tools for this already exist, but the style
of creating the SDMX DSD is critical in making this work interoperably.
Notice that what is being described in the SDMX is the dimensionality not of the data (which, being
microdata, is unpredictable - it can be tabulated in a very large number of different ways), but the
dimensionality of the CSV formatting. This can only be described as it appears, based on our
conventional use of only four dimensions. This has the benefit of predictability – we do not need to
describe the desired tabulation of the data, which doesn’t exist in the DDI since it simply describes the
microdata as it is formatted in the ASCII.
VII. Minimal DDI Profile for Register Data
The outline below provides quick descriptions of the high-level DDI elements which would be used to
describe register data. This does not go into the gritty detail, and that analysis and discussion would be
required before this proposal could be finalized. We have also indicated in the outline which fields are
expressed as structural metadata in the SDMX DSD, and which would be expressed as SDMX Reference
Metadata.
Note that a DDI Profile can provide application-specific names for different tools using the application
profile, so there may need to be a discussion of how the DDI elements are named for that purpose. The
DDI 3.1 names are provided here, as they would appear in the XML.
Almost every element here is optional – further discussion may want to consider which fields are
actually required for this profile.
Note that a convention for capturing DDI XML attributes (as opposed to elements) would need to be
agreed, but that is a technical detail, and not covered in this proposal.
8
DDI Instance – a generic wrapper for all DDI XML dinstances
Study Unit (Groups and ResourcePackages are not used) – represents a single file of register
data
Abstract – A required element describing the data collection; could be populated with
the name and description of the SDMX data flow.
Universe Ref – a required field describing the respondent poluation; this would be the
entities whose data is contained in the register, and would not typically change much over time.
Series Statement – A statement about which collection the data belongs to (again,
essentially the SDMX data flow)
Purpose – A statement of the purpose of the data collection, could be a citation of the
legal requirement for the data collection, etc.
Coverage – describes the coverage of the data set, expressed as three sub-fields
Topical Coverage (terms and keywords)
Spatial Coverage - description, code or bounding box
Temporal Coverage - date or date range
Analysis Unit – the analysis unit of the data collection (individual, business, etc.)
Kind of Data – This describes the data according to the user’s terminology. In this case,
a default value of “register data” or “administrative data” might be fine.
Other Materials – A list of related non-DDI materials, with links/citations. Probably
maps to an SDMX Annotation.
Notes – Notes, similar to SDMX Annotations.
Conceptual Components – the DDI module describing conceptual constructs. Should be
included if concepts are provided.
Concept Scheme – a listing of concepts with their descriptions, very similar to
SDMX Concept Schemes. Stored in the SDMX DSD.
Universe Scheme – a required listing of universes; in this case, it would only
hold the top-level universe for the register data set.
Data Collection – the DDI module describing the process of data collection, in this case
the register source and links to any processing code used to clean it.
9
Methodology (maybe - data source might be enough) – a description of the
methodology employed in the data collection.
Software – a description of any software employed in the data collection.
Collection Event – a description of the sourcing event to pull data from the
register.
Data Source – a description (and optional link to) the data source itself.
Processing Event – a description of (and optional link to the code for)
conducting a cleaning process or other process operating on the register data.
Logical Product – the DDI module containing logical metadata (variables, codes, logical
record structure, etc.); much of this is structural metadata in SDMX.
Data Relationship- used in DDI to describe hierarchical files and links; here it is a
minimal version only providing the link between a logical record and a physical
structure. This would be implicit in the SDMX file, and would not need to appear in the
SDMX Reference Metadata.
Logical Record – Describes the set of variables encompassed by the
logical record; in this case a minimal set (only the variables used in the single
file), needed for linking the physical and logical data constructs. Implicit in the
SDMX expression, and would not need to be captured in the SDMX Reference
Metadata.
Category Scheme – in DDI this provides a set of meanings used by a codelist;
this is structural metadata corresponding to the SDMX codelist descriptions. Structural
metadata in SDMX.
Code Scheme - the codes used by a codelist; these are paired with a Category
Scheme to make up what would be a codelist in SDMX, and therefore structural
metadata.
Variable Scheme – in DDI these are the columns found in the data file; they
would be implicit from the structural metadata for an SDMX DSD, so don’t need to be
captured elsewhere.
Physical Data Product – this is the DDI module which describes the physical structure of a data
set. Because of the heavy re-use of data structures in some data sets, DDI has many levels of indirection;
for our purposes, these are minimal descriptions of a single file, so this metadata is implicit in the SDMX
DSD and is not carried by the SDMX Reference Metadata.
10
Physical Structure Scheme – in DDI, a listing of various physical structures; for our
purposes an implicit list of one item which could be generated on transformation into an ASCII
format from an SDMX data set.
Record Layout Scheme – similar to a physical structure scheme; implicit in SDMX and
can be auto-generated when transforming the SDMX data set into an ASCII file
Physical instance - provides a link to the actual data file itself; in SDMX this would be part of the
Target Identifier of the Reference Metadata, or could be made explicit. Some option sub-fields could be
useful if this is explicit: Record Layout Reference, Gross File Structure, Data File Identification
Archive – a module required in DDI to contain an Oragnization Scheme, which is structural
metadata in SDMX contained in the DSD.
Organisation Scheme – a listing of organizations and individuals relevant to the data
collection (contact persons, maintenance agencies, etc.); for our purposes, a minimalist
expression matching what is found in an SDMX Organization Scheme.
Lifecycle Events – in DDI these provide a way of describing all the significant events
related to a data set and its metadata. These could be part of the register profile if they are useful, but
have no correlary in SDMX.
DDI Profile Reference – this is a reference to the DDI profile we are describing here, presumably
published in DDI XML form on the UN/ECE site or other suitable location (the DDI Alliance site?
SDMX.org?)
VIII. Description of SDMX Reference Metadata Structure for Register
Data
Rather than provide a lengthy and incomplete SDMX Metadata Structure Definition describing the
outline provided above (minus the elements which are either in the DSD or implicit), a shorter example
of an SDMX Metadata report is shown. When expressed in the SDMX-ML structure-specific schema, the
DDI elements literally become SDMX elements.
This involves declaring an SDMX Concept for each element used, and then fitting it into the
presentational hierarchy of metadata attributes, with the appropriate representation (typically text or
XHTML).
As described above, the Targey identifier could be a simple data set ID, or could also include a reference
to a topical classification (an SDMX Category Scheme) and/or a reference to an SDMX Data Flow. Even
the reporting organization could be added if that was seen as useful (a reference to the SDMX Data
Provider).
11
What is presented below is an example of what our SDMX Reference Metadata might look like – there
are some details which are not in the outline. It does not cover the whole list, only the top portions, to
give a sense of what the SDMX-ML would look like, for comparison to the outline. Namespace
abbreviations have been removed for clarity. (Remember – this is an SDMX-ML Reference Metadata
report, even though it says “DDIInstanc – this would be made clear through the namespace
abbreviations, in a real implementation.)
IX. Summary
It is possible to create a set of agreements and standards-based artefacts, to support the exemplar
business case of collecting register data using either SDMX or DDI in an aligned fashion. These artefacts
include a DDI Profile, an SDMX DSD created according to the principles of describing the tabular format
of a data file, rather than a possible tabulation of the microdata, and an SDMX metadata structure
definition. Alongside these artefacts would be documentation describing the various rules regarding the
transformations of implicit metadata. In actual exchanges, a DDI instance plus a data file or an SDMX
data set and a matched SDMX reference metadata report would be exchanged.
This approach could be applied to support more complex business cases as well, but the register case
has been selected because it is a subject of discussion, and is relatively simple for the purposes of
demonstrating the approach.
We offer this as an initial draft for the basis of discussions, and not as a finished solution – more work
will be needed, but we feel the outlined approach is one worth consideration.
12
Download