SDMX and DDI: A Proposed Approach for Interoperable Exchange of Register Data I. Introduction The purpose of this document is to illustrate how a single metadata model can work effectively between DDI and SDMX, allowing both types of XML formatting to be used for a specific case. This document is created as input for the dialogue on-going between the SDMX sponsors and the DDI Alliance. In order to illustrate this approach, one of the simpler use cases described as part of the ongoing informal dialogue between SDMX and DDI has been selected – the documentation, reporting, and/or exchange of register data. This is a use case for which DDI – Lifecycle (formerly DDI 3) – is used today, and one about which there has been much recent discussion within the SDMX community. If the use of the standards is approached in the way described, it should be easily possible to take DDI XML and turn it into SDMX, and vice-versa. The same approach could potentially be taken for other use cases as well. II. Background One common use of DDI is to describe register or administrative data. This is also a function supported (to a certain extent) within SDMX. Today, we are seeing a growing number of statistical agencies involved in the collection, reporting, and even dissemination of microdata (including register data). While the toolkits available for much microdata functionality have generally been based on DDI rather than SDMX, the latter standard is preferred by many official statistics organizations. Thus, it may be the case that a supplier of register data would prefer to use DDI, while the collector or end-user might prefer to have the same data and metadata in SDMX, or vice-versa. This paper proposes an approach which allows this difference to have minimal impact: if the problem can be easily reduced to a simple XML transformation, for which the mappings are agreed (and obvious) then the preference of one standard over the other becomes much less of an issue. If a standard profile could be identified for the use of the standards, then generically useful tools could even be developed. Illustrated below is a standard DDI view of how register data can be documented: 1 From this picture – based on a production implementation collecting and documenting labour data – we can see that DDI provides a set of useful metadata alongside the data file. This includes not only structural metadata (codes, concepts, etc.) for manipulating the file, but also provides information about how the data was sourced, and what events were performed in its collection and cleaning. In this example, data was coming from more than one administrative source and being integrated into a single data set, but differences in the sources of administrative data required different processing/cleaning. This is not an uncommon scenario – register data often comes from disparate sources, but is formed into a single coherent data set when collected. This proposal takes a typical case like this and shows how a selection of DDI elements can be used to document/exchange the metadata alongside the data file (presumably in ASCII); and how alternately SDMX can be used to describe the data and metadata, so that an easily-understood transformation could be specified. The key to this is based on a sub-set of DDI being identified, and the technique of applying SDMX being documented. It is worth noting that the reference metadata mechanism in SDMX was designed to – among other things – allow for non-SDMX standards to be reflected in an SDMX system. This proposal uses the existing capabilities of SDMX to describe microdata, and also the reference metadata capabilities of SDMX to describe a sub-set of DDI elements. 2 This proposal is not made at a fine-grained level of detail: it outlines the DDI elements needed at a fairly high level. It is a rough first draft – obviously, there would need to be discussion about what the optimal metadata payload for register data exchange would be. Some of the more complex capabilities of DDI – such as describing hierarchical data sets with different types of records (persons, households, etc.) or other complex linked structures are also not used in the draft profile, although these could be included if desired. The presented use of SDMX and DDI is intended to be the basis of a discussion, and to propose this approach as one which could be used by the SDMX and DDI communities to better serve their users needs. It is not presented as a finished business solution. III. Description of Profiles in DDI In the DDI standard, there is a feature known as a “DDI Profile”. This is an XML description of what fields within the overall DDI standard are supported by a specific application or organization, and it allows for several functions: (1) It indicates whether a particular part of the standard is used or not (2) It allows for optional elements to be required (3) It allows for fixed or default values to be supplied for elements which do not have them normally in the standard (4) It provides for application-specific names and documentation to be attached to the used elements This XML profile can then be published and referenced by other applications, so that they know what to expect from a particular DDI instance. We are proposing to use this feature of DDI as a way of helping to document a profile of DDI 3.1 to support our register use case, when there could be an issue around whether SDMX or DDI is the preferred standard. Obviously, the DDI profile can only express the profile in terms of DDI, so it would need to be supplemented by some SDMX artefacts and documentation as well. But the profile could be published and then referenced by register-data applications. The contents of the profile are provided in a later section – the description here of their capabilities is provided for those who are unfamiliar with this feature of DDI. IV. Description of Reference Metadata in SDMX There is a feature of SDMX which can be used to describe metadata models of any sort – SDMX Reference Metadata structures and reports. While this feature of SDMX is typically used for describing quality frameworks, it is very flexible, and one of its intended uses was to help SDMX systems cope with metadata based on other standard models. (The quality frameworks, in fact, can be seen as a type of metadata based on their own standards such as DQAF, etc.) 3 The schematic below shows how metadata reports are structured in SDMX. They use concepts to define the meanings of metadata attributes, assign representations (codelists, etc.) to these, and then allow for the attributes to be arranged as a flat list or a hierarchy. Once the structure has been defined, it is reflected in a standard XML report. The trickiest part of SDMX Reference Metadata is the definition of a “Target Identifier”. The function of a target identifier is to allow the report to be attached specifically to an SDMX-defined object of any type (a data flow, a topical classification expressed as an SDMX CategoryScheme, etc.). It effectively identifies the subject of the metadata report. For our proposal, we assume that an SDMX data set (the microdata), possibly a data flow (we can already get this from the data set), and a topical category would make up the target identifier. The only absolutely required field here would of course be the data set identifier. It is easy to imagine how a hierarchical XML structure such as DDI can be expressed using this technique, as a set of metadata attributes in an SDMX reference metadata report. For our purposes, a standard MSD would be published which corresponded with many elements of the DDI Profile mentioned above. To give an example, in our DDI profile described below, we have elements such as StudyUnit, which contains child elements Abstract and Purpose. One can easily imagine an SDMX Metadata Structure 4 which has concepts titled “StudyUnit”, “Abstract”, and “Purpose”, arranged in a hierarchical report to reflect their relationships in the DDI. These would also have the same representations (both standards use XHTML to do formatting of this type of metadata, and both cab describe codelists, numeric types, strings, etc). V. Microdata Structures in SDMX In DDI, it is traditional 9and generally the case) that data is left in it’s native format, or exported into an ASCII-based format such as CSV. While the standard is capable of carrying data, this is a secondary use of the standard, and a feature not widely implemented. For SDMX, however, the standard formatting of the data itself is a primary feature of the standard, and is key to the successful exchange of data. As has been described elsewhere [reference to SDMX/DDI Microdata paper], SDMX was designed with some capacity to describe microdata sets. This model does not have the richness of DDI – it was originally to be used in the description of financial data at the account level, corresponding to a required reporting format established by legal obligation. Thus, SDMX supports a relatively small sub-set of the requirements which DDI was designed to meet as regards microdata. In describing register data in SDMX, we will use the features of the standard intended for this purpose. This corresponds to part of the overall metadata payload contained in our DDI Profile, and in our proposal is supplemented by an attached SDMX Reference Metadata Report. The problem with register data is that it is not inherently dimensional. In order to express it in SDMX, we will need to agree on how to apply the dimensional aspects of SDMX to a data file which could be tabulated in many different ways. This is not difficult to do, however: Dimension 1: Frequency – Even if not really needed from the perspective of a data reporter, it is traditional to use this as the first dimension in an SDMX data structure definition (DSD). Also, the data collector will likely want to create a full data set over time, in which this dimension will be relevant. Values for this would be the standard values for this dimension – quarterly, monthly, daily, etc. This might or might not be found in a DDI instance as a variable, but would be known by the reporter and collector in a regular exchange of data. Dimension 2: Entity or Identity - These dimension types in SDMX were designed to describe either specific accounts (with the Identity type) or legal entities (such as reporting organizations or individuals, with the Entity type). This corresponds to the case identifier variable as described in the DDI file. Dimension 3: Measure – This dimension gives the measure type of each of the values contained in the register data file (typically corresponding to the column headers in a CSV file). The values for this dimension are specific to each type of register data, and are expressed in the SDMX DSD as the values of a codelist, used to represent this measure dimension. These correspond to all the non-frequency, nontime, and non-case-identifying variables in the DDI-described data set. 5 Dimension 4: Time – This dimension again may seem unnecessary to the data reporter, but will be used by the data collector to create a view of the data over time. Typically, there is a variable within the register microdata which gives a reporting period, but this is not necessarily the case. It is stronbgly recommended for SDMX DSDs, however, so should be included here, as the value will be known to both counterparties whether contained in the microdata or not. The primary measure in SDMX would be specified as “OBS_VALUE”, and be given as forgiving a type as needed to describe all of the reported microdata – the value of the Measure dimension would be used to validate actual data types (these could all be described as secondary measures in SDMX if desired). Note that SDMX uses concepts to provide the names for all dimensions and measures. The rule here would be that the identifiers of the variables in the DDI description of the register microdata would become the concepts used in the DSD for the measure dimension. All microdata formatted as SDMX-ML under this proposal would use this four-dimensional structure, even in cases where there might be variables which may be tabulated differently by the counterparties under other circumstances. SDMX here is being used as a mechanism for data transport only. VI. Data Transformation between SDMX and DDI at the Microdata Level This section walks through an example of how the data described in DDI would be transformed into the data contained in an SDMX data set. In the later sections we will show how the rest of the metadata described in DDI could be transported in a reference metadata report. Consider the following CSV file, as viewed in a spreadsheet tool. We will use this subset to illustrate our approach, both with SDMX and DDI. In the DDI description of this file, we will have four variables declared: CASE_ID, PART_FUL, EMPLOY, and WAGES, corresponding to the headers of each column. CASE_ID holds the employee identifier used by the administrative source (in this case, some fictional social security numbers), and is identified in the DDI metadata as the record identifier. PART_FUL holds a coded value indicating whether the employee 6 works full-time or part-time (PT is part-time, FT is full-time). EMPLOY holds a value taken from an employment classification. Wages is the total compensation reported by the administrative source, expressed in US dollars. Each DDI variable provides an ID, one or more names and labels, and specifies the type of the value (which can be numeric, textual, coded, etc.) the DDI instance also describes the categories and codes which provide valid values for coded types (similar to codelists in SDMX). Other parts of the DDI – not the variable descriptions – provide the logical and physical layout of the record: in this case, our four variables in the order they are shown above. Because this example is a CSV file, the format in DDI would be described as a delimited format, with the separator being a comma (fixed-width ASCII formats are also common, and supported, as are various types of database formats). Let us assume that this is data reported annually, and that this is the report for 2010. What would this look like in SDMX? We would use our four-dimensional structure, as described above: 1. Frequency – Value in this case is known, but comes from the standard SDMX codelist: Annual 2. Case identifier (Entitity) – Value is a unique identifier, which is fine because the values here are assigned to the individuals at birth, and are guaranteed to be unique (effectively a tax ID) 3. Measure – Value is taken from a codelist containing codes for PART_FUL, EMPLOY, and WAGES 4. Reference Period – value is “2010” Our primary measure, OBS_VALUE, is declared in the DSD as a string, because the identifiers contain characters which are non-numeric and un-coded (each different type of measure can be declared as a specific secondary measure). For those parts of the DDI which correspond to the dimensions, codelists, data types, and record layouts, there is no need to reproduce them in our SDMX Reference Metadata – they can be determined by looking at the SDMX DSD. Here is our example data expressed as version 2.0 CrossSectional format SDMX-ML (namespace abbreviations have been removed for clarity): 7 The transformation between the two formats – DDI-described CSV and SDMX-ML – should be easy to imagine. It is similar to any other CSV-SDMX transformation. The tools for this already exist, but the style of creating the SDMX DSD is critical in making this work interoperably. Notice that what is being described in the SDMX is the dimensionality not of the data (which, being microdata, is unpredictable - it can be tabulated in a very large number of different ways), but the dimensionality of the CSV formatting. This can only be described as it appears, based on our conventional use of only four dimensions. This has the benefit of predictability – we do not need to describe the desired tabulation of the data, which doesn’t exist in the DDI since it simply describes the microdata as it is formatted in the ASCII. VII. Minimal DDI Profile for Register Data The outline below provides quick descriptions of the high-level DDI elements which would be used to describe register data. This does not go into the gritty detail, and that analysis and discussion would be required before this proposal could be finalized. We have also indicated in the outline which fields are expressed as structural metadata in the SDMX DSD, and which would be expressed as SDMX Reference Metadata. Note that a DDI Profile can provide application-specific names for different tools using the application profile, so there may need to be a discussion of how the DDI elements are named for that purpose. The DDI 3.1 names are provided here, as they would appear in the XML. Almost every element here is optional – further discussion may want to consider which fields are actually required for this profile. Note that a convention for capturing DDI XML attributes (as opposed to elements) would need to be agreed, but that is a technical detail, and not covered in this proposal. 8 DDI Instance – a generic wrapper for all DDI XML dinstances Study Unit (Groups and ResourcePackages are not used) – represents a single file of register data Abstract – A required element describing the data collection; could be populated with the name and description of the SDMX data flow. Universe Ref – a required field describing the respondent poluation; this would be the entities whose data is contained in the register, and would not typically change much over time. Series Statement – A statement about which collection the data belongs to (again, essentially the SDMX data flow) Purpose – A statement of the purpose of the data collection, could be a citation of the legal requirement for the data collection, etc. Coverage – describes the coverage of the data set, expressed as three sub-fields Topical Coverage (terms and keywords) Spatial Coverage - description, code or bounding box Temporal Coverage - date or date range Analysis Unit – the analysis unit of the data collection (individual, business, etc.) Kind of Data – This describes the data according to the user’s terminology. In this case, a default value of “register data” or “administrative data” might be fine. Other Materials – A list of related non-DDI materials, with links/citations. Probably maps to an SDMX Annotation. Notes – Notes, similar to SDMX Annotations. Conceptual Components – the DDI module describing conceptual constructs. Should be included if concepts are provided. Concept Scheme – a listing of concepts with their descriptions, very similar to SDMX Concept Schemes. Stored in the SDMX DSD. Universe Scheme – a required listing of universes; in this case, it would only hold the top-level universe for the register data set. Data Collection – the DDI module describing the process of data collection, in this case the register source and links to any processing code used to clean it. 9 Methodology (maybe - data source might be enough) – a description of the methodology employed in the data collection. Software – a description of any software employed in the data collection. Collection Event – a description of the sourcing event to pull data from the register. Data Source – a description (and optional link to) the data source itself. Processing Event – a description of (and optional link to the code for) conducting a cleaning process or other process operating on the register data. Logical Product – the DDI module containing logical metadata (variables, codes, logical record structure, etc.); much of this is structural metadata in SDMX. Data Relationship- used in DDI to describe hierarchical files and links; here it is a minimal version only providing the link between a logical record and a physical structure. This would be implicit in the SDMX file, and would not need to appear in the SDMX Reference Metadata. Logical Record – Describes the set of variables encompassed by the logical record; in this case a minimal set (only the variables used in the single file), needed for linking the physical and logical data constructs. Implicit in the SDMX expression, and would not need to be captured in the SDMX Reference Metadata. Category Scheme – in DDI this provides a set of meanings used by a codelist; this is structural metadata corresponding to the SDMX codelist descriptions. Structural metadata in SDMX. Code Scheme - the codes used by a codelist; these are paired with a Category Scheme to make up what would be a codelist in SDMX, and therefore structural metadata. Variable Scheme – in DDI these are the columns found in the data file; they would be implicit from the structural metadata for an SDMX DSD, so don’t need to be captured elsewhere. Physical Data Product – this is the DDI module which describes the physical structure of a data set. Because of the heavy re-use of data structures in some data sets, DDI has many levels of indirection; for our purposes, these are minimal descriptions of a single file, so this metadata is implicit in the SDMX DSD and is not carried by the SDMX Reference Metadata. 10 Physical Structure Scheme – in DDI, a listing of various physical structures; for our purposes an implicit list of one item which could be generated on transformation into an ASCII format from an SDMX data set. Record Layout Scheme – similar to a physical structure scheme; implicit in SDMX and can be auto-generated when transforming the SDMX data set into an ASCII file Physical instance - provides a link to the actual data file itself; in SDMX this would be part of the Target Identifier of the Reference Metadata, or could be made explicit. Some option sub-fields could be useful if this is explicit: Record Layout Reference, Gross File Structure, Data File Identification Archive – a module required in DDI to contain an Oragnization Scheme, which is structural metadata in SDMX contained in the DSD. Organisation Scheme – a listing of organizations and individuals relevant to the data collection (contact persons, maintenance agencies, etc.); for our purposes, a minimalist expression matching what is found in an SDMX Organization Scheme. Lifecycle Events – in DDI these provide a way of describing all the significant events related to a data set and its metadata. These could be part of the register profile if they are useful, but have no correlary in SDMX. DDI Profile Reference – this is a reference to the DDI profile we are describing here, presumably published in DDI XML form on the UN/ECE site or other suitable location (the DDI Alliance site? SDMX.org?) VIII. Description of SDMX Reference Metadata Structure for Register Data Rather than provide a lengthy and incomplete SDMX Metadata Structure Definition describing the outline provided above (minus the elements which are either in the DSD or implicit), a shorter example of an SDMX Metadata report is shown. When expressed in the SDMX-ML structure-specific schema, the DDI elements literally become SDMX elements. This involves declaring an SDMX Concept for each element used, and then fitting it into the presentational hierarchy of metadata attributes, with the appropriate representation (typically text or XHTML). As described above, the Targey identifier could be a simple data set ID, or could also include a reference to a topical classification (an SDMX Category Scheme) and/or a reference to an SDMX Data Flow. Even the reporting organization could be added if that was seen as useful (a reference to the SDMX Data Provider). 11 What is presented below is an example of what our SDMX Reference Metadata might look like – there are some details which are not in the outline. It does not cover the whole list, only the top portions, to give a sense of what the SDMX-ML would look like, for comparison to the outline. Namespace abbreviations have been removed for clarity. (Remember – this is an SDMX-ML Reference Metadata report, even though it says “DDIInstanc – this would be made clear through the namespace abbreviations, in a real implementation.) IX. Summary It is possible to create a set of agreements and standards-based artefacts, to support the exemplar business case of collecting register data using either SDMX or DDI in an aligned fashion. These artefacts include a DDI Profile, an SDMX DSD created according to the principles of describing the tabular format of a data file, rather than a possible tabulation of the microdata, and an SDMX metadata structure definition. Alongside these artefacts would be documentation describing the various rules regarding the transformations of implicit metadata. In actual exchanges, a DDI instance plus a data file or an SDMX data set and a matched SDMX reference metadata report would be exchanged. This approach could be applied to support more complex business cases as well, but the register case has been selected because it is a subject of discussion, and is relatively simple for the purposes of demonstrating the approach. We offer this as an initial draft for the basis of discussions, and not as a finished solution – more work will be needed, but we feel the outlined approach is one worth consideration. 12