NSS Seminar on OpenDDI A Standards-Based Global Microdata Portal for Researchers Arofan Gregory Metadata Technology North America 23 August 2011 Overview • • • • • • • Background on ABS, SDMX, and DDI Context for the OpenDDI Portal The problem for researchers The problem for data disseminators What is OpenDDI? Demo Some ideas about this technology and NSS ABS, DDI, and SDMX • The ABS is currently prototyping and implementing two open standards which are becoming widespread among data producers and archives world-wide – The Statistical Data and Metadata Exchange (SDMX) – The Data Documentation Initiative (DDI) • The Australian Data Archive (formerly ASSDA) has long been a user of DDI • Both standards leverage modern metada-driven paradigms which enable increased automation in data systems SDMX • SDMX comes out of the world of official statistics, and was developed for statistical exchange and reporting of statistical aggregates • It is international in scope, developed by a consortia consisting of the BIS, ECB, Eurostat, IMF, OECD, The World Bank, and the UN Statistical Division • It is now being widely adopted as the recommended standard for statistical exchange from the highest levels (the UN Statistical Commission) SDMX Products • The SDMX Information Model – a conceptual model for statistical exchanges • XML formats for statistical data and metadata • A registry-based “SOA” architecture for statistical exchange – Provides immediate interoperability between organizations • Recommendations for content harmonization across domains and organizational boundaries DDI • A standard developed by an international member-based consortia – ABS is a member • Traditionally dominated by the world of national data archives, it is now increasingly being used by national statistical institutes • Adoption is widespread Note: There is a detailed DDI presentation by Wendy Thomas from 2009 on the Statistical Leadership Seminars site DDI Products • A model for the production and processing of microdata/survey data into statistical aggregate products • XML formats for data and metadata involved in data production – Strong focus on detailed metadata describing exactly how input data has been collected and processed – Used heavily by data archives and research data centers, as well as by statistical agencies DDI - Lifecycle DDI Metadata • At each stage of the lifecycle, DDI captures metadata regarding that production step – Examples include classifications, concepts, variables, processing steps, etc. • There is an emphasis on data comparability and reuse • The standard is able to express humanreadable metadata (in multiple languages) as well as “machine-actionable” metadata An Observation • DDI and SDMX are standards which allow for services which have never before been possible – The OpenDDI Portal is only possible because of the widespread use of a standard metadata format: DDI – This is a paradigm which builds on the existence of the Internet and Service-Oriented tools and technologies – Other efforts to create portals of this type have been tried and failed, due to the lack of standard metadata descriptions Generic Process Example DDI Aggregate Data Set (Lower level) Anonymization, cleaning, recoding, etc. Raw Data Set Micro-Data Set/ Public Use Files Aggregation, harmonization Aggregate Data Set (Highest-Level) Aggregate Data Set (Higher Level) SDMX Context for the OpenDDI Portal • Metadata Technology has a standards-based business model – We promote the use of SDMX and DDI – We are the international experts in these standards, and are active in their development and implementation • OpenDDI is the first of a set of services which we plan to develop to promote adoption of the standards – In future, other developments will include more online services based on DDI – We plan to expand the site to encompass SDMX services as well under the umbrella brand of “OpenMetadata” The Problem for Researchers • Researchers have difficulty finding data for secondary re-use – There is a (very) large number of data archives and other data producers, even within a single domain or country – Existing searchable metadata is of varying quality (Google doesn’t cut it!) • Most high-quality research data is confidential microdata – Researchers must apply to the archive or producer for access, assuming they can locate the appropriate data – It cannot be published directly onto the Web in many cases, and thus cannot be easily located The Problem for Data Disseminators • In most cases, when data are “harvested” off of their sites, it is re-published without proper provenance information – Often, the re-publishers of data do not maintain it properly – Provenance is a major issue! – The situation is even worse for the re-publishing of metadata • Data disseminators develop effective systems for delivering their own data – Re-publishers may not offer the same level of quality What is OpenDDI? • OpenDDI is a global catalog service offering visibility into the holdings of the world’s data producers and archives to researchers • It is based on those producers exposing their holdings as DDI descriptions of their data • It provides good provenance information, and granular comparison functionality across data sets • It directs users to the source of the data, so they can apply for access and leverage other data services offered by the producers themselves Technical Information • OpenDDI has not yet been launched – We are now doing testing with several data disseminators – It will be a public service when it is officially launched – We have a “build it and they will come” mentality: we do not have a pre-defined business model for this functionality aside from the marketing benefit we get from it – It covers only about 1.3 million variables today – we anticipate well over 3 million in time Demo • http://www.openmetadata.org/openddi/ Some Ideas about the OpenDDI Technology and NSS • It seems obvious that portal functionality of this type could be deployed at a national level • OpenDDI itself harvests a similar portal deployed by the World Bank for NSIs in the developing world • Functionality could be extended to cover aggregate statistics exposed as SDMX – Researchers are only one audience • A minimum requirement would be a “DDI Lite” or “SDMX Lite” format for data and/or metadata Thank You! Questions?