NSS Seminar on OpenDDI

advertisement
NSS Seminar on OpenDDI
A Standards-Based Global Microdata Portal for Researchers
Arofan Gregory
Metadata Technology North America
23 August 2011
Overview
•
•
•
•
•
•
•
Background on ABS, SDMX, and DDI
Context for the OpenDDI Portal
The problem for researchers
The problem for data disseminators
What is OpenDDI?
Demo
Some ideas about this technology and NSS
ABS, DDI, and SDMX
• The ABS is currently prototyping and
implementing two open standards which are
becoming widespread among data producers and
archives world-wide
– The Statistical Data and Metadata Exchange (SDMX)
– The Data Documentation Initiative (DDI)
• The Australian Data Archive (formerly ASSDA) has
long been a user of DDI
• Both standards leverage modern metada-driven
paradigms which enable increased automation in
data systems
SDMX
• SDMX comes out of the world of official statistics,
and was developed for statistical exchange and
reporting of statistical aggregates
• It is international in scope, developed by a
consortia consisting of the BIS, ECB, Eurostat, IMF,
OECD, The World Bank, and the UN Statistical
Division
• It is now being widely adopted as the
recommended standard for statistical exchange
from the highest levels (the UN Statistical
Commission)
SDMX Products
• The SDMX Information Model – a conceptual
model for statistical exchanges
• XML formats for statistical data and metadata
• A registry-based “SOA” architecture for
statistical exchange
– Provides immediate interoperability between
organizations
• Recommendations for content harmonization
across domains and organizational boundaries
DDI
• A standard developed by an international
member-based consortia
– ABS is a member
• Traditionally dominated by the world of national
data archives, it is now increasingly being used by
national statistical institutes
• Adoption is widespread
Note: There is a detailed DDI presentation by Wendy
Thomas from 2009 on the Statistical Leadership
Seminars site
DDI Products
• A model for the production and processing of
microdata/survey data into statistical
aggregate products
• XML formats for data and metadata involved
in data production
– Strong focus on detailed metadata describing
exactly how input data has been collected and
processed
– Used heavily by data archives and research data
centers, as well as by statistical agencies
DDI - Lifecycle
DDI Metadata
• At each stage of the lifecycle, DDI captures
metadata regarding that production step
– Examples include classifications, concepts,
variables, processing steps, etc.
• There is an emphasis on data comparability
and reuse
• The standard is able to express humanreadable metadata (in multiple languages) as
well as “machine-actionable” metadata
An Observation
• DDI and SDMX are standards which allow for
services which have never before been
possible
– The OpenDDI Portal is only possible because of
the widespread use of a standard metadata
format: DDI
– This is a paradigm which builds on the existence of
the Internet and Service-Oriented tools and
technologies
– Other efforts to create portals of this type have
been tried and failed, due to the lack of standard
metadata descriptions
Generic Process Example
DDI
Aggregate Data Set
(Lower level)
Anonymization, cleaning,
recoding, etc.
Raw Data Set
Micro-Data Set/
Public Use Files
Aggregation,
harmonization
Aggregate Data Set
(Highest-Level)
Aggregate Data Set
(Higher Level)
SDMX
Context for the OpenDDI Portal
• Metadata Technology has a standards-based business
model
– We promote the use of SDMX and DDI
– We are the international experts in these standards, and
are active in their development and implementation
• OpenDDI is the first of a set of services which we plan
to develop to promote adoption of the standards
– In future, other developments will include more online
services based on DDI
– We plan to expand the site to encompass SDMX services as
well under the umbrella brand of “OpenMetadata”
The Problem for Researchers
• Researchers have difficulty finding data for secondary
re-use
– There is a (very) large number of data archives and other
data producers, even within a single domain or country
– Existing searchable metadata is of varying quality (Google
doesn’t cut it!)
• Most high-quality research data is confidential
microdata
– Researchers must apply to the archive or producer for
access, assuming they can locate the appropriate data
– It cannot be published directly onto the Web in many
cases, and thus cannot be easily located
The Problem for Data Disseminators
• In most cases, when data are “harvested” off of
their sites, it is re-published without proper
provenance information
– Often, the re-publishers of data do not maintain it
properly
– Provenance is a major issue!
– The situation is even worse for the re-publishing of
metadata
• Data disseminators develop effective systems for
delivering their own data
– Re-publishers may not offer the same level of quality
What is OpenDDI?
• OpenDDI is a global catalog service offering
visibility into the holdings of the world’s data
producers and archives to researchers
• It is based on those producers exposing their
holdings as DDI descriptions of their data
• It provides good provenance information, and
granular comparison functionality across data
sets
• It directs users to the source of the data, so they
can apply for access and leverage other data
services offered by the producers themselves
Technical Information
• OpenDDI has not yet been launched
– We are now doing testing with several data
disseminators
– It will be a public service when it is officially launched
– We have a “build it and they will come” mentality: we
do not have a pre-defined business model for this
functionality aside from the marketing benefit we get
from it
– It covers only about 1.3 million variables today – we
anticipate well over 3 million in time
Demo
• http://www.openmetadata.org/openddi/
Some Ideas about the OpenDDI
Technology and NSS
• It seems obvious that portal functionality of this
type could be deployed at a national level
• OpenDDI itself harvests a similar portal deployed
by the World Bank for NSIs in the developing
world
• Functionality could be extended to cover
aggregate statistics exposed as SDMX
– Researchers are only one audience
• A minimum requirement would be a “DDI Lite” or
“SDMX Lite” format for data and/or metadata
Thank You!
Questions?
Download