Data Documentation Initiative: A global phenomenon – coming soon to ABS! Wendy Thomas Chair, DDI Technical Implementation Committee 1 September 2009 Acknowledgments • Slides provided for use by: – Wendy Thomas – Pascal Heus – Mary Vardigan – Peter Granda – Nancy McGovern – Jeremy Iverson – Dan Smith What is Metadata? • Common definition: Data about Data Unlabeled stuff Labeled stuff The bean example is taken from: A Manager’s Introduction to Adobe eXtensible Metadata Platform, http://www.adobe.com/products/xmp/pdfs/whitepaper.pdf Managing data and metadata is challenging! We are in charge of the We want access need to collect the data. We easy support our to high quality and well information from the Webut have an users also need to documented data! it, producers, protect our preserve information and provide access to respondents! management our users! Academic problem Producers Users Government Sponsors Librarians Business Policy Makers General Public Media/Press Summary of ABS Metadata Management Principles • Life-cycle focus • Data supported by accessible metadata • Metadata available and useable in context of client’s need • Registration authority for metadata element • Clear identification, ownership, approval status of metadata elements • • • • Describe metadata flow Reuse metadata Capture at source Capture derivable metadata automatically • Ensure cost/benefit of metadata • Variations from standards tightly documented • Make metadata active to the greatest possible extent NISO: A FRAMEWORK OF GUIDANCE FOR BUILDING GOOD DIGITAL COLLECTIONS Metadata Principle 1: Good metadata conforms to community standards in a way that is appropriate to the materials in the collection, users of the collection, and current and potential future uses of the collection. Metadata Principle 2: Good metadata supports interoperability. Metadata Principle 3: Good metadata uses authority control and content standards to describe objects and collocate related objects. Metadata Principle 4: Good metadata includes a clear statement of the conditions and terms of use for the digital object. Metadata Principle 5: Good metadata supports the long-term curation and preservation of objects in collections. Metadata Principle 6: Good metadata records are objects themselves and therefore should have the qualities of good objects, including authority, authenticity, archivability, persistence, and unique identification. Some major XML metadata specifications for data content management • Statistical Data and Metadata Exchange (SDMX) – Macrodata, time series, indicators, registries – http://www.sdmx.org • Data Documentation Initiative (DDI) – Microdata (surveys, studies), aggregate, administrative data – http://www.ddialliance.org • ISO/IEC 11179 – Semantic modeling, concepts, registries – http://metadata-standards.org/11179/ • ISO 19115 – Geography – http://www.isotc211.org/ • Dublin Core – General resources (documentation, images, multimedia) – http://www.dublincore.org Metadata provides support for: • • • • • • • Survey and data collection preparation Data collection Data processing Analysis Data discovery and access Replication Repurposing (secondary data use or data products) Metadata • Metadata is essential information for research and reuse of data • The further data gets from its source, the greater the importance of the metadata • Content is critical • Structure is becoming increasingly important in a networked world Why Standards? • Standards provide structure for: – Accurate transfer of content between systems – Increased automation of ingest, reducing costs – Interoperability between systems and software – Structural base for discovery and comparison Example: Dublin Core • Print card catalogs • Standalone databases • WorldCat and Google • • • • Static stationary Proprietary structure Little cross-site searching • Standardized content • Cross-site searching Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Citation structure • Coverage – Temporal – Topical – Spatial • Location specific information Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Structure and content of a data element as the building block of information • Supports registry functions • Provides – Object – Property – Representation Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • i.e., ANZLIC and US FGDC • Focus is on describing spatial objects and their attributes Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Proprietary standards • Content is generally limited to: – Variable name – Variable label – Data type and structure – Category labels • Translation tools used to transport content Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Digital Library Federation • Consistent outer wrapper for digital objects of all type • Contains a profile providing the structural information for the contained object Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Preservation information for digital objects Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Developed for statistical tables • Supports well structured, well defined data, particularly timeseries data • Contains both metadata and data • Supports transfer of data between systems Interacting Standards for Data • Dublin Core • ISO/IEC 11179 • ISO 19115 – Geography • Statistical Packages • METS • PREMIS • SDMX • DDI • Version 3.0 covers life-cycle of data and metadata • Data collection • Processing • Management • Reuse or repurposing • Support for registries • Grouping & Comparison Metadata Coverage • Dublin Core • ISO/IEC 11179 • ISO 19115 • Statistical Packages • METS • PREMIS • SDMX • DDI • • • • • • [Packaging] Citation Geographic Coverage Temporal Coverage Topical Coverage Structure information – Physical storage description – Variable (name, label, categories, format) • • • • • • • Source information Methodology Detailed description of data Processing Relationships Life-cycle events Management information The Data Documentation Initiative (DDI) • International XML based specification – Started in 1995, now driven by DDI Alliance (30+ members) – Became XML specification in 2000 (v1.0) – Current version is 2.1 with focus on archiving (codebook) • New Version 3.0 (2008) – Focus on entire survey “Life Cycle” – Provide comprehensive metadata on the entire survey process and usage – Aligned on other metadata standards (DC, MARC, ISO/IEC 11179, SDMX, …) – Include machine actionable elements to facilitate processing, discovery and analysis Intent of DDI Design • Facilitate point-of-origin capture of metadata • Reuse of metadata to support: – Consistency and accuracy of metadata content – Provide internal and external implicit comparisons – Support external registries of concepts, questions, variables, etc. – Metadata driven processing • Provide clear paths of interaction with other major standards Basic Structures • DDI 3 used a model similar to SDMX in terms of the following: – Indentifiable, Versionable, and Maintainable objects – The use of multiple schemas to describe different process sub-sections in the life-cycle – Use of schemes to facilitate reuse of common materials DDI: Full content coverage for survey and administrative data • • • • • • • • • Conceptual coverage Methodology Data Collection Processing – cleaning, paradata Recoding and derivations Variable and tabular content Internal relationships Physical storage Data management Plus: Relationships between studies • Comparison by design – Study series can inherit from earlier metadata – Capture changes only • Data integration – Mapping of codes between source and target – Capture comparison information • Comparison of abstract content models – Publication of reusable materials (code schemes, concept schemes, geographic structure, etc.) Current Areas of DDI Development • Controlled vocabularies to improve machine actionability • Data collection methodology and process expansion for more depth and detail • Qualitative data • Increased comparison coverage • Tools DDI 3.0 Metadata Life Cycle • Data and metadata creation is not a static process: It dynamically evolved across time and involves many agencies/individuals • DDI 2.x is about archiving, DDI 3.0 focuses on the entire “life cycle” • 3.0 emphasizes metadata reuse to minimize redundancy and discrepancies, support comparison, and drive the data and metadata creation process • Supports multilingual, grouping, geography, and registries • 3.0 is extensible When to capture metadata? • Metadata must be captured at the time the event occurs! (not after the facts) • Documenting after the facts leads to considerable loss of information • This is true for producers and researchers Reuse • DDI is designed around schemes (lists of items) for commonly reused information within a study such as categories, code schemes, concepts, universe, etc. – Items are “used” in multiple locations in a DDI document by referencing the item in the list – Enter once, use in multiple locations – Items can be versioned for management over time without having to change content in multiple locations Comparison and Registries • Information in DDI schemes can be published in external registries and used by multiple studies – Provides implicit comparison both within a study and between studies – Supports organizational consistency through the use of agreed content managed in registries – Referencing structured lists provides further context to individual items used in a study Metadata driven processing • Capturing metadata upstream can provide over 90% of the building blocks needed to generate the remainder of the metadata • DDI supports imbedding command code to run data processing events driving data capture, data processing during after collection, and to support post-collection recoding, derivations, and harmonization maps Questions to Variables REGISTRY Question Development Software Instrument Development Software CAI Identifying Universe and Concepts Organizing questions and flow logic Building or Importing Question Text and Response Domains DDI Capturing raw response data and process data Data Processing Software Data cleaning and verification DDI Recoding and/or deriving new data elements using existing or new categories or coding schemes Working with other standards • There is no single standard that does it all • DDI was specifically designed to support easy interaction with: – Dublin Core – mapping of citation elements and imbedding native Dublin Core – ISO/IEC 11179 – working with an editor of the standard to reflect data element model and ISO/IEC 11179-5 naming conventions for registry intended items Standards continued – SDMX – DDI NCubes were revised to incorporate the ability to attach attributes to any area of a cube and map cleanly into and out of SDMX cubes. SDMX has added means of attaching fragments of DDI which provide source and processing information that can be indexed and delivered through SMDX tools. – ISO 19115 (ANZLIC) – Geographic elements in DDI are structured to reflect basic discovery elements used by geographic search engines and provide the detailed geographic structure information needed by GIS system to incorporate the data accurately DDI does not replace good content • DDI structures metadata to leverage content – – – – – Collection and processing Discovery and access Analysis and repurposing Registries Comparison • DDI is not a software application – Supports and informs software applications • DDI is a neutral archival structure – Preserving content and relationships Value • Supports consistent use concepts, questions, variables, etc. throughout organization • Supports implicit comparison through reuse of content • Supports explicit comparison by mapping content between studies and to standard content • Retention of explicit relationships between data collection and the resulting data files • Early capture of a broad range of metadata at point of creation Value - continued • • • • • Interoperability Flexibility in data storage Reuse of element structures Strong data typing Improved data mining between and across systems • Improved access to detailed metadata DDI User Base • Archives and data libraries worldwide – Catalogs – Data delivery – Documentation delivery from data systems • Research Institutes/Services Data Centers – Documentation for data – Data search and analysis systems – Data management systems • International Organizations and National Statistical Agencies – Data collection and management Archives and Data Libraries (examples) • Catalogs – ICPSR Data Catalog and Social Science Variable Database – CESSDA Data Portal – The Dataverse Network (former Virtual Data Collection) • Data delivery – California Digital Library “Counting California” – National Geographic Historical Information System • Documentation delivery – Survey Documentation and Analysis (SDA) – Data Liberation Initiative Metadata Collection Research Institutes/Service Data Centers (examples) • Documentation for data – German Microcensus (GESIS) – Institute for the Study of Labor (IZA) – US General Social Survey (NORC) • Data search and analysis systems – Nesstar – Canadian Research Data Centres (RDC’s) • Data management systems – Questionnaire Development Documentation System (University of Konstanz/GESIS) Current DDI Products at ICPSR • Most existing products currently in DDI 2.1 with new additions moving to DDI 3 • DDI-XML variable-level codebooks output as PDF files for downloading by users • DDI-XML metadata records created initially by data depositors and edited by ICPSR staff to augment content and include additional fields • Increasing use of DDI for special projects: Social Science Variables Database, various harmonization and data processing tasks Potential Use of DDI 3 at ICPSR • Information collected from data producers in precollection phase – Concept • Metadata output from CAI applications – Data Collection • Processor‘s dashboard – Metadata Processing • Metadata mining: New faceted search tool to facilitate discovery through more precise searching – Data Discovery • Relational database for comparison and harmonization across studies – Repurposing Potential Use of DDI 3 at ICPSR -2 • Use of DDI in combination with other metadata standards, e.g., Dublin Core, MARC, PREMIS • Beginning of FEDORA “object-centered“ implementation concepts into data processing and data preservation strategies • Processor‘s dashboard – Data Processing • Relationships of study object to file object • DDI 3 as “wrapper“ for all ICPSR metadata? CURRENT WORKFLOW TECHNOLOGIES: PROCESSOR-BASED SPSS-BASED NOT DDI-BASED FUTURE WORKFLOW TECHNOLOGIES: FEDORA DATASTREAMS ICPSR “KEEPSAKE” OBJECTS PREMIS METADATA DDI “LIFECYCLE EVENTS” (PROCESSING HISTORY) SSVD – The Public Search • First batch of variable-level description files uploaded into SSVD: – Approx. 3,500 DDI files (one file per dataset), representing • Approx. 1,300 ICPSR studies (approx. 18.5 percent of total ICPSR holdings, excluding US Census; approx. 30 percent of holdings with data and setups) – Over 1,000,000 individual variable descriptions; 23,000,000 categories SSVD – The Public Search • New database finalized Fall 2008 • Built to match DDI 3.0 data model • Both DDI 2.x and DDI 3.0 compliant – Designed to accept both DDI 2.x and 3.0 input and produce output in both versions • ICPSR version currently uploads DDI 2.1 and generates DDI 3.0 individual variable descriptions. SSVD – The Public Search Moving forward… • Transition to automated DDI upload – DDI uploaded at the time of study publication – First quality check performed by study processing staff – Acceptable DDI immediately released for public view – Problematic DDI suppressed from public view for further review, and upgrade as appropriate Entry screen for internal search Search results screen IPUMS at MPC • Did not use DDI because DDI 2 cannot handle translation tables • Currently in the process of mapping DDI 3 codebook output from IPUMS database • Importing DDI 2 files from Microdata Toolkit into processing, validation, and harmonization system NHGIS • Contains historical aggregate data from population, housing, agricultural, and economic censuses as well as BEA data from 1790 to 2000 • Runs from DDI 2 nCube descriptions • Searches variables, identifies related nCube tables, determines geographic availability • Generates data subsets with geographic links to objects in NHGIS shape files, and shape files Future Plans • Funding has been obtained to improve search and extraction system • Current limitations of the system reflect limitations of DDI 2 • Moving to DDI 3 will support: – broader cross survey searching – identification of common dimensions between NCubes over time – support harmonization instructions as well as common transformations such as calculation of medians International Organizations and National Statistical Agencies • International Household Survey Network (IHSN) – – – – Major international organizations involved Coordination of activities Adopted DDI 1/2.x as standard Developed the Microdata Management Toolkit and related tools / guidelines – http://www.surveynetwork.org • Accelerated Data Program (ADP) – World Bank / Paris 21 – Implement IHSN activities in developing countries • Task 1. Documentation and dissemination of existing survey microdata. – Has introduced DDI in national statistical agencies in over 50 countries – http://www.surveynetwork.org/adp INDEPTH/DSS Example • 38 Demographic Surveillance Sites in 19 countries spanning Africa, South Asia, Central American and Oceania • Diverse yet similar health research portfolios • Data management goals: – Standardize and harmonize data collection tools – Cross-site comparability of information – Sharing data effectively and efficiently Reasons for choosing DDI • “It will be ideal to describe our data for the purposes of the Data Repository” • “It has really powerful features that will enable us to standardise several facets of our work.” • “I originally underestimated the usefulness DDI will have as a means to harmonised data collection between sites.” • Ability to expand comparison and harmonization with additional groups such as AIDS research team Statistical Agencies • BLS considering publication of category and coding standards supported by BLS such as NAICS, SOC etc • Statistics Canada considering publishing concept schemes in DDI 3 for use by the research community • DDI is becoming more widely used for survey and census collection in developing countries (primarily Africa) MQDS Version 1 • Extracted metadata from Blaise data model as XML tagged data • Provided user interface for selection of – Blaise files – Instrument questions and sections – Types of metadata to extract – Languages to display – Style sheet for generation of instrument documentation or codebook Using MQDS V1 XML: Codebook in Five Languages National Latino and Asian American Study www.icpsr.umich.edu/CPES MQDS Version 1 • Limitations – XML not DDI-compliant • DDI Version 2 did not have XML tags for all metadata provided by Blaise • Did not provide easy means of adding XML tags without becoming noncompliant – XML files for complex surveys can be very large (text files) • Entire files had to be processed in computer memory • Limited ability to fully automate documentation DDI Version 3 • Included extensions proposed by DDI working group on instrument design Persistent Content of Question Question text • Static • Dynamic or variable Multiple-part question Response domain •Open •Set categories •Special types (date, time, etc.) Definitional text Use of Question in Instrument Order and routing •Sequence / skip patterns •Loops Universe Analysis unit Instructions MQDS Version 3 • Joint SRC and ICPSR venture • Goals: – Address version 2 limitations • Process Blaise instrument of any size – Exploit new elements and validate to the recently released DDI version 3 standard – Move from processing XML metadata in memory to streaming metadata to a relational database MQDS Version 3 Relational Database: Import, Export, Transform SQL Server / SQL Server Express XML (DDI 3) Relational Db Blaise Datamode l (BMI) User specifies input files (location, file type, etc.) Blaise Database (BDB) Other File Types (e.g. SAS, SPSS, etc) Database connectio n settings DDI 3 elements not in *.bmi 2. Export 1. Import User specifies output files (location, Language/locale, XML output options, etc.) 3. Transform Questionnaire Codebook User specifies stylesheet selection criteria, type of output desired (html, rtf, pdf), etc. MQDS Version 3 • Relational database – DDI compliant standardized tables – Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs – Allows • Automated documentation of any Blaise survey instrument • Importing and documenting data produced by other software • Lower cost development of other tools that facilitate editing and disseminating data <c:SubUniverse isVersionable="true" id="U4863" isInclusive="true"> <c:HumanReadable>-1</c:HumanReadable> <c:MachineReadable> <r:Code>GOSCHOL = Yes</r:Code> </c:MachineReadable> </c:SubUniverse> <l:Variable id="V32373“> <r:Name>A_FEM.AB_FEM.FMARIT</r:Name> <r:UniverseReference> <r:ID>U4657</r:ID> </r:UniverseReference> <l:QuestionReference> <r:ID>Q69</r:ID> </l:QuestionReference> <c:SubUniverse isVersionable="true" id="U4657" isInclusive="false"> <c:HumanReadable>-1</c:HumanReadable> <c:MachineReadable> <r:Code>(MARSTAT = Widowed) OR (FMARSTAT = Widowed)</r:Code> </c:MachineReadable> <d:ComputationItem id="CI150"> <d:Code programmingLanguage="Blaise"> <r:Code>FMARIT := 2</r:Code> </d:Code> <d:AssignedVariableReference isReference="true"> <r:ID>Q69</r:ID> </d:AssignedVariableReference> </d:ComputationItem> Colectica Feature Overview Current Focus: Data Collection Survey Design: Diagram • Visually design survey instruments • Drag items from the toolbox Colectica by Algenta Technologies Survey Design: Item Editor • Edit item details using friendly input forms Colectica by Algenta Technologies Multilingual Support • All text fields can be represented in multiple languages Colectica by Algenta Technologies Concept Repository Concepts • Use built-in or custom concept banks to describe survey items • Useful for comparability Colectica by Algenta Technologies Question Repository • Share questions across studies • Drag previouslyused questions or sequences onto new instruments Colectica by Algenta Technologies Import Existing CAI Code • Import from: – Blaise® – CASES – CSPro • Support for additional languages can be added Colectica by Algenta Technologies Generate CAI Source Code • Currently support CASES • Blaise® and CSPro coming soon • Support for additional CAI systems can be added Colectica by Algenta Technologies Generate Publishable Documentation • Generate codebooks and diagrams • Output to HTML and PDF Colectica by Algenta Technologies Also: Study Concept & Design • Basic support for Study Concept & Design documentation Colectica by Algenta Technologies Generate DDI 3.1 Colectica by Algenta Technologies Additional Information • Beta available now • Web: http://www.colectica.com/ • Email: contact@algenta.com Thank you • DDI Alliance – http://www.ddialliance.org • Wendy Thomas – wlt@pop.umn.edu