Documentation and Cataloguing in Data Archiving Session6 United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 What is documentation? Documentation: comprehensive information on the processes and methods used to produce, archive and disseminate micro-data o Documentation includes metadata and other information related to the dataset United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Role of documentation Documentation explains how the data were collected, their content and structure and any manipulation that may have taken place, how to access the data, terms for their use, etc Documentation is required in order to understand and interpret the data by providing a context: without proper documentation, data are useless The further data gets from its source, the greater the importance of the documentation (metadata) Also allows reuse of documents for future surveys United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 When to undertake documentation Documentation is an incremental process that should be a shared responsibility among various parts of an institution Different types of documentation can be added by different people at various stages of an information object’s life cycle A common documentation framework, used by different actors - the actor who is closest to the information to be used as documentation/metadata adds that information to the framework United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Types of material for documentation Three broad categories of documentation: o Explanatory material o Contextual information o Cataloguing material United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Types of material for documentation Explanatory material – required to ensure the longterm viability and functionality of a dataset and without which full understanding of the dataset and its contents cannot be achieved o o o Data collection methods (data collection process including instruments used, methods employed, and how these were developed) Structure of the dataset (information about relationships between individual files or records within the study, e.g., the number of cases and variables in each file and the number of files in the dataset) Technical information (computer system used to generate the files; software packages with which the files were created; medium on which the data was stored; and complete list of all data files present in the dataset) United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Types of material for documentation (contd) Explanatory material (contd.) o o o o o Variables and values, coding and classification schemes (descriptions of all variables (or fields) in the dataset, with explanations about coding and classifications used and for blank and missing fields) Derived variables (how it was done) Weighting and grossing (procedures should be explained) Data source (sources from which the data is derived e.g. questions used) Confidentiality and anonymization (if data contain any confidential information or anonymization has been implemented and implication of both on data usage) United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Types of material for documentation (contd) Contextual information - the context in which the data was collected, and how it was put to use o Description of the originating project (why the data collection was felt necessary; who or what was being studied; the geographic and temporal coverage) o Provenance of the dataset (history of the data collection process, changes and developments that occurred in the data themselves and the methodology, or any adjustments made) o Serial and time-series datasets, new editions (e.g., descriptions of changes in question text, variable labelling or sampling procedures for repeated cross-section, time-series datasets) United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Types of material for documentation (contd) Catalogue metadata: o A sub-set of core data documentation providing standardized structured information explaining the purpose, origin, time reference, geographic location, creator, access conditions and terms of use of data United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards Traditionally, data producers wrote text-based codebooks. To take advantage of web technology, these have been replaced by XML-based codebooks Use of metadata standards brings key data documentation together into a single document, creating detailed and structured content about the data. This enhances: − Quality of statistical documentation provided to data users − Access to the data and semantic interoperability of data sets The Data Documentation Initiative (DDI) Dublin Core Metadata Standard United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) On XML (eXtensive Markup Language) − A way of tagging text for meaning instead of appearance (i.e., XML can be used to organize text by tagging with meaningful information − Unlike text in the database, XML text files can be viewed and edited using any standard text editor − With appropriate tools, XML files can be searched and queried like a regular database − XML documents can be read and transformed by other software applications into user-friendly formats, e.g., spreadsheets, PDF files or web pages United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) The Data Documentation Initiative (DDI): o Is based around the data lifecycle model and provides specifications for a structured framework for organizing the content, presentation, transfer and preservation of metadata in the social and behavioural sciences o Provides comprehensive metadata on the entire survey process and usage o Facilitates point-of-origin capture of metadata o Includes machine actionable elements to facilitate processing, discovery and analysis United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) The Data Documentation Initiative (DDI): o Facilitates reuse of common metadata items because DDI is designed around schemes (lists of items) for commonly reused information within a study, e.g., categories, code schemes, concepts, universe, etc. − Items are entered once and used in multiple locations in a DDI document by referencing item in the list o Reuse of items supports: − Consistency and accuracy of metadata content thereby minimizing redundancy and discrepancies − Internal and external implicit comparisons − External registries of concepts, questions, variables, etc. − Metadata driven processing United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) The Data Documentation Initiative (DDI): o Information in DDI schemes can be stored in external registries and used by multiple studies to support: − Comparisons within and between studies − Organizational consistency through use of agreed content managed in registries o Designed to support easy interaction with other major standards (Dublin Core, SDMX, ISO/IEC 1179, ISO 19115) − Ensures that metadata can be connected to other domains or stages of the lifecycle United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) Dublin Core Metadata Standard: o A general purpose metadata standard for describing digital resources related to micro-data − − − − − − Questionnaires Reports Manuals Data processing scripts Programs etc. o Makes it easy and inexpensive to create descriptive records for information resources while providing for effective retrieval of these resources on the web or other similar networked environment United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Metadata standards (contd.) Dublin Core Metadata Standard: o Consists of 15 metadata elements: Title Relation Rights Subject Coverage Date Description Creator Format Type Publisher Identifier Source Contributor Language United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 What is cataloguing? Cataloguing: creation of documentation for a dataset providing standardized structured information so that searchers can easily identify and access datasets according to their needs (title of study, source, year of collection, etc) United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Cataloguing Sharing survey micro-data with legitimate users offers many benefits, e.g., the diversity of research work, the acceptability of data, the quality of data, etc. Therefore, users should be informed about existence and characteristics of datasets Cataloguing material serves as: − A bibliographic record of the dataset, allowing it to be properly acknowledged and cited in publications − A formal record for long-term preservation purposes − Basic instrument used for resource discovery, allowing datasets to be uniquely identified within the collection by providing appropriate information to help secondary users identify the study as useful to their purpose United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Cataloguing (contd.) Searchable catalogues facilitate finding datasets and related metadata and increase access to datasets Use of XML-based metadata standards facilitate creation of catalogues as they are structured making them searchable Information on title of dataset, data collector(s), dates of data collection, temporal and geographic coverage, methods of data collection, sampling design and frames (if undertaken), other documentation information. Also variable names, abstracts and key words… United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Cataloguing (contd.) Characteristics of a good survey catalogue - From the user point of view: o Compliant with international metadata standard, particularly XML standards o Provides detailed metadata, including at the variable level o Provides user-friendly search functionalities (full text search) o Provides clear information on the policy and procedure for accessing the data o Provides a list and direct access to reference materials (questionnaires, manuals, reports) o Includes a "search by topic" compliant with an international thesaurus United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Cataloguing (contd.) Characteristics of a good survey catalogue - From the catalogue administrator's point of view: o Provides a secure environment for storing and sharing data and metadata o Provides a "users' requests" and "user's management" tool to receive and respond to data requests and information queries o Provides a solution for sharing public use files and licensed files o Generates admin reports on access requests received/processed; most popular surveys/documents; keywords used for searching data; etc. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011 Thank You! United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, 20-23 September, 2011