Documentation and Cataloguing in Data Archiving Session6

advertisement
Documentation and
Cataloguing in Data Archiving
Session6
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
What is documentation?
Documentation: comprehensive information on
the processes and methods used to produce,
archive and disseminate micro-data
o Documentation includes metadata and other
information related to the dataset
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Role of documentation
 Documentation explains how the data were collected, their
content and structure and any manipulation that may have
taken place, how to access the data, terms for their use, etc
 Documentation is required in order to understand and
interpret the data by providing a context: without proper
documentation, data are useless
 The further data gets from its source, the greater the
importance of the documentation (metadata)
 Also allows reuse of documents for future surveys
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
When to undertake documentation
 Documentation is an incremental process that should be a
shared responsibility among various parts of an institution
 Different types of documentation can be added by different
people at various stages of an information object’s life cycle
 A common documentation framework, used by different
actors - the actor who is closest to the information to be
used as documentation/metadata adds that information to
the framework
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Types of material for documentation
 Three broad categories of documentation:
o
Explanatory material
o
Contextual information
o
Cataloguing material
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Types of material for documentation
 Explanatory material – required to ensure the longterm viability and functionality of a dataset and without
which full understanding of the dataset and its contents
cannot be achieved
o
o
o
Data collection methods (data collection process including
instruments used, methods employed, and how these were
developed)
Structure of the dataset (information about relationships
between individual files or records within the study, e.g., the
number of cases and variables in each file and the number of files
in the dataset)
Technical information (computer system used to generate the
files; software packages with which the files were created; medium
on which the data was stored; and complete list of all data files
present in the dataset)
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Types of material for documentation (contd)
 Explanatory material (contd.)
o
o
o
o
o
Variables and values, coding and classification schemes
(descriptions of all variables (or fields) in the dataset, with
explanations about coding and classifications used and for blank
and missing fields)
Derived variables (how it was done)
Weighting and grossing (procedures should be explained)
Data source (sources from which the data is derived e.g.
questions used)
Confidentiality and anonymization (if data contain any
confidential information or anonymization has been implemented
and implication of both on data usage)
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Types of material for documentation (contd)
 Contextual information - the context in which the data
was collected, and how it was put to use
o
Description of the originating project (why the data collection
was felt necessary; who or what was being studied; the
geographic and temporal coverage)
o
Provenance of the dataset (history of the data collection
process, changes and developments that occurred in the data
themselves and the methodology, or any adjustments made)
o
Serial and time-series datasets, new editions (e.g.,
descriptions of changes in question text, variable labelling or
sampling procedures for repeated cross-section, time-series
datasets)
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Types of material for documentation (contd)
 Catalogue metadata:
o
A sub-set of core data documentation providing
standardized structured information explaining the
purpose, origin, time reference, geographic location,
creator, access conditions and terms of use of data
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards
 Traditionally, data producers wrote text-based
codebooks. To take advantage of web technology, these
have been replaced by XML-based codebooks
 Use of metadata standards brings key data documentation
together into a single document, creating detailed and
structured content about the data. This enhances:
− Quality of statistical documentation provided to data users
− Access to the data and semantic interoperability of data sets
 The Data Documentation Initiative (DDI)
 Dublin Core Metadata Standard
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
 On XML (eXtensive Markup Language)
− A way of tagging text for meaning instead of appearance (i.e., XML
can be used to organize text by tagging with meaningful
information
− Unlike text in the database, XML text files can be viewed and edited
using any standard text editor
− With appropriate tools, XML files can be searched and queried like a
regular database
− XML documents can be read and transformed by other software
applications into user-friendly formats, e.g., spreadsheets, PDF files
or web pages
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
 The Data Documentation Initiative (DDI):
o Is based around the data lifecycle model and provides specifications
for a structured framework for organizing the content, presentation,
transfer and preservation of metadata in the social and behavioural
sciences
o Provides comprehensive metadata on the entire survey process and
usage
o Facilitates point-of-origin capture of metadata
o Includes machine actionable elements to facilitate processing,
discovery and analysis
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
 The Data Documentation Initiative (DDI):
o Facilitates reuse of common metadata items because DDI is
designed around schemes (lists of items) for commonly reused
information within a study, e.g., categories, code schemes,
concepts, universe, etc.
− Items are entered once and used in multiple locations in a DDI document
by referencing item in the list
o Reuse of items supports:
− Consistency and accuracy of metadata content thereby minimizing
redundancy and discrepancies
− Internal and external implicit comparisons
− External registries of concepts, questions, variables, etc.
− Metadata driven processing
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
 The Data Documentation Initiative (DDI):
o Information in DDI schemes can be stored in external
registries and used by multiple studies to support:
− Comparisons within and between studies
− Organizational consistency through use of agreed content
managed in registries
o Designed to support easy interaction with other major
standards (Dublin Core, SDMX, ISO/IEC 1179, ISO 19115)
− Ensures that metadata can be connected to other domains or
stages of the lifecycle
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
Dublin Core Metadata Standard:
o A general purpose metadata standard for
describing digital resources related to micro-data
−
−
−
−
−
−
Questionnaires
Reports
Manuals
Data processing scripts
Programs
etc.
o Makes it easy and inexpensive to create descriptive records
for information resources while providing for effective
retrieval of these resources on the web or other similar
networked environment
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Metadata standards (contd.)
Dublin Core Metadata Standard:
o
Consists of 15 metadata elements:
Title
Relation
Rights
Subject
Coverage
Date
Description
Creator
Format
Type
Publisher
Identifier
Source
Contributor
Language
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
What is cataloguing?
Cataloguing: creation of documentation for a
dataset providing standardized structured
information so that searchers can easily identify
and access datasets according to their needs (title
of study, source, year of collection, etc)
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Cataloguing
 Sharing survey micro-data with legitimate users offers
many benefits, e.g., the diversity of research work, the
acceptability of data, the quality of data, etc. Therefore,
users should be informed about existence and
characteristics of datasets
 Cataloguing material serves as:
− A bibliographic record of the dataset, allowing it to be properly
acknowledged and cited in publications
− A formal record for long-term preservation purposes
− Basic instrument used for resource discovery, allowing datasets
to be uniquely identified within the collection by providing
appropriate information to help secondary users identify the
study as useful to their purpose
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Cataloguing (contd.)
 Searchable catalogues facilitate finding datasets and
related metadata and increase access to datasets
 Use of XML-based metadata standards facilitate creation
of catalogues as they are structured making them
searchable
 Information on title of dataset, data collector(s), dates of
data collection, temporal and geographic coverage,
methods of data collection, sampling design and frames
(if undertaken), other documentation information. Also
variable names, abstracts and key words…
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Cataloguing (contd.)

Characteristics of a good survey catalogue - From the
user point of view:
o
Compliant with international metadata standard, particularly XML
standards
o
Provides detailed metadata, including at the variable level
o
Provides user-friendly search functionalities (full text search)
o
Provides clear information on the policy and procedure for
accessing the data
o
Provides a list and direct access to reference materials
(questionnaires, manuals, reports)
o
Includes a "search by topic" compliant with an international
thesaurus
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Cataloguing (contd.)

Characteristics of a good survey catalogue - From the
catalogue administrator's point of view:
o
Provides a secure environment for storing and sharing data and
metadata
o
Provides a "users' requests" and "user's management" tool to
receive and respond to data requests and information queries
o
Provides a solution for sharing public use files and licensed files
o
Generates admin reports on access requests received/processed;
most popular surveys/documents; keywords used for searching
data; etc.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Thank You!
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia,
20-23 September, 2011
Download