Some Metadata Definitions

advertisement
Draft
Memo
2011-06-10
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
1(8)
Some Metadata Definitions
The purpose of this paper is to initiate a discussion on what metadata are
essential to and specific to a statistical data warehouse. A reasonable starting
point is to establish some basic definitions. To that end the Internet and the
bookshelves were searched for metadata and data warehousing related
information. In particular Internet sites set up by national and international
organisations working with statistics and/or standards were searched. Most
of the results shown below come from Eurostat, OECD, UNECE as statistics
sites and NISO, ISO as standards organisations. Detailed search results have
been compiled in Annex 1.
Metadata and data
Metadata and statistical metadata
General definitions of metadata can be found in many books and many sites
on the Internet. Most of them are very short and simple. The most commonly
used generic definition states that:

Metadata are data about data
There are some variations on the theme, e.g. claiming that metadata should
(or must) be structured or formalised. Perhaps somewhat unexpectedly the
sources that have a relation to statistics give definitions that are even shorter
and vaguer than some of the general purpose sources. The OECD definition
of statistical metadata is for example simply:

Statistical metadata are data about statistical data
This definition will obviously cover all kinds of documentation with some
reference to any type of statistical data and is applicable to metadata that
refer to data stored in a statistical data warehouse as well as any other type of
data store.
Data and statistical data
Since the definition of metadata shows that they are just a special case of
data, we need a reasonable definition of data as well. A derivative from a
number of slightly varying definitions would be:

Data are qualitative and/or quantitative information collected through
observation
As well as a definition of statistical metadata, we can find several definitions
of statistical data. OECD provides this definition:
Document1
16-02-08 23.49
Draft
Memo
2011-06-10
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell

2(8)
Statistical data are data from a survey or administrative source used
to produce statistics
For statistical data warehouse purposes this definition has to be slightly
revised:

Statistical data are data from one or several surveys and/or
administrative sources used to produce statistics
Metadata categories
Metadata may describe many different aspects of data. Hence metadata can
be categorised in a number of ways, or overlapping dimensions.
Consequently, each metadata item normally belongs to several categories.
Active vs. passive metadata
Traditionally, metadata have been seen as a documentation of an existing
object or a process, such as a statistical production process, that is running
or has already finished – i.e. the result of a task most often carried out as the
last, even optional step of the production process. This indicates a passive,
recording role, which is useful for documenting, e.g., the methods used to
plan and carry out a survey or the quality achieved for the final results.
Passive metadata will become more active if they are used as input for
planning, e.g., a new survey round or a new similar statistics product. The
term active metadata should, however, be reserved for metadata that are
operational. Active metadata may be regarded as an intermediate layer
between the user and the data, which can be used by humans or computer
programmes to search, link, retrieve or perform other operations on data.
Thus active metadata may contain rules or code (algorithmic metadata).
Some authors use the term active only for those metadata, i.e. those that can
be interpreted or executed at runtime to support metadata driven processes,
calling all other non-passive metadata semi-active.
Passive metadata are used as documentation in all statistics production
regardless of storage environment. In a statistical data warehouse active
metadata must be available in what is often called the metadata layer.
Suggested definitions:


Document1
16-02-08 23.49
Active metadata are metadata stored and organised in a way that
enables operational use, manual or automated, for one or more
processes (GSBPM)
Passive metadata are any metadata that are not active
Draft
Memo
2011-06-10
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
3(8)
Structured vs. free-form textual metadata
As mentioned above some authors claim that metadata must be structured, or
formalised. The opposite would probably be metadata in a completely free
form. In practice all metadata probably follow some kind of structure, which
may be more or less strict. At one end we have completely and strictly
formalised metadata, meaning that only pre-determined codes or numerical
information from a pre-determined domain may be used. At the other end we
find a loose structure, e.g. a set of chapters, subdivisions, headings, etc. that
may be mandatory or optional and whose contents may adhere to some rules
or may be entered in a completely free form (text, diagrams, etc.).
Strictly structured metadata are obviously well suited for use in an active
role, but there is no simple, unambiguous mapping between active and
structured, and passive and free-form, respectively.
Since active metadata are vital to building an efficient statistical data
warehouse it follows that in that environment metadata should also be well
structured, whenever possible.
Suggested definitions:


Structured metadata are metadata stored and organised according to
standardised codes, lists and hierarchies
Textual metadata are metadata that contain descriptive information
using formats ranging from completely free-form to semi-structured
Reference vs. structural metadata
Most sources define two main categories of metadata, most often called
business and technical metadata. The distinction between those two varies
between the authors, but a generalised definition could be that business
metadata help the user understand, interpret and evaluate the contents, the
subject matter, the quality, etc, of the data, and technical metadata help the
user find and access the data by providing attributes such as names and
descriptions of files, tables, columns, fields, etc.
In the “statistical sources” the terms business and technical metadata are
rarely used. Several different synonyms can be found for business metadata,
e.g. conceptual or logical. Most commonly used is, however, reference
metadata. Instead of technical metadata you will often find the term
structural metadata
Document1
16-02-08 23.49
Draft
Memo
2011-06-10
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
4(8)
Structural/technical metadata can quite easily be represented as structured
and active, while more work and efforts are required to facilitate making
reference/business metadata active by storing them in a structured way.
Other similar categorisations are sometimes used, e.g. the term
administrative metadata (cf. NISO) for a subset of structural metadata to
define metadata that handle users’ rights to access and utilise data (rights
management metadata) and metadata specifically for archiving purposes
(preservation metadata).
Suggested definitions:


Reference metadata are metadata that describe the contents and
quality of the data in order to help the user understand and evaluate
them (conceptually)
Structural metadata are metadata that help the user find, identify,
access and use the data (physically)
Process metadata
Information on an operation, such as start and end times, result status code,
number of records processed, resources used, etc., is a specific type of
metainformation. This kind of metadata is known under several names, such
as process metadata, process data, process metrics, paradata. These data may
either contain expected values or actual outcome. In both cases they are
primarily intended for planning – in the latter case by evaluating finished
processes in order to improve recurring or similar ones. Process metadata
should be structured to facilitate computer aided evaluation.
Suggested definition:

Process metadata are metadata that describe the expected or actual
outcome of one or more processes using evaluable and operational
metrics
Quality metadata
Quality metadata may be read as metadata on the quality of the data or
metadata (of high) quality. Both interpretations are relevant to statistics
production and data warehousing.
Keeping track of, maintaining and perhaps raising the quality of the data in
the warehouse is an important governance task that requires support from
metadata. Quality information should be available in different forms and
serve several purposes: to describe the quality achieved (e.g. how a survey
was carried out, or what the outcome was), or to measure the outcome (a
contribution to the process metadata). The main objective of the former is to
Document1
16-02-08 23.49
Draft
Memo
2011-06-10
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
5(8)
serve the end users of the data, while the latter primarily supports
governance and future improvements. Hence quality metadata may be seen
as a different dimension that cuts through all the others.
Metadata quality is obviously a very important issue, and it should be high,
within the restrictions of reasonable cost-benefit analysis. Inferior metadata
quality may lead to unnecessary misinterpretations of the data contents or
even in completely useless data.
Suggested definition:

Quality metadata are any kind of metadata that contribute to the
description or interpretation of the quality of data.
Metadata structures
Several sources claim that the data warehouse needs a central system where
its metadata are registered and logically stored, a metadata registry. This
registry will make it easier to handle identification, checks for duplicates,
ensure consistency, etc. It is, however, a logical matter; a centralised
metadata registry does not imply that metadata are physically stored in a
centralised system.
The term metadata repository is also frequently used, particularly when
discussing metadata in relation to data warehousing. In this case the
distinction between logical and physical matters seems less clear. The
repository is logically centralised, but while some also advocate a centralised
physical solution, based on some form of central “metadatabase”, others
prefer coordinated, physically distributed systems. This means that a metadata registry may be seen as a subset of a metadata repository, or as a
narrower definition.
A third commonly used term is the metadata layer. A data warehouse is
often described as consisting of several parts that serve separate functions,
sometimes called layers. The metadata layer may in this case be interpreted
as a synonym for either the metadata registry or the metadata repository,
depending on the exact definitions being used.
Metadata collection and usage
The metadata lifecycle is commonly described as divided into the following
three basic phases:
1. Collection
Metadata should be captured as early as possible in the production
process. The sources vary. Collection of some types of metadata can
Document1
16-02-08 23.49
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
Draft
Memo
2011-06-10
6(8)
and should be automated. When data is entered into the data
warehouse basic metadata must already exist in a correct form
2. Maintenance
Metadata must be up to date at all times. Processes must be in place
to capture changes, synchronize metadata with the changing
architecture
3. Deployment
Metadata must be available to users in the right form and with the
right tools.
Collection of metadata should be automated whenever possible. This means
that, e.g., metadata that exist in the sources, such as administrative data files
used as input, should be used directly or in a derived form.
Another way of simplifying metadata collection is to use what already exists.
Reuse and inherit are common keywords in metadata literature. One of the
major advantages of using metadata is that duplicate and “near-duplicate”
data can be revealed and avoided. Reusing data and metadata saves
resources, increases efficiency and quality. Revealing, e.g., variables having
almost, but not quite, the same definitions can improve harmonisation and
comparability. The data harmonisation that will be enabled by metadata
harmonisation is a vital task for the data warehouse – possibly the most
important and at the same time one of the most difficult ones.
Different user categories need different metadata and have different
requirements. End users want to use metadata to easily and correctly find and
interpret the data they need. Data stewards want an inventory of what is
stored in the data warehouse. Analysts want to compare the data sources.
Programmers want to make sure that they use the standard names. These are
just a few examples of metadata usage. The use ranges from detailed and
operational to overview and descriptive.
Metadata standards
Standards for metadata have been discussed for many years, but still have
not developed very far. The most successful effort is probably ISO/IEC
11179, Metadata registries, which is a standard on the conceptual level.
Several NSIs have based their metadata systems on that standard.
The Common Warehouse Metamodel (CWM) is a specification for
modelling metadata for data warehouses. The standard is supported by the
Object Management Group, which in turn is supported by several major
software companies.
Document1
16-02-08 23.49
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
Draft
Memo
2011-06-10
7(8)
DDI, the Data Documentation Initiative, is an XML based standard
specification for documentation of social science data. It is supported by an
international alliance.
SDMX, Statistical Data and Metadata eXchange, is also based on XML.
Several NSIs are currently cooperating on the development of a Generic
Statistical Information Model (GSIM), which includes the Common
Reference Model (CRM), and is linked to the Generic Statistical Business
Process Model (GSBPM). The work is lead by ABS.
Metadata for statistical data warehouses
“Metadata is the DNA of the data warehouse, defining its elements and how
they work together. [...] Metadata plays such a critical role in the architecture
that it makes sense to describe the architecture as being metadata driven.”1
Panos Vassiliadis2 of the University of Ioannina, Greece, summarizes well
the requirements of data warehouse metadata. They should include
information on:
1. the contents of the data warehouse, their location and their structure
2. the processes that take place in the data warehouse
3. the implicit semantics of data along with any other kind of data that
aids the end-user exploit the information of the warehouse
4. the infrastructure and physical characteristics of components and the
sources of the data warehouse
5. security, authentication, and usage statistics that aids the
administrator tune the operation of the data warehouse as appropriate
The metadata categories described earlier in this paper are general. Some
sources mention metadata categories specific to the data warehouse
environment, e.g. ETL metadata (for the “Extract–Transform–Load”
process), but these all seem to be subsets or just renaming the categories
already defined.
Looking at the categories, and keeping in mind the specific demands of a
statistics production environment it is possible to assess which categories
play special roles building and maintaining a statistical data warehouse
(SDW).

1
2
Document1
16-02-08 23.49
SDW requires active metadata. The amount of objects (variables,
value domains, etc.) stored makes it necessary to provide the users
Kimball, The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008, p. 117
Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009
ESSnet on Data Warehousing
Statistics Sweden
Lars-Göran Lundell
Draft
Memo
2011-06-10
8(8)
(persons and software) with active assistance finding and processing
the data.

SDW requires structured metadata. The amount of metadata items
will be large and the requirement for metadata to be active makes it
necessary to structure the metadata very well.

SDW requires structural metadata. Active metadata must, at least to
some part, be structural.

Process metadata are vital to a SDW. Since the data warehouse
supports many concurrent users it is very important to keep track of
usage, performance, etc. In a data warehouse that has been less than
perfectly designed one user’s choice of tool or operation could impair
the performance for other users. An analysis of process metadata can
be an input to correcting this anomaly.
This does not mean that the remaining metadata categories should be
disregarded, but that they are used and needed in a statistical data warehouse
in the same way as in any statistics production environment.
Document1
16-02-08 23.49
Annex 1
1(4)
Metadata related terms
Sources
 Wikipedia






Direct quotations from Wikipedia and from its sources
http://en.wikipedia.org/wiki/Metadata
http://en.wikipedia.org/wiki/Data_warehouse
ISO (International Standards Organization, ISO/IEC 11179 Metadata registries (MDR)), http://metadata-stds.org/11179/
NISO (National Information Standards Organization), Understanding Metadata. http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
UNECE Metadata Common Vocabulary, MCV (Draft, March 2006)
http://circa.europa.eu/Public/irc/dsis/metadata/library?l=/metadata_forces/force_meeting_092007/mtf-6-mcv-anxpdf/_EN_1.0_&a=d
UNECE, Terminology on Statistical Metadata (2000) http://www.unece.org/stats/puSblications/53metadaterminology.pdf
UNECE, Guidelines for the modeling of statistical data and metadata (1995) http://www.unece.org/stats/publications/metadatamodeling.pdf
OECD, Glossary of Statistical Terms http://stats.oecd.org/glossary/
Term
Metadata
Statistical
metadata
Wikipedia
Data providing information
about one or more aspects of
the data, such as:
 Means of creation of the
data
 Purpose of the data
 Time and date of creation
 Creator or author of data
 Placement on a computer
network where the data
was created
 Standards use
ISO
Data that defines and
describes other data
NISO
Structured information that
describes, explains, locates, or
otherwise makes it easier to
retrieve, use, or manage an
information resource. Metadata
is often called data about data or
information about information.
Metadata can describe resources
at any level of aggregation. It
can describe a collection, a
single resource, or a component
part of a larger resource
OECD, UNECE (Metadata
Common Vocabulary)
Data that defines and describes
other data.
UNECE (Terminology,
Guidelines)
Data and other documentation
that describes objects in a
formalized way
Metadata are data that describe
other data, and data become
metadata when they are used in this
way.
Data about statistical data.
· Comprises data and other
documentation that describes
objects in a formalised way.
· Provides information on data and
about processes of producing and
using data.
Metadata describing statistical
data
Annex 1
2(4)
Term
Data
Wikipedia
Qualitative or quantitative
attributes of a variable or set
of variables. Data are
typically the results of
measurements [...] or
observations [...].
ISO
NISO
Re-interpretable
representation of
information in a
formalized manner
suitable for
communication,
interpretation, or
processing
Statistical
data
Structural
metadata
Reference
metadata
Describe the structure of
computer systems such as
tables, columns and indexes.
Bretheron & Singley
(Technical) Defines the
objects and processes from a
technical perspective [...] like
tables, fields, data types,
indexes [...] Kimball
(Guide) Help humans find
specific items.
Bretheron & Singley
(Business) Describes the
contents [...] in user
accessible terms [...] what
data you have, where it
comes from, what it means,
[...] Kimball
OECD, UNECE (Metadata
Common Vocabulary)
Characteristics or information,
usually numerical, that are collected
through observation.
UNECE (Terminology,
Guidelines)
The physical representation of
information in a manner
suitable for communication,
interpretation, or processing
by human beings or by
automatic means.
Data from a survey or
administrative source used to
produce statistics
Data that are collected and/ or
generated by statistics in
process of statistical
observations or statistical data
processing
Indicate how compound objects
are put together, e.g., how pages
are ordered to form chapters
Act as identifiers and descriptors of
the data. They are used to identify,
use, and process data matrixes and
data cubes, e.g. names of columns
or dimensions of statistical cubes.
(Descriptive) Describe a
resource for purposes such as
discovery and identification. It
can include elements such as
title, abstract, author, and
keywords
Describe the contents and the
quality of the statistical data. Should
include conceptual, methodological
and quality metadata
Annex 1
3(4)
Term
Wikipedia
ISO
Algorithmic
metadata
UNECE (Terminology,
Guidelines)
An instance of a metadata object. It
has associated attributes. It can have
a distinct status: mandatory,
conditional and optional.
A group of characters
describing the data and treated
as metadata unit
Describes the results of
various operations [...] start
time, end time, CPU seconds
used [...]
Kimball
Instance of a metadata
object
Metadata
item
Metadata
usage
OECD, UNECE (Metadata
Common Vocabulary)
Provide information to help
manage a resource, such as
when and how it was created,
file type and other technical
information, and who can access
it. There are several subsets [...]:
− Rights management metadata,
which deals with intellectual
property rights, and
− Preservation metadata, which
contains information needed to
archive and preserve a resource.
Administrative
metadata
Process
metadata
NISO
Data virtualization, statistics
and census services, data
warehousing
Discovery and organisation of
electronic resources,
interoperability, integration,
identification, archiving.
Include
 the algorithms as such
behind statistical
procedures, including
procedures for statistical
analysis;
 descriptions of the
algorithms
Annex 1
4(4)
Term
Metadata
layer
Metadata
registry
Metadata
repository
Wikipedia
[data warehouse] The data
dictionary – This is usually
more detailed than an
operational system data
dictionary.
A central location in an
organization where metadata
definitions are stored and
maintained in a controlled
method. Metadata registries
are used whenever data must
be used consistently within
an organization or group of
organizations.
A data dictionary [...] a
"centralized repository of
information about data such
as meaning, relationships to
other data, origin, usage, and
format."
ISO
NISO
OECD, UNECE (Metadata
Common Vocabulary)
The layer in the reference model for
standardization in statistics used to
denote the set of attributes related to
statistical metainformation
Information system for
registering metadata
(MDR)
Provides information on the
definition, origin, source, and
location of data [...] at many
levels, including schemes, usage
profiles, metadata elements, and
code lists for element values. It
provides an integrating resource
for legacy data, acts as a lookup
tool for designers of new
databases, and documents each
data element.
An information system for
registering metadata. Registration
accomplishes three main goals:
identification, provenance, and
monitoring quality. [...] It manages
the semantics of data.
A logically central statistical
metadata repository that allows for
the query, editing, and managing of
metadata. Such a system provides a
mechanism for looking up
information about statistical
products as well as their design,
development, and analysis.
UNECE (Terminology,
Guidelines)
(Metadata holding) A logical
or physical set of metadata
(e.g. database) stored together
with its description (e.g.
schema)
Download