1.1 Metadata Framework for Statistical Data Warehousing

advertisement
in partnership with
Title:
Framework of metadata requirements and roles in the S-DWH
WP:
1
Deliverable:
1.1
Version:
1.12
Date:
6-9-2013
Author:
Lars-Göran Lundell
NSI:
Sweden
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Metadata Framework for Statistical Data
Warehousing
Contents
1
2
Metadata – general considerations ................................................................. 3
1.1
Metadata definitions and terminology ................................................ 3
1.1.1 Metadata and data ................................................................... 3
1.1.2 Metadata categories ................................................................ 4
1.1.3 Metadata subsets ..................................................................... 6
1.1.4 Metadata structures ................................................................. 9
1.2
Metadata collection and usage .......................................................... 10
1.3
Metadata standards ........................................................................... 11
1.3.1 GSBPM ................................................................................. 11
1.3.2 GSIM .................................................................................... 11
1.3.3 MDR (ISO/IEC 11179) ......................................................... 12
1.3.4 CWM .................................................................................... 12
1.3.5 DDI ....................................................................................... 12
1.3.6 SDMX ................................................................................... 12
1.3.7 MCV ..................................................................................... 13
Metadata in the statistical data warehouse ................................................... 13
2.1
The SDWH metadata requirements .................................................. 13
2.1.1 Minimum metadata requirements for the SDWH ................. 15
2.2
Metadata and the layered SDWH ..................................................... 15
2.2.1 Source layer metadata ........................................................... 16
2.2.2 Integration layer metadata .................................................... 17
2.2.3 Interpretation and data analysis layer metadata .................... 17
2.2.4 Data access layer metadata ................................................... 17
2.2.5 Summary of SDWH layers and metadata categories ............ 18
2.3
Organising SDWH metadata ............................................................ 19
2.4
SDWH metadata governance ............................................................ 19
2.5
The SDWH and metadata standards ................................................. 20
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
3(21)
Metadata plays a very active and important part in the data warehouse environment.
[...] Metadata for the data warehouse environment is one of the most important
aspects.1
Metadata is the DNA of the data warehouse, defining its elements and how they
work together. [...] Metadata plays such a critical role in the architecture that it
makes sense to describe the architecture as being metadata driven.2
The quotations above originate from “the fathers of the data warehouse”, Bill Inmon
and Ralph Kimball. Even if they do not always agree on how a data warehouse
should be built and maintained, they obviously share the view that much effort
should be devoted to designing the metadata system when establishing a data
warehouse. To everyone working in an organisation that produces statistics, like a
national statistical institute (NSI), the need for good metadata is already well
known, regardless of the production environment. Thus it is obvious that a statistical
data warehouse (SDWH) is dependent on its metadata for statistical as well as data
warehousing purposes.
According to the framework partnership agreement (FPA) for the ESSnet project on
micro data linking and data warehousing in statistical production, this project is to
“define a functional model of the SDWH, so that the issues raised by the ESSnet
can be assessed in a generic and standardized way”.
This paper attempts to define the roles and purposes of metadata in the SDWH in
generic terms, and to distinguish between them and those used in statistics
production regardless of the environment, i.e. a general metadata framework for
statistical data warehousing.
1 Metadata – general considerations
1.1
Metadata definitions and terminology
The first step in any kind of standardisation work usually concerns making sure that
all involved parties understand and agree on a set of basic definitions and use a
common terminology. This chapter covers a number of important basic definitions.
1.1.1
Metadata and data
General definitions of metadata can be found in many books and many sites on the
Internet. Most of them are very short and simple. The most commonly used generic
definition states that:
[Def 1.1]
Metadata are data about data3
There are some variations on the theme, e.g. claiming that metadata should (or
must) be structured or formalised. Perhaps somewhat unexpectedly the sources that
have a relation to statistics give definitions that are even shorter and vaguer than
some of the general purpose sources. The definition of statistical metadata given by
OECD and UNECE, e.g., simply states that:
1
Inmon, Metadata in the Data Warehouse, (White Paper), 2000
Kimball, The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008, p. 117
3 ISO/IEC 11179; Eurostat Metadata Common Vocabulary
2
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
[Def 1.2]
4(21)
Statistical metadata are data about statistical data4
This definition will obviously cover all kinds of documentation with some reference
to any type of statistical data and is applicable to metadata that refer to data stored
in a SDWH as well as any other type of data store.
Since the definition of metadata shows that they are just a special case of data, we
need a reasonable definition of data as well. A derivative from a number of slightly
varying definitions would be:
[Def 1.3]
Data are qualitative and/or quantitative information collected through
observation5
As well as a definition of statistical metadata, we can find several definitions of
statistical data:
[Def 1.4]
Statistical data are data derived from either statistical or nonstatistical sources, which are used in the process of producing
statistical products 6
These basic definitions are very generic and state nothing about requirements on the
contents or organisation of the data or metadata.
1.1.2
Metadata categories
Metadata may describe many different aspects of data. Hence metadata can be
categorised in several ways, where the categories form a multi-dimensional
structure. Consequently, each metadata item normally belongs to several categories.
1.1.2.1 Active – passive
Traditionally, metadata have been seen as a documentation of an existing object or a
process, such as a statistical production process that is running or has already
finished – i.e. the result of a task most often carried out as the last, even optional
step of the production process. This indicates a passive, recording role, which is
useful for documenting, e.g., the variables, objects and methods used to plan and
carry out a survey or the quality achieved for the results.
Passive metadata will become more active if they are used as input for planning,
e.g., a new survey round or a new similar statistics product. The term active
metadata should, however, be reserved for metadata that are operational. Active
metadata may be regarded as an intermediate layer between the user and the data,
which can be used by humans or computer programmes to search, link, retrieve or
perform other operations on data. Thus active metadata may be expressed as
parameters, and may contain rules or code (algorithmic metadata). Some authors
use the term active only for those metadata, i.e. those that can be interpreted or
executed at runtime to support metadata driven processes, calling all other nonpassive metadata semi-active.
4
OECD, Glossary of Statistical Terms; UNECE, Terminology on Statistical Metadata
Eurostat Metadata Common Vocabulary
6 Eurostat Metadata Common Vocabulary
5
ESSnet on Data Warehousing
WP1
Version 1.12
Report
2013-09-06
5(21)
Passive metadata are used as documentation in all statistics production regardless of
storage environment. In the SDWH active metadata must be available in what is
often called the metadata layer (cf. definition 4.1).
[Def 2.1]
Active metadata are metadata stored and organised in a way that
enables operational use, manual or automated, for one or more
processes
Examples: Instruction; user manual; parameter; script (SQL, XML)
[Def 2.2]
Passive metadata are all metadata that are not active
Examples: Quality report for a survey, a census or register;
documentation of methods that were used during a survey; most log
lists; definitions of variables
1.1.2.2 Formalised – free-form
According to some sources all metadata must be structured, or formalised. In a
reverse case all metadata would be created and stored in completely free form –
unstructured and non-formalised. In practice all metadata probably follow some
kind of structure, which may be more or less strict. At one end, we have completely
and strictly formalised metadata, meaning that only pre-determined codes or
numerical information from a pre-determined domain may be used. At the other
end, we find a loose structure, e.g. a set of chapters, subdivisions, headings, etc.,
that may be mandatory or optional and whose contents may adhere to some rules or
may be entered in a completely free form (text, diagrams, etc.).
Strictly formalised metadata are obviously well suited for use in an active role, but
there is no simple, unambiguous mapping between active and formalised, and
passive and free-form, respectively. Still, formalised metadata are more easily used
actively, and since active metadata are vital to building an efficient SDWH it
follows that its metadata should also be formalised, whenever possible.
[Def 2.3]
Formalised metadata are metadata stored and organised according
to standardised codes, lists and hierarchies
Examples: Classification codes; parameter lists; most log lists
[Def 2.4]
Free-form metadata are metadata that contain descriptive
information using formats ranging from completely free-form to
partly formalised (semi-structured)
Examples: Quality report for a survey a census or register;
methodological description; process documentation; background
information
1.1.2.3 Reference – structural
Most sources define two main categories of metadata, often called business and
technical metadata. The “statistical sources” rarely use those terms. Several
different synonyms can be found for business metadata, e.g. conceptual or logical,
but the most commonly used term is reference metadata. Instead of technical
metadata, the “statistical sources” most often use the term structural metadata to
refer to the same thing.
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
6(21)
The distinction between the two categories varies between sources, but generally
reference metadata help the user understand, interpret and evaluate the contents, the
subject matter, the quality, etc, of the corresponding data, whilst structural metadata
help the user, who in this case may be man or machine, find, access and utilise the
data operationally.
Particularly in data warehousing structural metadata can be defined as any metadata
that can be used actively or operationally in a metadata driven system. The user may
in this case be a human or a machine (a programme, a process, a system). This
includes metadata that describe the physical locations of the corresponding data,
such as names or other identities of servers, databases, tables, columns, files,
positions, etc.
Structural metadata are normally represented as formalised and active, whilst
reference metadata are typically passive and stored in a free format, requiring more
efforts to make them active by storing them in a structured way.
[Def 2.5]
Reference metadata are metadata that describe the contents and
quality of the data in order to help the user understand and evaluate
them (conceptually) 7
Examples: Quality information on survey, register and variable
levels; variable definitions; reference dates; confidentiality
information; contact information; relations between metadata items
[Def 2.6]
Structural metadata are metadata that help the user find, identify,
access and utilise the data (physically)
Examples: Classification codes; parameter lists
All categories described above are valid for all metadata, i.e. every metadata item
can be categorised on the three scales: active–passive, formalised–free-form, and
reference–structural, as illustrated in figure 1.
Figure 1 Categorisation of a metadata item
1.1.3
Metadata subsets
In addition to the categories described above a metadata item may (but does not
necessarily have to) also belong to a specific type, or subset of metadata. Below are
described the subsets that are generally best known or considered most important.
Several more types may be identified to serve special purposes, but are not further
described here.
7
Eurostat Metadata Common Vocabulary
ESSnet on Data Warehousing
WP1
Version 1.12
Report
2013-09-06
7(21)
1.1.3.1 Statistical metadata
According to definition 1.2 statistical metadata are “data about statistical data”. This
definition is very generic and needs to be more precise in order to be useful. From a
more operational point of view statistical metadata can be seen as those metadata
that directly refer to central concepts in the statistics, e.g., those that define and
describe statistical unit types used in a survey, a census or a register, their
characteristics, the variables and the statistical activities8. This still means that the
statistical metadata subset may – at least partly – overlap some other subsets, but
will exclude some more administrative and technical ones.
Statistical metadata may belong to any of the metadata categories described above.
[Def 3.1]
Statistical metadata are data about statistical data.
Examples: Variable definition; register description; code list
1.1.3.2 Process metadata
Information on an operation, such as when it started and ended, the resulting status,
the number of records processed, which resources were used, etc., is a specific type
of metainformation. This kind of metadata is known under several names, such as
process metadata, process data, process metrics, or paradata. These data may
contain either expected values or actual outcome. In both cases, they are primarily
intended for planning – in the latter case by evaluating finished processes in order to
improve recurring or similar ones. If process metadata are formalised, this will
obviously facilitate computer-aided evaluation.
Process metadata are less likely to be categorised as free-form, but may be active or
passive, and reference or structural.
[Def 3.2]
Process metadata are metadata that describe the expected or actual
outcome of one or more processes using evaluable and operational
metrics
Examples: Operator’s manual (active, formalised, reference);
parameter list (active, formalised, reference); log file (passive,
formalised, reference/structural)
1.1.3.3 Quality metadata
Quality metadata may be read as metadata on the quality of the data or metadata (of
high) quality. Both interpretations are relevant to statistics production and data
warehousing.
Keeping track of, maintaining and perhaps raising the quality of the data in the
SDWH is an important governance task that requires support from metadata.
Quality information should be available in different forms and serve several
purposes: to describe the quality achieved (e.g. how a survey was carried out, or
what the outcome was), or to measure the outcome (a contribution to the process
metadata). The main objective of the former is to serve the end users of the data,
while the latter primarily supports governance and future improvements.
8
The Neuchâtel Terminology Model, Part II, 2006
ESSnet on Data Warehousing
WP1
Version 1.12
Report
2013-09-06
8(21)
Most quality metadata can be categorised as passive, free-form and reference
metadata.
[Def 3.3]
Quality metadata are any kind of metadata that contributes to the
description or interpretation of the quality of data.
Examples: Quality declarations for a survey, a census or a register
(passive, free-form, reference); documentation of methods that were
used during a survey (passive, free-form, reference); most log lists
(passive, formalised, reference/structural)
Metadata quality is obviously a very important issue, and it should be high, within
the restrictions of reasonable cost-benefit analysis. Inferior metadata quality may
lead to unnecessary misinterpretations of the data contents or even in completely
useless data. A detailed discussion and recommendations on metadata quality can be
found in Recommendations on the Impact of Metadata Quality in the Statistical
Data Warehouse9 (Deliverable 1.2).
1.1.3.4 Technical metadata
In section 1.1.2.3, technical metadata were mentioned as a commonly used
synonym for structural metadata. We chose not to further use the term as a metadata
category, but may instead use it as a metadata subset referring to information
necessary to locate the data physically.
Technical metadata are usually categorised as formalised, active and structural.
[Def 3.4]
Technical metadata are metadata that describe or define the physical
storage or location of data.
Examples: Server, database, table and column names and/or
identifiers; server, directory and file names and/or identifiers
1.1.3.5 Authorisation metadata
Every computerised system needs some way of handling user privileges, access
rights, etc. Users need to be classified or assigned a role as, e.g., “administrator”,
“user” or “guest”, or to be given an explicit privilege to “read”, “write”, or “update”
a certain item, etc. In a data warehouse, having a large amount of data and many
users performing various tasks, there is a need for a comprehensive authorisation
subsystem. This system will need to store and use its own administrative data,
which may be defined as authorisation metadata.
Authorisation metadata are categorised as active, formalised and structural.
[Def 3.5]
Authorisation metadata are administrative data that are used by
programmes, systems or subsystems to manage users’ access to data.
Examples: User lists with privileges; cross references between
resources and users
9
Bowler, Lindelauf, Dressen (2013)
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
9(21)
1.1.3.6 Data models
The various types of data models are an often overlooked type of metadata. The
reason is probably that these metadata are usually only seen as useful to the
technical staff (IT personnel).
[Def 3.6]
A data model is an abstract documentation of the structure of data
needed and created by business processes.
Important types of data models for the SDWH include the conceptual model that
usually gives a high-level overview, and the physical model that describes the
details of databases, files, etc. The metadata model can also be described
conceptually as well as physically.
[Def 3.6.1]
1.1.4
A metadata model is a special case of a data model: an abstract
documentation of the structure of metadata used by business
processes.
Metadata structures
In order to find, retrieve and use metadata efficiently their locations must be known
to users on some level.
A data warehouse is often described as consisting of several parts that serve
separate functions, sometimes called layers10. Since metadata is a vital part of the
SDWH the term metadata layer is sometimes used to refer to the metadata store and
metadata functions in the SDWH.
[Def 4.1]
A metadata layer is a conceptual term that refers to all metadata in a
data warehouse, regardless of logical or physical organisation.
Metadata need to be organised in some kind of structured, logical way in order to
make it possible to find and use them. Some sources use different terms for logical
and physical metadata structures, respectively, but most do not distinguish between
them. Sometimes it is useful to have separate terms; a logical structure may be, e.g.,
physically stored in several distributed, coordinated structures.
Another distinction can be found in the level of formal organisation of the metadata
store, the restrictions and approval rules required to perform changes, and the
coordination of the contents. The term registry often refers to a more strictly
administered, regulated and coordinated environment than the more general term
repository.
[Def 4.2]
A metadata registry is a central point where logical metadata
definitions are stored and maintained using controlled methods.
In order to load a metadata item into the registry it must fulfil the requirements set
up on, e.g., structure, contents and relations to other metadata items. Normally the
registry does not define any links between metadata and the data they describe, i.e.
the physical addresses of the data.
Usually the definition of a metadata repository does not require the metadata to
adhere to strict rules in order to be loaded. On the other hand, the repository usually
10
Palma, S-DWH Business Architecture, 2013
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
10(21)
implies storing metadata for operational use, i.e. it is expected to contain a link to
the corresponding data.
[Def 4.3]
A metadata repository is a physical location where metadata and
their links to data are stored.
Figure 2. Using the metadata layer to locate and retrieve data
1.2
Metadata collection and usage
The metadata lifecycle is commonly described as divided into the following three
basic phases:
1. Collection
Metadata should be captured as early as possible in the production process.
The sources vary. Collection of some types of metadata can and should be
automated. When data is entered into the data warehouse basic metadata
must already exist in a correct form
2. Maintenance
Metadata must be up to date at all times. Processes must be in place to
capture changes, synchronize metadata with the changing architecture
3. Deployment
Metadata must be available to users in the right form and with the right
tools.
Collection of metadata should be automated whenever possible. This means that,
e.g., metadata that exist in the sources, such as administrative data files used as
input, should be used directly or in a derived form.
Another way of simplifying metadata collection is to use what already exists. Reuse
and inherit are common keywords in metadata literature. One of the major
advantages of using metadata is that duplicate and “near-duplicate” data can be
revealed and avoided. Reusing data and metadata saves resources, increases
efficiency and quality. Revealing, e.g., variables having almost, but not quite, the
same definitions can improve harmonisation and comparability. The improvement
of data consistency that will be enabled by metadata harmonisation is a vital task for
the data warehouse – possibly the most important and at the same time one of the
most difficult ones.
Different user categories need different metadata and have different requirements.
End users want to use metadata to easily and correctly find and interpret the data
they need. Data stewards want an inventory of what is stored in the data warehouse.
Analysts want to compare the data sources. Programmers want to make sure that
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
11(21)
they use the standard names. These are just a few examples of metadata usage. The
use ranges from detailed and operational to overview and descriptive.
A more thorough discussion on metadata collection and usage can be found in
Definition of the functionalities of a metadata system to facilitate and support the
operation of the S-DWH11 (Deliverable 1.4).
1.3
Metadata standards
An important step towards a common view on metadata for a SDWH is to identify
which already existing standards are available and relevant for the purpose.
Metadata standards relating to statistics as well as to data warehousing should be
taken into account.
Information on statistical metadata standards can be found in the Common Metadata
Framework12, created and maintained by METIS, the joint UNECE-Eurostat-OECD
Work Session on Statistical Metadata. Below are briefly described the models and
technical standards considered most relevant to a SDWH. For more detailed
information on a standard, refer to Overview of and recommendations on the use of
metadata models13 (Deliverable 1.3) and the links below.
1.3.1
GSBPM
The Generic Statistical Business Process Model14 (GSBPM) provides a framework
to describe the statistical production process in terms of standard components
(phases and sub-processes). One of the original aims of the model is to standardise
the process terminology, thereby making it easier to compare and benchmark
processes within and between organisations, primarily NSIs and international
organisations.
The current version, 4.0, was released in 2009.
1.3.2
GSIM
The Generic Statistical Information Model15 (GSIM) is a reference framework that
describes information objects used in the production of official statistics. GSIM
provides a common language to describe information that supports the whole
statistical production process. It is aligned with relevant data management and
exchange standards, such as DDI and SDMX, but it is not directly tied to them, or to
any specific technology. GSIM and GSBPM are complementary models, where
GSIM describes the information that is used by and produced by the processes
described in GSBPM.
Version 1.0 of GSIM was released in 2012 after being developed by a group of
NSIs, led by the Australian Bureau of Statistics (ABS).
11
Ennok, Lundell, Bowler, De Giorgi, Kulla (2013)
http://www1.unece.org/stat/platform/display/metis/Part+B++Metadata+Concepts%2C+Standards%2C+Models+and+Registries
13
Dressen, Lindelauf, Goossens (2012)
14
http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+
Process+Model
15
http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=59703371
12
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
1.3.3
12(21)
MDR (ISO/IEC 11179)
ISO/IEC 1117916 is a well-established international standard for representing
metadata in a metadata registry (MDR). It has two main purposes: definition and
exchange of concepts. Thus, it describes the semantics and concepts, but does not
handle physical representation of the data. It aims to be a standard for metadatadriven exchange of data in heterogeneous environments, based on exact definitions
of data.
Several NSIs have based their current metadata systems on this standard. Most of
those are developed in-house, but at least one commercial product exists that claims
to support the standard (OneMeta MDR).
The standard was first published in 1999. The most recent update was made in
2013.
1.3.4
CWM
The Common Warehouse Metamodel17 (CWM), ISO/IEC 19504, is a specification
for modelling metadata for data warehouses. Its purpose is to enable easy
interchange of data warehouse metadata between tools, platforms and metadata
repositories in distributed heterogeneous environments. CWM is based on, or
supports other standards, such as UML, XML, Corba, and others.
CWM is supported by the Object Management Group, which in turn is supported by
several major software companies. Several commercial products claim to support
this standard, at least partly.
The current version, 1.1, was released in 2003.
1.3.5
DDI
The Data Documentation Initiative18 (DDI) has its roots in the data archive
environment, but with its latest development, DDI 3 or DDI Lifecycle, it has become
an increasingly interesting option for NSIs. DDI is an effort to create an
international standard for describing data from the social, behavioural, and
economic sciences. It is based on XML.
DDI is supported by a non-profit international organisation, the DDI Alliance.
Several tools that support DDI are available, both on the commercial market and as
free software.
The current version, 3.1, was published in 2009. Version 3.2 has been under public
review and is expected to be released in 2013.
1.3.6
SDMX
Statistical Data and Metadata eXchange19 (SDMX) was initiated by seven
international organisations to foster standards for the exchange of statistical
information. SDMX has its focus on macro data, even though the model supports
16
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
http://www.omg.org/spec/CWM/1.1
18
http://www.ddialliance.org/specification
19
http://www.sdmx.org/
17
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
13(21)
micro data. It is an adopted standard for delivering and sharing data between NSIs
and Eurostat. Sharing the results from the latest Population Census is perhaps the
most advanced example, so far. Several software products that are commonly used
by NSIs support SDMX.
The current version, 2.1, was published in 2012.
1.3.7
MCV
The Metadata Common Vocabulary20 (MCV) is not a standard in itself, but provides
definitions of common metadata concepts, in particular in the domain of statistical
metadata.
It was compiled as a part of the development of SDMX. It is maintained by the
SDMX consortium and is a part of the SDMX Content-oriented Guidelines.
The current version was published in 2009.
2 Metadata in the statistical data warehouse
Although most authors of data warehousing literature seem to agree with Inmon and
Kimball on the important role of metadata, you can find surprisingly little practical
support on how to implement a metadata layer. An article by Panos Vassiliadis21 of
the University of Ioannina, Greece, summarizes well the requirements of data
warehouse metadata. They should include information on:
1. the contents of the data warehouse, their location and their structure
2. the processes that take place in the data warehouse
3. the implicit semantics of data along with any other kind of data that aids the
end-user exploit the information of the warehouse
4. the infrastructure and physical characteristics of components and the
sources of the data warehouse
5. security, authentication, and usage statistics that aids the administrator tune
the operation of the data warehouse as appropriate
General metadata requirements for statistics production, regardless of environment,
have been investigated and discussed in many previous projects, several of those
initiated by international organisations. Those issues should not be repeated here;
instead this document will focus on the specific roles of metadata in a SDWH, and
the special demands that can be identified for that environment.
2.1
The SDWH metadata requirements
The data warehouse architecture is, according to Kimball and others, metadata
driven. Referring to the metadata categories described earlier, and keeping in mind
the specific metadata requirements of statistics production it is possible to assess
which categories play significant roles when building and maintaining a SDWH.

The SDWH requires active metadata. The amount of objects (variables,
value domains, etc.) stored makes it necessary to provide the users (persons
and software) with active assistance finding and processing the data.
20
http://sdmx.org/wp-content/uploads/2009/01/04_sdmx_cog_annex_4_mcv_2009.pdf
21
Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
14(21)

The SDWH requires formalised metadata. The amount of metadata items
will be large and the requirement for metadata to be active makes it
necessary to structure the metadata very well.

The SDWH requires structural metadata, especially technical metadata.
Active metadata must be structural, at least to some part.

Process metadata are vital to a SDWH. Since the data warehouse supports
many concurrent users it is very important to keep track of usage,
performance, etc. In a data warehouse that has been less than perfectly
designed one user’s choice of tool or operation could impair the
performance for other users. An analysis of process metadata can be an
input to correcting this anomaly.
The table below shows the possible combinations of metadata categories and
subsets. In the cells are indicated which combinations are of general interest for
statistics production (“gen”) and which ones are of particular interest for a SDWH
(“dw”). Most of the remaining combinations are possible, but less common or less
likely to be found useful.
Metadata
subset
Statistical
Process
Quality
Technical
Authorisation
Data model
Metadata category
Formalised
Free-form
Reference
Structural
Reference
Structural
Act Pas Act Pas Act Pas Act Pas
dw
gen
dw
dw
dw gen
gen
dw
gen
dw
gen
dw
dw
Consistency within the metadata layer is an example of attributes that are regarded
as desirable in any statistics production environment, but that are considered
necessary in the SDWH environment. In the SDWH all metadata items (concepts as
well as physical references) must be uniquely identified and there must be one-toone relationships between identity and definition, and identity and name,
respectively. The concept “Local unit”, e.g., must be given an identity and a
definition, and these must be consistently used in the SDWH regardless of source,
context, etc. If there will be a need for a slightly different definition, it must be
given a new identity and a new name.
In the SDWH it is desirable to be able to analyse data by time series on a low level
of aggregation, or even to perform longitudinal analysis on object level. To support
these functions metadata items should have validity information: “valid from 01-012001”, “valid until 31-12-2012”.
In order to be metadata driven the SDWH has higher demands for process metadata,
and it is more likely to have a built-in ability to produce process metadata.
The SDWH is not only a data store, but it is also a system of processes to refine its
data from input to output. These processes need active metadata: automated
processes need formalised process metadata, such as programmes, parameters, etc.,
and manual processes need process metadata such as instructions, scripts, etc.
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
2.1.1
15(21)
Minimum metadata requirements for the SDWH
A SDWH without metadata, or with insufficiently comprehensive metadata, cannot
be called a true data warehouse, since its data are not interpretable or
understandable in a reliable or useful way. The more and the better metadata
available, the more useful the SDWH becomes and the more reliable become the
analyses and conclusions derived from its data. Adding metadata is arguably one of
the most demanding and expensive parts of governing the SDWH. Since budget,
time and human resources always form constraints, requiring complete and highest
quality metadata is asking for the impossible. In practice, a minimum set of
requirements should be identified, defining what metadata are vital to the SDWH.
The following list of items refers to the metadata subsets discussed in chapter 1.1.3.
Statistical metadata
 Variable name, definition, reference time and source
 Value domain mapped to the variable
(particularly important if the value domain corresponds to a formal
classification)
Process metadata
 Load time (date and time when data item was loaded into the SDWH)
Technical metadata
 Physical location (name/identity of server/database/column etc. where
variable is stored)
 Data type (record layout: length, decimals, etc.)
Authentication metadata
 Access rights mapped to users, groups and roles
2.2
Metadata and the layered SDWH
In the general discussions on metadata for the statistical production lifecycle several
attempts have been made to link metadata to the generic phases, such as the
GSBPM processes: what metadata are produced during a process, what metadata are
needed to perform a process, and what metadata are forwarded from one process to
the next one.
The GSBPM is applicable to any statistics production, including a SDWH. Hence
the results from, e.g., the METIS group could, and should, be used when discussing
metadata for the SDWH. There are, however, alternative or complementing models
that may be used to describe the specific needs for the SDWH.
During its first phase of this project it was agreed that the figure below conveys a
good description of the processes and data flow that take place in a generic SDWH.
The layered approach, with input of raw data at the bottom and dissemination of
refined data (or statistics) at the top shows the necessary production steps in a
simplified way. The layered architecture of the S-DWH is elaborated in detail in SDWH Business Architecture22 (Deliverable 3.1).
The metadata layer at the left-hand side indicates the necessity of metadata support
from start to finish. Examples of which metadata categories and functionalities are
used in the different layers are found in Documentation of the mapping of the result
of 1.4 on the ‘ideal architecture’ framework23 (Deliverable 1.6).
22
23
Palma (2013)
Ennok (2013)
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
16(21)
Figure 3 The SDWH layers
2.2.1
Source layer metadata
The source layer is the entry point to the SDWH regarding data as well as metadata.
Data are collected at various sources outside of the control of the data warehouse,
and they have various origins, spanning from surveys and censuses conducted
within the organisation to administrative registers kept by other organisations.
Hence, the original metadata that accompany the data will vary in contents and
quality, and the possibilities to influence the metadata will vary as well.
The source layer, being the entry point, has the important role of gatekeeper,
making sure that data entered into the SDWH and forwarded to the integration layer
always have matching metadata of at least the agreed minimum extent and quality.
The metadata may be either already available, loaded earlier, e.g. with a previous
periodic delivery, or supplied with the current data delivery.
The main responsibilities for this layer include:
 to make sure that all relevant data are collected from the sources, including
their metadata,
 to add or complete missing or bad metadata,
 to deliver data and metadata in the best possible formats to the integration
layer.
The metadata from the sources must satisfy the minimum requirements described in
chapter 2.1.1, but should if possible include more comprehensive metadata, such as
quality information.
In the source layer the foundation is laid for metadata to be used in the next layers.
Consistency in definitions and standardisation of code lists are examples of areas
where efforts should be made to influence the sources in order to build the strongest
possible metadata foundation.
ESSnet on Data Warehousing
WP1
2.2.2
Version 1.12
Report
2013-09-06
17(21)
Integration layer metadata
The efficiency of the data linking and similar tasks carried out in the integration
layer will depend on the quality of the metadata carried forward from the source
layer.
In this layer data are extracted from the sources, transformed as necessary, and
loaded into their places in the data warehouse (ETL operations). These tasks need to
use active metadata, such as descriptions and operator manuals as well as derivation
rules, etc., being used, i.e. scripts, parameters and program code for the tools used.
The ETL operations will also create several types of metadata:
 Structural process metadata
o Automatically generated formalised information, log data on
performance, errors, etc.
o Manually added, more or less formalised information
 Structural statistical metadata
o Automatically generated additions to, or new versions of, code lists,
linkage keys, etc.
o Manually added additions, corrections and updates to the new
versions
 Reference metadata
o Manually added quality information, process information, etc.,
regarding a dataset, or a new version
2.2.3
Interpretation and data analysis layer metadata
The interpretation and data analysis layer stores cleaned, versioned and wellstructured final micro data. Once a new dataset or a new version has been loaded
few updates are made to the data in this layer. Consequently, metadata are normally
only added, with few or no changes being made.
On loading data to this layer the following additions should be made to metadata:
 Structural process metadata
o Automatically generated log data
 Structural statistical metadata
o New versions of code lists, etc.
 Reference metadata
o Optional additions to quality information, process information, etc.
Relatively few users will access this layer, but those who do will need metadata to
perform their tasks:
 Structural process metadata
o Estimation rules, descriptions, code, etc.
o Confidentiality rules
 Structural statistical metadata
o Variable definitions
o Derivation rules
 Reference metadata
o Quality information, process information, etc.
2.2.4
Data access layer metadata
Loading data into the access layer means reorganising data from the analysis layer
by derivation or aggregation into relevant stores, or data marts. This will require
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
18(21)
metadata that describe and support the process itself (derivation and aggregation
rules), but also metadata that describe the reorganised data.
Necessary metadata to load the data access layer include:
 Structural process metadata
o Derivation and aggregation rules
 Structural technical metadata
o New physical references, etc.
Using the data access layer will require:
 Structural statistical metadata
o Optional additional definitions of derived entities or attributes,
aggregates, etc.
 Structural technical metadata
o Physical references, etc.
 Reference metadata
o Information on sources, links to source quality information
2.2.5
Summary of SDWH layers and metadata categories
The table below gives a rough overview of where in the SDWH layers three
important metadata categories are created (indicated by c) and used (u).
Layer
Statistical Process Quality
metadata metadata metadata
Data access
u
cu
u
Interpretation
cu
cu
cu
Integration
cu
cu
c
Source
c
c
c
The table shows that the lower layers mainly provide metadata, but can’t make
much use of them, while in the higher layers metadata are used, but relatively few
are added. This very much agrees with the rule that metadata should be captured as
close to the source, or as early in the process as possible.
The SDWH architecture should make it possible to trace any changes made to data
as well as metadata by using process metadata and versioning both data and
metadata. Thus, a metadata item is normally never changed, updated or replaced.
Instead, a new version is created when necessary, which means that there will
always be a possibility to identify which metadata was considered correct at a
certain point in time even if it has later been revised.
A more detailed analysis of the metadata subsets and their use in the SDWH layers
can be found in Definition of the functionalities of a metadata system to facilitate
and support the operation of the S-DWH24 (Deliverable 1.4)
24
Ennok, Lundell, Bowler, de Giorgi, Kulla (2013)
ESSnet on Data Warehousing
WP1
2.3
Version 1.12
Report
2013-09-06
19(21)
Organising SDWH metadata
This project has defined the SDWH as “a central statistical data store, regardless of
the data’s source” 25. Although it is not explicitly expressed in the definition, the
SDWH should be understood as a logically coherent data store, but not necessarily
as one single physical unit.
The logical coherence means that it must be possible to uniquely identify a data
item throughout the data warehouse, to trace it on its way through the logical layers
from input to dissemination, and to follow it longitudinally. A user must be able to
search the entire metadata layer and, if permitted, to access data in the logical
SDWH without actual knowledge of their physical locations.
From the requirements on data follow similar demands on metadata: all data in the
SDWH must have corresponding metadata, all metadata items must be uniquely
identifiable, metadata should be versioned to enable longitudinal use, etc., and
metadata must provide “live” links to the physical data. To achieve this it must be
possible, and should be easy to monitor and govern the metadata layer. This
requires the metadata layer to have comprehensive registry functionality (according
to definition 4.2) as well as repository functions (definition 4.3). The registry
functions are needed to control data consistency, to make the data contents
searchable, etc., and the repository functions are needed to be able to operationalise
data access.
After searching the metadata registry/repository for a concept and finding it, a user
must be able to retrieve its corresponding data (a case of active metadata) –
provided that he/she is allowed to do that according to the authorisation metadata.
Whether to build one or more repositories will depend on local circumstances. In a
decentralised or geographically dispersed organisation, building one single metadata
repository may be technically difficult, or at least less attractive. The
recommendation from a functional and governance point of view is a solution with
one single installation that covers both registry and repository functions.
2.4
SDWH metadata governance
Metadata’s vital role in the SDWH means that all metadata must be reliable at all
times. This calls for well-organised management of metadata and governance of the
registry. The governing role includes everything from being a watchdog and police
to providing advice and inspiration. The balance between the tasks may vary,
depending on organisational structures and several other factors. If much of the
metadata is computer-generated (automated) there may be need for regular checks,
crosschecks, reviews or follow-up activities. If, on the other hand, much of the
metadata is entered manually, possibly in a decentralised environment, there will
probably be more need for advice as well as checks.
The SDWH is assumed to contain complete and correct metadata available to users
and processes when needed. The general rule says that metadata should be collected
as close to the source, and as early in the process as possible. In an ideal situation
every delivery of new data to the SDWH from external or internal sources should be
accompanied by a complete set of corresponding metadata; data and metadata
25
Berglund, Palma. Functional Architecture of the S-DWH, 2013
Version 1.12
Report
2013-09-06
ESSnet on Data Warehousing
WP1
20(21)
should be loaded and updated in parallel. Similarly, every time a new variable is
derived and added to the SDWH the metadata repository should be updated
immediately. In practice this will of course not always be the case – metadata will
be missing, incomplete or incorrect. The metadata repository may contain items that
have not yet been approved for the metadata registry and hence may not be linkable
or searchable.
The metadata model used in the SDWH should be flexible enough not to require
completeness and correctness in all details and in every situation. It should allow for
variations, acknowledging that some metadata are more important than others. In
order to call a data store a SDWH it must meet the minimum metadata requirements
stated in section 2.1.1.
More detailed recommendations on metadata governance for the SDWH are given
in Recommendations and guidelines on the governance of metadata management in
the S-DWH26 (Deliverable 1.5).
2.5
The SDWH and metadata standards
As described above, in section 1.2, there is a wide variety of metadata standards
applicable to statistics production and to data warehousing. New standards will
come, as well as new versions of the existing ones. The SDWH should be able to
handle these changes without having to undergo major revisions of its metadata
layer or rebuilding any of its other layers. One method to handle these future
changes is to implement the so called “standards agnostic” approach to the metadata
repository.
Standards-agnostic means that the standards themselves are represented as metadata
objects within the repository. 27




Every metadata object describes which versions of which standards are
supported
Transformation services between standards and versions are also registered
resources
Introducing new standards or new versions of standards has a minimal
impact on existing applications
“Standard” does not necessarily refer to public standards
Implementing the standards-agnostic approach will require use of a metadata
repository that in itself is standards-agnostic, not organised according to one
specific standard. This will mean focusing on the standardisation itself, not on
which standard is used.
The main purpose of the SDWH is to support efficient statistics production. This
means that all current metadata standards relevant for this purpose should be
supported by the SDWH. This recommendation includes following the MDR
(ISO/IEC 11179) standard for the metadata registry/repository, and making sure that
data and metadata can be packaged in SDMX format for data exchange.
26
De Georgi, Ennok, Lindelauf (2013)
27
Arofan Gregory, Metadata Technology Ltd., Workshop on Metadata Standards, 07/12/2011
ESSnet on Data Warehousing
WP1
Version 1.12
Report
2013-09-06
21(21)
Further information on metadata standards for the Statistical Data Warehouse can
be found in Overview of and recommendations on the use of metadata models28
(Deliverable 1.3)
28
Dressen, Lindelauf, Goossens (2013)
ESSnet on Data Warehousing
WP1
Version 1.12
Report
2013-09-06
22(21)
References
Berglund, Björn; Palma, Antonio Laureti. Functional Architecture of the S-DWH
(Architectural framework), 2013. (Deliverable 3.3) [Link (.doc)]
Bowler, Colin; Lindelauf, Michel; Dressen, Jos. Recommendations on the Impact of
Metadata Quality in the Statistical Data Warehouse, 2013. (Deliverable 1.2)
[Link (.doc)]
Common Metadata Framework, Part B – Metadata Concepts, Standards, Models
and Registries. UNECE [Link]
Common Warehouse Metamodel – ISO/IEC 19504. Object Management Group,
2003. [Link 1] [Link 2]
Data Documentation Initiative (DDI) Specification. DDI Alliance, 2009. [Link 1]
[Link 2]
De Georgi, Viviana; Ennok, Maia; Lindelauf, Michel. Recommendations and
guidelines on the governance of metadata management in the S-DWH, 2013.
(Deliverable 1.5) [Link (.doc)]
Dressen, Jos; Lindelauf, Michel; Goossens, Harry. Overview of and
recommendations on the use of metadata models, 2013. (Deliverable 1.3)
[Link (.doc)] [Appendix (.xls)]
Ennok, Maia. Documentation of the mapping of the result of 1.4 on the ‘ideal
architecture’ framework, 2013. (Deliverable 1.6) [Link (.doc)]
Ennok, Maia; Lundell, Lars-Göran; Bowler, Colin; De Giorgi, Viviana; Kulla, Kaia.
Definition of the functionalities of a metadata system to facilitate and support the
operation of the S-DWH, 2013. (Deliverable 1.4) [Link (.doc)]
Gregory, Arofan. The Standards Working Together. Presentation at the Data
without Boundaries Workshop, Gothenburg 2011 [Link (.pdf)]
Inmon, William H. Metadata in the Data Warehouse, (White Paper), 2000
[Link (.pdf)]
Kimball, Ralph. The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley,
2008
Metadata Common Vocabulary (MCV). SDMX Consortium, 2009 [Link 1 (.pdf)]
[Link 2]
Metadata Registries (MDR) – ISO/IEC 11179. ISO (International Organization for
Standardization) [Link 1] [Link 2]
Neuchâtel Terminology Model. Part II: Variables and related concepts, object types
and their attributes. Version 1, 2006 [Link (.pdf)]
Palma, Antonio Laureti. S-DWH Business Architecture, 2013. (Deliverable 3.1)
[Link (.doc)]
Statistical Data and Metadata eXchange (SDMX), Standards [Link 1] [Link 2]
Vassiliadis, Panos. Data Warehouse Metadata, Encyclopedia of Database Systems,
Springer, 2009
Link to all deliverables 1.1 – 1.6: http://www.cros-portal.eu/content/deliverables-13
Link to all deliverables 2.1 – 2.8: http://www.cros-portal.eu/content/deliverables-10
Link to all deliverables 3.1 – 3.5: http://www.cros-portal.eu/content/deliverables-8
Annex 1
1(3)
Metadata related terms
Sources
 Wikipedia [1] Direct quotations from Wikipedia and from its sources, http://en.wikipedia.org/wiki/Metadata, http://en.wikipedia.org/wiki/Data_warehouse
 ISO [2] (International Standards Organization, ISO/IEC 11179 Metadata registries (MDR)), http://metadata-stds.org/11179/
 NISO [3] (National Information Standards Organization), Understanding Metadata, http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
 Eurostat Metadata Common Vocabulary, MCV (2009) [4], http://sdmx.org/wp-content/uploads/2009/01/04_sdmx_cog_annex_4_mcv_2009.pdf
 Eurostat’s Concepts and Definitions Database, CODED [4,5], http://ec.europa.eu/eurostat/ramon/index.cfm?TargetUrl=DSP_PUB_WELC
 UNECE, Terminology on Statistical Metadata (Conference of European Statisticians, 2000) [5], http://www.unece.org/stats/publications/53metadaterminology.pdf
 UNECE, The Common Metadata Framework (2009-2011) [5], http://www1.unece.org/stat/platform/display/metis/The+Common+Metadata+Framework
 OECD, Glossary of Statistical Terms [5], http://stats.oecd.org/glossary/
Term
Metadata
Wikipedia
[1]
Data providing information
about one or more aspects of the
data, such as:
 Means of creation of the data
 Purpose of the data
 Time and date of creation
 Creator or author of data
 Placement on a computer
network where the data was
created
 Standards use
ISO
[2]
Data that defines and
describes other data
Statistical
metadata
Data
Qualitative or quantitative
attributes of a variable or set of
variables. Data are typically the
results of measurements [...] or
observations [...].
Re-interpretable
representation of
information in a
formalized manner
suitable for
communication,
interpretation, or
processing
NISO
[3]
Structured information that
describes, explains, locates, or
otherwise makes it easier to
retrieve, use, or manage an
information resource. Metadata is
often called data about data or
information about information.
Metadata can describe resources at
any level of aggregation. It can
describe a collection, a single
resource, or a component part of a
larger resource
Metadata Common Vocabulary
[4]
Data that defines and describes other
data.
Terminology, Framework,
Glossary [5]
Data and other documentation
that describes objects in a
formalized way
Metadata are data that describe other
data, and data become metadata when
they are used in this way.
Data about statistical data.
· Comprise data and other
documentation that describe objects in a
formalised way.
· Provide information on data and about
processes of producing and using data.
Characteristics or information, usually
numerical, that are collected through
observation.
Metadata describing statistical
data
The physical representation of
information in a manner suitable
for communication,
interpretation, or processing by
human beings or by automatic
means.
Annex 1
2(3)
Term
Wikipedia
[1]
ISO
[2]
NISO
[3]
Statistical
data
Structural
metadata
Reference
metadata
Describe the structure of
computer systems such as
tables, columns and indexes.
Bretheron & Singley
(Technical) Defines the objects
and processes from a technical
perspective [...] like tables,
fields, data types, indexes [...]
Kimball
(Guide) Help humans find
specific items.
Bretheron & Singley
(Business) Describes the
contents [...] in user accessible
terms [...] what data you have,
where it comes from, what it
means, [...] Kimball
Indicate how compound objects are
put together, e.g., how pages are
ordered to form chapters
(Descriptive) Describe a resource
for purposes such as discovery and
identification. It can include
elements such as title, abstract,
author, and keywords
Metadata
item
Terminology, Framework,
Glossary [5]
Data that are collected and/ or
generated by statistics in process
of statistical observations or
statistical data processing
An instance of a metadata object. It has
associated attributes. It can have a
distinct status: mandatory, conditional
and optional.
A group of characters describing
the data and treated as metadata
unit
Provide information to help manage
a resource, such as when and how it
was created, file type and other
technical information, and who can
access it. There are several subsets
[...]:
− Rights management metadata,
which deals with intellectual
property rights, and
− Preservation metadata, which
contains information needed to
archive and preserve a resource.
Administrative metadata
Process
metadata
Metadata Common Vocabulary
[4]
Data derived from either statistical or
non-statistical sources, which are used
in the process of producing statistical
products
Act as identifiers and descriptors of the
data. They are needed to identify, use,
and process data matrixes and data
cubes, e.g. names of columns or
dimensions of statistical cubes.
Structural metadata must be associated
with the statistical data, otherwise it
becomes impossible to identify, retrieve
and navigate the data.
Describe the contents and the quality of
the statistical data. Should include
conceptual, methodological and quality
metadata
Describes the results of various
operations [...] start time, end
time, CPU seconds used [...]
Kimball
Instance of a metadata
object
Annex 1
3(3)
Term
Metadata
usage
Wikipedia
[1]
Data virtualization, statistics
and census services, data
warehousing
ISO
[2]
NISO
[3]
Discovery and organisation of
electronic resources,
interoperability, integration,
identification, archiving.
Metadata Common Vocabulary
[4]
Include
 the algorithms as such behind
statistical procedures,
including procedures for
statistical analysis;
 descriptions of the
algorithms
Algorithmic
metadata
Metadata
layer
Metadata
registry
Metadata
repository
Terminology, Framework,
Glossary [5]
[data warehouse] The data
dictionary – This is usually
more detailed than an
operational system data
dictionary.
A central location in an
organization where metadata
definitions are stored and
maintained in a controlled
method. Metadata registries are
used whenever data must be
used consistently within an
organization or group of
organizations.
A data dictionary [...] a
"centralized repository of
information about data such as
meaning, relationships to other
data, origin, usage, and format."
A layer in the reference model for
standardisation in statistics used to
denote the set of attributes related to
statistical metainformation
Information system for
registering metadata
(MDR)
Provides information on the
definition, origin, source, and
location of data [...] at many levels,
including schemes, usage profiles,
metadata elements, and code lists
for element values. It provides an
integrating resource for legacy data,
acts as a lookup tool for designers
of new databases, and documents
each data element.
Information system for registering
metadata. Registration accomplishes
three main goals: identification,
provenance, and monitoring quality.
[...] It manages the semantics of data.
A logically central statistical metadata
repository that allows for querying,
editing, and managing of metadata.
Such a system provides a mechanism
for looking up information about
statistical products as well as their
design, development, and analysis.
(Metadata holding) A logical or
physical set of metadata (e.g.
database) stored together with its
description (e.g. schema)
Download