DWH-SGA2-WP1 - 1.2 Recommendations on the Impact of Metadata

Recommendations on the Impact of Metadata Quality in
the Statistical Data Warehouse
Colin Bowler, Michel Lindelauf, Jos
Recommendations on the Impact of Metadata Quality in the
Statistical Data Warehouse
In a Statistical Data Warehouse (SDWH), the data is only as good as the metadata defining and
describing it. Consequently, it is very important to carry out not just data quality checks, but metadata
quality assurance also.
Quality of metadata (and data) can sometimes be difficult to define in an unambiguous manner, and in
the context of a SDWH this is no different.
Here, we are specifically interested in the quality of metadata, and this document is intended to
provide recommendations as to how to approach metadata quality assurance within the SDWH.
2. What is Quality?
So what is the definition of ‘Quality’?
A general definition which can be used is ‘fitness for use, or purpose’.
ISO9000:2005 defines quality as the ‘degree to which a set of inherent characteristics fulfils
requirements ‘.
‘Fitness for use’ is a relative definition, allowing for various perspectives on what constitutes quality,
depending on the intended uses of the metadata (and indeed the intended uses of the data to which
the metadata refers).
Also, the degree of quality indicates that there will be a set of acceptable quality levels associated
with the characteristics, or dimensions, which the metadata must satisfy in order to be fit for use.
3. Quality measure or quality indicator?
Quality measures are defined as items that directly measure a particular aspect of quality. For
example, the time lag from the reference date to the release of the output is a direct measure.
However, in practice many quality measures can be difficult or costly to calculate. Instead, the use of
quality indicators can give an insight into quality.
Quality indicators usually consist of information that is a by-product of the statistical process. They do
not measure quality directly but can provide enough information to provide an insight into quality
(ONS – 2007).
4. Types of Statistical Metadata
Lundell (2012) defines three main metadata categories in use in the SDWH, and also states that any
item of metadata will normally fit into each of these categories:
 Active /Passive
 Formalised/Free-form
 Structural/Reference.
Active metadata, enables operational use, driving the processes within the S-DWH (e.g.
scripts/triggers to carry out activities on the data/metadata), whereas Passive metadata does not act
upon the the data /metadata within the system, e.g. quality reports, documentation etc.
Formalised metadata would have some form of structure, e.g. classifications/code lists, whereas
free-form metadata might contain descriptive information, as in quality reports for example.
Structural metadata is generally thought of (especially in the statistical data world) as metadata
which defines data, and generally help the user ‘find, identify, access and utilise the data’– for
example, classification codes. Reference metadata, by contrast, describe the content and quality of
the data, and is most usually associated with quality reports.
All of these categories of metadata could be subject to quality measurement, except perhaps the
quality report reference metadata which is itself a report on quality measurement.
5. Quality in International Standards for Metadata
There are some international standards and statistical models which apply to, or are concerned with
metadata, and quality characteristics are mentioned in some of them. Appendix B and Appendix C
provide more detail of specific standards available.
The ISO 11179 standard pertains to Metadata Registries (MDR), which has the data element as the
fundamental concept, and is concerned with the semantics around metadata definitions
Lars-Goran Lundell (2012) definition of metadata registry is: “a central location in an organization
where metadata definitions are stored and maintained in a controlled method.”
ISO 11179 states that the main purposes of monitoring metadata quality are:
Monitoring adherence to rules for providing metadata for each attribute
Monitoring adherence to conventions for forming definitions, creating names, and performing
Determining whether an administered item still has relevance
Determining the similarity of related administered items and harmonizing their differences
Determining whether it is possible to ever get higher quality metadata for some administered
Within an MDR, quality is monitored through the use of a registration status. The status records the
level of quality. The status records the level of quality for each administered item (i.e an administered
item’s level of conformance to the required standard), and the levels go (in increasing quality) from
Candidate, Recorded, Qualified, Standard, Preferred Standard.
This is a rigorous evaluation process, and could be used to apply to different elements of metadata
which are used for evaluation of specific quality dimensions, as appropriate to the scenario, or usecase (see below).
6. Dimensions, or Characteristics, of Metadata Quality
When examining the dimensions to be used when assessing quality in the context of statistical
processing, there are many available.
The European Statistical System (ESS) specifies that we have dimensions relating to:
Timeliness and Punctuality
Accessibility and Clarity
From Johanis (2002), is the suggestion of a similar set of dimensions, originating from Statistics
Canada’s Quality Assurance Framework (QAF) (2002):
Whilst these are seen as ‘static’ quality dimensions, the QAF also defines some complementary
quality aspects which are seen as ‘dynamic’:
Bruce & Hillman (2004), discussing metadata quality within the digital library context, suggest 7
similar dimensions, with an additional one covering ‘Provenance’:
Conformance to expectations
Logical consistency and coherence
From Daas & van Nederpelt (2010), the dimensions thought appropriate to metadata in the
context of ‘secondary data sources’ (i.e. mainly non-survey sources) are:
Clarity (encompassing coherence)
Comparability (encompassing linkability, replaceability and uniqueness)
Completeness (encompassing coverage, detailedness, availability, relevance, selectivity
and size)
Correctness (encompassing accuracy, authenticity, and reliability)
Timeliness (encompassing punctuality)
Daas & Ossen (2011) proposes that when evaluating metadata quality of secondary data sources, the
use of ‘hyperdimensions’ is appropriate. These are where several metadata quality characteristics are
grouped together to give an overall quality assessment for a data source.
So which set of dimensions do we use when assessing metadata quality throughout the SDWH?
There does not seem to be any conclusive guidance around this specific issue.
Most sets of dimensions quoted in statistical quality frameworks appear to be aimed specifically at
statistical outputs from a data perspective, rather than metadata. When we examine the detail of the
dimensions, we can see that some are really aimed at metadata e.g. when considering Timeliness, an
examination of the period to which data pertains compared to the period for which data is required,
the period information itself would be considered as metadata. However, measurement of the
Timeliness aspect is still a quality attribute which relates to the data itself rather than the metadata.
Examining the suitability of all of the dimensions in all of the frameworks mentioned, we can come up
with a list of dimensions, and associated descriptions, which is more specific to metadata rather than
statistical outputs:
Relevance - the degree to which statistical metadata meet current and potential user needs
Accuracy – the degree of closeness of descriptive metadata to the true value of the metadata
Accessibility – a measure of the ease with which users are able to access metadata in the
SDWH. This might mean easy read accessibility for external users to a website, and also
mean restrictive access to users regarding updates of metadata within the SDWH. [i.e. the
dimension might have different descriptions across different layers]
Comparability – the degree to which metadata can be compared over time and domain.
[Example is classification, for which comparability would need to be maintained. This
dimension would also be needed to enable data linking.]
Coherence – the degree to which statistical metadata enables the bringing together of
statistical information from different sources within a broad analytical framework and over
time. The use of standard classifications and concepts promotes coherence.
Uniqueness – the degree to which a metadata item can be uniquely identified, named and
Stability – the measure of how metadata remains stable over time, where appropriate
Completeness – the degree to which metadata items are present for statistical data
Interpretability – a measure of the availability of the supplementary information necessary to
interpret and utilize it appropriately. This information normally covers the underlying concepts,
variables and classifications used.
Some examples of the use of these dimensions are expressed in 7. below
7. Application of Quality Characteristics to Metadata in the Layers
The importance of the various quality characteristics when assessing the quality of different metadata
will vary depending upon a set of criteria which includes (but is not necessarily limited to):
(1) the layer of the SDWH in which the evaluation needs to take place;
(2) the source of the metadata (e.g..may be accompanying the data provided/collected, or may be
entered separately);
(3) the use to which the data associated with the metadata is to be put; and
(4) the functionality of the metadata in question .
Incidentally, the mention of ‘layers’ in (1) above refers to the four functional layers identified in the
description of the SDWH business architecture by Laureti Palma (2012). The layers are defined as:
I. Source layer – the level in which we locate all activities related to storing and managing internal or
external data sources
II.Integration layer – where all operational activities needed for the statistical production processes are
carried out
III.Interpretation and data analysis layer – specifically for statisticians and domain experts – and
enables any data analysis, data mining, and support to design production processes or data re-use
IV. Access Layer – for the final presentation, dissemination, and delivery of the information sought.
Following are some examples of how metadata quality dimensions may be applied in the layers:
Source Layer
When judging whether specific administrative data should be used as a source for a particular
analysis or statistical purpose, an assessment of the relevance of data concepts, definitions and
classification metadata of the administrative populations and variables will determine the potential
usage of the associated data. Whereas a statistical institution can adjust the concepts, definitions and
classifications used in its own surveys to meet user needs, the institution usually has little or no
influence over those used by administrative sources. Hence the presence of metadata containing
sufficiently accurate descriptions of the concepts can assist the decision of whether the source data
meets their needs.
All metadata made available by an administrative data supplier should be described, along with those
metadata items which are missing. A description should include how the missing metadata affect the
ability to assess the fitness for purpose of the administrative data. The completeness of the
information would be used to determine whether the users can make appropriate use of data. Links to
appropriate metadata ensure that this information is accessible.
The accuracy of the metadata will also play a part in the judgement of whether a data source can be
used. For example, if data for a variable from an external data source has an accompanying
description of simply ‘Sales’ then the metadata it would fail the quality requirements of an output
which might require aggregation of variables values with a more specific description, such as ‘Sales excluding VAT’. In this instance, because of the quality of the metadata, this piece of data will be
overlooked for this particular output, even though the variable might have actually represented ‘Sales
excluding VAT’ but did not expressly say so in the description.
Integration layer
As might be expected, the process of data integration depends very much upon the coherence of the
metadata accompanying the different data sources. For example, are the classifications relating to the
data using the same standards (such as Standard Industrial Classification or NACE etc.) or local
versions of the classifications which might still display coherence?
Many of the issues relating to the quality of metadata in the source layer are relevant to the
integration layer also. In this layer we would expect processes such as editing, imputation,
classification/coding to take place, often to be carried out by automated scripts. The assessment of
the quality of these scripts (which are actually Active metadata) would be particularly important. One
quality aspect of these scripts to consider might be the uniqueness e.g. single instances of scripts
carrying out a single function, rather than multiple instances of the same script (although perhaps with
different names), causing problems when amendments/updates are required.
Interpretation and Analysis layer
When generating or prototyping a potential new output, the user will need to check whether data
exists for the statistical concept(s) that they are measuring. This would include a quality check of
descriptions of the statistical measure, the population, variables, statistical unit types, domains and
time reference. Quality checking this metadata would give users an understanding of the relevance of
the input data to their needs (for example, whether the output covers their required population or time
As might be thought, the interpretability of metadata is important in this layer. The availability of
relevant supplementary metadata is important to enable the appropriate data analysis to take place
Access layer
In the scenario of carrying out a search for valid datasets via some form of data explorer, the entry
into the search engine of valid search criteria is obviously very important if the appropriate datasets
are to be found. This means that any metadata entered as part of the search criteria much have an
acceptable level of quality in order for the search to be successful. For example, if the user enters a
value of ‘201203’ as the reference period of the data they require, but the metadata is held in the
SDWH in the form of ‘2012Q1’, then the search will fail. Metadata quality checks need to be carried
out on the correctness or accuracy of the metadata entered by the user.
8. Acceptable Quality Levels
For each of the quality dimensions, there will be associated threshold values. These values would
indicate the acceptability of the metadata following the application of quality measurements to the
These levels could conceivably change depending upon the quality requirements of particular outputs
or processes, or depending upon legislative requirements.
9. Metadata Quality Management
Should we be concerned about the management of metadata quality?
Some aspects of the Quality Management principles (ISO 9000) should be applied to metadata
quality management. In particular the following principles seem particularly relevant to the SDWH
Customer focus - Organizations depend on their customers and therefore should understand
current and future customer needs, should meet customer requirements and strive to exceed
customer expectations;
Process approach - A desired result is achieved more efficiently when activities and related
resources are managed as a process;
System approach to management - Identifying, understanding and managing interrelated
processes as a system contributes to the organization's effectiveness and efficiency in
achieving its objectives
This indicates that the management of metadata in the SDWH should encompass quality
management processes. For example, it would be expected that a customer, such as an expert user
who is carrying out some detailed analysis process, will have sufficient access to a system which will
provide all the information required by the user relating to the metadata, including some form of
mechanism for feeding back any information relating to the metadata quality which might come to light
as a result of the process being carried out by the user.
The management of metadata, like that of data, should also encompass standard processes rather
than each user adopting their own, ad-hoc approach. This means that the control of metadata in the
SDWH system should be a systematic approach, minimising the opportunity to make mistakes whilst
creating or updating metadata. This would require the implementation of sets of rules governing the
creation, updating, and even reading of the metadata within the SDWH.
The adoption of standards for codes, names and definitions should help in the context of the
application of the dimensions of completeness, coherence, accuracy, uniqueness, and
Whilst the human aspects should play a large part in metadata quality management (for example, by
applying sensible choices in the selection of variable names etc.) the software tools or applications
set up to enable access to the metadata would normally be used to enforce the adoption of naming
standards and processes, by imposing underlying business rules (e.g. on the metadata user’s GUI, an
entry field for ‘Date’ can only accept valid dates in the format ‘DDMMYY’ – characters must be
numeric, ‘DD’ must be > 00 and < 32 etc.).
10. References
Lars-Goran Lundell (2012) – Metadata Framework for Statistical Data Warehousing (ESSnet project
on Statistical Data Warehouse)
International Standard ISO9000:2005 – Quality Management Systems fundamentals and vocabulary
International Standard ISO/IEC 11179 – Information Technology – Metadata Registries (Parts 1 – 6)
Office for National Statistics (2007) – Guidelines for Measuring Statistical Quality – Published by Her
Majesty’s Stationery Office (HMSO) – now ‘The Stationery Office’ - for the Office for National Statistics
Paul Johanis (2002) - Assessing the Quality of Metadata. Statistics Canada presentation at the work
session on METIS, 6-8 March 2002, Luxembourg
Statistics Canada - Statistics Canada’s Quality Assurance Framework (2002)
Thomas R. Bruce & Diane I Hillman (2004) – The Continnuum of metadata quality: Defining,
expressing, exploiting. From Metadata in Practice (pp.238-256). Chicago ALA
Piet J.H. Daas and Peter W.M. van Nederpelt (2010) - Application of the object oriented quality
management model to secondary data sources – Statistics Netherlands
Piet J.H. Daas and Saskia J.L. Ossen (2011) – Metadata Quality Evaluation of Secondary Data
Sources - Statistics Netherlands. Presented at the 5th International Quality Conference, May 20th
Antonio Laureti Palma (2012) – S-DWH Business Architecture – Deliverable 3.1 for ESSnet on MicroData Linking and Statistical Data Warehouse
Appendix A – Quality Dimension Definitions
Quality Assurance Framework- Stats Canada
Relevance: The relevance of statistical information reflects the degree to which it meets the real
needs of clients. It is concerned with whether the available information sheds light on the issues of
most importance to users. Assessing relevance is a subjective matter dependent upon the varying
needs of users. The Agency’s challenge is to weigh and balance the conflicting needs of current and
potential users to produce a program that goes as far as possible in satisfying the most important
needs within given resource constraints.
Accuracy: The accuracy of statistical information is the degree to which the information correctly
describes the phenomena it was designed to measure. It is usually characterized in terms of error in
statistical estimates and is traditionally decomposed into bias (systematic error) and variance (random
error) components. It may also be described in terms of the major sources of error that potentially
cause inaccuracy (e.g., coverage, sampling, non-response, response).
Timeliness: The timeliness of statistical information refers to the delay between the reference point
(or the end of the reference period) to which the information pertains, and the date on which the
information becomes available. It is typically involved in a trade-off against accuracy. The timeliness
of information will influence its relevance.
Accessibility: The accessibility of statistical information refers to the ease with which it can be
obtained from the Agency. This includes the ease with which the existence of information can be
ascertained, as well as the suitability of the form or medium through which the information can be
accessed. The cost of the information may also be an aspect of accessibility for some users.
Interpretability: The interpretability of statistical information reflects the availability of the
supplementary information and metadata necessary to interpret and utilize it appropriately. This
information normally covers the underlying concepts, variables and classifications used, the
methodology of data collection and processing, and indications of the accuracy of the statistical
Coherence: The coherence of statistical information reflects the degree to which it can be
successfully brought together with other statistical information within a broad analytic framework and
over time. The use of standard concepts, classifications and target populations promotes coherence,
as does the use of common methodology across surveys. Coherence does not necessarily imply full
numerical consistency
ONS Guidelines for Measuring Statistical Quality (based upon the ESS Quality Guidelines)
Relevance - The degree to which the statistical product meets user needs for both coverage and
Accuracy - The degree to which the statistical product meets user needs for both coverage and
Timeliness and Punctuality - Timeliness refers to the lapse of time between publication and the period
to which the data refer. Punctuality refers to the time lag between the actual and planned dates of
Accessibility and Clarity - Accessibility is the ease with which users are able to access the data. It also
relates to the format(s) in which the data are available and the availability of supporting information.
Clarity refers to the quality and sufficiency of the metadata, illustrations and accompanying advice.
Comparability - The degree to which data can be compared over time and domain.
Coherence - The degree to which data that are derived from different sources or methods, but
which refer to the same phenomenon, are similar.
Appendix B – International Standards relevant to Metadata
ISO/IEC TR 20943 – Achieving Metadata Registry Content Consistency
This standard consists of 6 parts and some are still under development or on
hold but can provide the reader with useful information on the subject of
metadata within a SDWH.
The purpose of ISO/IEC TR 20943-1:2003 is to describe a set of procedures for the
consistent registration of ata elements and their attributes in a registry. ISO/IEC TR
20943-1:2003 is not a data entry manual, but a user’s guide for conceptualizing a
data element and its associated metadata items for the purpose of consistently
establishing good quality data elements. An organization may adapt and/or add to
these procedures as necessary. The scope of ISO/IEC TR 20943-1:2003 is limited to
the associated items of a data element: the data element identifier, names and
definitions in particular contexts, and examples; data element concept; conceptual
omain with its value meanings; and value domain with its permissible values.
The purpose of ISO/IEC 20943-2 is to describe ways of representing XML structured
data in a 11179-3 metadata registry hereinafter referred to as "a 11179 MDR" or
simply "an MDR"). XML structures may be mapped to, and represented by, one or
more constructs in an MDR. ISO/IEC 11179-3:2003 does not explicitly specify how to
represent XML structures, and practitioners have found more than one way to
represent similar structures using the constructs defined by ISO/IEC 11179-3:2003.
This part describes some possible representations of various XML structures, some
pros and cons of each, with techniques for mapping from one to another.
ISO/IEC TR 9789:1994 - Guidelines for the organisation and representation of data elements
for data interchange * coding methods and principles
Note that this is not a free publication.
The ISO 9789 standard provides general guidance on the manner on which data can
be expressed by codes. Describes the objectives of coding, the characteristics,
advantages and disadvantages of different coding methods, the features of codes and
gives guidelines for the design of codes.
ISO/IEC TR 14957:2010 - Representation of data elements values: Notation of the format
Note that this is not a free publication.
ISO/IEC 14957:2010 specifies the notation to be used for stating the format, i.e. the
character classes, used in the representation of data elements and the length of these
representations. It also specifies additional notations relative to the representation of
numerical figures. For example, this formatting technique might be used as part of the
metadata for data elements. The scope of ISO/IEC 14957:2010 is limited to graphic
characters, such as digits, letters and special characters. The scope is limited to the
basic datatypes of characters, character strings, integers, reals, and pointers.
ISO/IEC 24706 - Metadata for technical standards and specifications documents
Note that this document is still under development and therefore there is no
summary available yet. We think it is worth to check this standard again in the
near future but for now it is not useful for the project.
ISO/IEC 19773 – Metadata registries (MDR) Module
Note that this is not a free publication
ISO/IEC 19773:2011 specifies small modules of data that can be used or reused in
applications. These modules have been extracted from ISO/IEC 11179-3, ISO/IEC
19763, and OASIS EBXML, and have been refined further. These modules are
intended to harmonize with current and future versions of the ISO/IEC 11179 series
and the ISO/IEC 19763 series. These modules include: reference-or-literal (reflit) for
on-demand choices of pointers or data; multitext, multistring, etc. for recording
internationalized and localized data within the same structure; slots and slot arrays for
standardized extensible data structures; internationalized contact data, including UPU
postal addresses, ITU-T E.164 phone numbers, internet E-mail addresses, etc.;
generalized model for context data based upon who-what-where-when-why-how
(W5H); data structures for reified relationships and entity-person-groups. Conformity
can be selected on a per-module basis.
ISO/IEC 20944 – Metadata Registry Interoperability & Bindings (MDR-IB)
Note that this standard consist of 5 parts and some are still under development
but can provide the reader with useful information on the subject of metadata
within a SDWH.
The ISO/IEC 20944 family of standards is being developed to provide interoperability
among metadata registries (11179-3), such as reading/writing attributes from/to a
metadata registry. However, the ISO/IEC 20944 series may be used generically, such
as for applications that are unrelated to 11179-3 metadata registries, or applications
that extend 11179-3 metadata registry attributes (attributes outside of the 11179-3
Appendix C - Summary of ISO/IEC 11179
International standards apply to metadata. Much work is being accomplished in the national and
international standards communities, especially ANSI (American National Standards Institute) and
ISO (International Organization for Standardization) to reach consensus on standardizing metadata
and registries.
The core standard is ISO/IEC 11179-1 and subsequent standards. All yet published registrations
according to this standard cover just the definition of metadata and do not serve the structuring of
metadata storage or retrieval neither any administrative standardisation. It is important to note that
this standard refers to metadata as data about containers of data and not to metadata (metacontent)
as data about data contents. It should also be noted that this standard describes itself originally as a
"data element" registry, describing disembodied data elements, and explicitly disavows the capability
of containing complex structures. Thus the original term "data element" is more applicable than the
later applied buzzword "metadata".
Intended purpose
Today, organizations often want to exchange data quickly and precisely between computer systems
using enterprise application integration technologies. Completed transactions are also often
transferred to separate data warehouse and business rules systems with structures designed to
support data for analysis. The industry de facto standard model for data integration platforms is the
Common Warehouse Model (CWM). Data integration is often also solved as a data, rather than a
metadata, problem, with the use of so called master data. ISO/IEC 11179 claims that it is a standard
for metadata-driven exchange of data in an heterogeneous environment, based on exact definitions of
Structure of an ISO/IEC 11179 metadata registry
The ISO/IEC 11179 model is a result of two principles of semantic theory, combined with basic
principles of data modelling.
The first principle from semantic theory is the thesaurus type relation between wider and more narrow
(or specific) concepts, e.g. the wide concept "income" has a relation to the more narrow concept "net
The second principle from semantic theory is the relation between a concept and its representation,
i.e. "buy" and "purchase" are the same concept even if different terms are used.
The basic principle of data modelling is the combination of an object class and a characteristic. For
example, "Person - hair color".
When applied to data modelling, ISO/IEC 11179 combines a wide "concept" with an "object class" to
form a more specific "data element concept". For example, the high-level concept "income" is
combined with the object class "person" to form the data element concept "net income of person".
Note that "net income" is more specific than "income".
The different possible representations of a data element concept are then described with the use of
one or more data elements. Differences in representation may be a result of the use of synonyms or
different value domains in different data sets in a data holding. A value domain is the permitted range
of values for a characteristic of an object class. An example of a value domain for "gender of person"
is "M = Male, F = Female, U = Unknown". The letters M, F and U are then the permitted values of
gender of person in a particular dataset.
The data element concept "monthly net income of person" may thus have one data element called
"monthly net income of individual by 100 dollar groupings" and one called "monthly net income of
person range 0-1000 dollars", etc., depending on the heterogeneity of representation that exists within
the data holdings covered by one ISO/IEC 11179 registry. Note that these two examples have
different terms for the object class (person/individual) and different value sets (a 0-1000 dollar range
as opposed to 100 dollar groupings).
The result of this is a catalogue of sorts, in which related data element concepts are grouped by a
high-level concept and an object class, and data elements grouped by a shared data element
concept. Strictly speaking, this is not a hierarchy, even if it resembles one.
It is worth noting that ISO/IEC 11179 proper does not describe data as it is actually stored. There is
no part of the model that caters to the description of physical files, tables and columns. All the
ISO/IEC 11179 constructs are "semantic" as opposed to "physical" or "technical".
Since the standard has two main purposes (definition and exchange) the core object is the data
element concept, since it defines a concept and, ideally, describes data independent of its
representation in any one system, table, column or organisation.
The data element is foundational concept in an ISO/IEC 11179 metadata registry. The purpose of the
registry is to maintain a semantically precise structure of data elements.
Each Data element in an ISO/IEC 11179 metadata registry:
should be registered according to the Registration guidelines (11179-6)
will be uniquely identified within the register (11179-5)
should be named according to Naming and Identification Principles (11179-5) See data
element name
 should be defined by the Formulation of Data Definitions rules (11179-4) See data element
definition and
 may be classified in a Classification Scheme (11179-2) See classification scheme
Data elements that store "Codes" or enumerated values must also specify the semantics of each of
the code values with precise definitions
Structure of the ISO/IEC 11179 standard
The standard consists of six parts:
 Part 1 - Framework
 Part 2 - Classification
 Part 3 - Registry metamodel and basic attributes
 Part 4 - Formulation of data definitions
 Part 5 - Naming and identification principles
 Part 6 - Registration
Part 1 explains the purpose of each part. Part 3 specifies the metamodel that defines the registry. The
other parts specify various aspects of the use of the registry.
11179-1: Framework
This part of ISO/IEC 11179 introduces and discusses fundamental ideas of data elements, value
domains, data element concepts, conceptual domains, and classification schemes essential to the
understanding of this set of standards and provides the context for associating the individual parts of
ISO/IEC 11179.
11179-2: Classification
This part of ISO/IEC 11179 provides a conceptual model for managing classification schemes. There
are many structures used to organize classification schemes and there are many subject matter areas
that classification schemes describe. So, this Part also provides a two-faceted classification for
classification schemes themselves.
11179-3: Registry metamodel and basic attributes
This part of ISO/IEC 11179 specifies a conceptual model for a metadata registry, and a set of basic
attributes for metadata for use when a full registry solution is not needed.
11179-4: Formulation of data definition
This part of ISO/IEC 11179 provides guidance on how to develop unambiguous data definitions. A
number of specific rules and guidelines are presented in ISO/IEC 11179-4 that specify exactly how a
data definition should be formed. A precise, well-formed definition is one of the most critical
requirements for shared understanding of an administered item; well-formed definitions are imperative
for the exchange of information. Only if every user has a common and exact understanding of the
data item can it be exchanged trouble-free.
11179-5: Naming and identification principles
This part of ISO/IEC 11179 provides guidance for the identification of administered items.
Identification is a broad term for designating, or identifying, a particular data item. Identification can be
accomplished in various ways, depending upon the use of the identifier. Identification includes the
assignment of numerical identifiers that have no inherent meanings to humans; icons (graphic
symbols to which meaning has been assigned); and names with embedded meaning, usually for
human understanding, that are associated with the data item's definition and value domain.
11179-6: Registration
This part of ISO/IEC 11179 provides instruction on how a registration applicant may register a data
item with a central Registration Authority and the allocation of unique identifiers for each data item.
Maintenance of administered items already registered is also specified in this document.
Additional information
Classification scheme: 11179-2 (Wikipedia)
In metadata a classification scheme is a hierarchical arrangement of kinds of things (classes) or
groups of kinds of things. Typically it is accompanied by descriptive information of the classes or
groups. A classification scheme is intended to be used for an arrangement or division of individual
objects into the classes or groups. The classes or groups are based on characteristics which the
objects (members) have in common. In linguistics, the subordinate concept is called a hyponym of its
superordinate. Typically a hyponym is 'a kind of' its superordinate (Keith Allan, Natural language
The ISO/IEC 11179 metadata registry standard uses classification schemes as a way to classify
administered items, such as data elements, in a metadata registry.
Some quality criteria for classification schemes are:
Whether different kinds are grouped together. In other words whether it is a grouping system
or a pure classification system. In case of grouping, a subset (subgroup) does not have
(inherit) all the characteristics of the superset, which makes that the knowledge and
requirements about the superset are not applicable for the members of the subset.
Whether the classes have overlaps.
Whether subordinates (may) have multiple superordinates. Some classification schemes
allow that a kind of thing has more than one superordinate others don't. Multiple supertypes
for one subtype implies that the subordinate has the combined characteristics of all its
superordinates. This is called multiple inheritance (of characteristics from multiple
superordinates to their subordinates).
Whether the criteria for belonging to a class or group are well defined.
Whether the kinds of relations between the concepts are made explicit and well defined.
Whether subtype-supertype relations are distinguished from composition relations (partwhole relations) and from object-role relations.
Benefits of using classification schemes
Using one or more classification schemes for the classification of a collection of objects has many
benefits. Some of these include:
It allows a user to find an individual object quickly on the basis of its kind or group.
It makes it easier to detect duplicate objects.
It conveys semantics (meaning) of an object from the definition of its kind, which meaning is not
conveyed by the name of the individual object or its way of spelling.
Knowledge and requirements about a kind of thing can be applied to the members of the kind.
Examples of kinds of classification schemes
The following are examples of different kinds of classification schemes. This list is in approximate
order from informal to more formal:
thesaurus - a collection of categorized concepts, denoted by words or phrases, that are related to
each other by narrower term, wider term and related term relations.
taxonomy - a formal list of concepts, denoted by controlled words or phrases, arranged from abstract
to specific, related by subtype-supertype relations or by superset-subset relations.
data model - an arrangement of concepts (entity types), denoted by words or phrases, that have
various kinds of relationships. Typically, but not necessarily, representing requirements and
capabilities for a specific scope (application area).
network (mathematics) - an arrangement of objects in a random graph.
ontology - an arrangement of concepts that are related by various well defined kinds of relations. The
arrangement can be visualized in a directed acyclic graph.
One example of a classification scheme for data elements is a representation term.
Data element definition 11179-4 (Wikipedia)
In metadata, a data element definition is a human readable phrase or sentence associated with a
data element within a data dictionary that describes the meaning or semantics of a data element.
Data element definitions are critical for external users of any data system. Good definitions can
dramatically ease the process of mapping one set of data into another set of data. This is a core
feature of distributed computing and intelligent agent development.
There are several guidelines that should be followed when creating high-quality data element
Properties of clear definitions
A good definition is:
Precise - The definition should use words that have a precise meaning. Try to avoid words that have
multiple meanings or multiple word senses.
Concise - The definition should use the shortest description possible that is still clear.
Non Circular - The definition should not use the term you are trying to define in the definition itself.
This is known as a circular definition.
Distinct - The definition should differentiate a data element from other data elements. This process is
called disambiguation.
Unencumbered - The definition should be free of embedded rationale, functional usage, domain
information, or procedural information.
A data element definition is a required property when adding data elements to a metadata registry.
Definitions should not refer to terms or concepts that might be misinterpreted by others or that have
different meanings based on the context of a situation. Definitions should not contain acronyms that
are not clearly defined or linked to other precise definitions.
If you are creating a large number of data elements, all the definitions should be consistent with
related concepts.
Critical Data Element -- Not all data elements are of equal importance or value to an organization. A
key metadata property of an element is categorizing the data as a Critical Data Element (CDE). This
categorization provides focus for data governance and data quality. An organization often has various
sub-categories of CDEs, based on use of the data. e.g.,
Security Coverage – data elements that are categorized as personal health information or PHI
warrant particular attention for security and access
Marketing Department Usage – the Marketing department could have a particular set of CDEs
identified for identifying Unique Customer or for Campaign Management
Finance Department Usage – the Finance department could have a different set of CDEs from
Marketing. They are focused on data elements which provide measures and metrics for fiscal
Standards such as the ISO/IEC 11179 Metadata Registry specification give guidelines for creating
precise data element definitions. Specifically chapter four of the ISO/IEC 11179 metadata registry
standard covers data element definition quality standards
Using precise words
Common words such as play or run frequently have many meanings. For example the WordNet
database documents over 57 different distinct meanings for the word "play" but only a single definition
for the term dramatic play. Fewer definitions in a chosen word's dictionary entry is preferable. This
minimizes misinterpretation related to a reader's context and background. The process of finding a
good meaning of a word is called Word sense disambiguation.
Examples of definitions that could be improved
Here is the definition of "person" data element as defined in the www.w3c.org Friend of a Friend
specification *:
Person: A person.
Although most people do have an intuitive understanding of what a person is, the definition has much
room for improvement. The first problem is that the definition is circular. Note that this definition really
does not help most readers and needs to be clarified.
Here is the definition of the "Person" Data Element in the Global Justice XML Data Model 3.0 *:
person: Describes inherent and frequently associated characteristics of a person.
Note that once again the definition is still circular. Person should not reference itself. The definition
should use terms other than person to describe what a person is.
Here is a more precise but shorter definition of a person:
Person: An individual human being.
Note that it uses the word individual to state that this is an instance of a class of things called human
being. Technically you might use "homo sapiens" in your definition, but more people are familiar with
the term "human being" than "homo sapiens," so commonly used terms, if they are still precise, are
always preferred.
Sometimes your system may have cultural norms and assumptions in the definitions. For example if
your "Person" data element tracked characters in a science fiction series that included aliens you may
need a more general term other than human being.
Person: An individual of a sentient species.
Data element name 11179-5 (Wikipedia)
A data element name is a name given to a data element in, for example, a data dictionary or
metadata registry. In a formal data dictionary, there is often a requirement that no two data elements
may have the same name, to allow the data element name to become an identifier, though some data
dictionaries may provide ways to qualify the name in some way, for example by the application
system or other context in which it occurs.
In a database driven data dictionary, the fully qualified data element name may become the primary
key, or an alternate key, of a Data Elements table of the data dictionary.
The data element name typically conforms to ISO/IEC 11179 metadata registry naming conventions
and has at least three parts:
 Object, Property and Representation term.
Many standards require the use of Upper camel case to differentiate the components of a data
element name. This is the standard used by ebXML, GJXDM and NIEM.
Example of ISO/IEC 11179 naming in relational databases
ISO/IEC 11179 is applicable when naming tables and columns within a relational database.
Tables are Collections of Entities, and follow Collection naming guidelines. Ideally, a collective name
is used: e.g., Personnel. Plural is also correct: Employees. Incorrect names include: Employee,
tblEmployee, and EmployeeTable.
Columns are Properties of the Entity and are named in a multi-part format:
[Object] [Qualifier] Property RepresentationTerm
The Object part may be omitted from a name when the property is within its object's context. The
Qualifier is used when it is necessary to uniquely identify an element. For example, columns on the
WorkOrders table would be expressed as:
For Requirements_Text, the full name (i.e., the name that goes in the registry, or data dictionary) is
 Object is WorkOrder in full name.
 Property is Requirements in full name.
 RepresentationTerm is Text in full name.
The Requesting_Employee_Number and Approving_Employee_Number columns have Qualifiers to
ensure that the data element names are unique and descriptive. The Object part of the element name
is also omitted because it is declared within the object context.
Note that for the examples provided, an underscore was used as a separator. A separator is not
mandated by ISO/IEC 11179 but is recommended.
Example of ISO/IEC 11179 name in XML
Users frequently encounter ISO/IEC 11179 when they are exposed to XML Data Element names that
have a multi-part Camel Case format:
Object [Qualifier] Property RepresentationTerm
The specification also includes normative documentation in appendices.
For example the XML element for a person's given (first) name would be expressed as:
Where Person is the Object=Person, Property=Given and Representation term="Name". In this case
the optional qualifier is not used.