Corporate Metadata Repository Model

advertisement
CORPORATE METADATA REPOSITORY (CMR) MODEL
Daniel W. Gillman
U.S. Bureau of Labor Statistics
_________________________________________________________
1.
Introduction
The term metadata means, loosely, data about data. This means that metadata is data, and
therefore we can store metadata in a database. Such a database is popularly called a
repository or registry. This paper contains a description of a conceptual model for a metadata
repository for an entire statistical organization, called the Corporate Metadata Repository
(CMR) Model. From this point forward, the term metadata means statistical metadata.
The CMR model is designed to support the metadata necessary to describe the survey life
cycle, the linkages between similar designs and processes used across surveys, and the use of
metadata to drive systems in support of the survey life cycle (Gillman and Appel, 1999). The
CMR model supports this broad view of metadata for a survey, through its set of classes. It
supports linking designs and processes across surveys over time, through the use of its
relationships. Finally, metadata that drives data dissemination and automated survey design
and processing systems is supported, through the attributes defined for each of the classes.
The paper is devoted to a description of the CMR model that justifies the claims in the
previous paragraph. Several perspectives the model supports are provided. Each of the
perspectives represents a way to understand the organization of the model. The model itself
is described from three of those perspectives: dimensions, terminology, and survey life cycle.
Finally, there is a short conclusion.
2.
Perspectives
This section contains descriptions of several perspectives on the model. Each of the
perspectives describes some characteristics the model has, content the model represents,
systems the model supports, or survey activities the model describes. These perspectives are
outlined in the sections below.
2.1
Conceptual Model
The CMR model is a conceptual model rather than a logical or physical one. Logical models
specify all the data an application will use, their types, and their relationships. A physical
model describes how the data will be logically organized for use by an application system
with specific needs for that system as part of the design, e.g., de-normalization decisions. A
conceptual model, on the other hand, is a representation of the human understanding of the
data including the relationships that exist between the data, and not necessarily how the data
will be represented in the logical model. Conceptual models are meant for people to read and
understand.
2.2
Standards for CMR
The CMR model is based on an international metadata standard called Standardization and
specification of data elements (ISO/IEC 11179, 2000), which specifies a set of attributes for
describing data elements. The standard is now under revision; and this led to an expansion of
the scope of the standard, the development of a full conceptual model for a metadata registry,
and renaming the standard Metadata registries (ISO/IEC CD 11179-3, 2000).
Some standards describe content and not format, metadata standards describe the data
necessary to describe other data or processes, and data semantics refers to the meaning of
data. ISO/IEC 11179 is a metadata content standard focused on the semantics of data.
ISO/IEC 11179 is well suited to the CMR, because the understanding of data is so important
in statistics. The CMR model is an extension of the ISO/IEC 11179 (revised) metamodel.
Another view of the ISO/IEC 11179 metamodel is that it provides a mechanism for humans
to locate, retrieve, and understand metadata. It is this human side that separates this standard
from other metadata standards. An analogy is the card catalog system employed by libraries,
which are designed for people to understand. ISO/IEC 11179 metamodel supports the card
catalog analogy for data descriptions. The CMR model expands this for statistical surveys.
2.3
Metadata Orientation
The CMR model supports the production-oriented and output-oriented purposes of statistical
information systems (SIS's) (Sundgren, 1992). Output-oriented SIS's are commonly called
data dissemination systems, and they usually are accessible via the Internet. Productionoriented SIS's are automated survey design and processing systems. Increasingly, both types
of systems have a strong metadata component as part their design. The production-oriented
SIS's may write metadata to their metadata component.
SIS's with metadata components are metadata driven if the metadata is used to determine the
path the system takes during a session (Kent, et al, 1999). Broadly defined metadata
repositories can serve as the metadata component for many SIS's. The CMR model supports
the design of such metadata repositories.
2.4
Metadata Structure
Metadata, and data, can be structured, semi-structured, or unstructured. Data is structured if
its type and schema are known (Abiteboul et al, 2000). If only one is known, the data is
semi-structured, and the data is unstructured if neither is known. The CMR model supports
all three kinds of metadata. For example, documents, a common source of metadata in
statistical organizations, are semi-structured.
A special class of structured metadata is called embedded (Kent, et al, 1999). Embedded
metadata is used directly by software applications. The CMR model has many structured
elements, and the items that populate these elements can be embedded.
2.5
Survey Life Cycle
For the CMR model, the survey life cycle is comprised of the following stages (LaPlant, et al,
1996): Content, Planning, Design, Collection, Processing, Analysis, and Dissemination.
Metadata to support and describe all the stages is accounted for in the CMR model.
2.6
Dimensions
The CMR model is organized as a set of overlapping dimensions. Each dimension is a model
on its own, but the combination, realized through relationships across the dimensions, is
greater than the sum of its parts. The main dimensions are Data, Business, Administration,
Documents, Terminology, and Classification.
The data dimension describes data elements and their allowed values (called value domains).
The model distinguishes a conceptual piece (the data element concept) containing the
definition, and the representation, which contains the value domain and data type.
The business dimension contains a model that describes surveys and their components. This
part of the model contains the classes of objects for which the business operation of the
statistical organization needs to keep track.
Documents require small amounts of metadata to keep track of them; it is the relationships of
the documents to business objects that give them context and classify them. These are
contained in the document dimension.
The administration dimension handles all the common characteristics for all the classes in the
other dimensions. It keeps track of metadata quality and the metadata management life cycle
for each object being described.
The classification dimension describes classification schemes that are used to classify objects,
and the terminology dimension manages concepts and terms describing objects.
2.7
Hierarchical
The CMR model is also organized in a hierarchical perspective. This helps to understand
how concepts play a fundamental role in metadata. Concepts are understood from the
perspective of terminology theory. Terminology theory is briefly described, and the
application of the theory to the CMR model is explored.
2.7.1
Terminology
In the theory of terminology, concepts are mental constructs or units of thought. Concepts
are organized or grouped by common elements, called characteristics.
Essential
characteristics are necessary and sufficient for identifying a concept. Other characteristics are
inessential. The sum of characteristics that constitute a concept is called its intension. The
set of objects a concept refers to is its extension. In natural language, concepts are expressed
through definitions, which specify a unique intension and extension. A designation (term,
appellation, or symbol) represents a concept (Sager, 1990; ISO FDIS 704, 1999).
A subject field is a branch of human knowledge. A subject field is comprised of a set of
related concepts, or concept system. A terminology is the set of designations that represent
the concepts in a concept system. The designations make up a special language, which is
used in a subject field (ISO DIS 1087-1, 1998).
2.7.2
Explaining the Model
Every model is an example of a structured terminology, or ontology. This is because the set
of classes and their relationships constitute a concept system; the attributes of each class are
its characteristics, though not all essential; and the objects populated for each class are part of
its extension. So, the CMR model is an ontology for metadata.
Terminology describes the relationship between conceptual and representational classes in
the data dimension, such as data element concepts and value domains. These are determined
in the design of the questions in a questionnaire and used in the schema of a database or
survey design and processing systems. The systems are implementations of algorithms based
on methodologies. The questionnaire and databases reflect the characteristics of the units in
the sample, which is selected from a sampling frame, and that in turn is an instantiation of the
universe. Both the universe and methodologies are conceptual, which brings the outside
frame of Figure 1 back to the inside frame.
Sample/Frame/Universe, Algorithm/Methodology
Questionnaire, Database, Processing System
Data Element Concept, Value Domain
Terminology
Figure 1: Terminology Hierarchy
3
CMR Model
This section contains a partial description of the CMR model. The description is broken into
sub-sections based on the dimension perspective of section 2.6. The support for the survey
life cycle perspective of section 2.5 is explained in section 3.2.1. Finally, the terminology
perspective, a unifying principle, is used to explain aspects of the model.
Each of the sub-sections contains a model represented in the Unified Modeling Language
(UML). These models are less detailed versions of those found in the CMR model. The
attributes depicted in each class are meant to signify greater detail in the actual model.
3.1
Data
The data dimension describes data, i.e., data elements and their associated classes. There are
four primary classes in this dimension of the model: Data element, Data element concept,
Value domain, and Conceptual domain.
A data element concept is the conceptual part of a data element, described independently of
any particular representation. From a data modeling perspective, it is composed of two parts:
Object classes and properties.
Object classes are the things about which we wish to collect and store data. They are
concepts. Examples are cars, persons, households, employees, and orders. Properties are
what humans use to distinguish or describe objects. They are characteristics, not necessarily
essential ones, of the object class and form its intension. They are also concepts. Examples
of properties are color, model, sex, age, income, address, and price.
Since a data model is an ontology and object classes and properties are concepts, then
assigning an object class and property to a data element concept classifies it. For some data
element concepts, however, their classification may be much more complex. There may be
many characteristics of a data element concept, and they may not meaningfully or uniquely
divide into an object class and property.
The representation describes the form of the data, including a value domain, data type, and, if
necessary, a unit of measure. Value domains are sets of allowed values for data elements. A
non-enumerated domain is a value domain where the allowed values are specified by a
description. An enumerated domain is a value domain where the allowed (permissible)
values are listed. These values have meanings, called value meanings.
A data element concept may be associated with different value domains as needed to form
conceptually similar data elements. Those value domains are conceptualized by a conceptual
domain, which links conceptually similar value domains. Each value domain is part of the
extension of its conceptual domain, and the allowed values are the intension of the value
domain itself. For a conceptual domain, the set of value meanings is its intension. Figure 2
illustrates the main ideas in this section (ISO/IEC CD 11179-3, 2000).
Data Element Concept
DEC Administration: 0..1
Object Class: 0..1
Property: 0..1
1..1
Conceptual Domain
0..N
+Specifying
+Having
1..1
CD Administration: 0..1
Value Meanings: 0..N
1..1
+Expressed by
+Represented by
+Expressing
+Representing
0..N
Data Element
0..N
Value Domain
0..N
DE Administration: 1..1
Derivation: 0..1
+Representing
+Represented by
1..1
VD Administration: 1..1
Permissible Values: 0..N
Description: 0..1
Data Type: 1..1
Figure 2: Data Dimension Model
3.2
Business Dimension
The Business Dimension describes the business of statistical organizations. It is composed of
classes, attributes, and relationships that describe data that the organization needs to keep
about surveys. The model supports the storage of metadata as single attributes or as
documents. Figure 3 shows the Business Dimension model.
The model describes survey designs, processing, analyses, and data sets. It contains classes
for each of the important parts of a survey. The model supports organized storage and
complex searches for metadata describing a survey, and it supports searches for metadata
across multiple surveys. The model also provides several other features:

A list of all current surveys conducted by the agency

Comparison of designs, specifications, or procedures across surveys

Reuse of designs, specifications, or procedures

Categorizing and classifying documents

Assembling complete documentation for a survey

Attributes to support embedded metadata
+Incorporates
+Incorporated by
0..N
1..N
Planning &
Design
+Designs
1..1
P&D Administration: 1..1
Planning Documents: 0..N
Design Documents: 0..N
Survey
Methodology &
Algorithms
0..N
+Designed by Survey Administration: 1..1
M&A Administration: 1..1
Methodology Desc: 0..N
Algorithm Desc: 0..N
OMB Number: 1..1
1..1
0..N
+Designs
+Designed by
+Implements
0..N
0..N
+Parent
Frame & Sample
0..N
0..1
+Child F&S Administration: 1..1
+Used by
1..1
Survey Instance
1..N
+Uses
Sampling Scheme: 1..1
Sample size: 1..1
SI Administration: 1..1
Response Rate: 1..1
Refusal Rate: 1..1
0..N
0..1
+Instrumented by
0..1
+Child
Systems
System Administration: 1..1
Hardware: 0..N
Software: 0..N
0..N
0..1
1..N
+Creates
+Implements
+Created by
+Instruments
1..N
+Processed by
0..N
+Parent
1..N
0..N
+Processes
+Implemented by
Questionnaire
Questionnaire
Administration: 1..1
Questions: 1..N
Response Choices: 1..N
0..N
1..N
Data Sets
+Generate
1..N
Data Set
Administration: 1..1
File Location: 1..1
Data Set Type: 1..1
0..N
+Generated
Products
Product Administration: 1..1
File Formats: 1..N
Prices: 1..N
Figure 3: Business Dimension Model
3.2.1
Survey Life Cycle
The model supports the survey life cycle. Content, Planning, and Design are captured in the
Planning and Design, Frame and Sample, Methodology and Algorithms, and Survey classes.
Collection is captured in the Questionnaire, Survey, Survey Instance, Data Set, and Systems
classes. Processing and Analysis are captured in the Survey Instance, Methodology and
Algorithms, Systems, and Data Set classes. Finally, Dissemination is captured in the Data
Set, Systems, and Product classes.
3.2.2
Questionnaire Model, Linking Business – Data Dimensions
The CMR model contains many links between the various dimensions within the model.
Using the detailed questionnaire model, Figure 4 illustrates links between questionnaires and
data elements. Data Element Concepts and Questions are linked, because they each express
concepts describing the same data, albeit from a different perspective. Value Domains and
Response Choices each describe the valid values some data can take.
3.3
Administration and Document Dimensions
The Registration Authority (ISO/IEC 11179, 2000) establishes the rules under which the
repository operates. Monitoring metadata quality, monitoring the life cycle of the described
objects, and maintaining paths of accountability for metadata are important functions.
1..N
+Maps
+Mapped by 1..1
+Parent
0..1
Question Map
0..N
+Child
Map Identifier: [1..1]
0..N
1..N
0..N
0..N
Questionnaire
Data Element
Concept
Questionnaire
Administration: [1..1]
OMB Number: [1..1]
DEC Administration: [0..1]
Object Class: [0..1]
Property:[ 0..1]
0..N
+Followed by
0..N
+Contained by
+Preceded by
+Links
+Corresponds to
+Follows
+Precedes
+Corresponds to
+Contains
+Linked by
1..1
0..N
Response Choice
RC Identifier: [1..1]
Choice Text: [1..N]
0..N
0..N 1..N
1..1
0..N
Question
+Parent
0..1
Question
Administration: [1..1]
Question Text:[ 1..1]
0..N
+Child
+Corresponds to
Value Domain
VD Administration:[ 1..1]
Permissible Values: [0..N]
Description: [0..1]
Data Type: [1..1]
+Corresponds to
0..N
Figure 4: Questionnaire Model
An Administration Record is established each time an object is described, or registered. The
common attributes are provided along with specialized attributes for each object. Metadata is
often provided in the form of documents, so URL's to relevant documents are critical
metadata. The model allows links to as many documents as necessary. Each document may
be linked to many objects. Figure 5 provides a data model for the registration process.
Registration
Authority
RAIdentifier: [1..1]
+Registers
1..N
1..N
+Registered by
Administration
Record
Identifier: [1..1]
0..N Registration Status: [1..1]
+Submitted by
Time Frame
+Valid in
1..N Creation Date: [1..1]
1..N +Validates Last Change Date: [0..1]
Administrative Status: [1..1]
0..N
1..N
Begin Date: [0..1]
Until Date: [0..1]
+Documented by
+Administered by
+Documents
1..N
Organization
Name: [1..1]
+Submits
1..1
+Administers
1..N
+Manages
1..N
+Managed by
1..N
Contact
Name: [1..1]
Address: [1..1]
Phone: [1..1]
Email: [0..1]
Document
Doc Administration: [1..1]
URL: [1..N]
Figure 5: Administration and Documents Dimensions Model
3.4
Terminology and Classification Dimensions Model
The CMR model contains a dimension for managing classification schemes used to classify
objects the CMR model describes. A classification scheme is an ontology. So, concepts
represented in these ontologies are associated with CMR objects. The classification of
objects ties them to particular subject fields, such as the concept system determined by the
variables of interest for a survey. In other words, the concepts a survey is trying to measure
make up part of the subject field for that survey. For instance, assigning an object class and
property, as described in section 3.1, is a form of classification.
The CMR model also supports terminology management as it applies to the concepts, terms,
and characteristics that describe registered objects. One of the most important metadata
elements for any registered object is its definition. This is crucial for understanding the
meaning of the object. The CMR model supports semantics. Terminology management is a
fundamental part that aim. Figure 6 provides the terminology and classifications model.
Classification
Scheme
Classification Scheme
Administration [1..1]
+Associates
0..N
0..N
+Associates
1..N
Classification
Scheme Item
Classification Scheme
Item Relationship
Description: [1..1]
Name: [1..1]
Value: [1..1]
+Contained in
+Contains
0..N
+Classifies
Definition
+Expressed in
0..N
Definition: [1..1]
+Expresses 1..N
Language
Context
Administration: [1..1]
Description: [1..1]
Language Identifier: [1..1]
+Expresses
1..N
0..N
Context
Designation
1..N
+Defined by
+Defines
0..N
+Designates
0..N
1..N
+Designated by
0..N
+Classified
0..N
by
Administration
Record
Identifier: [1..1]
Registration Status:
[1..1]
Administrative Status:
[1..1]
+Expressed in Designation: [1..1]
Figure 6: Terminology and Classification Dimensions Model
4.
Conclusion
This paper contains a description of the CMR model. The model is described first through a
series of perspectives, including one from terminology. The terminology perspective is a
unifying principle for the discussion. Another perspective is dimensions. The model is
broken into conceptual parts, called dimensions. Each dimension model can stand alone, but
when interconnected to form the CMR model, provides a detailed description of surveys and
data.
The CMR model is primarily about semantics, the meaning of things. But, it is designed to
work with computer systems and provide embedded metadata, and it is designed to work with
people who wish to understand data, processes, or designs.
Finally, repositories designed around the CMR model are being built at the Census Bureau
and Statistics Canada. The Bureau of Labor Statistics and the National Institute for Statistics
in China are at the beginning stages of building repositories based on the CMR model.
5.
References
Abiteboul, S., Buneman, P., and Suciu, D. (2000) Data on the Web, San Francisco: Morgan
Kaufman Publishers.
Gillman, D.W. and Appel, M.V. (1999) "Statistical Metadata Research at the Census
Bureau", Proceedings of 1999 Federal Committee on Statistical Methodology Research
Conference, 15-17 November, Washington, DC.
ISO FDIS 704 (1999) Terminology work – Principles and methods, Geneva: International
Organization for Standardization.
ISO DIS 1087-1 (1998) Terminology work – Vocabulary – Part 1: Theory and application,
Geneva: International Organization for Standardization.
ISO/IEC 11179 (2000) Information technology - Specification and standardization of data
elements –
(1999) Part 1: Framework
(2000) Part 2: Classification for data elements
(1994) Part 3: Basic attributes for data elements
(1995) Part 4: Rules and guidelines for formulation of data definitions
(1995) Part 5: Naming and identification of data elements
(1996) Part 6: Registration of data elements, Geneva: International Organization for
Standardization and International Electrotechnical Commission.
ISO/IEC CD 11179-3 (2000) Information technology – Metadata registries – Part 3:
Metamodel for a metadata registry, Geneva: International Organization for Standardization
and International Electrotechnical Commission.
LaPlant, W.P., Lestina, G., Gillman, D.W., and Appel, M.V. (1996), "Proposal for a
Statistical Metadata Standard," Proceedings of Census Annual Research Conference ‘96,
Crystal City, VA.
Kent, J.-P., et al (1999) "Take Care of the Meta, and the Meta Will Take Care of the Data",
UN/ECE Worksession on Statistical Metadata, Working Paper #6, Geneva.
Sager, J. (1990) A Practical Course in Terminology Processing, Amsterdam: John Benjamins
Publishing.
Sundgren, B. (1992) "Organizing the Metainformation Systems of a Statistical Office", R&D
Report 1992:10, Statistics Sweden.
Download