CORPORATE METADATA REPOSITORY (CMR) MODEL Daniel W. Gillman U.S. Bureau of Labor Statistics _________________________________________________________ 1. Introduction The term metadata means, loosely, data about data. This means that metadata is data, and therefore we can store metadata in a database. Such a database is popularly called a repository or registry. This paper contains a description of a conceptual model for a metadata repository for an entire statistical organization, called the Corporate Metadata Repository (CMR) Model. From this point forward, the term metadata means statistical metadata. The CMR model is designed to support the metadata necessary to describe the survey life cycle, the linkages between similar designs and processes used across surveys, and the use of metadata to drive systems in support of the survey life cycle (Gillman and Appel, 1999). The CMR model supports this broad view of metadata for a survey, through its set of classes. It supports linking designs and processes across surveys over time, through the use of its relationships. Finally, metadata that drives data dissemination and automated survey design and processing systems is supported, through the attributes defined for each of the classes. The paper is devoted to a description of the CMR model that justifies the claims in the previous paragraph. Several perspectives the model supports are provided. Each of the perspectives represents a way to understand the organization of the model. The model itself is described from three of those perspectives: dimensions, terminology, and survey life cycle. Finally, there is a short conclusion. 2. Perspectives This section contains descriptions of several perspectives on the model. Each of the perspectives describes some characteristics the model has, content the model represents, systems the model supports, or survey activities the model describes. These perspectives are outlined in the sections below. 2.1 Conceptual Model The CMR model is a conceptual model rather than a logical or physical one. Logical models specify all the data an application will use, their types, and their relationships. A physical model describes how the data will be logically organized for use by an application system with specific needs for that system as part of the design, e.g., de-normalization decisions. A conceptual model, on the other hand, is a representation of the human understanding of the data including the relationships that exist between the data, and not necessarily how the data will be represented in the logical model. Conceptual models are meant for people to read and understand. 2.2 Standards for CMR The CMR model is based on an international metadata standard called Standardization and specification of data elements (ISO/IEC 11179, 2000), which specifies a set of attributes for describing data elements. The standard is now under revision; and this led to an expansion of the scope of the standard, the development of a full conceptual model for a metadata registry, and renaming the standard Metadata registries (ISO/IEC CD 11179-3, 2000). Some standards describe content and not format, metadata standards describe the data necessary to describe other data or processes, and data semantics refers to the meaning of data. ISO/IEC 11179 is a metadata content standard focused on the semantics of data. ISO/IEC 11179 is well suited to the CMR, because the understanding of data is so important in statistics. The CMR model is an extension of the ISO/IEC 11179 (revised) metamodel. Another view of the ISO/IEC 11179 metamodel is that it provides a mechanism for humans to locate, retrieve, and understand metadata. It is this human side that separates this standard from other metadata standards. An analogy is the card catalog system employed by libraries, which are designed for people to understand. ISO/IEC 11179 metamodel supports the card catalog analogy for data descriptions. The CMR model expands this for statistical surveys. 2.3 Metadata Orientation The CMR model supports the production-oriented and output-oriented purposes of statistical information systems (SIS's) (Sundgren, 1992). Output-oriented SIS's are commonly called data dissemination systems, and they usually are accessible via the Internet. Productionoriented SIS's are automated survey design and processing systems. Increasingly, both types of systems have a strong metadata component as part their design. The production-oriented SIS's may write metadata to their metadata component. SIS's with metadata components are metadata driven if the metadata is used to determine the path the system takes during a session (Kent, et al, 1999). Broadly defined metadata repositories can serve as the metadata component for many SIS's. The CMR model supports the design of such metadata repositories. 2.4 Metadata Structure Metadata, and data, can be structured, semi-structured, or unstructured. Data is structured if its type and schema are known (Abiteboul et al, 2000). If only one is known, the data is semi-structured, and the data is unstructured if neither is known. The CMR model supports all three kinds of metadata. For example, documents, a common source of metadata in statistical organizations, are semi-structured. A special class of structured metadata is called embedded (Kent, et al, 1999). Embedded metadata is used directly by software applications. The CMR model has many structured elements, and the items that populate these elements can be embedded. 2.5 Survey Life Cycle For the CMR model, the survey life cycle is comprised of the following stages (LaPlant, et al, 1996): Content, Planning, Design, Collection, Processing, Analysis, and Dissemination. Metadata to support and describe all the stages is accounted for in the CMR model. 2.6 Dimensions The CMR model is organized as a set of overlapping dimensions. Each dimension is a model on its own, but the combination, realized through relationships across the dimensions, is greater than the sum of its parts. The main dimensions are Data, Business, Administration, Documents, Terminology, and Classification. The data dimension describes data elements and their allowed values (called value domains). The model distinguishes a conceptual piece (the data element concept) containing the definition, and the representation, which contains the value domain and data type. The business dimension contains a model that describes surveys and their components. This part of the model contains the classes of objects for which the business operation of the statistical organization needs to keep track. Documents require small amounts of metadata to keep track of them; it is the relationships of the documents to business objects that give them context and classify them. These are contained in the document dimension. The administration dimension handles all the common characteristics for all the classes in the other dimensions. It keeps track of metadata quality and the metadata management life cycle for each object being described. The classification dimension describes classification schemes that are used to classify objects, and the terminology dimension manages concepts and terms describing objects. 2.7 Hierarchical The CMR model is also organized in a hierarchical perspective. This helps to understand how concepts play a fundamental role in metadata. Concepts are understood from the perspective of terminology theory. Terminology theory is briefly described, and the application of the theory to the CMR model is explored. 2.7.1 Terminology In the theory of terminology, concepts are mental constructs or units of thought. Concepts are organized or grouped by common elements, called characteristics. Essential characteristics are necessary and sufficient for identifying a concept. Other characteristics are inessential. The sum of characteristics that constitute a concept is called its intension. The set of objects a concept refers to is its extension. In natural language, concepts are expressed through definitions, which specify a unique intension and extension. A designation (term, appellation, or symbol) represents a concept (Sager, 1990; ISO FDIS 704, 1999). A subject field is a branch of human knowledge. A subject field is comprised of a set of related concepts, or concept system. A terminology is the set of designations that represent the concepts in a concept system. The designations make up a special language, which is used in a subject field (ISO DIS 1087-1, 1998). 2.7.2 Explaining the Model Every model is an example of a structured terminology, or ontology. This is because the set of classes and their relationships constitute a concept system; the attributes of each class are its characteristics, though not all essential; and the objects populated for each class are part of its extension. So, the CMR model is an ontology for metadata. Terminology describes the relationship between conceptual and representational classes in the data dimension, such as data element concepts and value domains. These are determined in the design of the questions in a questionnaire and used in the schema of a database or survey design and processing systems. The systems are implementations of algorithms based on methodologies. The questionnaire and databases reflect the characteristics of the units in the sample, which is selected from a sampling frame, and that in turn is an instantiation of the universe. Both the universe and methodologies are conceptual, which brings the outside frame of Figure 1 back to the inside frame. Sample/Frame/Universe, Algorithm/Methodology Questionnaire, Database, Processing System Data Element Concept, Value Domain Terminology Figure 1: Terminology Hierarchy 3 CMR Model This section contains a partial description of the CMR model. The description is broken into sub-sections based on the dimension perspective of section 2.6. The support for the survey life cycle perspective of section 2.5 is explained in section 3.2.1. Finally, the terminology perspective, a unifying principle, is used to explain aspects of the model. Each of the sub-sections contains a model represented in the Unified Modeling Language (UML). These models are less detailed versions of those found in the CMR model. The attributes depicted in each class are meant to signify greater detail in the actual model. 3.1 Data The data dimension describes data, i.e., data elements and their associated classes. There are four primary classes in this dimension of the model: Data element, Data element concept, Value domain, and Conceptual domain. A data element concept is the conceptual part of a data element, described independently of any particular representation. From a data modeling perspective, it is composed of two parts: Object classes and properties. Object classes are the things about which we wish to collect and store data. They are concepts. Examples are cars, persons, households, employees, and orders. Properties are what humans use to distinguish or describe objects. They are characteristics, not necessarily essential ones, of the object class and form its intension. They are also concepts. Examples of properties are color, model, sex, age, income, address, and price. Since a data model is an ontology and object classes and properties are concepts, then assigning an object class and property to a data element concept classifies it. For some data element concepts, however, their classification may be much more complex. There may be many characteristics of a data element concept, and they may not meaningfully or uniquely divide into an object class and property. The representation describes the form of the data, including a value domain, data type, and, if necessary, a unit of measure. Value domains are sets of allowed values for data elements. A non-enumerated domain is a value domain where the allowed values are specified by a description. An enumerated domain is a value domain where the allowed (permissible) values are listed. These values have meanings, called value meanings. A data element concept may be associated with different value domains as needed to form conceptually similar data elements. Those value domains are conceptualized by a conceptual domain, which links conceptually similar value domains. Each value domain is part of the extension of its conceptual domain, and the allowed values are the intension of the value domain itself. For a conceptual domain, the set of value meanings is its intension. Figure 2 illustrates the main ideas in this section (ISO/IEC CD 11179-3, 2000). Data Element Concept DEC Administration: 0..1 Object Class: 0..1 Property: 0..1 1..1 Conceptual Domain 0..N +Specifying +Having 1..1 CD Administration: 0..1 Value Meanings: 0..N 1..1 +Expressed by +Represented by +Expressing +Representing 0..N Data Element 0..N Value Domain 0..N DE Administration: 1..1 Derivation: 0..1 +Representing +Represented by 1..1 VD Administration: 1..1 Permissible Values: 0..N Description: 0..1 Data Type: 1..1 Figure 2: Data Dimension Model 3.2 Business Dimension The Business Dimension describes the business of statistical organizations. It is composed of classes, attributes, and relationships that describe data that the organization needs to keep about surveys. The model supports the storage of metadata as single attributes or as documents. Figure 3 shows the Business Dimension model. The model describes survey designs, processing, analyses, and data sets. It contains classes for each of the important parts of a survey. The model supports organized storage and complex searches for metadata describing a survey, and it supports searches for metadata across multiple surveys. The model also provides several other features: A list of all current surveys conducted by the agency Comparison of designs, specifications, or procedures across surveys Reuse of designs, specifications, or procedures Categorizing and classifying documents Assembling complete documentation for a survey Attributes to support embedded metadata +Incorporates +Incorporated by 0..N 1..N Planning & Design +Designs 1..1 P&D Administration: 1..1 Planning Documents: 0..N Design Documents: 0..N Survey Methodology & Algorithms 0..N +Designed by Survey Administration: 1..1 M&A Administration: 1..1 Methodology Desc: 0..N Algorithm Desc: 0..N OMB Number: 1..1 1..1 0..N +Designs +Designed by +Implements 0..N 0..N +Parent Frame & Sample 0..N 0..1 +Child F&S Administration: 1..1 +Used by 1..1 Survey Instance 1..N +Uses Sampling Scheme: 1..1 Sample size: 1..1 SI Administration: 1..1 Response Rate: 1..1 Refusal Rate: 1..1 0..N 0..1 +Instrumented by 0..1 +Child Systems System Administration: 1..1 Hardware: 0..N Software: 0..N 0..N 0..1 1..N +Creates +Implements +Created by +Instruments 1..N +Processed by 0..N +Parent 1..N 0..N +Processes +Implemented by Questionnaire Questionnaire Administration: 1..1 Questions: 1..N Response Choices: 1..N 0..N 1..N Data Sets +Generate 1..N Data Set Administration: 1..1 File Location: 1..1 Data Set Type: 1..1 0..N +Generated Products Product Administration: 1..1 File Formats: 1..N Prices: 1..N Figure 3: Business Dimension Model 3.2.1 Survey Life Cycle The model supports the survey life cycle. Content, Planning, and Design are captured in the Planning and Design, Frame and Sample, Methodology and Algorithms, and Survey classes. Collection is captured in the Questionnaire, Survey, Survey Instance, Data Set, and Systems classes. Processing and Analysis are captured in the Survey Instance, Methodology and Algorithms, Systems, and Data Set classes. Finally, Dissemination is captured in the Data Set, Systems, and Product classes. 3.2.2 Questionnaire Model, Linking Business – Data Dimensions The CMR model contains many links between the various dimensions within the model. Using the detailed questionnaire model, Figure 4 illustrates links between questionnaires and data elements. Data Element Concepts and Questions are linked, because they each express concepts describing the same data, albeit from a different perspective. Value Domains and Response Choices each describe the valid values some data can take. 3.3 Administration and Document Dimensions The Registration Authority (ISO/IEC 11179, 2000) establishes the rules under which the repository operates. Monitoring metadata quality, monitoring the life cycle of the described objects, and maintaining paths of accountability for metadata are important functions. 1..N +Maps +Mapped by 1..1 +Parent 0..1 Question Map 0..N +Child Map Identifier: [1..1] 0..N 1..N 0..N 0..N Questionnaire Data Element Concept Questionnaire Administration: [1..1] OMB Number: [1..1] DEC Administration: [0..1] Object Class: [0..1] Property:[ 0..1] 0..N +Followed by 0..N +Contained by +Preceded by +Links +Corresponds to +Follows +Precedes +Corresponds to +Contains +Linked by 1..1 0..N Response Choice RC Identifier: [1..1] Choice Text: [1..N] 0..N 0..N 1..N 1..1 0..N Question +Parent 0..1 Question Administration: [1..1] Question Text:[ 1..1] 0..N +Child +Corresponds to Value Domain VD Administration:[ 1..1] Permissible Values: [0..N] Description: [0..1] Data Type: [1..1] +Corresponds to 0..N Figure 4: Questionnaire Model An Administration Record is established each time an object is described, or registered. The common attributes are provided along with specialized attributes for each object. Metadata is often provided in the form of documents, so URL's to relevant documents are critical metadata. The model allows links to as many documents as necessary. Each document may be linked to many objects. Figure 5 provides a data model for the registration process. Registration Authority RAIdentifier: [1..1] +Registers 1..N 1..N +Registered by Administration Record Identifier: [1..1] 0..N Registration Status: [1..1] +Submitted by Time Frame +Valid in 1..N Creation Date: [1..1] 1..N +Validates Last Change Date: [0..1] Administrative Status: [1..1] 0..N 1..N Begin Date: [0..1] Until Date: [0..1] +Documented by +Administered by +Documents 1..N Organization Name: [1..1] +Submits 1..1 +Administers 1..N +Manages 1..N +Managed by 1..N Contact Name: [1..1] Address: [1..1] Phone: [1..1] Email: [0..1] Document Doc Administration: [1..1] URL: [1..N] Figure 5: Administration and Documents Dimensions Model 3.4 Terminology and Classification Dimensions Model The CMR model contains a dimension for managing classification schemes used to classify objects the CMR model describes. A classification scheme is an ontology. So, concepts represented in these ontologies are associated with CMR objects. The classification of objects ties them to particular subject fields, such as the concept system determined by the variables of interest for a survey. In other words, the concepts a survey is trying to measure make up part of the subject field for that survey. For instance, assigning an object class and property, as described in section 3.1, is a form of classification. The CMR model also supports terminology management as it applies to the concepts, terms, and characteristics that describe registered objects. One of the most important metadata elements for any registered object is its definition. This is crucial for understanding the meaning of the object. The CMR model supports semantics. Terminology management is a fundamental part that aim. Figure 6 provides the terminology and classifications model. Classification Scheme Classification Scheme Administration [1..1] +Associates 0..N 0..N +Associates 1..N Classification Scheme Item Classification Scheme Item Relationship Description: [1..1] Name: [1..1] Value: [1..1] +Contained in +Contains 0..N +Classifies Definition +Expressed in 0..N Definition: [1..1] +Expresses 1..N Language Context Administration: [1..1] Description: [1..1] Language Identifier: [1..1] +Expresses 1..N 0..N Context Designation 1..N +Defined by +Defines 0..N +Designates 0..N 1..N +Designated by 0..N +Classified 0..N by Administration Record Identifier: [1..1] Registration Status: [1..1] Administrative Status: [1..1] +Expressed in Designation: [1..1] Figure 6: Terminology and Classification Dimensions Model 4. Conclusion This paper contains a description of the CMR model. The model is described first through a series of perspectives, including one from terminology. The terminology perspective is a unifying principle for the discussion. Another perspective is dimensions. The model is broken into conceptual parts, called dimensions. Each dimension model can stand alone, but when interconnected to form the CMR model, provides a detailed description of surveys and data. The CMR model is primarily about semantics, the meaning of things. But, it is designed to work with computer systems and provide embedded metadata, and it is designed to work with people who wish to understand data, processes, or designs. Finally, repositories designed around the CMR model are being built at the Census Bureau and Statistics Canada. The Bureau of Labor Statistics and the National Institute for Statistics in China are at the beginning stages of building repositories based on the CMR model. 5. References Abiteboul, S., Buneman, P., and Suciu, D. (2000) Data on the Web, San Francisco: Morgan Kaufman Publishers. Gillman, D.W. and Appel, M.V. (1999) "Statistical Metadata Research at the Census Bureau", Proceedings of 1999 Federal Committee on Statistical Methodology Research Conference, 15-17 November, Washington, DC. ISO FDIS 704 (1999) Terminology work – Principles and methods, Geneva: International Organization for Standardization. ISO DIS 1087-1 (1998) Terminology work – Vocabulary – Part 1: Theory and application, Geneva: International Organization for Standardization. ISO/IEC 11179 (2000) Information technology - Specification and standardization of data elements – (1999) Part 1: Framework (2000) Part 2: Classification for data elements (1994) Part 3: Basic attributes for data elements (1995) Part 4: Rules and guidelines for formulation of data definitions (1995) Part 5: Naming and identification of data elements (1996) Part 6: Registration of data elements, Geneva: International Organization for Standardization and International Electrotechnical Commission. ISO/IEC CD 11179-3 (2000) Information technology – Metadata registries – Part 3: Metamodel for a metadata registry, Geneva: International Organization for Standardization and International Electrotechnical Commission. LaPlant, W.P., Lestina, G., Gillman, D.W., and Appel, M.V. (1996), "Proposal for a Statistical Metadata Standard," Proceedings of Census Annual Research Conference ‘96, Crystal City, VA. Kent, J.-P., et al (1999) "Take Care of the Meta, and the Meta Will Take Care of the Data", UN/ECE Worksession on Statistical Metadata, Working Paper #6, Geneva. Sager, J. (1990) A Practical Course in Terminology Processing, Amsterdam: John Benjamins Publishing. Sundgren, B. (1992) "Organizing the Metainformation Systems of a Statistical Office", R&D Report 1992:10, Statistics Sweden.