Draft Memo 2011-06-10 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell 1(8) Some Metadata Definitions The purpose of this paper is to initiate a discussion on what metadata are essential to and specific to a statistical data warehouse. A reasonable starting point is to establish some basic definitions. To that end the Internet and the bookshelves were searched for metadata and data warehousing related information. In particular Internet sites set up by national and international organisations working with statistics and/or standards were searched. Most of the results shown below come from Eurostat, OECD, UNECE as statistics sites and NISO, ISO as standards organisations. Detailed search results have been compiled in Annex 1. Metadata and data Metadata and statistical metadata General definitions of metadata can be found in many books and many sites on the Internet. Most of them are very short and simple. The most commonly used generic definition states that: Metadata are data about data There are some variations on the theme, e.g. claiming that metadata should (or must) be structured or formalised. Perhaps somewhat unexpectedly the sources that have a relation to statistics give definitions that are even shorter and vaguer than some of the general purpose sources. The OECD definition of statistical metadata is for example simply: Statistical metadata are data about statistical data This definition will obviously cover all kinds of documentation with some reference to any type of statistical data and is applicable to metadata that refer to data stored in a statistical data warehouse as well as any other type of data store. Data and statistical data Since the definition of metadata shows that they are just a special case of data, we need a reasonable definition of data as well. A derivative from a number of slightly varying definitions would be: Data are qualitative and/or quantitative information collected through observation As well as a definition of statistical metadata, we can find several definitions of statistical data. OECD provides this definition: Document1 16-02-08 23.49 Draft Memo 2011-06-10 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell 2(8) Statistical data are data from a survey or administrative source used to produce statistics For statistical data warehouse purposes this definition has to be slightly revised: Statistical data are data from one or several surveys and/or administrative sources used to produce statistics Metadata categories Metadata may describe many different aspects of data. Hence metadata can be categorised in a number of ways, or overlapping dimensions. Consequently, each metadata item normally belongs to several categories. Active vs. passive metadata Traditionally, metadata have been seen as a documentation of an existing object or a process, such as a statistical production process, that is running or has already finished – i.e. the result of a task most often carried out as the last, even optional step of the production process. This indicates a passive, recording role, which is useful for documenting, e.g., the methods used to plan and carry out a survey or the quality achieved for the final results. Passive metadata will become more active if they are used as input for planning, e.g., a new survey round or a new similar statistics product. The term active metadata should, however, be reserved for metadata that are operational. Active metadata may be regarded as an intermediate layer between the user and the data, which can be used by humans or computer programmes to search, link, retrieve or perform other operations on data. Thus active metadata may contain rules or code (algorithmic metadata). Some authors use the term active only for those metadata, i.e. those that can be interpreted or executed at runtime to support metadata driven processes, calling all other non-passive metadata semi-active. Passive metadata are used as documentation in all statistics production regardless of storage environment. In a statistical data warehouse active metadata must be available in what is often called the metadata layer. Suggested definitions: Document1 16-02-08 23.49 Active metadata are metadata stored and organised in a way that enables operational use, manual or automated, for one or more processes (GSBPM) Passive metadata are any metadata that are not active Draft Memo 2011-06-10 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell 3(8) Structured vs. free-form textual metadata As mentioned above some authors claim that metadata must be structured, or formalised. The opposite would probably be metadata in a completely free form. In practice all metadata probably follow some kind of structure, which may be more or less strict. At one end we have completely and strictly formalised metadata, meaning that only pre-determined codes or numerical information from a pre-determined domain may be used. At the other end we find a loose structure, e.g. a set of chapters, subdivisions, headings, etc. that may be mandatory or optional and whose contents may adhere to some rules or may be entered in a completely free form (text, diagrams, etc.). Strictly structured metadata are obviously well suited for use in an active role, but there is no simple, unambiguous mapping between active and structured, and passive and free-form, respectively. Since active metadata are vital to building an efficient statistical data warehouse it follows that in that environment metadata should also be well structured, whenever possible. Suggested definitions: Structured metadata are metadata stored and organised according to standardised codes, lists and hierarchies Textual metadata are metadata that contain descriptive information using formats ranging from completely free-form to semi-structured Reference vs. structural metadata Most sources define two main categories of metadata, most often called business and technical metadata. The distinction between those two varies between the authors, but a generalised definition could be that business metadata help the user understand, interpret and evaluate the contents, the subject matter, the quality, etc, of the data, and technical metadata help the user find and access the data by providing attributes such as names and descriptions of files, tables, columns, fields, etc. In the “statistical sources” the terms business and technical metadata are rarely used. Several different synonyms can be found for business metadata, e.g. conceptual or logical. Most commonly used is, however, reference metadata. Instead of technical metadata you will often find the term structural metadata Document1 16-02-08 23.49 Draft Memo 2011-06-10 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell 4(8) Structural/technical metadata can quite easily be represented as structured and active, while more work and efforts are required to facilitate making reference/business metadata active by storing them in a structured way. Other similar categorisations are sometimes used, e.g. the term administrative metadata (cf. NISO) for a subset of structural metadata to define metadata that handle users’ rights to access and utilise data (rights management metadata) and metadata specifically for archiving purposes (preservation metadata). Suggested definitions: Reference metadata are metadata that describe the contents and quality of the data in order to help the user understand and evaluate them (conceptually) Structural metadata are metadata that help the user find, identify, access and use the data (physically) Process metadata Information on an operation, such as start and end times, result status code, number of records processed, resources used, etc., is a specific type of metainformation. This kind of metadata is known under several names, such as process metadata, process data, process metrics, paradata. These data may either contain expected values or actual outcome. In both cases they are primarily intended for planning – in the latter case by evaluating finished processes in order to improve recurring or similar ones. Process metadata should be structured to facilitate computer aided evaluation. Suggested definition: Process metadata are metadata that describe the expected or actual outcome of one or more processes using evaluable and operational metrics Quality metadata Quality metadata may be read as metadata on the quality of the data or metadata (of high) quality. Both interpretations are relevant to statistics production and data warehousing. Keeping track of, maintaining and perhaps raising the quality of the data in the warehouse is an important governance task that requires support from metadata. Quality information should be available in different forms and serve several purposes: to describe the quality achieved (e.g. how a survey was carried out, or what the outcome was), or to measure the outcome (a contribution to the process metadata). The main objective of the former is to Document1 16-02-08 23.49 Draft Memo 2011-06-10 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell 5(8) serve the end users of the data, while the latter primarily supports governance and future improvements. Hence quality metadata may be seen as a different dimension that cuts through all the others. Metadata quality is obviously a very important issue, and it should be high, within the restrictions of reasonable cost-benefit analysis. Inferior metadata quality may lead to unnecessary misinterpretations of the data contents or even in completely useless data. Suggested definition: Quality metadata are any kind of metadata that contribute to the description or interpretation of the quality of data. Metadata structures Several sources claim that the data warehouse needs a central system where its metadata are registered and logically stored, a metadata registry. This registry will make it easier to handle identification, checks for duplicates, ensure consistency, etc. It is, however, a logical matter; a centralised metadata registry does not imply that metadata are physically stored in a centralised system. The term metadata repository is also frequently used, particularly when discussing metadata in relation to data warehousing. In this case the distinction between logical and physical matters seems less clear. The repository is logically centralised, but while some also advocate a centralised physical solution, based on some form of central “metadatabase”, others prefer coordinated, physically distributed systems. This means that a metadata registry may be seen as a subset of a metadata repository, or as a narrower definition. A third commonly used term is the metadata layer. A data warehouse is often described as consisting of several parts that serve separate functions, sometimes called layers. The metadata layer may in this case be interpreted as a synonym for either the metadata registry or the metadata repository, depending on the exact definitions being used. Metadata collection and usage The metadata lifecycle is commonly described as divided into the following three basic phases: 1. Collection Metadata should be captured as early as possible in the production process. The sources vary. Collection of some types of metadata can Document1 16-02-08 23.49 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell Draft Memo 2011-06-10 6(8) and should be automated. When data is entered into the data warehouse basic metadata must already exist in a correct form 2. Maintenance Metadata must be up to date at all times. Processes must be in place to capture changes, synchronize metadata with the changing architecture 3. Deployment Metadata must be available to users in the right form and with the right tools. Collection of metadata should be automated whenever possible. This means that, e.g., metadata that exist in the sources, such as administrative data files used as input, should be used directly or in a derived form. Another way of simplifying metadata collection is to use what already exists. Reuse and inherit are common keywords in metadata literature. One of the major advantages of using metadata is that duplicate and “near-duplicate” data can be revealed and avoided. Reusing data and metadata saves resources, increases efficiency and quality. Revealing, e.g., variables having almost, but not quite, the same definitions can improve harmonisation and comparability. The data harmonisation that will be enabled by metadata harmonisation is a vital task for the data warehouse – possibly the most important and at the same time one of the most difficult ones. Different user categories need different metadata and have different requirements. End users want to use metadata to easily and correctly find and interpret the data they need. Data stewards want an inventory of what is stored in the data warehouse. Analysts want to compare the data sources. Programmers want to make sure that they use the standard names. These are just a few examples of metadata usage. The use ranges from detailed and operational to overview and descriptive. Metadata standards Standards for metadata have been discussed for many years, but still have not developed very far. The most successful effort is probably ISO/IEC 11179, Metadata registries, which is a standard on the conceptual level. Several NSIs have based their metadata systems on that standard. The Common Warehouse Metamodel (CWM) is a specification for modelling metadata for data warehouses. The standard is supported by the Object Management Group, which in turn is supported by several major software companies. Document1 16-02-08 23.49 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell Draft Memo 2011-06-10 7(8) DDI, the Data Documentation Initiative, is an XML based standard specification for documentation of social science data. It is supported by an international alliance. SDMX, Statistical Data and Metadata eXchange, is also based on XML. Several NSIs are currently cooperating on the development of a Generic Statistical Information Model (GSIM), which includes the Common Reference Model (CRM), and is linked to the Generic Statistical Business Process Model (GSBPM). The work is lead by ABS. Metadata for statistical data warehouses “Metadata is the DNA of the data warehouse, defining its elements and how they work together. [...] Metadata plays such a critical role in the architecture that it makes sense to describe the architecture as being metadata driven.”1 Panos Vassiliadis2 of the University of Ioannina, Greece, summarizes well the requirements of data warehouse metadata. They should include information on: 1. the contents of the data warehouse, their location and their structure 2. the processes that take place in the data warehouse 3. the implicit semantics of data along with any other kind of data that aids the end-user exploit the information of the warehouse 4. the infrastructure and physical characteristics of components and the sources of the data warehouse 5. security, authentication, and usage statistics that aids the administrator tune the operation of the data warehouse as appropriate The metadata categories described earlier in this paper are general. Some sources mention metadata categories specific to the data warehouse environment, e.g. ETL metadata (for the “Extract–Transform–Load” process), but these all seem to be subsets or just renaming the categories already defined. Looking at the categories, and keeping in mind the specific demands of a statistics production environment it is possible to assess which categories play special roles building and maintaining a statistical data warehouse (SDW). 1 2 Document1 16-02-08 23.49 SDW requires active metadata. The amount of objects (variables, value domains, etc.) stored makes it necessary to provide the users Kimball, The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008, p. 117 Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009 ESSnet on Data Warehousing Statistics Sweden Lars-Göran Lundell Draft Memo 2011-06-10 8(8) (persons and software) with active assistance finding and processing the data. SDW requires structured metadata. The amount of metadata items will be large and the requirement for metadata to be active makes it necessary to structure the metadata very well. SDW requires structural metadata. Active metadata must, at least to some part, be structural. Process metadata are vital to a SDW. Since the data warehouse supports many concurrent users it is very important to keep track of usage, performance, etc. In a data warehouse that has been less than perfectly designed one user’s choice of tool or operation could impair the performance for other users. An analysis of process metadata can be an input to correcting this anomaly. This does not mean that the remaining metadata categories should be disregarded, but that they are used and needed in a statistical data warehouse in the same way as in any statistics production environment. Document1 16-02-08 23.49 Annex 1 1(4) Metadata related terms Sources Wikipedia Direct quotations from Wikipedia and from its sources http://en.wikipedia.org/wiki/Metadata http://en.wikipedia.org/wiki/Data_warehouse ISO (International Standards Organization, ISO/IEC 11179 Metadata registries (MDR)), http://metadata-stds.org/11179/ NISO (National Information Standards Organization), Understanding Metadata. http://www.niso.org/publications/press/UnderstandingMetadata.pdf. UNECE Metadata Common Vocabulary, MCV (Draft, March 2006) http://circa.europa.eu/Public/irc/dsis/metadata/library?l=/metadata_forces/force_meeting_092007/mtf-6-mcv-anxpdf/_EN_1.0_&a=d UNECE, Terminology on Statistical Metadata (2000) http://www.unece.org/stats/puSblications/53metadaterminology.pdf UNECE, Guidelines for the modeling of statistical data and metadata (1995) http://www.unece.org/stats/publications/metadatamodeling.pdf OECD, Glossary of Statistical Terms http://stats.oecd.org/glossary/ Term Metadata Statistical metadata Wikipedia Data providing information about one or more aspects of the data, such as: Means of creation of the data Purpose of the data Time and date of creation Creator or author of data Placement on a computer network where the data was created Standards use ISO Data that defines and describes other data NISO Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information. Metadata can describe resources at any level of aggregation. It can describe a collection, a single resource, or a component part of a larger resource OECD, UNECE (Metadata Common Vocabulary) Data that defines and describes other data. UNECE (Terminology, Guidelines) Data and other documentation that describes objects in a formalized way Metadata are data that describe other data, and data become metadata when they are used in this way. Data about statistical data. · Comprises data and other documentation that describes objects in a formalised way. · Provides information on data and about processes of producing and using data. Metadata describing statistical data Annex 1 2(4) Term Data Wikipedia Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements [...] or observations [...]. ISO NISO Re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing Statistical data Structural metadata Reference metadata Describe the structure of computer systems such as tables, columns and indexes. Bretheron & Singley (Technical) Defines the objects and processes from a technical perspective [...] like tables, fields, data types, indexes [...] Kimball (Guide) Help humans find specific items. Bretheron & Singley (Business) Describes the contents [...] in user accessible terms [...] what data you have, where it comes from, what it means, [...] Kimball OECD, UNECE (Metadata Common Vocabulary) Characteristics or information, usually numerical, that are collected through observation. UNECE (Terminology, Guidelines) The physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means. Data from a survey or administrative source used to produce statistics Data that are collected and/ or generated by statistics in process of statistical observations or statistical data processing Indicate how compound objects are put together, e.g., how pages are ordered to form chapters Act as identifiers and descriptors of the data. They are used to identify, use, and process data matrixes and data cubes, e.g. names of columns or dimensions of statistical cubes. (Descriptive) Describe a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords Describe the contents and the quality of the statistical data. Should include conceptual, methodological and quality metadata Annex 1 3(4) Term Wikipedia ISO Algorithmic metadata UNECE (Terminology, Guidelines) An instance of a metadata object. It has associated attributes. It can have a distinct status: mandatory, conditional and optional. A group of characters describing the data and treated as metadata unit Describes the results of various operations [...] start time, end time, CPU seconds used [...] Kimball Instance of a metadata object Metadata item Metadata usage OECD, UNECE (Metadata Common Vocabulary) Provide information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets [...]: − Rights management metadata, which deals with intellectual property rights, and − Preservation metadata, which contains information needed to archive and preserve a resource. Administrative metadata Process metadata NISO Data virtualization, statistics and census services, data warehousing Discovery and organisation of electronic resources, interoperability, integration, identification, archiving. Include the algorithms as such behind statistical procedures, including procedures for statistical analysis; descriptions of the algorithms Annex 1 4(4) Term Metadata layer Metadata registry Metadata repository Wikipedia [data warehouse] The data dictionary – This is usually more detailed than an operational system data dictionary. A central location in an organization where metadata definitions are stored and maintained in a controlled method. Metadata registries are used whenever data must be used consistently within an organization or group of organizations. A data dictionary [...] a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format." ISO NISO OECD, UNECE (Metadata Common Vocabulary) The layer in the reference model for standardization in statistics used to denote the set of attributes related to statistical metainformation Information system for registering metadata (MDR) Provides information on the definition, origin, source, and location of data [...] at many levels, including schemes, usage profiles, metadata elements, and code lists for element values. It provides an integrating resource for legacy data, acts as a lookup tool for designers of new databases, and documents each data element. An information system for registering metadata. Registration accomplishes three main goals: identification, provenance, and monitoring quality. [...] It manages the semantics of data. A logically central statistical metadata repository that allows for the query, editing, and managing of metadata. Such a system provides a mechanism for looking up information about statistical products as well as their design, development, and analysis. UNECE (Terminology, Guidelines) (Metadata holding) A logical or physical set of metadata (e.g. database) stored together with its description (e.g. schema)