in partnership with Title: Framework of metadata requirements and roles in the S-DWH WP: 1 Deliverable: 1.1 Version: 1.12 Date: 6-9-2013 Author: Lars-Göran Lundell NSI: Sweden ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Metadata Framework for Statistical Data Warehousing Contents 1 2 Metadata – general considerations ................................................................. 3 1.1 Metadata definitions and terminology ................................................ 3 1.1.1 Metadata and data ................................................................... 3 1.1.2 Metadata categories ................................................................ 4 1.1.3 Metadata subsets ..................................................................... 6 1.1.4 Metadata structures ................................................................. 9 1.2 Metadata collection and usage .......................................................... 10 1.3 Metadata standards ........................................................................... 11 1.3.1 GSBPM ................................................................................. 11 1.3.2 GSIM .................................................................................... 11 1.3.3 MDR (ISO/IEC 11179) ......................................................... 12 1.3.4 CWM .................................................................................... 12 1.3.5 DDI ....................................................................................... 12 1.3.6 SDMX ................................................................................... 12 1.3.7 MCV ..................................................................................... 13 Metadata in the statistical data warehouse ................................................... 13 2.1 The SDWH metadata requirements .................................................. 13 2.1.1 Minimum metadata requirements for the SDWH ................. 15 2.2 Metadata and the layered SDWH ..................................................... 15 2.2.1 Source layer metadata ........................................................... 16 2.2.2 Integration layer metadata .................................................... 17 2.2.3 Interpretation and data analysis layer metadata .................... 17 2.2.4 Data access layer metadata ................................................... 17 2.2.5 Summary of SDWH layers and metadata categories ............ 18 2.3 Organising SDWH metadata ............................................................ 19 2.4 SDWH metadata governance ............................................................ 19 2.5 The SDWH and metadata standards ................................................. 20 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 3(21) Metadata plays a very active and important part in the data warehouse environment. [...] Metadata for the data warehouse environment is one of the most important aspects.1 Metadata is the DNA of the data warehouse, defining its elements and how they work together. [...] Metadata plays such a critical role in the architecture that it makes sense to describe the architecture as being metadata driven.2 The quotations above originate from “the fathers of the data warehouse”, Bill Inmon and Ralph Kimball. Even if they do not always agree on how a data warehouse should be built and maintained, they obviously share the view that much effort should be devoted to designing the metadata system when establishing a data warehouse. To everyone working in an organisation that produces statistics, like a national statistical institute (NSI), the need for good metadata is already well known, regardless of the production environment. Thus it is obvious that a statistical data warehouse (SDWH) is dependent on its metadata for statistical as well as data warehousing purposes. According to the framework partnership agreement (FPA) for the ESSnet project on micro data linking and data warehousing in statistical production, this project is to “define a functional model of the SDWH, so that the issues raised by the ESSnet can be assessed in a generic and standardized way”. This paper attempts to define the roles and purposes of metadata in the SDWH in generic terms, and to distinguish between them and those used in statistics production regardless of the environment, i.e. a general metadata framework for statistical data warehousing. 1 Metadata – general considerations 1.1 Metadata definitions and terminology The first step in any kind of standardisation work usually concerns making sure that all involved parties understand and agree on a set of basic definitions and use a common terminology. This chapter covers a number of important basic definitions. 1.1.1 Metadata and data General definitions of metadata can be found in many books and many sites on the Internet. Most of them are very short and simple. The most commonly used generic definition states that: [Def 1.1] Metadata are data about data3 There are some variations on the theme, e.g. claiming that metadata should (or must) be structured or formalised. Perhaps somewhat unexpectedly the sources that have a relation to statistics give definitions that are even shorter and vaguer than some of the general purpose sources. The definition of statistical metadata given by OECD and UNECE, e.g., simply states that: 1 Inmon, Metadata in the Data Warehouse, (White Paper), 2000 Kimball, The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008, p. 117 3 ISO/IEC 11179; Eurostat Metadata Common Vocabulary 2 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 [Def 1.2] 4(21) Statistical metadata are data about statistical data4 This definition will obviously cover all kinds of documentation with some reference to any type of statistical data and is applicable to metadata that refer to data stored in a SDWH as well as any other type of data store. Since the definition of metadata shows that they are just a special case of data, we need a reasonable definition of data as well. A derivative from a number of slightly varying definitions would be: [Def 1.3] Data are qualitative and/or quantitative information collected through observation5 As well as a definition of statistical metadata, we can find several definitions of statistical data: [Def 1.4] Statistical data are data derived from either statistical or nonstatistical sources, which are used in the process of producing statistical products 6 These basic definitions are very generic and state nothing about requirements on the contents or organisation of the data or metadata. 1.1.2 Metadata categories Metadata may describe many different aspects of data. Hence metadata can be categorised in several ways, where the categories form a multi-dimensional structure. Consequently, each metadata item normally belongs to several categories. 1.1.2.1 Active – passive Traditionally, metadata have been seen as a documentation of an existing object or a process, such as a statistical production process that is running or has already finished – i.e. the result of a task most often carried out as the last, even optional step of the production process. This indicates a passive, recording role, which is useful for documenting, e.g., the variables, objects and methods used to plan and carry out a survey or the quality achieved for the results. Passive metadata will become more active if they are used as input for planning, e.g., a new survey round or a new similar statistics product. The term active metadata should, however, be reserved for metadata that are operational. Active metadata may be regarded as an intermediate layer between the user and the data, which can be used by humans or computer programmes to search, link, retrieve or perform other operations on data. Thus active metadata may be expressed as parameters, and may contain rules or code (algorithmic metadata). Some authors use the term active only for those metadata, i.e. those that can be interpreted or executed at runtime to support metadata driven processes, calling all other nonpassive metadata semi-active. 4 OECD, Glossary of Statistical Terms; UNECE, Terminology on Statistical Metadata Eurostat Metadata Common Vocabulary 6 Eurostat Metadata Common Vocabulary 5 ESSnet on Data Warehousing WP1 Version 1.12 Report 2013-09-06 5(21) Passive metadata are used as documentation in all statistics production regardless of storage environment. In the SDWH active metadata must be available in what is often called the metadata layer (cf. definition 4.1). [Def 2.1] Active metadata are metadata stored and organised in a way that enables operational use, manual or automated, for one or more processes Examples: Instruction; user manual; parameter; script (SQL, XML) [Def 2.2] Passive metadata are all metadata that are not active Examples: Quality report for a survey, a census or register; documentation of methods that were used during a survey; most log lists; definitions of variables 1.1.2.2 Formalised – free-form According to some sources all metadata must be structured, or formalised. In a reverse case all metadata would be created and stored in completely free form – unstructured and non-formalised. In practice all metadata probably follow some kind of structure, which may be more or less strict. At one end, we have completely and strictly formalised metadata, meaning that only pre-determined codes or numerical information from a pre-determined domain may be used. At the other end, we find a loose structure, e.g. a set of chapters, subdivisions, headings, etc., that may be mandatory or optional and whose contents may adhere to some rules or may be entered in a completely free form (text, diagrams, etc.). Strictly formalised metadata are obviously well suited for use in an active role, but there is no simple, unambiguous mapping between active and formalised, and passive and free-form, respectively. Still, formalised metadata are more easily used actively, and since active metadata are vital to building an efficient SDWH it follows that its metadata should also be formalised, whenever possible. [Def 2.3] Formalised metadata are metadata stored and organised according to standardised codes, lists and hierarchies Examples: Classification codes; parameter lists; most log lists [Def 2.4] Free-form metadata are metadata that contain descriptive information using formats ranging from completely free-form to partly formalised (semi-structured) Examples: Quality report for a survey a census or register; methodological description; process documentation; background information 1.1.2.3 Reference – structural Most sources define two main categories of metadata, often called business and technical metadata. The “statistical sources” rarely use those terms. Several different synonyms can be found for business metadata, e.g. conceptual or logical, but the most commonly used term is reference metadata. Instead of technical metadata, the “statistical sources” most often use the term structural metadata to refer to the same thing. Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 6(21) The distinction between the two categories varies between sources, but generally reference metadata help the user understand, interpret and evaluate the contents, the subject matter, the quality, etc, of the corresponding data, whilst structural metadata help the user, who in this case may be man or machine, find, access and utilise the data operationally. Particularly in data warehousing structural metadata can be defined as any metadata that can be used actively or operationally in a metadata driven system. The user may in this case be a human or a machine (a programme, a process, a system). This includes metadata that describe the physical locations of the corresponding data, such as names or other identities of servers, databases, tables, columns, files, positions, etc. Structural metadata are normally represented as formalised and active, whilst reference metadata are typically passive and stored in a free format, requiring more efforts to make them active by storing them in a structured way. [Def 2.5] Reference metadata are metadata that describe the contents and quality of the data in order to help the user understand and evaluate them (conceptually) 7 Examples: Quality information on survey, register and variable levels; variable definitions; reference dates; confidentiality information; contact information; relations between metadata items [Def 2.6] Structural metadata are metadata that help the user find, identify, access and utilise the data (physically) Examples: Classification codes; parameter lists All categories described above are valid for all metadata, i.e. every metadata item can be categorised on the three scales: active–passive, formalised–free-form, and reference–structural, as illustrated in figure 1. Figure 1 Categorisation of a metadata item 1.1.3 Metadata subsets In addition to the categories described above a metadata item may (but does not necessarily have to) also belong to a specific type, or subset of metadata. Below are described the subsets that are generally best known or considered most important. Several more types may be identified to serve special purposes, but are not further described here. 7 Eurostat Metadata Common Vocabulary ESSnet on Data Warehousing WP1 Version 1.12 Report 2013-09-06 7(21) 1.1.3.1 Statistical metadata According to definition 1.2 statistical metadata are “data about statistical data”. This definition is very generic and needs to be more precise in order to be useful. From a more operational point of view statistical metadata can be seen as those metadata that directly refer to central concepts in the statistics, e.g., those that define and describe statistical unit types used in a survey, a census or a register, their characteristics, the variables and the statistical activities8. This still means that the statistical metadata subset may – at least partly – overlap some other subsets, but will exclude some more administrative and technical ones. Statistical metadata may belong to any of the metadata categories described above. [Def 3.1] Statistical metadata are data about statistical data. Examples: Variable definition; register description; code list 1.1.3.2 Process metadata Information on an operation, such as when it started and ended, the resulting status, the number of records processed, which resources were used, etc., is a specific type of metainformation. This kind of metadata is known under several names, such as process metadata, process data, process metrics, or paradata. These data may contain either expected values or actual outcome. In both cases, they are primarily intended for planning – in the latter case by evaluating finished processes in order to improve recurring or similar ones. If process metadata are formalised, this will obviously facilitate computer-aided evaluation. Process metadata are less likely to be categorised as free-form, but may be active or passive, and reference or structural. [Def 3.2] Process metadata are metadata that describe the expected or actual outcome of one or more processes using evaluable and operational metrics Examples: Operator’s manual (active, formalised, reference); parameter list (active, formalised, reference); log file (passive, formalised, reference/structural) 1.1.3.3 Quality metadata Quality metadata may be read as metadata on the quality of the data or metadata (of high) quality. Both interpretations are relevant to statistics production and data warehousing. Keeping track of, maintaining and perhaps raising the quality of the data in the SDWH is an important governance task that requires support from metadata. Quality information should be available in different forms and serve several purposes: to describe the quality achieved (e.g. how a survey was carried out, or what the outcome was), or to measure the outcome (a contribution to the process metadata). The main objective of the former is to serve the end users of the data, while the latter primarily supports governance and future improvements. 8 The Neuchâtel Terminology Model, Part II, 2006 ESSnet on Data Warehousing WP1 Version 1.12 Report 2013-09-06 8(21) Most quality metadata can be categorised as passive, free-form and reference metadata. [Def 3.3] Quality metadata are any kind of metadata that contributes to the description or interpretation of the quality of data. Examples: Quality declarations for a survey, a census or a register (passive, free-form, reference); documentation of methods that were used during a survey (passive, free-form, reference); most log lists (passive, formalised, reference/structural) Metadata quality is obviously a very important issue, and it should be high, within the restrictions of reasonable cost-benefit analysis. Inferior metadata quality may lead to unnecessary misinterpretations of the data contents or even in completely useless data. A detailed discussion and recommendations on metadata quality can be found in Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse9 (Deliverable 1.2). 1.1.3.4 Technical metadata In section 1.1.2.3, technical metadata were mentioned as a commonly used synonym for structural metadata. We chose not to further use the term as a metadata category, but may instead use it as a metadata subset referring to information necessary to locate the data physically. Technical metadata are usually categorised as formalised, active and structural. [Def 3.4] Technical metadata are metadata that describe or define the physical storage or location of data. Examples: Server, database, table and column names and/or identifiers; server, directory and file names and/or identifiers 1.1.3.5 Authorisation metadata Every computerised system needs some way of handling user privileges, access rights, etc. Users need to be classified or assigned a role as, e.g., “administrator”, “user” or “guest”, or to be given an explicit privilege to “read”, “write”, or “update” a certain item, etc. In a data warehouse, having a large amount of data and many users performing various tasks, there is a need for a comprehensive authorisation subsystem. This system will need to store and use its own administrative data, which may be defined as authorisation metadata. Authorisation metadata are categorised as active, formalised and structural. [Def 3.5] Authorisation metadata are administrative data that are used by programmes, systems or subsystems to manage users’ access to data. Examples: User lists with privileges; cross references between resources and users 9 Bowler, Lindelauf, Dressen (2013) Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 9(21) 1.1.3.6 Data models The various types of data models are an often overlooked type of metadata. The reason is probably that these metadata are usually only seen as useful to the technical staff (IT personnel). [Def 3.6] A data model is an abstract documentation of the structure of data needed and created by business processes. Important types of data models for the SDWH include the conceptual model that usually gives a high-level overview, and the physical model that describes the details of databases, files, etc. The metadata model can also be described conceptually as well as physically. [Def 3.6.1] 1.1.4 A metadata model is a special case of a data model: an abstract documentation of the structure of metadata used by business processes. Metadata structures In order to find, retrieve and use metadata efficiently their locations must be known to users on some level. A data warehouse is often described as consisting of several parts that serve separate functions, sometimes called layers10. Since metadata is a vital part of the SDWH the term metadata layer is sometimes used to refer to the metadata store and metadata functions in the SDWH. [Def 4.1] A metadata layer is a conceptual term that refers to all metadata in a data warehouse, regardless of logical or physical organisation. Metadata need to be organised in some kind of structured, logical way in order to make it possible to find and use them. Some sources use different terms for logical and physical metadata structures, respectively, but most do not distinguish between them. Sometimes it is useful to have separate terms; a logical structure may be, e.g., physically stored in several distributed, coordinated structures. Another distinction can be found in the level of formal organisation of the metadata store, the restrictions and approval rules required to perform changes, and the coordination of the contents. The term registry often refers to a more strictly administered, regulated and coordinated environment than the more general term repository. [Def 4.2] A metadata registry is a central point where logical metadata definitions are stored and maintained using controlled methods. In order to load a metadata item into the registry it must fulfil the requirements set up on, e.g., structure, contents and relations to other metadata items. Normally the registry does not define any links between metadata and the data they describe, i.e. the physical addresses of the data. Usually the definition of a metadata repository does not require the metadata to adhere to strict rules in order to be loaded. On the other hand, the repository usually 10 Palma, S-DWH Business Architecture, 2013 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 10(21) implies storing metadata for operational use, i.e. it is expected to contain a link to the corresponding data. [Def 4.3] A metadata repository is a physical location where metadata and their links to data are stored. Figure 2. Using the metadata layer to locate and retrieve data 1.2 Metadata collection and usage The metadata lifecycle is commonly described as divided into the following three basic phases: 1. Collection Metadata should be captured as early as possible in the production process. The sources vary. Collection of some types of metadata can and should be automated. When data is entered into the data warehouse basic metadata must already exist in a correct form 2. Maintenance Metadata must be up to date at all times. Processes must be in place to capture changes, synchronize metadata with the changing architecture 3. Deployment Metadata must be available to users in the right form and with the right tools. Collection of metadata should be automated whenever possible. This means that, e.g., metadata that exist in the sources, such as administrative data files used as input, should be used directly or in a derived form. Another way of simplifying metadata collection is to use what already exists. Reuse and inherit are common keywords in metadata literature. One of the major advantages of using metadata is that duplicate and “near-duplicate” data can be revealed and avoided. Reusing data and metadata saves resources, increases efficiency and quality. Revealing, e.g., variables having almost, but not quite, the same definitions can improve harmonisation and comparability. The improvement of data consistency that will be enabled by metadata harmonisation is a vital task for the data warehouse – possibly the most important and at the same time one of the most difficult ones. Different user categories need different metadata and have different requirements. End users want to use metadata to easily and correctly find and interpret the data they need. Data stewards want an inventory of what is stored in the data warehouse. Analysts want to compare the data sources. Programmers want to make sure that Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 11(21) they use the standard names. These are just a few examples of metadata usage. The use ranges from detailed and operational to overview and descriptive. A more thorough discussion on metadata collection and usage can be found in Definition of the functionalities of a metadata system to facilitate and support the operation of the S-DWH11 (Deliverable 1.4). 1.3 Metadata standards An important step towards a common view on metadata for a SDWH is to identify which already existing standards are available and relevant for the purpose. Metadata standards relating to statistics as well as to data warehousing should be taken into account. Information on statistical metadata standards can be found in the Common Metadata Framework12, created and maintained by METIS, the joint UNECE-Eurostat-OECD Work Session on Statistical Metadata. Below are briefly described the models and technical standards considered most relevant to a SDWH. For more detailed information on a standard, refer to Overview of and recommendations on the use of metadata models13 (Deliverable 1.3) and the links below. 1.3.1 GSBPM The Generic Statistical Business Process Model14 (GSBPM) provides a framework to describe the statistical production process in terms of standard components (phases and sub-processes). One of the original aims of the model is to standardise the process terminology, thereby making it easier to compare and benchmark processes within and between organisations, primarily NSIs and international organisations. The current version, 4.0, was released in 2009. 1.3.2 GSIM The Generic Statistical Information Model15 (GSIM) is a reference framework that describes information objects used in the production of official statistics. GSIM provides a common language to describe information that supports the whole statistical production process. It is aligned with relevant data management and exchange standards, such as DDI and SDMX, but it is not directly tied to them, or to any specific technology. GSIM and GSBPM are complementary models, where GSIM describes the information that is used by and produced by the processes described in GSBPM. Version 1.0 of GSIM was released in 2012 after being developed by a group of NSIs, led by the Australian Bureau of Statistics (ABS). 11 Ennok, Lundell, Bowler, De Giorgi, Kulla (2013) http://www1.unece.org/stat/platform/display/metis/Part+B++Metadata+Concepts%2C+Standards%2C+Models+and+Registries 13 Dressen, Lindelauf, Goossens (2012) 14 http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+ Process+Model 15 http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=59703371 12 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 1.3.3 12(21) MDR (ISO/IEC 11179) ISO/IEC 1117916 is a well-established international standard for representing metadata in a metadata registry (MDR). It has two main purposes: definition and exchange of concepts. Thus, it describes the semantics and concepts, but does not handle physical representation of the data. It aims to be a standard for metadatadriven exchange of data in heterogeneous environments, based on exact definitions of data. Several NSIs have based their current metadata systems on this standard. Most of those are developed in-house, but at least one commercial product exists that claims to support the standard (OneMeta MDR). The standard was first published in 1999. The most recent update was made in 2013. 1.3.4 CWM The Common Warehouse Metamodel17 (CWM), ISO/IEC 19504, is a specification for modelling metadata for data warehouses. Its purpose is to enable easy interchange of data warehouse metadata between tools, platforms and metadata repositories in distributed heterogeneous environments. CWM is based on, or supports other standards, such as UML, XML, Corba, and others. CWM is supported by the Object Management Group, which in turn is supported by several major software companies. Several commercial products claim to support this standard, at least partly. The current version, 1.1, was released in 2003. 1.3.5 DDI The Data Documentation Initiative18 (DDI) has its roots in the data archive environment, but with its latest development, DDI 3 or DDI Lifecycle, it has become an increasingly interesting option for NSIs. DDI is an effort to create an international standard for describing data from the social, behavioural, and economic sciences. It is based on XML. DDI is supported by a non-profit international organisation, the DDI Alliance. Several tools that support DDI are available, both on the commercial market and as free software. The current version, 3.1, was published in 2009. Version 3.2 has been under public review and is expected to be released in 2013. 1.3.6 SDMX Statistical Data and Metadata eXchange19 (SDMX) was initiated by seven international organisations to foster standards for the exchange of statistical information. SDMX has its focus on macro data, even though the model supports 16 http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html http://www.omg.org/spec/CWM/1.1 18 http://www.ddialliance.org/specification 19 http://www.sdmx.org/ 17 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 13(21) micro data. It is an adopted standard for delivering and sharing data between NSIs and Eurostat. Sharing the results from the latest Population Census is perhaps the most advanced example, so far. Several software products that are commonly used by NSIs support SDMX. The current version, 2.1, was published in 2012. 1.3.7 MCV The Metadata Common Vocabulary20 (MCV) is not a standard in itself, but provides definitions of common metadata concepts, in particular in the domain of statistical metadata. It was compiled as a part of the development of SDMX. It is maintained by the SDMX consortium and is a part of the SDMX Content-oriented Guidelines. The current version was published in 2009. 2 Metadata in the statistical data warehouse Although most authors of data warehousing literature seem to agree with Inmon and Kimball on the important role of metadata, you can find surprisingly little practical support on how to implement a metadata layer. An article by Panos Vassiliadis21 of the University of Ioannina, Greece, summarizes well the requirements of data warehouse metadata. They should include information on: 1. the contents of the data warehouse, their location and their structure 2. the processes that take place in the data warehouse 3. the implicit semantics of data along with any other kind of data that aids the end-user exploit the information of the warehouse 4. the infrastructure and physical characteristics of components and the sources of the data warehouse 5. security, authentication, and usage statistics that aids the administrator tune the operation of the data warehouse as appropriate General metadata requirements for statistics production, regardless of environment, have been investigated and discussed in many previous projects, several of those initiated by international organisations. Those issues should not be repeated here; instead this document will focus on the specific roles of metadata in a SDWH, and the special demands that can be identified for that environment. 2.1 The SDWH metadata requirements The data warehouse architecture is, according to Kimball and others, metadata driven. Referring to the metadata categories described earlier, and keeping in mind the specific metadata requirements of statistics production it is possible to assess which categories play significant roles when building and maintaining a SDWH. The SDWH requires active metadata. The amount of objects (variables, value domains, etc.) stored makes it necessary to provide the users (persons and software) with active assistance finding and processing the data. 20 http://sdmx.org/wp-content/uploads/2009/01/04_sdmx_cog_annex_4_mcv_2009.pdf 21 Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 14(21) The SDWH requires formalised metadata. The amount of metadata items will be large and the requirement for metadata to be active makes it necessary to structure the metadata very well. The SDWH requires structural metadata, especially technical metadata. Active metadata must be structural, at least to some part. Process metadata are vital to a SDWH. Since the data warehouse supports many concurrent users it is very important to keep track of usage, performance, etc. In a data warehouse that has been less than perfectly designed one user’s choice of tool or operation could impair the performance for other users. An analysis of process metadata can be an input to correcting this anomaly. The table below shows the possible combinations of metadata categories and subsets. In the cells are indicated which combinations are of general interest for statistics production (“gen”) and which ones are of particular interest for a SDWH (“dw”). Most of the remaining combinations are possible, but less common or less likely to be found useful. Metadata subset Statistical Process Quality Technical Authorisation Data model Metadata category Formalised Free-form Reference Structural Reference Structural Act Pas Act Pas Act Pas Act Pas dw gen dw dw dw gen gen dw gen dw gen dw dw Consistency within the metadata layer is an example of attributes that are regarded as desirable in any statistics production environment, but that are considered necessary in the SDWH environment. In the SDWH all metadata items (concepts as well as physical references) must be uniquely identified and there must be one-toone relationships between identity and definition, and identity and name, respectively. The concept “Local unit”, e.g., must be given an identity and a definition, and these must be consistently used in the SDWH regardless of source, context, etc. If there will be a need for a slightly different definition, it must be given a new identity and a new name. In the SDWH it is desirable to be able to analyse data by time series on a low level of aggregation, or even to perform longitudinal analysis on object level. To support these functions metadata items should have validity information: “valid from 01-012001”, “valid until 31-12-2012”. In order to be metadata driven the SDWH has higher demands for process metadata, and it is more likely to have a built-in ability to produce process metadata. The SDWH is not only a data store, but it is also a system of processes to refine its data from input to output. These processes need active metadata: automated processes need formalised process metadata, such as programmes, parameters, etc., and manual processes need process metadata such as instructions, scripts, etc. Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 2.1.1 15(21) Minimum metadata requirements for the SDWH A SDWH without metadata, or with insufficiently comprehensive metadata, cannot be called a true data warehouse, since its data are not interpretable or understandable in a reliable or useful way. The more and the better metadata available, the more useful the SDWH becomes and the more reliable become the analyses and conclusions derived from its data. Adding metadata is arguably one of the most demanding and expensive parts of governing the SDWH. Since budget, time and human resources always form constraints, requiring complete and highest quality metadata is asking for the impossible. In practice, a minimum set of requirements should be identified, defining what metadata are vital to the SDWH. The following list of items refers to the metadata subsets discussed in chapter 1.1.3. Statistical metadata Variable name, definition, reference time and source Value domain mapped to the variable (particularly important if the value domain corresponds to a formal classification) Process metadata Load time (date and time when data item was loaded into the SDWH) Technical metadata Physical location (name/identity of server/database/column etc. where variable is stored) Data type (record layout: length, decimals, etc.) Authentication metadata Access rights mapped to users, groups and roles 2.2 Metadata and the layered SDWH In the general discussions on metadata for the statistical production lifecycle several attempts have been made to link metadata to the generic phases, such as the GSBPM processes: what metadata are produced during a process, what metadata are needed to perform a process, and what metadata are forwarded from one process to the next one. The GSBPM is applicable to any statistics production, including a SDWH. Hence the results from, e.g., the METIS group could, and should, be used when discussing metadata for the SDWH. There are, however, alternative or complementing models that may be used to describe the specific needs for the SDWH. During its first phase of this project it was agreed that the figure below conveys a good description of the processes and data flow that take place in a generic SDWH. The layered approach, with input of raw data at the bottom and dissemination of refined data (or statistics) at the top shows the necessary production steps in a simplified way. The layered architecture of the S-DWH is elaborated in detail in SDWH Business Architecture22 (Deliverable 3.1). The metadata layer at the left-hand side indicates the necessity of metadata support from start to finish. Examples of which metadata categories and functionalities are used in the different layers are found in Documentation of the mapping of the result of 1.4 on the ‘ideal architecture’ framework23 (Deliverable 1.6). 22 23 Palma (2013) Ennok (2013) Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 16(21) Figure 3 The SDWH layers 2.2.1 Source layer metadata The source layer is the entry point to the SDWH regarding data as well as metadata. Data are collected at various sources outside of the control of the data warehouse, and they have various origins, spanning from surveys and censuses conducted within the organisation to administrative registers kept by other organisations. Hence, the original metadata that accompany the data will vary in contents and quality, and the possibilities to influence the metadata will vary as well. The source layer, being the entry point, has the important role of gatekeeper, making sure that data entered into the SDWH and forwarded to the integration layer always have matching metadata of at least the agreed minimum extent and quality. The metadata may be either already available, loaded earlier, e.g. with a previous periodic delivery, or supplied with the current data delivery. The main responsibilities for this layer include: to make sure that all relevant data are collected from the sources, including their metadata, to add or complete missing or bad metadata, to deliver data and metadata in the best possible formats to the integration layer. The metadata from the sources must satisfy the minimum requirements described in chapter 2.1.1, but should if possible include more comprehensive metadata, such as quality information. In the source layer the foundation is laid for metadata to be used in the next layers. Consistency in definitions and standardisation of code lists are examples of areas where efforts should be made to influence the sources in order to build the strongest possible metadata foundation. ESSnet on Data Warehousing WP1 2.2.2 Version 1.12 Report 2013-09-06 17(21) Integration layer metadata The efficiency of the data linking and similar tasks carried out in the integration layer will depend on the quality of the metadata carried forward from the source layer. In this layer data are extracted from the sources, transformed as necessary, and loaded into their places in the data warehouse (ETL operations). These tasks need to use active metadata, such as descriptions and operator manuals as well as derivation rules, etc., being used, i.e. scripts, parameters and program code for the tools used. The ETL operations will also create several types of metadata: Structural process metadata o Automatically generated formalised information, log data on performance, errors, etc. o Manually added, more or less formalised information Structural statistical metadata o Automatically generated additions to, or new versions of, code lists, linkage keys, etc. o Manually added additions, corrections and updates to the new versions Reference metadata o Manually added quality information, process information, etc., regarding a dataset, or a new version 2.2.3 Interpretation and data analysis layer metadata The interpretation and data analysis layer stores cleaned, versioned and wellstructured final micro data. Once a new dataset or a new version has been loaded few updates are made to the data in this layer. Consequently, metadata are normally only added, with few or no changes being made. On loading data to this layer the following additions should be made to metadata: Structural process metadata o Automatically generated log data Structural statistical metadata o New versions of code lists, etc. Reference metadata o Optional additions to quality information, process information, etc. Relatively few users will access this layer, but those who do will need metadata to perform their tasks: Structural process metadata o Estimation rules, descriptions, code, etc. o Confidentiality rules Structural statistical metadata o Variable definitions o Derivation rules Reference metadata o Quality information, process information, etc. 2.2.4 Data access layer metadata Loading data into the access layer means reorganising data from the analysis layer by derivation or aggregation into relevant stores, or data marts. This will require Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 18(21) metadata that describe and support the process itself (derivation and aggregation rules), but also metadata that describe the reorganised data. Necessary metadata to load the data access layer include: Structural process metadata o Derivation and aggregation rules Structural technical metadata o New physical references, etc. Using the data access layer will require: Structural statistical metadata o Optional additional definitions of derived entities or attributes, aggregates, etc. Structural technical metadata o Physical references, etc. Reference metadata o Information on sources, links to source quality information 2.2.5 Summary of SDWH layers and metadata categories The table below gives a rough overview of where in the SDWH layers three important metadata categories are created (indicated by c) and used (u). Layer Statistical Process Quality metadata metadata metadata Data access u cu u Interpretation cu cu cu Integration cu cu c Source c c c The table shows that the lower layers mainly provide metadata, but can’t make much use of them, while in the higher layers metadata are used, but relatively few are added. This very much agrees with the rule that metadata should be captured as close to the source, or as early in the process as possible. The SDWH architecture should make it possible to trace any changes made to data as well as metadata by using process metadata and versioning both data and metadata. Thus, a metadata item is normally never changed, updated or replaced. Instead, a new version is created when necessary, which means that there will always be a possibility to identify which metadata was considered correct at a certain point in time even if it has later been revised. A more detailed analysis of the metadata subsets and their use in the SDWH layers can be found in Definition of the functionalities of a metadata system to facilitate and support the operation of the S-DWH24 (Deliverable 1.4) 24 Ennok, Lundell, Bowler, de Giorgi, Kulla (2013) ESSnet on Data Warehousing WP1 2.3 Version 1.12 Report 2013-09-06 19(21) Organising SDWH metadata This project has defined the SDWH as “a central statistical data store, regardless of the data’s source” 25. Although it is not explicitly expressed in the definition, the SDWH should be understood as a logically coherent data store, but not necessarily as one single physical unit. The logical coherence means that it must be possible to uniquely identify a data item throughout the data warehouse, to trace it on its way through the logical layers from input to dissemination, and to follow it longitudinally. A user must be able to search the entire metadata layer and, if permitted, to access data in the logical SDWH without actual knowledge of their physical locations. From the requirements on data follow similar demands on metadata: all data in the SDWH must have corresponding metadata, all metadata items must be uniquely identifiable, metadata should be versioned to enable longitudinal use, etc., and metadata must provide “live” links to the physical data. To achieve this it must be possible, and should be easy to monitor and govern the metadata layer. This requires the metadata layer to have comprehensive registry functionality (according to definition 4.2) as well as repository functions (definition 4.3). The registry functions are needed to control data consistency, to make the data contents searchable, etc., and the repository functions are needed to be able to operationalise data access. After searching the metadata registry/repository for a concept and finding it, a user must be able to retrieve its corresponding data (a case of active metadata) – provided that he/she is allowed to do that according to the authorisation metadata. Whether to build one or more repositories will depend on local circumstances. In a decentralised or geographically dispersed organisation, building one single metadata repository may be technically difficult, or at least less attractive. The recommendation from a functional and governance point of view is a solution with one single installation that covers both registry and repository functions. 2.4 SDWH metadata governance Metadata’s vital role in the SDWH means that all metadata must be reliable at all times. This calls for well-organised management of metadata and governance of the registry. The governing role includes everything from being a watchdog and police to providing advice and inspiration. The balance between the tasks may vary, depending on organisational structures and several other factors. If much of the metadata is computer-generated (automated) there may be need for regular checks, crosschecks, reviews or follow-up activities. If, on the other hand, much of the metadata is entered manually, possibly in a decentralised environment, there will probably be more need for advice as well as checks. The SDWH is assumed to contain complete and correct metadata available to users and processes when needed. The general rule says that metadata should be collected as close to the source, and as early in the process as possible. In an ideal situation every delivery of new data to the SDWH from external or internal sources should be accompanied by a complete set of corresponding metadata; data and metadata 25 Berglund, Palma. Functional Architecture of the S-DWH, 2013 Version 1.12 Report 2013-09-06 ESSnet on Data Warehousing WP1 20(21) should be loaded and updated in parallel. Similarly, every time a new variable is derived and added to the SDWH the metadata repository should be updated immediately. In practice this will of course not always be the case – metadata will be missing, incomplete or incorrect. The metadata repository may contain items that have not yet been approved for the metadata registry and hence may not be linkable or searchable. The metadata model used in the SDWH should be flexible enough not to require completeness and correctness in all details and in every situation. It should allow for variations, acknowledging that some metadata are more important than others. In order to call a data store a SDWH it must meet the minimum metadata requirements stated in section 2.1.1. More detailed recommendations on metadata governance for the SDWH are given in Recommendations and guidelines on the governance of metadata management in the S-DWH26 (Deliverable 1.5). 2.5 The SDWH and metadata standards As described above, in section 1.2, there is a wide variety of metadata standards applicable to statistics production and to data warehousing. New standards will come, as well as new versions of the existing ones. The SDWH should be able to handle these changes without having to undergo major revisions of its metadata layer or rebuilding any of its other layers. One method to handle these future changes is to implement the so called “standards agnostic” approach to the metadata repository. Standards-agnostic means that the standards themselves are represented as metadata objects within the repository. 27 Every metadata object describes which versions of which standards are supported Transformation services between standards and versions are also registered resources Introducing new standards or new versions of standards has a minimal impact on existing applications “Standard” does not necessarily refer to public standards Implementing the standards-agnostic approach will require use of a metadata repository that in itself is standards-agnostic, not organised according to one specific standard. This will mean focusing on the standardisation itself, not on which standard is used. The main purpose of the SDWH is to support efficient statistics production. This means that all current metadata standards relevant for this purpose should be supported by the SDWH. This recommendation includes following the MDR (ISO/IEC 11179) standard for the metadata registry/repository, and making sure that data and metadata can be packaged in SDMX format for data exchange. 26 De Georgi, Ennok, Lindelauf (2013) 27 Arofan Gregory, Metadata Technology Ltd., Workshop on Metadata Standards, 07/12/2011 ESSnet on Data Warehousing WP1 Version 1.12 Report 2013-09-06 21(21) Further information on metadata standards for the Statistical Data Warehouse can be found in Overview of and recommendations on the use of metadata models28 (Deliverable 1.3) 28 Dressen, Lindelauf, Goossens (2013) ESSnet on Data Warehousing WP1 Version 1.12 Report 2013-09-06 22(21) References Berglund, Björn; Palma, Antonio Laureti. Functional Architecture of the S-DWH (Architectural framework), 2013. (Deliverable 3.3) [Link (.doc)] Bowler, Colin; Lindelauf, Michel; Dressen, Jos. Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse, 2013. (Deliverable 1.2) [Link (.doc)] Common Metadata Framework, Part B – Metadata Concepts, Standards, Models and Registries. UNECE [Link] Common Warehouse Metamodel – ISO/IEC 19504. Object Management Group, 2003. [Link 1] [Link 2] Data Documentation Initiative (DDI) Specification. DDI Alliance, 2009. [Link 1] [Link 2] De Georgi, Viviana; Ennok, Maia; Lindelauf, Michel. Recommendations and guidelines on the governance of metadata management in the S-DWH, 2013. (Deliverable 1.5) [Link (.doc)] Dressen, Jos; Lindelauf, Michel; Goossens, Harry. Overview of and recommendations on the use of metadata models, 2013. (Deliverable 1.3) [Link (.doc)] [Appendix (.xls)] Ennok, Maia. Documentation of the mapping of the result of 1.4 on the ‘ideal architecture’ framework, 2013. (Deliverable 1.6) [Link (.doc)] Ennok, Maia; Lundell, Lars-Göran; Bowler, Colin; De Giorgi, Viviana; Kulla, Kaia. Definition of the functionalities of a metadata system to facilitate and support the operation of the S-DWH, 2013. (Deliverable 1.4) [Link (.doc)] Gregory, Arofan. The Standards Working Together. Presentation at the Data without Boundaries Workshop, Gothenburg 2011 [Link (.pdf)] Inmon, William H. Metadata in the Data Warehouse, (White Paper), 2000 [Link (.pdf)] Kimball, Ralph. The Data Warehouse Lifecycle Toolkit (Second Edition), Wiley, 2008 Metadata Common Vocabulary (MCV). SDMX Consortium, 2009 [Link 1 (.pdf)] [Link 2] Metadata Registries (MDR) – ISO/IEC 11179. ISO (International Organization for Standardization) [Link 1] [Link 2] Neuchâtel Terminology Model. Part II: Variables and related concepts, object types and their attributes. Version 1, 2006 [Link (.pdf)] Palma, Antonio Laureti. S-DWH Business Architecture, 2013. (Deliverable 3.1) [Link (.doc)] Statistical Data and Metadata eXchange (SDMX), Standards [Link 1] [Link 2] Vassiliadis, Panos. Data Warehouse Metadata, Encyclopedia of Database Systems, Springer, 2009 Link to all deliverables 1.1 – 1.6: http://www.cros-portal.eu/content/deliverables-13 Link to all deliverables 2.1 – 2.8: http://www.cros-portal.eu/content/deliverables-10 Link to all deliverables 3.1 – 3.5: http://www.cros-portal.eu/content/deliverables-8 Annex 1 1(3) Metadata related terms Sources Wikipedia [1] Direct quotations from Wikipedia and from its sources, http://en.wikipedia.org/wiki/Metadata, http://en.wikipedia.org/wiki/Data_warehouse ISO [2] (International Standards Organization, ISO/IEC 11179 Metadata registries (MDR)), http://metadata-stds.org/11179/ NISO [3] (National Information Standards Organization), Understanding Metadata, http://www.niso.org/publications/press/UnderstandingMetadata.pdf. Eurostat Metadata Common Vocabulary, MCV (2009) [4], http://sdmx.org/wp-content/uploads/2009/01/04_sdmx_cog_annex_4_mcv_2009.pdf Eurostat’s Concepts and Definitions Database, CODED [4,5], http://ec.europa.eu/eurostat/ramon/index.cfm?TargetUrl=DSP_PUB_WELC UNECE, Terminology on Statistical Metadata (Conference of European Statisticians, 2000) [5], http://www.unece.org/stats/publications/53metadaterminology.pdf UNECE, The Common Metadata Framework (2009-2011) [5], http://www1.unece.org/stat/platform/display/metis/The+Common+Metadata+Framework OECD, Glossary of Statistical Terms [5], http://stats.oecd.org/glossary/ Term Metadata Wikipedia [1] Data providing information about one or more aspects of the data, such as: Means of creation of the data Purpose of the data Time and date of creation Creator or author of data Placement on a computer network where the data was created Standards use ISO [2] Data that defines and describes other data Statistical metadata Data Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements [...] or observations [...]. Re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing NISO [3] Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information. Metadata can describe resources at any level of aggregation. It can describe a collection, a single resource, or a component part of a larger resource Metadata Common Vocabulary [4] Data that defines and describes other data. Terminology, Framework, Glossary [5] Data and other documentation that describes objects in a formalized way Metadata are data that describe other data, and data become metadata when they are used in this way. Data about statistical data. · Comprise data and other documentation that describe objects in a formalised way. · Provide information on data and about processes of producing and using data. Characteristics or information, usually numerical, that are collected through observation. Metadata describing statistical data The physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means. Annex 1 2(3) Term Wikipedia [1] ISO [2] NISO [3] Statistical data Structural metadata Reference metadata Describe the structure of computer systems such as tables, columns and indexes. Bretheron & Singley (Technical) Defines the objects and processes from a technical perspective [...] like tables, fields, data types, indexes [...] Kimball (Guide) Help humans find specific items. Bretheron & Singley (Business) Describes the contents [...] in user accessible terms [...] what data you have, where it comes from, what it means, [...] Kimball Indicate how compound objects are put together, e.g., how pages are ordered to form chapters (Descriptive) Describe a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords Metadata item Terminology, Framework, Glossary [5] Data that are collected and/ or generated by statistics in process of statistical observations or statistical data processing An instance of a metadata object. It has associated attributes. It can have a distinct status: mandatory, conditional and optional. A group of characters describing the data and treated as metadata unit Provide information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets [...]: − Rights management metadata, which deals with intellectual property rights, and − Preservation metadata, which contains information needed to archive and preserve a resource. Administrative metadata Process metadata Metadata Common Vocabulary [4] Data derived from either statistical or non-statistical sources, which are used in the process of producing statistical products Act as identifiers and descriptors of the data. They are needed to identify, use, and process data matrixes and data cubes, e.g. names of columns or dimensions of statistical cubes. Structural metadata must be associated with the statistical data, otherwise it becomes impossible to identify, retrieve and navigate the data. Describe the contents and the quality of the statistical data. Should include conceptual, methodological and quality metadata Describes the results of various operations [...] start time, end time, CPU seconds used [...] Kimball Instance of a metadata object Annex 1 3(3) Term Metadata usage Wikipedia [1] Data virtualization, statistics and census services, data warehousing ISO [2] NISO [3] Discovery and organisation of electronic resources, interoperability, integration, identification, archiving. Metadata Common Vocabulary [4] Include the algorithms as such behind statistical procedures, including procedures for statistical analysis; descriptions of the algorithms Algorithmic metadata Metadata layer Metadata registry Metadata repository Terminology, Framework, Glossary [5] [data warehouse] The data dictionary – This is usually more detailed than an operational system data dictionary. A central location in an organization where metadata definitions are stored and maintained in a controlled method. Metadata registries are used whenever data must be used consistently within an organization or group of organizations. A data dictionary [...] a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format." A layer in the reference model for standardisation in statistics used to denote the set of attributes related to statistical metainformation Information system for registering metadata (MDR) Provides information on the definition, origin, source, and location of data [...] at many levels, including schemes, usage profiles, metadata elements, and code lists for element values. It provides an integrating resource for legacy data, acts as a lookup tool for designers of new databases, and documents each data element. Information system for registering metadata. Registration accomplishes three main goals: identification, provenance, and monitoring quality. [...] It manages the semantics of data. A logically central statistical metadata repository that allows for querying, editing, and managing of metadata. Such a system provides a mechanism for looking up information about statistical products as well as their design, development, and analysis. (Metadata holding) A logical or physical set of metadata (e.g. database) stored together with its description (e.g. schema)