in partnership with Title: Definition of the functionalities of a metadata system to facilitate and support the operation of the S-DWH WP: 1 Deliverable: 1.4 Version: 2.0 (Final) Date: 30-9-2013 NSI: SE, SC, ONS ISTAT Authors: Maia Ennok, Kaia Kulla Lars Goran Lundell (3.1), Colin Bowler (3.3), Viviana De Giorgi (3.4), ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) INDEX 1. 2. 3. Introduction 1.1 Target audience 6 1.2 What is in this document? 6 Functionalities of a metadata system 7 2.1 Metadata creation 8 2.2 Metadata usage 8 2.3 Metadata maintenance 9 2.4 Metadata evaluation 9 Metadata functionalities by layers 3.1 Source layer 11 11 3.1.1 Metadata creation 11 3.1.2 Metadata usage 12 3.1.3 Metadata maintenance 12 3.1.4 Metadata evaluation 12 3.2 Integration layer 13 3.2.1 Metadata creation 13 3.2.2 Metadata usage 15 3.2.3 Metadata maintenance 16 3.2.4 Metadata evaluation 17 3.3 Interpretation and data analysis layer 3.3.1 Metadata creation 1 4 17 17 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 3.3.2 Metadata usage 18 3.3.3 Metadata maintenance 18 3.3.4 Metadata evaluation 19 3.4 Data access layer 19 3.4.1 Metadata creation 19 3.4.2 Metadata usage 19 3.4.3 Metadata maintenance 20 3.4.4 Metadata evaluation 21 3.4.5 Describing functionalities: an example 22 Table 1 Metadata functionalities 23 4. Functionalities of a metadata system based on running integrated metadata information system (iMETA) of Statistics Estonia 24 4.1 Metadata creation 4.1.1 Manual and automated creation 24 4.1.2 Metadata repository/managing meta models 24 4.1.3 Harvesting 25 4.1.4 Data access authorization metadata creation 25 4.2 Metadata usage 25 4.2.1 Users 25 4.2.2 Search and navigation 26 4.2.3 Metadata export (output generation) 26 4.2.4 Other systems and internationality 26 4.3 2 24 Metadata maintenance 27 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 4.3.1 Maintenance of metadata objects (insert, update, delete, versioning) 27 4.3.2 Users and rights (of metadata) management 28 4.3.3 User guides 28 4.4 Metadata evaluation 4.4.1 Use of standards 4.5 Metadata by metadata functionality groups and layers Table 2 Metadata by metadata functionality groups and layers 5. Conclusion 3 28 29 30 30 37 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 1. Introduction In an efficient metadata architecture the tools used in the data warehousing implementation (ETL, data modelling, rules, etc.) produce metadata in the warehouse life cycle in a manner and format that allows it to be easily referenced and integrated with the integrated metadata system. A metadata solution should identify and prioritize metadata functions that are important for statisticians and IT users’ understanding, navigation and acceptance of the system. The metadata lifecycle (described in Metadata framework for Statistical Data Warehousing1) as divided into the following three basic phases: 1. Collection Metadata should be captured as early as possible in the production process. Collection of some types of metadata can and should be automated. When data is entered to the data warehouse basic metadata must already exist in a correct form. Collection of metadata should be automated whenever possible. 2. Maintenance Metadata must be up to date always. Processes must be in place to capture changes, synchronize metadata with the changing architecture. 3. Deployment Metadata must be available to users in the right form and with the right tools. Different user categories need different metadata and have different requirements. End users want to use metadata to easily and correctly find and interpret the data they need. Data stewards want an inventory of what is stored in the data warehouse. Analysts want to compare the data sources. Programmers want to make sure that they use the standard names. In order to meet these diverse needs of different users of the (meta)data, the statistical metadata must be managed and maintained in the metadata system that has the specific requirements. 1 Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. Deliverable 1.1 4 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) In this document we focus on the functionalities of metadata system to facilitate and support the operation of the S-DWH. Metadata system can be evaluated by metadata exchange, usability, administration, and tool reliability. The components of a metadata system can be categorized into different categories: creating metadata, automatic extraction and production, conversion between metadata formats, subject description, encoding, structure, and syntax, exchange/transfer of metadata, harvesting, indexing, search, and browse of metadata databases, metadata repositories and metadata storage, metadata display, integrated environments. According to Common Metadata Framework2 the statistical metadata system (SMS) should be a tool enabling a statistical organization to perform effectively the following functions (that we have found also relevant to S-DWH): Planning, designing, implementing and evaluating statistical production (S-DWH) processes. Managing, unifying and standardizing workflows and processes. Documenting data collection, storage, evaluation and dissemination. Managing methodological activities, standardizing and documenting concept definitions and classifications. Managing communication with end-users of statistical outputs and gathering of user feedback. Improving the quality of statistical data and transparency of methodologies. The SMS should offer a relevant set of metadata for all criteria of statistical data quality. Managing statistical data sources and cooperation with respondents. Improving discovery and exchange of data between the statistical organization and its users. Common Metadata Framework Part A, page 8 (http://www1.unece.org/stat/platform/display/metis/The+Common+Metadata+Framework) 2 5 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Improving integration of statistical information systems with other national information systems. Disseminating statistical information to end users. End users need reliable metadata for searching, navigation, and interpretation of data. Improving integration between national and international organizations. International organizations are increasingly requiring integration of their own metadata with metadata of national statistical organizations in order to make statistical information more comparable and compatible, and to monitor the use of agreed standards. Developing a knowledge base on the processes of statistical information systems, to share knowledge among staff and to minimize the risks related to knowledge loss when staff leave or change functions. Improving administration of statistical information systems, including administration of responsibilities, compliance with legislation, performance and user satisfaction. 1.1 Target audience The aim of this document is to help statistical organizations improve the effectiveness of metadata of S-DWH across all the layers of S-DWH and all phases of the statistical business process. It is intended as a tool to assist S-DWH experts/users (managers, designers, subject-matter specialists, methodologists, information technology experts, researchers etc.) to develop business cases for a new or enhanced SMS for S-DWH. 1.2 What is in this document? In this document, we focus on the functionalities of metadata system to facilitate and support the operation of the S-DWH: 6 Functionalities by functionality groups (metadata creation, metadata usage, metadata maintenance, metadata evaluation); Metadata functionalities by layers (source layers, integration layers, interpretation and data analysis layer, data access layer); Case study of metadata functionalities (integrated metadata system (iMETA) of Statistics Estonia). ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 2. Functionalities of a metadata system Metadata system main functions are to gather and store metadata in one place, give an overview of metadata (queries, searches etc.), create and maintain metadata, evaluate metadata, managing access grant by role-based security. Core requirements of metadata system are record creation, modification, deletion, multi-value attributes, select-list menu, simple and advanced search, simple display, import and export using XML document (CSV), links to other databases, cataloguing history, and authorization management. Metadata system has to satisfy following requirements: provide different levels of information granularity, convert legacy systems and records into new ones, equip customized options for report generation, incorporate miscellaneous tools, in terms of metadata creation, retrieval, display, implement structured relations for existing metadata standards, enable multi-lingual processing (inc. Unicode character sets), a built-in process for managing the workflow evaluation of metadata, a role-based security system controlling access to all features of the system. In Common Metadata Framework3 a model for managing the phases of an SMS development life cycle is presented. SMS management has following phases: design, implementation, maintenance, use and evaluation. In this document we do not describe management of phases but phases themselves. Consider all above we can specify following metadata functionality groups for metadata system of S-DWH: 3 metadata creation; metadata usage; metadata maintenance; metadata evaluation. Common Metadata Framework Part A, page 26 7 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Metadata management consists of creation of metadata, usage of metadata, maintenance of metadata and evaluation of metadata. Metadata management will be more deeply covered in Recommendations and guidelines on the governance of metadata management in the S-DWH (deliverable 1.5)4. Metadata management also includes user training and composing user guide of metadata system. 2.1 Metadata creation Metadata creation is implementing metadata to the metadata system by creating metadata or collecting metadata. In metadata creation metadata objects, their definitions, links between metadata objects and processes and metadata repository are created by searching, retrieval, exporting and downloading metadata. List of metadata functionalities: manual creation; automated creation; harvesting from other systems: o automated extraction(automated, regular process of collection metadata descriptions from different sources to create useful aggregations of metadata and related services); o convert metadata; o manual import from files (XML, CSV); 2.2 data access authorization metadata creation; implementing metadata repository; creating links between metadata objects. Metadata usage By Metadata Framework users of S-DWH metadata can be both humans (statisticians, IT specialists, end-users etc.) and machines (other systems). 4 Di Giorgi. V, Lindelauf. M (2013) Recommendations and guidelines on the governance of metadata management in the S-DWH. Deliverable 1.5 8 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) All the metadata must be deployed meaning metadata must be available to users in the right form and with the right tools. Metadata is presented to other systems by metadata system that is integrated with other systems and S-DWH components; List of metadata functionalities: 2.3 search; navigation; metadata export; international use. Metadata maintenance The role of metadata maintenance is to ensure that all metadata stored in the metadata repository are up-to-date for ongoing use. List of metadata functionalities: 2.4 maintenance of metadata history (versioning, input, update, delete); updating meta models in metadata repository; updating links between metadata objects; users and rights (of metadata) management. Metadata evaluation In metadata system there should be functionalities to evaluate the quality of metadata and to assure that metadata has high quality. Metadata quality requirements are set in Recommendations on the impact of (meta)data quality in the S-DWH (deliverable 1.2).5 Quality evaluation processes are according to the indicators/requirements that must be implemented. 5 List of metadata functionalities: metadata validation (for example check value domains, check links between metadata objects); Bowler.C, Lindelauf. M, Dressen. J (2013) Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse. Deliverable 1.2 9 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 10 standards usage. ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 3. Metadata functionalities by layers Layers are defined in the S-DWH Business Architecture6 document (deliverable 3.1) and metadata subsets by layers are defined in the Metadata framework7 (deliverable 1.1) 3.1 Source layer The source layer is data’s entry point to the S-DWH. It is responsible for receiving and storing the original data from the NSI’s internal or external sources and making data available to the ETL functions that bring data to the integration layer. 3.1.1 Metadata creation In an ideal situation all metadata necessary to forward data from the source layer to the integration layer have either already been created by the external data suppliers and are delivered to the S-DWH, or can be created automatically either in the source layer or in the integration layer. In any case, a minimum requirement is that the technical metadata that describe the incoming data are provided by the data suppliers. If the metadata created by the external sources are delivered in standardised formats, such as DDI, SDMX, etc., the source layer should be able to create the metadata needed in the S-DWH by extracting them and, if necessary, converting them to the required formats automatically. Creating metadata by manually adding them to the S-DWH metadata repository should be a last resort, but will probably often be necessary to some degree. For example, metadata that documents a questionnaire may be created automatically or may need manual creation depending on what software has been used for the questionnaire design. Creating metadata in a controlled manner requires the use of a relevant editing tool, preferably a dedicated one. The tool must let the operator enter the required codes, texts, links, etc., in standardised formats, and should also allow the operator to immediately validate the entered values. It should contain copy/paste functions to make the task as manageable as possible. An example of when creation of metadata is necessary is when a value domain (code list) is only available as hardcopy. 6 7 Laureti Palma A. (2012) S-DWH Business Architecture. Deliverable 3.1 Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. Deliverable 1.1 11 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 3.1.2 Metadata usage The source layer in itself uses relatively few metadata. It needs information on the sources, such as: responsibilities for data deliveries (who makes source data available to the S-DWH, which access rights are needed, etc.), the methods to be used (are data going to be delivered to the S-DWH – “pushed”, physically collected by the S-DWH from some agreed location – “pulled”, or directly accessed from the original location – “virtual storage”), if relevant and possible the expected frequencies (when will new source data be available), source data formats (record layout, storage type, location). One of the main tasks of the source layer is to act as the warehouse’s gatekeeper, the function that makes sure that all data entered into the S-DWH adhere to an agreed set of rules (recommendations on metadata quality are described in Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse8). These rules are expressed as technical and process metadata. This means that in order to accept a delivery of source data (“raw data”) and allow them to be forwarded to the next layer, relevant and correct metadata must be available, i.e. they must already exist or they be created. 3.1.3 Metadata maintenance Some metadata are closely linked to one particular data delivery, e.g. the results of a census, but others are valid for several deliveries, e.g. the same metadata refer to several rounds of a survey. The first type may be delivered as part of the data delivery, while the second type should be entered in advance, before the first data delivery. 3.1.4 Metadata evaluation Regardless of whether metadata are entered manually or created automatically they must always be validated. New metadata should be compared with and checked against already existing metadata and, if relevant, data to ascertain consistency within the metadata repository, and between data and metadata. 8 Bowler. C (2013) Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse.Deliverable 1.2 12 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) The source layer’s gatekeeper responsibility requires that all codes that appear in the data must appear in the metadata as enumerated value domains. Since many of these codes will be used as dimensions in the following layers it is vital that no values are missing. A check that no mismatches exist must be carried out in the source layer, and any found errors must be corrected by editing the metadata or the data. In case metadata contain minimum and maximum values (e.g., a percentage value must be within the range of 0-100) the corresponding data values should be checked, and corrected when needed. 3.2 Integration layer According to the S-DWH Business Architecture9 in the integration layer all clerical operational activities typical of a statistical production process are carried out. This means operations carried out, automatically or manually, by users to produce statistical information in an IT infrastructure. In this layer all the sub-processes of phase 5 and one sub-processes from 6 is included. We include also some sub-processes of phases 1, 2 and 3 of the GSBPM which are relevant for metadata of S-DWH. All classical ETL processes are covered in the integration layer of S-DWH. 3.2.1 Metadata creation Most of statistical metadata is created manually, process metadata is created manually and automatically, technical metadata is created mostly automatically, same for quality metadata. As much as possible standards are used for creating metadata of integration layer for example for statistical metadata Neuchâtel is used, for quality metadata ESQRS is used. Metadata harvesting depends on how S-DWH is developed, for example in integration layer process and technical metadata are usually created in S-DWH and harvested by metadata system. If metadata of integration layer is in other format, there should convert metadata to suitable format for example transformation rules in collection systems are often different format than needed in data processing. 13 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Data access metadata (authorisation metadata) is created for data warehouse (data marts) and data staging areas. In Integration Layer the following metadata for integration is created in 1-3 phases of the GSBPM (Specify needs, Design, Build). For preparing business case metadata of statistical activity is created (manually by fill-in SMS according to standards ESMS, Neuchâtel etc.). Designing outputs output variable metadata (algorithms) is created, output validation metadata (algorithms) is created. Designing variable descriptions variable metadata (quality metadata according to ESQRS) is created, classifier metadata is created, algorithms are created, new variable metadata (creating algorithms) is created using available variables metadata. Design frame and sample methodology frame, sample and stratum metadata are created. Design production systems and workflow data warehouse data model metadata is created, coding algorithms are created using classifiers and coding tables metadata, imputation algorithms are created using imputation methods, dissemination metadata (publication calendar) is created, algorithms of statistical confidentiality are created, data processing algorithms (incl. aggregation) and scheduling metadata are created, questionnaires design are created, data model of raw data are created, raw data validation algorithms are created, pre-fill metadata is created. Configure workflows scheduling technical metadata is created. Building data collection instrument data collection structure technical metadata is created. NB! If in following phases someone has new needs then someone has to check previous phases for need to make changes in metadata. Metadata of phase 5 – Process 9 For data integration process metadata (where data is from data source metadata) is created, quality indicators for integration process are created. For classify & code process metadata is created. For review, validate & edit quality variables (edit failure rate - ESQRS) and process metadata are created. Laureti Palma A. (2012) S-DWH Business Architecture. Deliverable 3.1 14 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) For imputation; quality variables are created, quality variable metadata (imputation rate - ESQRS) is created and process metadata (how, who, when etc) is created. For derive new variables and statistical units: creates process metadata, creates quality variable metadata (new variables and new statistical unit rate). For weights calculation process metadata is automatically created. For aggregate calculation process metadata is automatically created. For data files finalization data finalizing metadata (date, person etc.) is created, data loading to SDW metadata (logs, date, person etc.) is created. Metadata of phase 6 – Analyze For draft output preparation output variable metadata (algorithms) is created using output variables metadata, classifiers metadata is created, tables titles (using output variable metadata) is created. 3.2.2 Metadata usage Metadata users in integration layer are both humans and machines. In every process of integration layer metadata should be navigable and searchable (example browsing metadata of variables by statistical activities and domains). All metadata objects in metadata system are related (example variable is related to statistical activity and classifier). Metadata is multilingual (English, local language), possible to share internationally via unified services with standard format (like XML, SDMX). S-DWH shares its metadata with other systems via metadata system. In S-DWH data object has reference to metadata object (example by metadata object id) in metadata system. Metadata of integration layer can be exported from metadata system. SDWH uses metadata from metadata system that retrieves metadata also from other systems. In Integration Layer metadata for integration is created in 1-3 phases of the GSBPM uses following metadata: 15 For checking data availability metadata of available administrative data is used. For design production systems and workflow variable metadata is used, tables/columns metadata are used, classifiers and coding tables ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) metadata is used, imputation methods design variable descriptions are used, available variables metadata is used. For configuring workflows scheduling statistical and process metadata is used. NB! If in following phases someone has new needs then someone has to check previous phases for need to make changes in metadata. Phase 5 – Process For data integration pre-filling metadata is used, sample metadata is used, data model of raw data is used, variable metadata is used, set-up collection metadata is used. For classify & code coding algorithms, classifiers, coding tables’ metadata is used. For imputation algorithms of imputation are used. For new variables and statistical units derivation new variable creating algorithms is used. For calculate weights stratum and frame metadata is used. For calculate aggregate methods for aggregate is used, variable metadata is used, classifiers metadata is used. For finalize data files finalization data warehouse data model metadata is used. Phase 6 – Analyze For draft output preparation output variables metadata is used, classifiers metadata is used. 3.2.3 Metadata maintenance The main functionalities of metadata maintenance at the integration layer level are: 16 Maintain (create, update, delete, versioning) integration metadata (data processing algorithms, data warehouse data models etc.). User rights are according to the S-DWH system operations of all S-DWH processes in all layers (for example data transformation, data loading) S-DWH has following operations: read metadata, create data processing packages, access to delicate data, solve data processing tasks, schedule packages, see logs etc. Integration metadata can be stored in different meta models (maintaining meta models). Meta model for statistical metadata, process metadata etc. ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) S-DWH uses the same metadata repository (in the metadata repository there can be different metadata models for different metadata subsets). All users of S-DWH can access to metadata (only for viewing metadata), but for changing metadata there should be granted privileges by S-DWH operations and statistical activities. 3.2.4 Metadata evaluation By creating integration layer metadata (data processing algorithms), this metadata is validated: controlling existing required values, data type controls, linking only existing objects, data models are with comments. Metadata is validated according to usable standards. Some evaluation controls are built-in a SMS for metadata fill-in processes, some are systematic a built-in processes for managing the workflow evaluation of metadata that control following (validation queries), some are organizational processes. 3.3 Interpretation and data analysis layer This layer is mainly aimed at ‘expert’ (i.e. statisticians/domain experts, and data scientists) users for carrying out advanced analysis, and data manipulation and interpretation functions, and access would be mainly interactive. The work in generating the analysis is effectively a design of potential statistical outputs. This layer might produce such things as data-marts, which would contain the results of the analysis. In many cases, however, the results of an investigation into the data required for a particular analysis may identify a shortfall in the availability of the required information. This may trigger the identification of requirements for whole new sets of variables, and methodologies around the processing of them. The following are some examples of the metadata used in this layer, and the GSBPM sub-processes where they might apply. 3.3.1 Metadata creation 17 Creation of a design for a new analysis or output definition [Note: the analysis might not actually become an output – it could be experimental at first] (process 2.1 – Design outputs). Manual creation of variable definitions for the new analysis or output (2.2 Design Variable Descriptions). Creation of the methodology design for the statistical processing – e.g. creation of specifications for imputation, validation etc. (process 2.5 – Design statistical processing methodology). ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Manual creation of SQL Scripts (or other data manipulation programming media) encompassing the data selection rules required to carry out the identification of the data to be used in the analysis. This might also include data matching and linking routines (active metadata). (process 5.1 – Integrate Data). Manual creation of a quality report (reference metadata) relating to the new special analysis after it has been run (process 6.5 – Finalise Outputs). Manual creation of an Interpretation Document metadata in text form to accompany any data sets. Also, creation of graphics/charts etc. to accompany the data and assist the user to interpret the data (Passive metadata – process 6.5 – Finalise Outputs). 3.3.2 Metadata usage To carry out examination of the metadata available describing administrative data sources, and methodologies, in order to evaluate suitability for a new analysis or identified output. Also, the examination of descriptions of existing variables to check existing data availability and suitability for inclusion in the analysis or output (process 1.5 – Check data availability). View lists of variables, code lists, and classifications, and their definitions, to determine the structural metadata items to be used as search criteria to provide data for the analysis and data integration. These variables would also serve as the essential dimensions of the fact tables of the data marts (process 5.1 – Integrate Data). Run the SQL Scripts (or algorithms encoded in other programming media) to extract and integrate the data from different sources in the data warehouse (process 5.1 – Integrate Data). Utilise disclosure rule metadata in the disclosure checking process for the intended output datasets being created by the run of the analysis (process 6.4 – Apply disclosure control). Utilise quality metadata as input to any interpretation documentation accompanying output data sets (e.g. when assessing ‘measures of uncertainty’ for the output). (process 6.5 – Finalize Outputs). 3.3.3 Metadata maintenance 18 Check appropriate rights exist in the S-DWH for the user who is attempting to create a new analysis design (check that the user has written access to the appropriate area of the metadata repository). Similarly for read rights to access certain data groups. Delete old or defunct analysis descriptions, and their associated SQL Scripts, as part of a maintenance/archive function. ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 3.3.4 Metadata evaluation 3.4 Recording quality characteristics from the different elements of the analysis during the preparation of a draft output. This might take the form of quality indicator attributes attached to variables (process 6.1 – Prepare Draft Outputs). Following evaluation of the output as a whole, the statistical content would need to have some approval status metadata attached or set. (process 6.5 – Finalise Outputs). Data access layer The access layer is the fourth and last layer identified in a generic S-DWH; it is the layer at the end of the process of an S-DWH that together with the interpretation layer represents the operational IT infrastructure. The access layer is the layer for the final presentation, dissemination and delivery of the information sought10. 3.4.1 Metadata creation Actually, metadata creation about data of the S-DWH at the data access level is merely an operation of converting/harvesting metadata already created in the other layers, in order to be used for dissemination. What is needed at this level is the procedure of harvesting metadata already provided. The only creation of metadata regards the subsequent typologies: 1. metadata about data access, for example statistics about users access on data and metadata, which data is the most requested, which year, which disaggregation, etc. 2. users’ evaluation metadata, e.g. assessment of easiness of finding information, Metadata about users and uses are created in an automated way. Users’ evaluation metadata should be generated automatically. 3.4.2 Metadata usage At the access level the main users of data/metadata are final users (researcher, students, organizations, etc.), who want to know in general the meaning of data and also the accuracy, the availability, and other important aspects of the quality of data. This is in order to be able to correctly identify and retrieve potentially relevant statistical data for a certain study/research/purpose, as well as for correctly interpret and (re)use statistical data. Metadata concerning quality, contents and availability 10 Laureti Palma A. (2012) S-DWH Business Architecture. Deliverable 1.3. 19 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) aspects of data and processes is an important part of a feedback system, as well as the users’ evaluation and users data access. The main functionalities/uses/purposes of metadata at the data access layer level are: searching data/metadata, which consists of identifying the existence of specific information; locating data/metadata, which means tracing a specific occurrence of data/metadata; selecting and filtering data/metadata; obtaining information on data/metadata availability, by querying directly the data and metadata repositories; obtaining feedback/evaluation from users, by working out statistics on the most/less searched table, graph, series of data; using systems for gathering marks to easiness, understandability, usefulness, glossary, etc.; tracking help usage; giving out a questionnaire to a random sample of users; analysing, evaluating and assessing information, for examples users’ feedbacks by using statistics tools; foreseeing accessibility and availability of data/metadata to other systems or services managed by others, for example access from external wikis and web sites; semantically interoperating and cooperating, which means enabling research to exchange information through a series of equivalences among addressed information, to best facilitate the coding, transmission and use of data, when content dependent metadata coming from the other layers evolves according to changes; including metadata for language variations, language limiters to filter search results, abstracts in various languages; integrating multilingual thesauri and international classification schemes; providing links to translations, related databases and web sites, etc. 3.4.3 Metadata maintenance The main functionalities/actions of metadata maintenance at the data access layer level are: 20 creating, updating, deleting, reviewing metadata; harmonising and exploiting data/metadata, for re-using it; maximising utility and usability of data/metadata, improving and promoting data/metadata re-use, e.g. by externalising metadata, which ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) means converting them into shared metadata, to serve for more objects than only one (more tables, more graphs); exporting metadata for using it in other systems converting metadata from one standard format to another, by using a metadata translation engine, which should be configured to support any metadata standard or profile; managing resources, i.e. managing data/metadata libraries through the catalogues and descriptors; managing authentications of data/metadata accesses; managing metadata about users of data and controlled use of data, e.g. describing how data can be accessed, when, by whom and with which restrictions and constraints. 3.4.4 Metadata evaluation As regards metadata, in order to ensure it is of good quality11, which means getting an unambiguous and definite access to the data, at least the subsequent functionalities/proprieties should be implemented: using domain-independent metadata properties ensuring valid values and structures, and appropriate default values creating procedures to verify individual items aggregating metadata for web searching, e.g. clustering it into a treegraph or organizing it so that “searchability” can be easier and more standardised using standardised and harmonised metadata formats for official statistics and implementing specific components for other typologies of dissemination standardising remote data/metadata access using automated/assisted procedures to update metadata as soon as it is available synchronising the dissemination of metadata with the dissemination of the data to which it applies implementing side services that require the integration of metadata and the cooperation in the acquisition of new metadata creating multilingual aids for users The six dimension of the quality of data are: relevance, accuracy, timeliness, accessibility, interpretability, and coherence. They can be applied to metadata as well. 11 21 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) facilitating coordination between administrations in relation to the possibility of common needs for using common resources, e.g. by managing permissions for administrators at different layers in order to understand each other’s concerns, using uniformity in programming language, permission types, etc., giving possibility to make request such as proposals, notifications, comparisons taking into account the international and future vision of the S-DWH, e.g. international visibility can be promoted whilst national interests are served. 3.4.5 Describing functionalities: an example Discussions exploring metadata in the information resource community (libraries, archives, museums, and other information centres), tend to group metadata elements by the various functions they support. The result is the identification of different types of metadata (or metadata classes), each of which comprises multiple metadata elements.12 The metadata framework for statistical data warehousing identifies metadata categories and metadata subsets13. For the S-DWH business architecture the subphases of the GSBPM at the level of access data layer are defined.14 Here is an example (Table 1) of illustrating metadata functionalities15, by first identifying the functions used at the level of the data access layer, the input needed and the output obtained, and then the categories/subsets of metadata used. In the last column we identify one of the correspondent sub-phases of the GSBPM such functions imply. Greenberg J. (2005). Understanding Metadata and Metadata Schemes. The Haworth Press, Inc. Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. 14 Laureti Palma A. (2012) S-DWH Business Architecture. Deliverable 3.1 15 The list is not exhaustive and regards only the data access layer. 12 13 22 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Table 1 Metadata functionalities Data Function name Data/metadata display Easier/simple/adv anced search Navigator On-line help Dictionary Download/Export Login/logout Users management (administrator) Monitor users access (administrator) Status change (administrator) 23 Metadata categories Description Allows to display and search the data/metadata by setting a basic/minimum/detail ed set of search criteria Allows to interact with the metadata system in order to search data/metadata of interest, view and understand them Allows to download the data/metadata about data/services Allows to log in/terminate the reserved session by types of user Allows to manage the register on accredited users/level of users Allows to monitor accesses and operations executed by users Allows to change the “users’ status” as a consequence of modified characteristics Metadata subsets GSBPM phase Input Output Search parameters List of metadata (HTML or other formats) Statistical Process Quality - Maps, dictionary items, list of objects - - Metadata file Statistical Process Quality Active/passive Formalised Reference/struc tural 7.3 Access credential (Possible) error messages Authorisation Active Formalised Structural 7.1 Alphanumer ic parameters Information on users Authorisation Technical Active Formalised Structural 7.5 - - Authorisation Technical Active Formalised Structural 7.3 Specified parameters - Authorisation Technical Active Formalised Structural 7.1 Active/passive Formalised Reference/struc tural Passive Free form 7.4 7.4 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 4. Functionalities of a metadata system based on running integrated metadata information system (iMETA) of Statistics Estonia In 2011 Statistics Estonia developed and took to production integrated metadata system (iMETA), it has been used among others in the population and housing census 2011 statistical data warehouse. 4.1 Metadata creation 4.1.1 Manual and automated creation Metadata can be created either by automated information processing or by manual work. Metadata can be manually created in iMETA by filling forms or by importing metadata from files (for example CSV files). Automated process can create metadata by scanning database table structure with descriptions to metadata repository. 4.1.2 Metadata repository/managing meta models Statistics Estonia has one metadata repository, where all metadata is stored. iMETA is one of the application that uses this repository (also data processing system (VAIS) and user role management application (URMA) create and maintain their metadata in metadata repository). In iMETA user interface can define and maintain links between different metadata in different meta models and between different sets of metadata. Metadata system is based on the meta-meta model, that has different meta models and some meta models are created and managed by different systems, but all metadata are stored in this metadata repository. Meta models in metadata repository: 24 Neuchâtel – terminology metadata (iMETA). Meta model for integrated statistical metadata system. This meta model contains statistical metadata like variables, classifiers, code lists etc. RBAC – role-based access control meta model for separate application of user and role management (URMA). You can view user and role metadata in the metadata navigator. This meta model contains authorisation metadata like roles, privileges by operations of S-DWH. XDTL – extensible data transformation language meta model for SDWH ETL objects. This meta model contains process metadata like ETL procedures (processing algorithms) etc. ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) RDBMS – relational database management system meta model for database metadata. Metadata of technical characteristics is created into databases but are scanned into an integrated metadata system (manual management in iMETA, automatic management by importing scanned database elements structures). This meta model contains technical metadata like server names, locations, databases, tables, columns etc. The principle is that metadata is filled where it formed (process and unit). 4.1.3 Harvesting Metadata captured automatically by computers can include information about when a metadata value was created, who created it, when it was last updated. One metadata element is created and maintained in one system/place, but it can be viewed and used in several systems. All history is also stored. For efficient metadata retrieval database API can be used. Text files (UTF 8 format CSV files) of metadata can be imported through user application. Attributes of metadata elements allow easily convert metadata from legacy and other systems into new ones Automated process can create metadata by scanning database table structure with descriptions to metadata repository (by scanning and storing metadata to metadata repository metadata is converted to suitable format). Metadata in metadata system is in different format (link between different format), so is it easy in export or import to convert metadata to other format (Neuchâtel and ESMS). 4.1.4 Data access authorization metadata creation Authorization metadata of data access in S-DWH is according to responsibility of statistical activity in iMETA (domain manager/statistical activity manager) and according to acceptance of direct manager. Access rights to data according to privilege of S-DWH (for example viewing and extracting data of statistical activity from collection system). In iMETA data access authorization metadata can be created to the following data of S-DWH: raw data from collection system, data in data staging are (data processing data), data in data warehouse, data cubes. 4.2 Metadata usage 4.2.1 Users Metadata users in source layer are statistical activity managers/domain managers, data warehouse developers/IT specialists and methodologists. 25 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Metadata users in integration layer are statistical activity managers/domain managers, data warehouse developers/IT specialists and data processing operators. Metadata users in interpretation and analysis layer are statistical activity managers/domain managers and analysts. Metadata users in access layer are mainly final/end users (researcher, students, organizations, etc.), statistical activity managers, methodologists and specialists of dissemination. 4.2.2 Search and navigation In integrated metadata system can be searched all metadata (full-text search function): search from all metadata terminology (Neuchâtel) objects like statistical activity, classifier, concept, unit of measure, questionnaire, technical characteristic etc. In the metadata navigator you can navigate in information of metadata objects like object properties and object links: SQL object, metadata elements, terminology object and other objects (role, user). 4.2.3 Metadata export (output generation) In metadata system there are customized options for report generation. Report or screen view is used for generating metadata export files (outputs in CSV, XML format). Outputs in XML formats: describing the XML structure and CSV file (choosing columns in the screen), creating matching with metadata and generating output (ESMS output transport). 4.2.4 Other systems and internationality Other systems use metadata from metadata system that is integrated with other systems. Metadata is used in all data warehouse phases and tools: data extraction, transformation, loading, presentation etc. Metadata retrieval and output through an API (combined according to needs). Integrated metadata system can be used internationally (share system and context): 26 Multilingual by system and by context. Enable multilingual processing (inc. Unicode character sets). ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 4.3 Expandable by using different meta models. Metadata maintenance 4.3.1 Maintenance of metadata objects (insert, update, delete, versioning) In integrated metadata system metadata objects are created and changed (updated, deleted) according to business rules. Versioning rules are flexible by metadata objects. Statistical metadata (also quality metadata – statistical activity attributes): Statistical activity – overview of statistical activities, statistical activities by domains and sub-domains, search, copy, description, associated indicators, variables, classifiers, questionnaires. Classifiers – maintenance of domains of classifiers, classifiers by domains, version of classifiers: attributes, elements, variants, levels, matching tables, indexes. Concepts – maintenance of domains, sub-concepts, associated concepts, associated statistical characteristics, vocabularies. Statistical characteristics – maintenance of statistical units by type, subtypes, characteristics. Units of measure - maintenance of types, associated statistical characteristics. Questionnaire – maintenance of questionnaires’ groups, versions of questionnaires (periods, deadlines, version of statistical activities). Legal acts – content of legal acts for producing statistics. Technical metadata Technical characteristics – maintenance of database metadata, description of databases and objects (tables) and columns. Maintain by manually in iMETA application or by automatic process scanning and importing database structures metadata to metadata repository. Process metadata Process log (start time, end time, duration, who started) Validation rules Transformation-editing rules Authorisation metadata 27 User, roles, privileges and operations. Also access to data in applications of S-DWH. ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Links between metadata objects: As all metadata are in one metadata repository, you can have an overview which statistical variables are used in which statistical activities. Also you can have an overview which statistical unit is described by which statistical characteristics. 4.3.2 Users and rights (of metadata) management Methodological unit is responsible for metadata management. Different units are responsible for metadata inputs and updates. Metadata systems apply following principles of user rights: All users of iMETA application can see all metadata in the metadata repository. Rights to add, change or delete metadata are driven from roles assigned to users. Users' responsibility is limited to the statistical domain, which they manage. All rights are inherited to sub-domains. User rights are according to the S-DWH system operations of all S-DWH processes (and all layers): data extraction, data transformation, data loading and presentation. S-DWH has following operations: read metadata, create data processing packages, access to delicate data, solve data processing tasks, schedule packages, see logs etc. User rights can be granted according to operations of objects (privileges). Privileges are for example creation of statistical activity, management of version of statistical activity, changing or deleting the description of statistical activity, management of variables of version of statistical activity, management of legal acts, management of questionnaire and management of roles. And users, roles and privileges are managed in URMA application. 4.3.3 User guides User guide of iMETA is available as online help, all workers of Statistics Estonia are also users of iMETA; they have granted access to iMETA application where they can see online help and all the metadata. 4.4 Metadata evaluation Evaluation processes according to the metadata quality requirements. Fill-in controls of metadata validation: 28 ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) no missing required values, in linkage objects have to exist, metadata element’s code must be unique, data type format control (inc. accordance to classifier). Systematic a built-in processes for managing the workflow evaluation of metadata that control following (validation queries): accordance to used meta models (for example metadata object has the following characteristics: name (local language, English), type, identifier, value domain, links to other elements, mandatory or optional has to be according to models. Organizational processes of metadata validation: methodological manual control in approval for harmonizing statistical metadata, test ETL metadata, data Warehouse architect validates data warehouse data model metadata. 4.4.1 Use of standards In metadata element description several standards are used: ISO 11179, SDMX (ESMS, ESQRS and EPMS). The MMX Meta model provides a storage mechanism for various knowledge models. All the meta models used in iMETA are stored in MMX Meta model. MMX Meta model is based on third level Meta-Object Facility (an Object Management Group standard for model-driven engineering). 29 4.5 Metadata by metadata functionality groups and layers Table 2 Metadata by metadata functionality groups and layers 2–Design 2.4 – design frame and sample methodology – creates frame, sample, stratum metadata 1.6 – prepare business case – creates metadata of statistical activity 2.2 – design variable descriptions – creates variable metadata (quality metadata), classifier metadata, creates algorithms, creates new variable metadata (creating algorithms) using available variables metadata, creates data quality variables metadata 2.5 - design statistical processing methodology - creates coding algorithms using classifiers and coding tables metadata, creates imputation algorithms uses imputation methods, creates data processing algorithms (incl. aggregation) 2.6 – design production systems and workflow – creates data warehouse data model metadata, creates data staging area data model, creates scheduling metadata Pha se sub–process/metadata 1.3 – establish output objectives - creates output variable metadata Pha se 2–Design 1.3 – establish output objectives - creates output variable metadata 2.1. – design outputs –creates output validation metadata (algorithms), creates algorithms of statistical confidentiality, creates dissemination metadata (publication calendar); builds output (output algorithms) 2.3 - design data collection methodology creates validation rules of data collection, design data model of raw data, creates pre– fill metadata sub–process/metadata Access layer 3–Build 2.2 – design variable descriptions – creates metadata of IOR (Initial observation registry) variable Pha se Interpretation and data analysis layer sub–process/metadata 2.1 – design outputs -builds output, creates technical metadata of output 2.6 – design production systems and workflow - creates statistical metadata of workflow 3.3 – configure workflows – creates technical metadata of workflow 7.1 – update output systems – creates process metadata (when, who, etc.) 7–Disseminate sub– process/metadata 1–Specify needs Pha se 2–Design Metadata creation Integration layer 1–Specify needs Source layer 7.2 – produce dissemination – creates process metadata 7.3 – manage release of dissemination products – creates process metadata 7.4 – promote dissemination – creates process metadata 7.5 – manage user support– creates user metadata, creates process metadata ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 4–Collect 4.2 – set up collection – creates process metadata of set up collection 31 3–Build 2.1 – design outputs – creates output validation metadata (algorithms), creates algorithms of statistical confidentiality, creates dissemination metadata (publication calendar); builds output (output algorithms); 5.3 – review, validate & edit; creates quality variables (edit failure rate) and creates process metadata 5.4 – impute; creates quality variables, creates quality variable metadata (imputation rate) and process metadata (how, who, when etc) 2–Design 5.2 – classify & code; creates process metadata 5.5 – derive new variables and statistical units – creates process metadata, creates quality variable metadata (new variables and new statistical units rate) 6.1 – prepare draft output; creates output variable metadata (algorithms) using output variables metadata, classifiers metadata, creates tables titles (using output variable metadata) 2.4 – design frame and sample methodology – creates sample metadata 2.6 – design production systems and workflow – creates statistical metadata of workflow 5.7 – calculate aggregate – creates process metadata 5.8 – finalize data files – creates data finalizing metadata (date, person etc.), creates data loading to SDW metadata (logs, date, person etc.) . 2.2 – design variable description – creates new variable metadata. 2.5 – design statistical processing methodology – describe methodological rules for analyse 5.6 – calculate weights, - creates process metadata 4.3 – run collection – creates process metadata (when, who, where) of data collection 4.4 – finalize collection– creates finalize metadata 3.3 – configure workflows – creates scheduling technical metadata. 3–Build 3.3 – configure workflows – creates technical metadata of IOR, creates technical metadata of data collection instruments Interpretation and data analysis layer 5.1 – integrate data; creates process metadata (where data is from data source metadata) , creates quality indicators for integration process, 5–Process 3.1 – build data collection instrument – creates data collection structure technical metadata Integration layer 6–Analyze 3–Build Source layer 3.3 – configure workflows - creates output technical metadata Access layer ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Source layer Integration layer Interpretation and data analysis layer 4–Collect 4.1 – select sample – creates process metadata of sample taking using 2.4 metadata 5–Process 5.1 – integrate data; creates process metadata (where data is from data source metadata), creates quality indicators for integration process, 5.5 – derive new variables and statistical units – creates process metadata, creates quality variable metadata (new variables and new statistical units rate) 5.6 – calculate weights – creates process metadata 5.7 – calculate aggregate – creates process metadata 32 Access layer ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Source layer Integration layer Interpretation and data analysis layer 6– Analyze 6.1 – prepare draft output – creates process, technical metadata of output 6.2 – validate outputs – creates process metadata 6.3 – scrutinize and explain – creates process metadata 6.4 – apply disclosure control – creates process metadata 9-Evaluate 7-Disseminate 6.5 – finalize outputs – creates process metadata 33 7.1 – update output systems – creates update process metadata by using output metadata (output variable metadata, classifiers metadata). 9.1 – gather evaluation inputs – creates process metadata 9.2 – conduct evaluation – creates process metadata Access layer ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 4.3 – run collection – uses variable metadata, IOR metadata 4.4 – finalize collection – uses IOR metadata 34 2.2 – design variable descriptions – creates new variable metadata (creating algorithms) using available variables metadata 2.5 - design statistical processing methodology – uses variable metadata 2.6 – design production systems and workflow –uses statistical processing metadata 3–Build 2.1 – design outputs – uses output variable 3.3 – configure workflows – uses scheduling statistical metadata sub–process/metadata 1.5 – check data availability – uses variable descriptions, uses variable and data element link descriptions 3.3 – configure workflows - uses statistical metadata of workflow 4.1 – select sample – uses 2.4 methodological metadata 5.1 – integrate data;, uses data model of raw data, uses variable metadata 5.5 – derive new variables and statistical units – uses new variable/unit creating algorithms Access layer Pha se sub–process/metadata 7.1 – update output systems,uses output metadata 7–Disseminate 1–Specify needs 1.5 – check data availability – uses metadata of available administrative data Ph as e 4–Collect 4.2 – set up collection – uses 2.6 pre–fill metadata, metadata of collection instrument, , uses 4.1 select sample metadata Metadata is multilingual (English, local language), possible to share internationally via unified services with standard format (like XML, SDMX) 2–Design 3.3 – configure workflows – uses metadata of IOR variables All S–DWH uses the same metadata repository Interpretation and data analysis layer 5–Process 4–Collect sub– process/metadata 3–Build Pha se 3–Build Metadata usages Integration layer 1–Specify needs Source layer 7.2 – produce dissemination – uses output metadata 7.3 – manage release of dissemination products – uses output metadata 7.4 – promote dissemination – uses output metadata 7.5 – manage user support – uses output metadata, user metadata ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Integration layer Interpretation and data analysis layer 5.1 – integrate data; uses, sample metadata, uses data model of raw data, uses variable metadata 6.1 – prepare draft output - uses output metadata For data profiling uses data quality variables metadata, uses algorithms for integration (by using priority metadata) 6.3 – scrutinize and explain – uses output metadata. 5.2 – classify & code; uses coding algorithms, classifiers, coding tables’ metadata. 6.2 – validate outputs – uses validation algorithms 5.4 – impute; uses imputation algorithms 6–Analyze 5–Process Source layer 5.5 – derive new variables and statistical units– uses new variable creating algorithms Metadata maintenance 35 Maintain (create, update, delete, versioning) source metadata (data sources metadata, variables Maintain (create, update, delete, versioning) integration metadata (data processing algorithms, data warehouse data models etc.). User rights are according to the S–DWH system 7– Disseminate 6.1 – prepare draft output; uses output variables metadata, classifiers metadata, create tables titles (uses output variable metadata) 7.1 – update output systems – uses output metadata. 9-Evaluate 6–Analyze 5.8 – finalize data files –uses data warehouse data model metadata 6.4 – apply disclosure control; uses algorithms of statistical confidentiality, 6.5 – finalize outputs – uses output metadata, quality variables metadata. 5.6 – calculate weights; uses stratum and frame metadata 5.7 – calculate aggregate; uses methods for aggregate, uses variable metadata, classifiers metadata Access layer 9.1 – gather evaluation inputs – uses evaluation metadata 9.2 – conduct evaluation – uses evaluation metadata Maintain (create, update, delete, versioning) interpretation metadata (outputs, output variables, classifier, clarifying metadata). Maintain (create, update, delete, versioning) access metadata (output, product, publication calendar, user support). Manual, automated. User rights are according to the S–DWH ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) Source layer metadata, sample metadata etc.). Mostly manually managed, technical automated. Integration layer Interpretation and data analysis layer operations of all S–DWH processes (and all layers) . Manual, automated. Integration metadata can be stored in different meta models (managing meta models). Meta model for statistical metadata, process metadata etc. User rights are according to the S-DWH system operations of all S-DWH processes. Evaluate and assure quality metadata of integration layer by – fill-in controls of metadata validation Evaluate and assure quality metadata of interpretation and data analysis layer by – fill-in controls of metadata validation Access layer system operations of all S–DWH processes. User rights are according to the S–DWH system operations of all S-DWH processes – data extraction, data transformation, data loading and presentation. S–DWH has following operations – read metadata, create data processing packages, access to delicate data, solve data processing tasks, schedule packages. Metadata evaluation Evaluate and assure quality metadata of source layer by – fill-in controls of metadata validation Systematic built–in processes for managing the workflow of quality assurance of metadata Organizational processes of metadata validation 36 Systematic built-in processes for managing the workflow of quality assurance of metadata Organizational processes of metadata validation Systematic built–in processes for managing the workflow of quality assurance of metadata Organizational processes of metadata validation Evaluate and assure quality metadata of data access layer by – fill-in controls of metadata validation Systematic built-in processes for managing the workflow of quality assurance of metadata Organizational processes of metadata validation ESSnet on Data Warehousing Maia Ennok (Statistics Estonia) 5. Conclusion Witnessing how metadata is handled in the integrated metadata system of S-DWH, we can see lots of benefits of the integrated metadata system. Benefits for users and systems are for examples: unified metadata, integrated metadata, metadata accessibility, metadata availability for opportunities of analysis. With metadata unification we would have unambiguous metadata. Metadata integration allows us to access to all relations between different kind of metadata and metadata elements so we could make different kind of analysis based on these relations. 37