A Generic Statistical Information Model Introduction The aim of this paper is to discuss what should be in a Generic Statistical Information Model (GSIM) and to suggest how it might usefully be structured. The paper starts by considering why we want a GSIM and what we might hope it will achieve for us and then looks at how it should be constructed. We already have the Generic Statistical Business Process Model (GSBPM). It has been developed over a number of years under the auspices of the Metis group and it is generally accepted by the National Statistical Offices (NSOs) and the international statistical agencies as providing a reasonable basis for discussion of what we do in producing national statistics. It is a “Reference Model”, which means it is not prescriptive and acknowledges that some agencies might do things slightly differently, or in different orders. It identifies a high-level “Value Chain” of major activity areas and within each area it identifies a number (about 7 to 12) of processes that typically or sometimes take place, with some description of these processes. The Metis meeting in 2010 agreed that we had taken the GSBPM about as far as was useful and that further work to refine it was not worthwhile. However, I think that most participants in the meeting felt that further work of a modelling nature was in fact needed – it was just that the nature of what this work might be was not clear but it was probably clear that it was not more refinement of the GSBPM. The GSBPM gives agencies a very useful framework within which to discuss how they do their business. It helps us use common terms, to discuss the similarities and differences in our approaches, and to work towards more-clearly defined common approaches. Partly as a result of being able to do this, agencies are now looking to take this work, around the definition and clarification of statistical processes, considerably further. They are looking to manage the processes, to automate them, to “industrialise” them, and to share tools and actual processing systems. They are also looking to add value to the statistical outputs, in terms of presenting, in an automated and managed fashion, more contextual and informative material about how the statistics were derived and what processes they have been subjected to, often referred to as “drill back” in discussions. In addition they are looking to use more generic processes in all areas with bespoke processes kept to an unavoidable minimum. They are looking for generic, multi-modal data capture, shared cross-area use of common data, and more use of readily-available or administrative data, as opposed to specifically-collected data, wherever sensible. And, of course, they are looking for flexible and automated dissemination options with an emphasis on self-service and the ability for clients to access follow-up information when they need it. The GSBMP does not help us greatly in any of these aims and it is to help with these aims that agencies are now seeking to develop the GSIM. However most of us have, as yet, not really developed a clear concept of what this GSIM needs to be. The aim of this paper is to put forward a candidate design for the GSIM that should actually help us to achieve our current aims. © Rapanea Consulting Limited 2012 Page 1 All seem to agree that metadata, and improved use of metadata, is key to achieving these aims. Most seem to agree that the two existing statistical metadata standards, Statistical Data and Metadata eXchange (SDMX) and Data Documentation Initiative (DDI) are the key metadata standards, although they also think that SDMX and DDI are not perfect, that they need to be integrated, and that there are other relevant metadata standards or practices, including ISO 11179, ISO 19115 (or national geographical variations), and local agency metadata, that are relevant. There is much talk of “metadata-driven processes” but only hazy explanations of what this actually means. While there is a desire to see SDMX and DDI integrated there is little in the way of concrete examples of how this might be approached. While there is a belief and expectation that particularly DDI, but probably also SDMX, will need some modifications, there is a distinct lack of detail on where the issues are. This largely reflects the difficulty of coming to grips with the processes and the metadata. DDI, in particular, is a very large and complex standard. It is not reasonable to expect anyone who is not an XML expert, and in particular, an XML Schema expert, to come to grips with it. But the people who have this expertise almost invariably lack the detailed statistical expertise needed to make judgements about the suitability of the metadata for use in National Statistical Office processes. Aims of a GSIM It seems to me that the prime aim of the GSIM is to help us sort out these issues and to plan how we will improve the management and operation of our statistical processes. When complete the GSIM should encapsulate our plans for improved processes and enable achievement of the aims stated above. It needs to be heavily focused on the metadata needed in each process, but, more particularly, on the metadata needed for the entire statistical life-cycle – ie on the metadata needed across all the processes that make up that life-cycle. It should also focus on the flow of data and “paradata” – information about decisions and choices made, and about data sets and other information produced and passed amongst processes. If it does this the GSIM will let us achieve a number of things along the way that are critically important if we are ultimately to achieve our goals. It should: - - - - Enable us to specify all the metadata requirements for our processes, allow them to be aligned with DDI and SDMX (and potentially other standards), and enable us to codify what we mean by modifications to the standards and by DDI and SDMX integration. Clarify what we mean by “metadata-driven processes” and let us start working to achieve this. Provide clarification and detail about our processes so that we can start working to achieve shared, common (across collections and across organisations and agencies) processes and tools. Give coherence to the metadata and paradata across the statistical life-cycle so that we have both a basis for better life-cycle management and a basis for metadata and paradata drill-back for the final statistical product. Provide a model that is actually usable and useful for the purpose of planning and implementing (and sharing) our approaches. © Rapanea Consulting Limited 2012 Page 2 What does this GSIM look like? I said above that the GSBPM gives us a very useful framework within which to discuss our processes. In fact it identifies and provides a basic description of all our processes. I propose that the GSIM takes each of these GSBPM processes and elaborates it by describing the metadata, paradata, and data inputs it requires and produces, and, in most cases, breaks the process into sub-processes with similar detail. There should be an emphasis on end-to-end coherence – the model should show us how various processes use common metadata and how paradata and data flow from process to process. The metadata links should be to elaborated metadata types, mostly to DDI and SDMX artefacts such as Concept Schemes and Concepts, Question Schemes and Question Items, Variable Schemes and Variables, Category Schemes and Code Schemes and Codelists, Record Layout Schemes and Physical Structure Schemes, and Data Structure Definitions. In my model there would be “layers” of process definitions. At the top there would be abstract processes that specify the metadata types and paradata information that would be required (or produced) by the process, and an identification of the data sets in terms of how they would relate to the metadata. This would be the GSIM layer – a library of abstract process definitions that showed what metadata, paradata, and data would be required and produced by each process. These abstract processes could be linked together, and when linked together, would show how metadata, paradata, and data flowed through the life-cycle of a collection. The next layer would be the linking together of a collection of processes to describe a particular statistical collection or area. At this stage we would fill in some of the detail that was abstract in the GSIM definitions. If the GSIM model said a process required Category Schemes, Code Schemes, Record Layouts, and Data Structure Definitions, this stage would identify which particular Category Schemes, Code Schemes, Record Layouts, and Data Structure Definitions were to be used. Similarly it would fill in some of the paradata about how data inputs and other information would be found, and how data and other outputs would be registered for access by subsequent processes. This layer would still be abstract in that it would not have actual data defined so the process, in most cases, could not actually be executed. It would relate to a statistical collection, such as Labour Force Survey, or Consumer Price Indexes, or National Accounts, not to an actual collection cycle. The final layer would relate to an actual cycle of a collection. There might possibly be some final, cycle-specific, metadata to be provided (probably via some paradata item), and paradata values would identify actual data sets (for which we had already identified the metadata needed for interpretation). The process would now be executable, either automatically or manually. Note that these abstract, top-layer, process definitions that make up this main part of the GSIM are themselves metadata. This process metadata will need its own metadata type and will be stored in the Metadata Registry/Repository (MRR) along with all other metadata. In my model I am looking to the SDMX Process metadata type, or more probably some variant on it, but other options could be considered. The first layer below the GSIM, where the processes are particularised for a collection might still be considered metadata but the bottom layer, where the process are particularised for a specific collection cycle, are probably better viewed as paradata. These lowest layers encapsulate, at the end of © Rapanea Consulting Limited 2012 Page 3 processing cycle, all the history of the cycle, and are important enablers of drill-back from disseminated statistics. Let us look at a specific example. I have chosen “Tabulation” (essentially GSBPM process 5.7 – “Calculate aggregates”) for my example, since, because of my background, I have good familiarity with this process. The tabulation process requires: - An input data set or data sets that must be described by o o o Record Layouts and Physical Data Structures Variable Schemes and Variables Category Schemes and Code Schemes (or possibly SDMX Codelists) Data Structure Definitions (DSDs) (or possibly DDI NCube definitions) that specify what tables are to be produced. These DSDs will link to Codelists that must be linkable to the Category Schemes and Code Schemes. Paradata items that will enable, at execution time, the identification of the input data set (or sets) and will indicate how output tables are to be registered, indexed, and categorised. A tabulation tool that that can be used to produce the tables along with metadata that shows how the necessary information from the data and paradata can be passed to the tool. In my model this text, expressed in a structured metadata format, is the GSIM process definition. It may have some layers that describe sub-processes. It should certainly be displayable diagrammatically and the GSIM should include a tool to do this (and support diagrammatic design and editing of processes). The Tabulation process is shown diagrammatically below. Metadata - Record Layout(s) - Physical Data Structure(s) - Variable(s) - Category Scheme(s) - Data Structure Definition(s) Paradata - Dataset key(s) - Registration keys, Categories and Keywords for registering tables - Data Structure Definition(s) Paradata Tabulation - Registration(s) for tables - Status information Tabulation tool - Parameter metadata for tool © Rapanea Consulting Limited 2012 Page 4 Figure 1 – Diagrammatic presentation of Tabulation process I see these models as being based directly on the GSBPM. While it is possibly not necessary that every GSBPM process should be modelled I think most would need to be. Moreover it is possible that there may be two (or more) alternative modellings for some processes. We are aiming to define common shareable processes but in the GSBPM discussions it was recognised that there are some processes that are done differently in different organisations and cultures. Equally some organisations may want to model processes that are not currently in the GSBPM. What does such a GSIM achieve? It closes the gap between the GSBPM and our day-to-day statistical processes. The GSBPM is a good reference model and a GSIM like the one proposed here makes it directly applicable to our everyday processes. It identifies all the metadata types we require for our processes. Moreover it identifies them in the context of the processes that use them so we now have a focus for assessing the requirements for each metadata type and assessing the suitability of the DDI, SDMX, and other options. The development of the GSIM will provide an excellent opportunity to clarify and resolve metadata issues and to develop an agreed and useful statistical metadata standard that, hopefully, can guide the evolution of both DDI and SDMX. It provides more detail on the requirements and functions of each process. In fact it provides sufficient detail for us to consider how we might fit existing tools into processes or how we might plan and build new shareable tools designed to work effectively with the metadata. It provides a basis for planning process automation and “metadata-driven processes”. Each process, abstract or executable, is itself described in metadata. We can build, using a process-management tool such as ActiveVos, a process execution engine that can take our process metadata and execute it. It captures process knowledge in a maintainable and useable form, in the form of metadata links and paradata items, including details of how they flow from process to process. It clarifies how our “end-to-end-managed-environment” will use the MRR and the metadata it contains. Our process management engine will retrieve process metadata from the MRR and will execute it. As it does this it will manage the paradata that both links processes together and provides an archiveable history of what happened in each process cycle. It enables us to plan how we manage our whole-of-collection metadata. Each process has its metadata requirements specified and has paradata links that link it to other processes. We can begin to organise our DDI container artefacts – Study Units, Groups, and Sub-Groups – to manage our metadata within and across collections. © Rapanea Consulting Limited 2012 Page 5 How do we proceed to develop this GSIM? The GSIM as proposed will consist of several parts: A library of elaborated processes, expressed diagrammatically and as process metadata artefacts, for all or most of the GSBPM processes. A map showing how these processes link together, and, in particular, explaining how paradata links the processes together and enables the data flows amongst processes. An explanation of how we would plan to use the DDI and SDMX (and possibly other) metadata to manage processes and collection families end-to-end. This would focus not so much on the detailed metadata artefacts but more on the containers and packaging of the metadata. A set of recommendations and issues for discussion to guide the evolution of DDI and SDMX. - A set of guidelines for MRR support for GSIM-based processes Some specifications for processes, to be implemented using a process management system, to execute the GSIM process metadata artefacts and to support creation and editing of the GSIM process metadata, including the transition from the abstract GSIM layer to collection and cycle layers. The major part of the work is to develop the library of processes. We need to do a significant number of these, with a focus on the major processes, so we can validate the approach and see how well it works, and start to do some of the abstraction required for some of the other parts of the GSIM. Making a good start on this process definition would seem to be a useful exercise for one of the “GSIM Sprint” sessions. A note about use of the Metadata Registry/Repository The paragraphs above imply how the MRR will enable end-to-end management of statistical processes and how the processes will interact with the MRR, but it is perhaps useful to spell some of this out in a little more detail. The MRR will be used to hold or register all metadata and to register all other artefacts involved – data sets, documents, paradata values, etc. ”Register” means the MRR holds information about an object – what it is, who owns it, where can be found, what its status is, what keywords and categories it is linked to, and what access controls it is subject to. Every registered object has a unique id and version and a URN (Universal Resource Name) that enables it to be retrieved. People or systems can query the MRR to find objects of interest, using type and version information, ownership information, and keywords and categories to retrieve URNs. Or they may know the URN of the object (because it has been passed to them by some other person or earlier process). Once a URN is obtained the MRR can provide a locator (a URI – Universal Resource Identifier) to where the object is stored. “Hold” or “Store” means the MRR actually holds the object in its repository. Objects that are held or stored are also registered and the query process is identical whether or not the © Rapanea Consulting Limited 2012 Page 6 object is actually stored in the MRR. If it is stored in the Repository the MRR can actually return the object, rather than a locator. In my model the MRR stores metadata in its repository but only registers other objects such as data and documents. If metadata is held in some legacy environment or in some external repository it may simply register that metadata. But for our main “future” metadata I assume it is registered and stored in the MRR. I assume most of our future metadata will be essentially DDI and SDMX (perhaps an evolved and unified DDI and SDMX). We will use basic metadata artefacts, such as Concepts and Concept Schemes, Variables and Variable Schemes, Questions and Question Schemes, Record Layouts, Category Schemes and Code Schemes and Codelists, Data Structure Definitions, and so on. But we will also use the DDI “Containers” – Study Units, Groups and Sub Groups, and Resource Packages – and SDMX Structures to organise and manage the basic metadata. Probably DDI metadata will be stored in the MRR in Resource Packages (containers designed for sharing metadata). For collections I expect we will have Groups and probably Sub Groups that pull together (by reference) shared metadata, originally stored in Resource Packages, as needed by the collection. Related collections will be held as Sub Groups inside a Group or higher-level Sub-Groups. Perhaps there will be a top-level group to contain all the organisation’s collections. Each cycle of a collection will have a Study Unit, inside the collection’s Sub-Group with access to metadata referenced in the Sub-Groups and Group above it. The collection’s SubGroup should contain a “template” Study Unit can be duplicated and renamed to roll forward to a new cycle. As the cycle progresses links to actual data sets and run-time paradata will be accumulated in the Study Unit so that at the end of the cycle it contains a full history of the cycle ready for archiving and use in drill-back. The GSIM abstract processes described above will indicate the metadata types the process requires. They will also identify “paradata” items that will, at execution time, enable cycle objects, such as data sets, to be accessed. These paradata items will essentially be URNs for the objects. They will either be passed (as paradata items) from earlier process steps or selected, from lists produced by querying the MRR, by the statistical clerk. The GSIM processes will be held in the MRR as metadata (possibly, as was noted earlier, in some variant of the SDMX Process artefact). When a collection is designed a subset of the GSIM abstract processes will be linked together into a large-scale process, specific metadata artefacts will be chosen for the metadata types specified in the abstract processes, and the paradata items to be passed amongst the processes will be specified. These collection processes will also reside in the MRR. The paradata now becomes “collection metadata” and will need a structure to manage and contain it. It seems to me that the SDMX Metadata Structure Definition (MSD) is a good candidate for use here. MSDs, which were designed as containers for SDMX Reference Metadata, allow the definition of ad-hoc structures for “Metadata Sets”, data files that will hold the values associated with the items in the structure. Thus the collection is described by a collection process, held as metadata in the MRR, which references metadata artefacts, held in or referenced from DDI Study Units, Sub-Groups, and Groups and SDMX © Rapanea Consulting Limited 2012 Page 7 Structures held in the MRR, with data and other process information described in Metadata Structure Definitions, held in the MRR and passed in Metadata Sets registered in the MRR. Summary The aim is to develop a GSIM that will actually help us in planning and developing improvements to the management and operation of our statistical processes. The GSIM as proposed here is specifically designed to be directly relevant to this exercise and to provide a way in which we can explore and clarify the metadata issues we need to resolve. It should also provide a sound basis for developing shared approaches and shared tools to support those approaches. It also provides a direct basis for building metadata-driven systems. Bryan Fitzpatrick Rapanea Consulting Limited January 2012 BryanMFitzpatrick@Yahoo.CO.UK © Rapanea Consulting Limited 2012 Page 8