Proposals from Bryan Fitzpatrick

advertisement
A Generic Statistical Information
Model
Introduction
The aim of this paper is to discuss what should be in a Generic Statistical Information Model
(GSIM) and to suggest how it might usefully be structured. The paper starts by considering
why we want a GSIM and what we might hope it will achieve for us and then looks at how it
should be constructed.
We already have the Generic Statistical Business Process Model (GSBPM). It has been
developed over a number of years under the auspices of the Metis group and it is generally
accepted by the National Statistical Offices (NSOs) and the international statistical agencies
as providing a reasonable basis for discussion of what we do in producing national statistics.
It is a “Reference Model”, which means it is not prescriptive and acknowledges that some
agencies might do things slightly differently, or in different orders. It identifies a high-level
“Value Chain” of major activity areas and within each area it identifies a number (about 7 to
12) of processes that typically or sometimes take place, with some description of these
processes. The Metis meeting in 2010 agreed that we had taken the GSBPM about as far
as was useful and that further work to refine it was not worthwhile. However, I think that
most participants in the meeting felt that further work of a modelling nature was in fact
needed – it was just that the nature of what this work might be was not clear but it was
probably clear that it was not more refinement of the GSBPM.
The GSBPM gives agencies a very useful framework within which to discuss how they do
their business. It helps us use common terms, to discuss the similarities and differences in
our approaches, and to work towards more-clearly defined common approaches. Partly as a
result of being able to do this, agencies are now looking to take this work, around the
definition and clarification of statistical processes, considerably further. They are looking to
manage the processes, to automate them, to “industrialise” them, and to share tools and
actual processing systems. They are also looking to add value to the statistical outputs, in
terms of presenting, in an automated and managed fashion, more contextual and informative
material about how the statistics were derived and what processes they have been
subjected to, often referred to as “drill back” in discussions. In addition they are looking to
use more generic processes in all areas with bespoke processes kept to an unavoidable
minimum. They are looking for generic, multi-modal data capture, shared cross-area use of
common data, and more use of readily-available or administrative data, as opposed to
specifically-collected data, wherever sensible. And, of course, they are looking for flexible
and automated dissemination options with an emphasis on self-service and the ability for
clients to access follow-up information when they need it.
The GSBMP does not help us greatly in any of these aims and it is to help with these aims
that agencies are now seeking to develop the GSIM. However most of us have, as yet, not
really developed a clear concept of what this GSIM needs to be. The aim of this paper is to
put forward a candidate design for the GSIM that should actually help us to achieve our
current aims.
© Rapanea Consulting Limited 2012
Page 1
All seem to agree that metadata, and improved use of metadata, is key to achieving these
aims. Most seem to agree that the two existing statistical metadata standards, Statistical
Data and Metadata eXchange (SDMX) and Data Documentation Initiative (DDI) are the key
metadata standards, although they also think that SDMX and DDI are not perfect, that they
need to be integrated, and that there are other relevant metadata standards or practices,
including ISO 11179, ISO 19115 (or national geographical variations), and local agency
metadata, that are relevant. There is much talk of “metadata-driven processes” but only
hazy explanations of what this actually means. While there is a desire to see SDMX and
DDI integrated there is little in the way of concrete examples of how this might be
approached. While there is a belief and expectation that particularly DDI, but probably also
SDMX, will need some modifications, there is a distinct lack of detail on where the issues
are. This largely reflects the difficulty of coming to grips with the processes and the
metadata. DDI, in particular, is a very large and complex standard. It is not reasonable to
expect anyone who is not an XML expert, and in particular, an XML Schema expert, to come
to grips with it. But the people who have this expertise almost invariably lack the detailed
statistical expertise needed to make judgements about the suitability of the metadata for use
in National Statistical Office processes.
Aims of a GSIM
It seems to me that the prime aim of the GSIM is to help us sort out these issues and to plan
how we will improve the management and operation of our statistical processes. When
complete the GSIM should encapsulate our plans for improved processes and enable
achievement of the aims stated above. It needs to be heavily focused on the metadata
needed in each process, but, more particularly, on the metadata needed for the entire
statistical life-cycle – ie on the metadata needed across all the processes that make up that
life-cycle. It should also focus on the flow of data and “paradata” – information about
decisions and choices made, and about data sets and other information produced and
passed amongst processes. If it does this the GSIM will let us achieve a number of things
along the way that are critically important if we are ultimately to achieve our goals. It should:
-
-
-
-
Enable us to specify all the metadata requirements for our processes, allow them to
be aligned with DDI and SDMX (and potentially other standards), and enable us to
codify what we mean by modifications to the standards and by DDI and SDMX
integration.
Clarify what we mean by “metadata-driven processes” and let us start working to
achieve this.
Provide clarification and detail about our processes so that we can start working to
achieve shared, common (across collections and across organisations and agencies)
processes and tools.
Give coherence to the metadata and paradata across the statistical life-cycle so that
we have both a basis for better life-cycle management and a basis for metadata and
paradata drill-back for the final statistical product.
Provide a model that is actually usable and useful for the purpose of planning and
implementing (and sharing) our approaches.
© Rapanea Consulting Limited 2012
Page 2
What does this GSIM look like?
I said above that the GSBPM gives us a very useful framework within which to discuss our
processes. In fact it identifies and provides a basic description of all our processes. I
propose that the GSIM takes each of these GSBPM processes and elaborates it by
describing the metadata, paradata, and data inputs it requires and produces, and, in most
cases, breaks the process into sub-processes with similar detail. There should be an
emphasis on end-to-end coherence – the model should show us how various processes use
common metadata and how paradata and data flow from process to process. The metadata
links should be to elaborated metadata types, mostly to DDI and SDMX artefacts such as
Concept Schemes and Concepts, Question Schemes and Question Items, Variable
Schemes and Variables, Category Schemes and Code Schemes and Codelists, Record
Layout Schemes and Physical Structure Schemes, and Data Structure Definitions.
In my model there would be “layers” of process definitions. At the top there would be
abstract processes that specify the metadata types and paradata information that would be
required (or produced) by the process, and an identification of the data sets in terms of how
they would relate to the metadata. This would be the GSIM layer – a library of abstract
process definitions that showed what metadata, paradata, and data would be required and
produced by each process. These abstract processes could be linked together, and when
linked together, would show how metadata, paradata, and data flowed through the life-cycle
of a collection.
The next layer would be the linking together of a collection of processes to describe a
particular statistical collection or area. At this stage we would fill in some of the detail that
was abstract in the GSIM definitions. If the GSIM model said a process required Category
Schemes, Code Schemes, Record Layouts, and Data Structure Definitions, this stage would
identify which particular Category Schemes, Code Schemes, Record Layouts, and Data
Structure Definitions were to be used. Similarly it would fill in some of the paradata about
how data inputs and other information would be found, and how data and other outputs
would be registered for access by subsequent processes. This layer would still be abstract
in that it would not have actual data defined so the process, in most cases, could not actually
be executed. It would relate to a statistical collection, such as Labour Force Survey, or
Consumer Price Indexes, or National Accounts, not to an actual collection cycle.
The final layer would relate to an actual cycle of a collection. There might possibly be some
final, cycle-specific, metadata to be provided (probably via some paradata item), and
paradata values would identify actual data sets (for which we had already identified the
metadata needed for interpretation). The process would now be executable, either
automatically or manually.
Note that these abstract, top-layer, process definitions that make up this main part of the
GSIM are themselves metadata. This process metadata will need its own metadata type
and will be stored in the Metadata Registry/Repository (MRR) along with all other metadata.
In my model I am looking to the SDMX Process metadata type, or more probably some
variant on it, but other options could be considered. The first layer below the GSIM, where
the processes are particularised for a collection might still be considered metadata but the
bottom layer, where the process are particularised for a specific collection cycle, are
probably better viewed as paradata. These lowest layers encapsulate, at the end of
© Rapanea Consulting Limited 2012
Page 3
processing cycle, all the history of the cycle, and are important enablers of drill-back from
disseminated statistics.
Let us look at a specific example.
I have chosen “Tabulation” (essentially GSBPM process 5.7 – “Calculate aggregates”) for
my example, since, because of my background, I have good familiarity with this process.
The tabulation process requires:
-
An input data set or data sets that must be described by
o
o
o
Record Layouts and Physical Data Structures
Variable Schemes and Variables
Category Schemes and Code Schemes (or possibly SDMX Codelists)
Data Structure Definitions (DSDs) (or possibly DDI NCube definitions) that specify
what tables are to be produced. These DSDs will link to Codelists that must be linkable to
the Category Schemes and Code Schemes.
Paradata items that will enable, at execution time, the identification of the input data
set (or sets) and will indicate how output tables are to be registered, indexed, and
categorised.
A tabulation tool that that can be used to produce the tables along with metadata that
shows how the necessary information from the data and paradata can be passed to the tool.
In my model this text, expressed in a structured metadata format, is the GSIM process
definition. It may have some layers that describe sub-processes. It should certainly be
displayable diagrammatically and the GSIM should include a tool to do this (and support
diagrammatic design and editing of processes). The Tabulation process is shown
diagrammatically below.
Metadata
- Record Layout(s)
- Physical Data Structure(s)
- Variable(s)
- Category Scheme(s)
- Data Structure Definition(s)
Paradata
- Dataset key(s)
- Registration keys,
Categories and Keywords
for registering tables
- Data Structure Definition(s)
Paradata
Tabulation
- Registration(s) for tables
- Status information
Tabulation tool
- Parameter metadata for tool
© Rapanea Consulting Limited 2012
Page 4
Figure 1 – Diagrammatic presentation of Tabulation process
I see these models as being based directly on the GSBPM. While it is possibly not
necessary that every GSBPM process should be modelled I think most would need to be.
Moreover it is possible that there may be two (or more) alternative modellings for some
processes. We are aiming to define common shareable processes but in the GSBPM
discussions it was recognised that there are some processes that are done differently in
different organisations and cultures. Equally some organisations may want to model
processes that are not currently in the GSBPM.
What does such a GSIM achieve?
It closes the gap between the GSBPM and our day-to-day statistical processes. The
GSBPM is a good reference model and a GSIM like the one proposed here makes it directly
applicable to our everyday processes.
It identifies all the metadata types we require for our processes. Moreover it
identifies them in the context of the processes that use them so we now have a focus for
assessing the requirements for each metadata type and assessing the suitability of the DDI,
SDMX, and other options. The development of the GSIM will provide an excellent
opportunity to clarify and resolve metadata issues and to develop an agreed and useful
statistical metadata standard that, hopefully, can guide the evolution of both DDI and SDMX.
It provides more detail on the requirements and functions of each process. In fact it
provides sufficient detail for us to consider how we might fit existing tools into processes or
how we might plan and build new shareable tools designed to work effectively with the
metadata.
It provides a basis for planning process automation and “metadata-driven
processes”. Each process, abstract or executable, is itself described in metadata. We can
build, using a process-management tool such as ActiveVos, a process execution engine that
can take our process metadata and execute it.
It captures process knowledge in a maintainable and useable form, in the form of
metadata links and paradata items, including details of how they flow from process to
process.
It clarifies how our “end-to-end-managed-environment” will use the MRR and the
metadata it contains. Our process management engine will retrieve process metadata from
the MRR and will execute it. As it does this it will manage the paradata that both links
processes together and provides an archiveable history of what happened in each process
cycle.
It enables us to plan how we manage our whole-of-collection metadata. Each
process has its metadata requirements specified and has paradata links that link it to other
processes. We can begin to organise our DDI container artefacts – Study Units, Groups,
and Sub-Groups – to manage our metadata within and across collections.
© Rapanea Consulting Limited 2012
Page 5
How do we proceed to develop this GSIM?
The GSIM as proposed will consist of several parts:
A library of elaborated processes, expressed diagrammatically and as process
metadata artefacts, for all or most of the GSBPM processes.
A map showing how these processes link together, and, in particular, explaining how
paradata links the processes together and enables the data flows amongst processes.
An explanation of how we would plan to use the DDI and SDMX (and possibly other)
metadata to manage processes and collection families end-to-end. This would focus not so
much on the detailed metadata artefacts but more on the containers and packaging of the
metadata.
A set of recommendations and issues for discussion to guide the evolution of DDI
and SDMX.
-
A set of guidelines for MRR support for GSIM-based processes
Some specifications for processes, to be implemented using a process management
system, to execute the GSIM process metadata artefacts and to support creation and editing
of the GSIM process metadata, including the transition from the abstract GSIM layer to
collection and cycle layers.
The major part of the work is to develop the library of processes. We need to do a
significant number of these, with a focus on the major processes, so we can validate the
approach and see how well it works, and start to do some of the abstraction required for
some of the other parts of the GSIM. Making a good start on this process definition would
seem to be a useful exercise for one of the “GSIM Sprint” sessions.
A note about use of the Metadata Registry/Repository
The paragraphs above imply how the MRR will enable end-to-end management of statistical
processes and how the processes will interact with the MRR, but it is perhaps useful to spell
some of this out in a little more detail. The MRR will be used to hold or register all metadata
and to register all other artefacts involved – data sets, documents, paradata values, etc.
”Register” means the MRR holds information about an object – what it is, who owns it, where
can be found, what its status is, what keywords and categories it is linked to, and what
access controls it is subject to. Every registered object has a unique id and version and a
URN (Universal Resource Name) that enables it to be retrieved. People or systems can
query the MRR to find objects of interest, using type and version information, ownership
information, and keywords and categories to retrieve URNs. Or they may know the URN of
the object (because it has been passed to them by some other person or earlier process).
Once a URN is obtained the MRR can provide a locator (a URI – Universal Resource
Identifier) to where the object is stored.
“Hold” or “Store” means the MRR actually holds the object in its repository. Objects that are
held or stored are also registered and the query process is identical whether or not the
© Rapanea Consulting Limited 2012
Page 6
object is actually stored in the MRR. If it is stored in the Repository the MRR can actually
return the object, rather than a locator.
In my model the MRR stores metadata in its repository but only registers other objects such
as data and documents. If metadata is held in some legacy environment or in some external
repository it may simply register that metadata. But for our main “future” metadata I assume
it is registered and stored in the MRR.
I assume most of our future metadata will be essentially DDI and SDMX (perhaps an
evolved and unified DDI and SDMX). We will use basic metadata artefacts, such as
Concepts and Concept Schemes, Variables and Variable Schemes, Questions and Question
Schemes, Record Layouts, Category Schemes and Code Schemes and Codelists, Data
Structure Definitions, and so on. But we will also use the DDI “Containers” – Study Units,
Groups and Sub Groups, and Resource Packages – and SDMX Structures to organise and
manage the basic metadata.
Probably DDI metadata will be stored in the MRR in Resource Packages (containers
designed for sharing metadata). For collections I expect we will have Groups and probably
Sub Groups that pull together (by reference) shared metadata, originally stored in Resource
Packages, as needed by the collection. Related collections will be held as Sub Groups
inside a Group or higher-level Sub-Groups. Perhaps there will be a top-level group to
contain all the organisation’s collections.
Each cycle of a collection will have a Study Unit, inside the collection’s Sub-Group with
access to metadata referenced in the Sub-Groups and Group above it. The collection’s SubGroup should contain a “template” Study Unit can be duplicated and renamed to roll forward
to a new cycle. As the cycle progresses links to actual data sets and run-time paradata will
be accumulated in the Study Unit so that at the end of the cycle it contains a full history of
the cycle ready for archiving and use in drill-back.
The GSIM abstract processes described above will indicate the metadata types the process
requires. They will also identify “paradata” items that will, at execution time, enable cycle
objects, such as data sets, to be accessed. These paradata items will essentially be URNs
for the objects. They will either be passed (as paradata items) from earlier process steps or
selected, from lists produced by querying the MRR, by the statistical clerk. The GSIM
processes will be held in the MRR as metadata (possibly, as was noted earlier, in some
variant of the SDMX Process artefact).
When a collection is designed a subset of the GSIM abstract processes will be linked
together into a large-scale process, specific metadata artefacts will be chosen for the
metadata types specified in the abstract processes, and the paradata items to be passed
amongst the processes will be specified. These collection processes will also reside in the
MRR. The paradata now becomes “collection metadata” and will need a structure to
manage and contain it. It seems to me that the SDMX Metadata Structure Definition (MSD)
is a good candidate for use here. MSDs, which were designed as containers for SDMX
Reference Metadata, allow the definition of ad-hoc structures for “Metadata Sets”, data files
that will hold the values associated with the items in the structure. Thus the collection is
described by a collection process, held as metadata in the MRR, which references metadata
artefacts, held in or referenced from DDI Study Units, Sub-Groups, and Groups and SDMX
© Rapanea Consulting Limited 2012
Page 7
Structures held in the MRR, with data and other process information described in Metadata
Structure Definitions, held in the MRR and passed in Metadata Sets registered in the MRR.
Summary
The aim is to develop a GSIM that will actually help us in planning and developing
improvements to the management and operation of our statistical processes. The GSIM as
proposed here is specifically designed to be directly relevant to this exercise and to provide
a way in which we can explore and clarify the metadata issues we need to resolve. It should
also provide a sound basis for developing shared approaches and shared tools to support
those approaches. It also provides a direct basis for building metadata-driven systems.
Bryan Fitzpatrick
Rapanea Consulting Limited
January 2012
BryanMFitzpatrick@Yahoo.CO.UK
© Rapanea Consulting Limited 2012
Page 8
Download