S-DWH Modular Workflow

advertisement
in partnership with
Title:
S-DWH Modular Workflow
WP:
3
Deliverable:
3.2
Version:
4.1
Date:
February 2013
NSI:
Statistics Estonia,
Istat, Statistics
Lithuania
Allan Randlepp, Antonio Laureti
Authors:
Palma and Valerij Žavoronok
1
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
@
S-DWH Modular Workflow
Version: 1.0 February 25, 2013: Allan Randlepp
Version: 2.0 February 27, 2013: Allan Randlepp
Version: 3.0 March 1, 2013: Antonio Laureti Palma
Version: 4.0 March 4, 2013: Allan Randlepp
Version: 4.1 June 17, 2013: Valerij Žavoronok
Summary
1
Introduction................................................................................................................................... 3
2
Statistical production models ........................................................................................................ 4
3
4
2.1
Stovepipe model .................................................................................................................... 4
2.2
Integrated model .................................................................................................................... 5
2.3
Warehouse approach ............................................................................................................. 6
Integrated Warehouse model ........................................................................................................ 7
3.1
Technical platform integration .............................................................................................. 7
3.2
Process integration ................................................................................................................ 8
3.3
Warehouse – reuse of data..................................................................................................... 9
S-DWH as layered modular system ............................................................................................ 11
4.1
Layered architecture ............................................................................................................ 11
4.2
CORE services and reuse of components............................................................................ 13
5
Conclusion .................................................................................................................................. 15
6
References................................................................................................................................... 16
2
1 Introduction
Statistical system is a complex system of data collection, data processing, statistical analyses, etc.
The following figure (by Sundgren (2004)) shows a statistical system as precisely defined, mandesigned system that measures external reality. Planning and control system on the figure
corresponds to phases 1–3 and 9 in GSBPM notations and statistical production system on the
figure corresponds to phases 4–8 in GSBPM.
This is a general, standardized view of the statistical system and it could represent one survey or the
whole statistical office or even an international organization. How such a system is built up and
organized in real life varies greatly. Some implementations of statistical system have worked quite
well so far and others not so well. Local environments of statistical systems are slightly different
but big changes in environment are more and more global. It does not matter anymore how well the
system has performed so far, some global changes in environment are so big that every system has
to adapt and change.
This paper presents the strengths and weaknesses of the main statistical production models based on
W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating
To The Implementation Of The Vision On The Production Method Of EU Statistics”. This is
followed with proposal how to combine integrated production model and warehouse approach. As
following, an overview is provided how statistical warehouse layered architecture gives modularity
to the statistical system as a whole.
3
2 Statistical production models
2.1
Stovepipe model
Today’s prevalent production model in statistical systems is the stovepipe model. That is the
outcome of a historic process in which statistics in individual domains have developed
independently. In stovepipe model a statistical action or survey is independent form other actions in
almost every phase of statistical production value chain.
Advantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek,
H. Linden (2009)):
1. The production processes are best adapted to the corresponding products.
2. It is flexible in that it can adapt quickly to relatively minor changes in the underlying
phenomena that the data describe.
3. It is under the control of the domain manager and it results in a low-risk business
architecture, as a problem in one of the production processes should normally not affect the
rest of the production.
Disadvantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W.
Kloek, H. Linden (2009)):
1. First, it may impose an unnecessary burden on respondents when the collection of data is
conducted in an uncoordinated manner and respondents are asked for the same information
more than once.
2. Second, the stovepipe model is not well adapted to collect data on phenomena that cover
multiple dimensions, such as globalization, sustainability or climate change.
3. Last but not least, this way of production is inefficient and costly, as it does not make use of
standardization between areas and collaboration between the Member States. Redundancies
and duplication of work, be it in development, in production or in dissemination processes
are unavoidable in the stovepipe model.
The stovepipe model is the dominant model in ESS and is reproduced and added at Eurostat level as
well, called as augmented stovepipe model.
4
2.2
Integrated model
Integrated model is the new and innovative way of producing statistics. It is based on the
combination of various data sources. This integration can be horizontal or vertical.
1. “Horizontal integration across statistical domains at the level of National Statistical
Institutes and Eurostat. Horizontal integration means that European statistics are no
longer produced domain by domain and source by source but in an integrated fashion,
combining the individual characteristics of different domains/sources in the process of
compiling statistics at an early stage, for example households or business surveys.” (W.
Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
2. “Vertical integration covering both the national and EU levels. Vertical integration
should be understood as the smooth and synchronized operation of information flows at
national and ESS levels, free of obstacles from the sources (respondents or
administration) to the final product (data or metadata). Vertical integration consists of
two elements: joint structures, tools and processes and the so-called European approach
to statistics.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden
(2009))
(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
Integrated model is created to avoid the disadvantages of stovepipe model (burden on respondents,
not suitable for surveying multi-dimensional phenomena, inefficiencies and high costs). “By
integrating data sets and combining data from different sources (including administrative sources)
the various disadvantages of the stovepipe model could be avoided. This new approach would
improve efficiency by elimination of unnecessary variation and duplication of work and create free
capacities for upcoming information needs.” (W. Radermacher, A. Baigorri, D. Delcambre, W.
Kloek, H. Linden (2009))
A task to go from the stovepipe model to the integrated model is not an easy one at all. In his
answer to UNSC about the draft of the paper on “Guidelines on Integrated Economic Statistics” W.
Radermacher writes: “To go from a conceptually integrated system such as the SNA to a practically
integrated system is a long term project and will demand integration in the production of primary
5
statistics. This is the priority objective that Eurostat has given to the European Statistical System
through its 2009 Communication to the European Parliament and the European Council on the
production method of the EU statistics ("a vision for the new decade").”
The Sponsorship on Standardization, a strategic task force in the European Statistical System, has
compared traditional and integrated approach to statistical production. They conclude that “in the
current situation, it is clearly shown that there are high level risks and low level opportunities” and
that “the full integration situation is more balanced than the current situation, and the most
interesting point is that risks are mitigated and opportunities exploded.” (The Sponsorship on
Standardisation (2013)) It seems that it is strategically wise to move away from stovepipes and
partly integrated statistical systems toward fully integrated statistical production systems.
2.3
Warehouse approach
In addition to the stovepipe model, augmented stovepipe model and integration model, W.
Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) describe also the warehouse
approach: “The warehouse approach provides the means to store data once, but use it for multiple
purposes. A data warehouse treats information as a reusable asset. Its underlying data model is not
specific to a particular reporting or analytic requirement. Instead of focusing on a process-oriented
design, the underlying repository design is modelled based on data inter-relationships that are
fundamental to the organisation across processes.”
Conceptual model of data warehousing in the ESS (European Statistical System)
(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
“Based on this approach statistics for specific domains should not be produced independently from
each other, but as integrated parts of comprehensive production systems, called data warehouses. A
6
data warehouse can be defined as a central repository (or "storehouse") for data collected via
various channels.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
3 Integrated Warehouse model
Integrated Warehouse model combines the integrated model and the warehouse approach into one
model. To have integrated warehouse centric statistical production system, different statistical
domains should have more consistency on methodology and share common tools and distributed
architecture. First we look at integration followed by warehouse and then combine both into one
model.
3.1
Technical platform integration
Let’s look at classical production system and try to find key integration points, where statistical
activities meet each other. Classical stovepipe statistical system looks like this:
Let’s begin integration of the platform from the end of the production system. Each well integrated
statistical production system has the main dissemination database, where all detailed statistics are
published. One for in-house use and the other for public use. To produce rich and integrated output,
especially cross domain output, we need warehouse where data are stored once, but can be used for
multiple purposes. Such a warehouse should be between process and analyze phases. And of course
there should be a raw database.
Depending on specific tools used or other circumstances, one may have more than one raw database
or warehouse or dissemination database, but less is better. For example, Statistics Estonia has three
integrated raw databases. The first is a web based tool for collecting data from enterprises. The
second is a data collection system for social surveys. And the third one is for administrative and
other data sources.
But this is not all, let’s look at planning and design phases. Descriptions of all statistical actions, all
classificators that are in use, input and output variables, selected data sources, descriptions of output
tables, questionnaires and so on, all these meta-objects should be collected during design and build
phases into one metadata repository. And needs of clients should be stored into central CRM
database.
7
These are the main integration points in database level, but this is not something new or
revolutionary. Although, software tools could be shared between statistical actions. How many data
collection systems do we need? How many processing or dissemination tools do we need? Both in
local and international level? Do we need different processing software for every statistical action
or for every statistical office? This kind of technological database and software level integration is
important and is not an easy task, but this is not good enough. We must go deeper into processes
and find ways to standardize sub-processes and methods. One way to go deeper into process is to
look at variables in each statistical activity.
3.2
Process integration
“Integration should address all stages of the production process, from design of the collection
system to the compilation and dissemination of data.” (W. Radermacher (2011)) Each statistical
action designs sample and questionnaires according to own needs and uses variations of
classificators as needed, selection of data sources is done according to the needs of the action, etc.
In the statistical system there is a number of statistical actions and each action collects some input
variables and produces some output variables. One way to find some common ground between
different statistical actions and sources is to focus on variables. Especially input variables because
data collection and processing are most costly phases of statistical production. Standardizing on
these phases gives us the fastest and biggest saving. Output variables will be standardized in SDMX
initiative.
Statistical actions should collect unique input variables not just rows and columns of tables in a
questionnaire. Each input variable should be collected and processed once in each period of time.
This should be done so that the outcome, input variable in warehouse, could be used for producing
various different outputs. This variable centric focus triggers changes in almost all phases of
statistical production process. Samples, questionnaires, processing rules, imputation methods, data
sources, etc., must be designed and built in compliance with standardized input variables, not
according to the needs of one specific statistical action.
The variable based on statistical production system reduces the administrative burden, lowers the
cost of data collection and processing and enables to produce richer statistical output faster. Of
course, this is true in boundaries of standardized design. If there is a need for special survey, one
can design his/her own sample, questionnaire, etc., but then this is a separate project with its own
price tag. But to produce regular statistics this way is not reasonable.
8
3.3
Warehouse – reuse of data
To organize reuse of already collected and processed data in statistical production system, the
boundaries of statistical actions must be removed. What will remain if statistical actions are
removed? Statistical actions are collection of input and output variables, processing methods, etc.
When we talk about data and reuse then we are interested in variables, samples or estimation frame
and timing of surveys.
The following figure represents a typical scenario with two surveys and one administrative data
source. Survey 1 collects with questionnaires two input variables A and B and may use the variable
B’ from the administrative source. Survey 1 analyses variables A and B*, where B* is easier B form
questionnaire or imputed B’ from administrative source. Survey 2 collects variables C and D and
analyses B’, C* and D.
This is a statistical action based on stovepipe model. In this case it is hard to re-use data on
interpretation layer, because imputation choices in integration layer for B* and C* are made
“locally” and there is great choice of similar variables in interpretation layer, like B* and B’. Also
samples of Survey 1 and Survey 2 may be not coherent, which means that the third survey, wanting
to analyze variables A, B’ and D in interpretation layer without collecting them again, has a
problem of coherence and sample.
To solve the problem we should invest some time and effort to planning and preparing Surveys 1
and 2, so that they would be coherent in a unique integrated variable-sampling centric warehouse.
In addition to analyzing data and generating output-cubes, interpretation layer can be used for
accessing to the production data. In interpretation layer statisticians can plan and prepare Surveys 1
and 2 by coordinating surveys and archives for a common evaluation frame and defining unique
variables. Information gained during this phase is basis for developing and tuning regular
production processes in integration layer.
This means that a coherent approach can be used if statisticians plan their actions following a
logical hierarchy of the variables estimation in a common frame. What the IT must support is then
an adequate environment for designing this strategy.
9
Then, according to a common strategy, Surveys 1 and 2 which collect data with questionnaires and
one administrative data source serve as examples. But this time, decisions being in design phase,
like design of the questionnaire, sample selection, imputation method, etc., are made “globally”, in
view of the interests of all three surveys. This way, integration of processes gives us reusable data
in the warehouse. Our warehouse now contains each variable only once, making it much easier to
reuse and manage our valuable data.
Another way of reusing data already in the warehouse is to calculate new variables. The following
figure illustrates the scenario where a new variable E is calculated from variables C* and D, loaded
already into the warehouse.
10
It means that data can be moved back from the warehouse to the integration layer. Warehouse data
can be used in the integration layer in multiple purposes, calculating new variables is only one
example.
Integrated variable based warehouse opens the way to any new possible sub-sequent statistical
actions that do not have to collect and process data and can produce statistics right from the
warehouse. Skipping the collection and processing phases, one can produce new statistics and
analyses are very fast and much cheaper than in case of the classical survey.
To design and build a statistical production system according to the integrated warehouse model
takes initially more time and effort than building the stovepipe model. But maintenance costs of
integrated warehouse system should be lower and new products which can be produced faster and
cheaper, to meet the changing needs, should compensate the initial investments soon.
4 S-DWH as layered modular system
4.1
Layered architecture
In a generic S-DWH system we identified four functional layers in which we group functionalities.
The ground level corresponds to the area where the external sources are incoming and interfaced,
while the top of the pile is where produced data are published to external user or system. In the
intermediate layers we manage the ETL functions for the DWH in which coherence analysis, data
mining, design for possible new strategies or data re-use are carried out.
Specifically, from the top to the bottom of the architectural pile, we define:
IV° - access layer, for the final presentation, dissemination and delivery of the information sought;
III° - interpretation and data analysis layer, is specifically for statisticians and enables any data
analysis, data mining and support to design production processes or data re-use;
11
II° - integration layer, where all operational activities needed for any statistical production process
are carried out;
I° - source layer, the level in which we locate all the activities related to storing and managing
internal or external data sources.
S-DWH layers are in specific order and the data go through layers without skipping any layers. It is
impossible to use the data directly from the other layer. If the data are needed, they have to be
moved to the layer where they are needed. And they cannot be moved so that some layers are
skipped. The data can be moved between neighbouring layers.
For example, to publish data in access layer, raw data need to be collected into raw database in
source layer, then extracted into integration layer for processing, then loaded into warehouse in
interpretation layer and after that someone can calculate statistics or make an analyze and publish it
in access layer.
Another example: sometime it is necessary to monitor collection process and analyze the raw data
during the collection. Then the raw data is extracted from the collection raw database, processed in
integration layer so that the data can be easily analyzed with specific tools in use for operational
activities, or loaded to interpretation layer, where it can be freely analyzed. This process is repeated
as often as needed – for example, once a day, once a week or hourly.
Another example: when we receive a new dataset, this is loaded by the integration layer from the
source to the interpretation layer, where statisticians can make their source-evaluation or, due to any
changes on the administrative regulations, define new variables or new process-up-date for existents
12
production process. It is interesting to note that this update must be included in the coherent SDWH by proper metadata.
4.2
CORE services and reuse of components
There are three main groups of workflows in S-DWH. One workflow updates data in the
warehouse, the second one updates in-house dissemination database and the third one updates the
public dissemination database.
These three automated data flows are quite independent from each other. Flow 1 is the biggest and
most complex. It extracts raw data from the source layer, processes it in integration layer and loads
to interpretation layer. And on the other hand, it brings cleansed data to the source layer for prefilling questionnaires, prepares sample data for collection systems, etc. Let’s name this flow the
processing flow.
Flow 2 and Flow 3 are very similar, both generate standard output to dissemination database. One
updates data in in-house dissemination database and the second one in public database. Both are
unidirectional flows. Let’s call Flow 2 as generate cube and Flow 3 as publish cube. In this context
the cube is a multidimensional table, for example .Stat or PC-Axis table.
Processing flows should be built up around input variables or groups of input variables to feed
variable based warehouse. Generate and publish cube flows are built around cubes, i.e. each flow
generates or publishes a cube.
There are many software tools available to build these modular flows. S-DWH’s layered
architecture itself provides the possibility to use different platforms and software in separate layers,
i.e. to re-use components already available in-house or internationally. In addition, different
13
software can be used inside the same layer to build up one particular flow. The problems arise when
we try to use these different modules and different data formats together.
Take a deeper look into CORE services. They are used to move data between S-DWH layers and
also inside the layers between different sub-tasks (e.g. edit, impute, etc.), then it is easier to use
software provided by statistical community or re-use self-developed components to build flows for
different purposes.
Generally CORE (COmmon Reference Environment) is an environment supporting the definition of
statistical processes and their automated execution. CORE processes are designed in a standard
way, starting from available services; specifically, process definition is provided in terms of abstract
statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering
the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped
according to CORE principles, and thus is easily integrated within a statistical process of another
NSI. Moreover, having a single environment for the execution of all statistical processes provides a
high level of automation and a complete reproducibility of processes execution.
NSI produce Official Statistics sharing very similar goals, hence several activities related to the
production of statistics are common. Nevertheless, such activities are currently carried on in an
independent way, without relying on shared solutions. Sharing a common architecture would result
in a reduction of costs due to duplicated activities and, at the same time, in an improvement of the
quality of produced statistics, due to the adoption of standardized solutions.
The main principles underlying CORA design are:
a) Platform Independence. NSIs use various platforms (e.g., hardware, operating systems,
database management systems, statistical software, etc.), hence architecture is bound to fail
if it endeavours to impose standards at a technical level. Moreover, the platform
independence allows to model statistical processes at a “conceptual level”, so that they do
not need to be modified when the implementation of a service changes.
b) Service Orientation. The vision is that the production of statistics takes place through
services calling other services. Hence services are the modular building blocks of the
architecture. By having clear communication interfaces, services implement principles of
modern software engineering like encapsulation and modularity.
c) Layered Approach. According to this principle, some services are rich and are positioned
at the top of the statistical process, so, for instance a publishing service requires the output
of all sorts of services positioned earlier in the statistical process, such as collecting data and
storing information. The ambition of this model is to bridge the whole range of layers from
collection to publication by describing all layers in terms of services delivered to a higher
layer, in such a way that each layer is dependent only on the first lower layer.
CORE principal objective is the design and implementation of an environment supporting the
definition of statistical processes and their automated execution. CORE processes are designed in a
standard way, starting from available services; specifically, process definition is provided in terms
of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of
fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be
wrapped according to CORE principles, and thus easily integrated within a statistical process of
another NSI. Moreover, having a single environment for the execution of all statistical processes
provides a high level of automation and a complete reproducibility of processes execution.
For us it is very important to make some transitions and mappings between different models and
approaches. Unfortunately mapping a CORE process to a business model is not possible because
14
the CORE model is an information model and there is no way to map a business model onto an
information model in a direct way. The two models are about different things. They can only be
connected if this connection is in some way a part of the models.
The CORE information model was designed with such a mapping in mind. Within this model, a
statistical service is an object, and one of its attributes is a reference to its GSBPM process.
Considering the GSBPM a business model, any mapping of the CORE model onto a business model
has to go through this reference to the GSBPM.
Usually different services use different services with their own tools which expect different data
formats. So for service interactions we need conversions. Evidently conversions are expensive.
Using for interactions CORE services number of conversions can be reduced noticeably.
In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint,
i.e. a CORE executable service. CORE service is indeed composed by an inner part, which is the
tool to be wrapped, and by input and output integration APIs. Such APIs transform from/to CORE
model into the tool specific format.
As anticipated, CORE mappings are designed for classes of tools and hence integration APIs should
support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE
and CORE-to-Relational, etc.
Basically, the integration API consists of a set of transformation components. Each transformation
component corresponds to a specific data format and the principal elements of their design are
specific mapping files, description files and transform operations.
In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE
operation is invoked. Conversely, the output of the tool is converted by Transform-to-CORE
operation. For each single input or output file a transformation must be launched.
In that way reusing of components can be performed in a very easy and efficient way.
5 Conclusion
Today, the prevalent model for producing statistics is the stovepipe model. But there is also an
integrated model and warehouse approach. Apart from an unavoidable statistical methodological
strategy needed for defining a common statistical variable definition, common reference samples
and a common estimation frame, in this paper the integration model and warehouse approach were
put together. Integration can be looked at form three main viewpoints:
1. Technical integration – integrating IT platforms and software tools.
2. Process integration – integrating statistical processes like survey design, sample selection,
data processing, etc.
3. Data integration – data are stored once, but used for multiple purposes.
When we put all these three integration aspects together, we get S-DWH, which is built on
integrated technology, uses integrated processes to produce statistics and reuses data efficiently.
And at the same level the S-DWH environment can be used to manage changes of statistical
production process, through the Interpretation Layer on the same data, which can support new
statistical processing strategies able to refine the S-DWH-self.
15
6 References
B. Sundgren (2010) “The Systems Approach to Official Statistics”, Official Statistics in Honour of
Daniel Thorburn, pp. 225–260. Available at: https://sites.google.com/site/bosundgren/mylife/Thorburnbokkap18Sundgren.pdf?attredirects=0
W. Radermacher (2011) “Global consultation on the draft Guidelines on Integrated Economic
Statistics”.
UNSC (2012) “Guidelines on Integrated Economic Statistics”. Available at:
http://unstats.un.org/unsd/statcom/doc12/RD-IntegratedEcoStats.pdf
The Sponsorship on Standardisation (2013) “Standardisation in the European Statistical System”.
W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating
To The Implementation Of The Vision On The Production Method Of EU Statistics”. Available
at: http://ec.europa.eu/eurostat/ramon/coded_files/TERMS-IN-STATISTICS_version_4-0.pdf
European Union, Communication from the Commission to the European Parliament and the
Council on the production method of EU statistics: a vision for the next decade, COM(2009)
404 final. Available at: http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF
ESSnet CORE (COmmon Reference Environment) http://www.cros-portal.eu/content/core-0
16
Download