3.2 S-DWH Modular Workflow_v6.0 (Final).

advertisement
in partnership with
Title:
S-DWH Modular Workflow
WP:
3
Version: 6.0 - Final
Autors:
Allan Randlepp,
Antonio Laureti Palma,
Francesco Altarocca
Valerij Žavoronok
Pedro Cunha
Deliverable:
3.2
Date:
October 2013
NSI:
Statistics Estonia
Istat
Istat
Statistics Lithuania
INE Portugal
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
1
@ S-DWH Modular Workflow
Version:
1.0
February 25, 2013
Allan Randlepp
Version:
2.0
February 27, 2013
Allan Randlepp
Version:
3.0
March 1, 2013
Antonio Laureti Palma
Version:
4.0
March 4, 2013
Allan Randlepp
Version:
4.1
June 17, 2013
Valerij Žavoronok
Version:
4.2
October 12, 2013
Francesco Altorocca
Version:
5.0
October 10, 2013
Pedro Cunha
Version:
6.0
October 15, 2013
Antonio Laureti Palma
Final Version
2
Index
1
Introduction................................................................................................................................... 4
2
Statistical production models ........................................................................................................ 6
3
4
5
6
2.1
Stovepipe model .................................................................................................................... 6
2.2
Integrated model .................................................................................................................... 7
2.3
Warehouse approach ............................................................................................................. 9
Integrated Warehouse model ...................................................................................................... 10
3.1
Technical platform integration ............................................................................................ 10
3.2
Process integration .............................................................................................................. 11
3.3
Warehouse – reuse of data................................................................................................... 12
S-DWH as layered modular system ............................................................................................ 15
4.1
Layered architecture ............................................................................................................ 15
4.2
Layered approach of a full active S-DWH .......................................................................... 16
4.3
Source layer ......................................................................................................................... 16
4.4
Integration layer .................................................................................................................. 18
4.5
Interpretation layer .............................................................................................................. 21
4.6
Access layer ......................................................................................................................... 26
Workflow scenarios .................................................................................................................... 29
5.1
Scenario 1: full linear end-to-end workflow ....................................................................... 29
5.2
Scenario 2: Monitoring collection ....................................................................................... 30
5.3
Scenario 3: Evaluating new data source .............................................................................. 30
5.4
Scenario 4: Re-using data for new standard output ............................................................. 31
5.5
Scenario 5: re-using data for complex custom query .......................................................... 32
5.6
Generic workflow suitable for reuse of components........................................................... 32
5.7
A simple statistical process ................................................................................................. 33
5.8
CORE services and reuse of components............................................................................ 36
Conclusion .................................................................................................................................. 39
References .......................................................................................................................................... 40
3
1. Introduction
Statistical system is a complex system of data collection, data processing, statistical analyses, etc.
The following figure (by Sundgren (2004)) shows a statistical system as precisely defined, mandesigned system that measures external reality. Planning and control system on the figure
corresponds to phases 1–3 and 9 in GSBPM notations and statistical production system on the
figure corresponds to phases 4–8 in GSBPM.
This is a general, standardized view of the statistical system and it could represent one survey or the
whole statistical office or even an international organization. How such a system is built up and
organized in real life varies greatly. Some implementations of statistical system have worked quite
well so far and others not so well. Local environments of statistical systems are slightly different
but big changes in environment are more and more global. It does not matter anymore how well the
system has performed so far, some global changes in environment are so big that every system has
to adapt and change.
4
This paper presents the strengths and weaknesses of the main statistical production models based on
W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating
To The Implementation Of The Vision On The Production Method Of EU Statistics”. This is
followed with proposal how to combine integrated production model and warehouse approach. This
corresponds to a metadata-driven data warehouse which is well-suited for supporting the
management of modules in generic workflows. This modular approach can reduce the “time to
market”, i.e. the length of time it takes from a product being conceived until its availability for use.
As following, an overview is provided how statistical warehouse layered architecture gives
modularity to the statistical system as a whole. In order to suggest a possible roadmap towards
process optimization and cost reduction, we will introduce a possible simple description of a
generic workflow, which links the business model with the information system.
5
2. Statistical production models
2.1
Stovepipe model
Today’s prevalent production model in statistical systems is the stovepipe model. That is the
outcome of a historic process in which statistics in individual domains have developed
independently. In stovepipe model a statistical action or survey is independent form other actions in
almost every phase of statistical production value chain.
Advantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek,
H. Linden (2009)):
1. The production processes are best adapted to the corresponding products.
2. It is flexible in that it can adapt quickly to relatively minor changes in the underlying
phenomena that the data describe.
3. It is under the control of the domain manager and it results in a low-risk business
architecture, as a problem in one of the production processes should normally not
affect the rest of the production.
6
Disadvantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W.
Kloek, H. Linden (2009)):
1. First, it may impose an unnecessary burden on respondents when the collection of
data is conducted in an uncoordinated manner and respondents are asked for the
same information more than once.
2. Second, the stovepipe model is not well adapted to collect data on phenomena that
cover multiple dimensions, such as globalization, sustainability or climate change.
3. Last but not least, this way of production is inefficient and costly, as it does not make
use of standardization between areas and collaboration between the Member States.
Redundancies and duplication of work, be it in development, in production or in
dissemination processes are unavoidable in the stovepipe model.
The stovepipe model is the dominant model in ESS and is reproduced and added at Eurostat level as
well, called as augmented stovepipe model.
2.2
Integrated model
Integrated model is the new and innovative way of producing statistics. It is based on the
combination of various data sources. This integration can be horizontal or vertical.
1. “Horizontal integration across statistical domains at the level of National Statistical
Institutes and Eurostat. Horizontal integration means that European statistics are no longer
produced domain by domain and source by source but in an integrated fashion, combining
the individual characteristics of different domains/sources in the process of compiling
statistics at an early stage, for example households or business surveys.” (W. Radermacher,
A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
2. “Vertical integration covering both the national and EU levels. Vertical integration should
be understood as the smooth and synchronized operation of information flows at national
and ESS levels, free of obstacles from the sources (respondents or administration) to the
final product (data or metadata). Vertical integration consists of two elements: joint
structures, tools and processes and the so-called European approach to statistics.” (W.
Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
7
(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
Integrated model is created to avoid the disadvantages of stovepipe model (burden on respondents,
not suitable for surveying multi-dimensional phenomena, inefficiencies and high costs). “By
integrating data sets and combining data from different sources (including administrative sources)
the various disadvantages of the stovepipe model could be avoided. This new approach would
improve efficiency by elimination of unnecessary variation and duplication of work and create free
capacities for upcoming information needs.” (W. Radermacher, A. Baigorri, D. Delcambre, W.
Kloek, H. Linden (2009))
A task to go from the stovepipe model to the integrated model is not an easy one at all. In his
answer to UNSC about the draft of the paper on “Guidelines on Integrated Economic Statistics” W.
Radermacher writes: “To go from a conceptually integrated system such as the SNA to a practically
integrated system is a long term project and will demand integration in the production of primary
statistics. This is the priority objective that Eurostat has given to the European Statistical System
through its 2009 Communication to the European Parliament and the European Council on the
production method of the EU statistics ("a vision for the new decade").”
The Sponsorship on Standardization, a strategic task force in the European Statistical System, has
compared traditional and integrated approach to statistical production. They conclude that “in the
current situation, it is clearly shown that there are high level risks and low level opportunities” and
that “the full integration situation is more balanced than the current situation, and the most
interesting point is that risks are mitigated and opportunities exploded.” (The Sponsorship on
Standardisation (2013)) It seems that it is strategically wise to move away from stovepipes and
partly integrated statistical systems toward fully integrated statistical production systems.
8
2.3
Warehouse approach
In addition to the stovepipe model, augmented stovepipe model and integration model, W.
Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) describe also the warehouse
approach: “The warehouse approach provides the means to store data once, but use it for multiple
purposes. A data warehouse treats information as a reusable asset. Its underlying data model is not
specific to a particular reporting or analytic requirement. Instead of focusing on a process-oriented
design, the underlying repository design is modelled based on data inter-relationships that are
fundamental to the organisation across processes.”
Conceptual model of data warehousing in the ESS (European Statistical System)
(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
“Based on this approach statistics for specific domains should not be produced independently from
each other, but as integrated parts of comprehensive production systems, called data warehouses. A
data warehouse can be defined as a central repository (or "storehouse") for data collected via
various channels.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
9
3. Integrated Warehouse model
Integrated Warehouse model combines the integrated model and the warehouse approach into one
model. To have integrated warehouse centric statistical production system, different statistical
domains should have more consistency on methodology and share common tools and distributed
architecture. First we look at integration followed by warehouse and then combine both into one
model.
3.1
Technical platform integration
Let’s look at classical production system and try to find key integration points, where statistical
activities meet each other. Classical stovepipe statistical system looks like this:
Let’s begin integration of the platform from the end of the production system. Each well integrated
statistical production system has the main dissemination database, where all detailed statistics are
published. One for in-house use and the other for public use. To produce rich and integrated output,
especially cross domain output, we need warehouse where data are stored once, but can be used for
multiple purposes. Such a warehouse should be between process and analyse phases. And of course
there should be a raw database.
Depending on specific tools used or other circumstances, one may have more than one raw database
or warehouse or dissemination database, but less is better. For example, Statistics Estonia has three
integrated raw databases. The first is a web based tool for collecting data from enterprises. The
second is a data collection system for social surveys. And the third one is for administrative and
other data sources.
But this is not all, let’s look at planning and design phases. Descriptions of all statistical actions, all
classificators that are in use, input and output variables, selected data sources, descriptions of output
tables, questionnaires and so on, all these meta-objects should be collected during design and build
phases into one metadata repository. And needs of clients should be stored into central CRM
database.
10
These are the main integration points in database level, but this is not something new or
revolutionary. Although, software tools could be shared between statistical actions. How many data
collection systems do we need? How many processing or dissemination tools do we need? Both in
local and international level? Do we need different processing software for every statistical action
or for every statistical office? This kind of technological database and software level integration is
important and is not an easy task, but this is not good enough. We must go deeper into processes
and find ways to standardize sub-processes and methods. One way to go deeper into process is to
look at variables in each statistical activity.
3.2
Process integration
“Integration should address all stages of the production process, from design of the collection
system to the compilation and dissemination of data.” (W. Radermacher (2011)) Each statistical
action designs sample and questionnaires according to own needs and uses variations of
classificators as needed, selection of data sources is done according to the needs of the action, etc.
In the statistical system there is a number of statistical actions and each action collects some input
variables and produces some output variables. One way to find some common ground between
different statistical actions and sources is to focus on variables. Especially input variables because
data collection and processing are most costly phases of statistical production. Standardizing on
these phases gives us the fastest and biggest saving. Output variables will be standardized in SDMX
initiative.
Statistical actions should collect unique input variables not just rows and columns of tables in a
questionnaire. Each input variable should be collected and processed once in each period of time.
This should be done so that the outcome, input variable in warehouse, could be used for producing
various different outputs. This variable centric focus triggers changes in almost all phases of
statistical production process. Samples, questionnaires, processing rules, imputation methods, data
sources, etc., must be designed and built in compliance with standardized input variables, not
according to the needs of one specific statistical action.
The variable based on statistical production system reduces the administrative burden, lowers the
cost of data collection and processing and enables to produce richer statistical output faster. Of
course, this is true in boundaries of standardized design. If there is a need for special survey, one
can design his/her own sample, questionnaire, etc., but then this is a separate project with its own
price tag. But to produce regular statistics this way is not reasonable.
11
3.3
Warehouse – reuse of data
To organize reuse of already collected and processed data in statistical production system, the
boundaries of statistical actions must be removed. What will remain if statistical actions are
removed? Statistical actions are collection of input and output variables, processing methods, etc.
When we talk about data and reuse then we are interested in variables, samples or estimation frame
and timing of surveys.
The following figure represents a typical scenario with two surveys and one administrative data
source. Survey 1 collects with questionnaires two input variables A and B and may use the variable
B’ from the administrative source. Survey 1 analyses variables A and B*, where B* is easier B form
questionnaire or imputed B’ from administrative source. Survey 2 collects variables C and D and
analyses B’, C* and D.
This is a statistical action based on stovepipe model. In this case it is hard to re-use data on
interpretation layer, because imputation choices in integration layer for B* and C* are made
“locally” and there is great choice of similar variables in interpretation layer, like B* and B’. Also
samples of Survey 1 and Survey 2 may be not coherent, which means that the third survey, wanting
to analyse variables A, B’ and D in interpretation layer without collecting them again, has a
problem of coherence and sample.
To solve the problem we should invest some time and effort to planning and preparing Surveys 1
and 2, so that they would be coherent in a unique integrated variable-sampling centric warehouse.
In addition to analysing data and generating output-cubes, interpretation layer can be used for
accessing to the production data. In interpretation layer statisticians can plan and prepare Surveys 1
and 2 by coordinating surveys and archives for a common evaluation frame and defining unique
variables. Information gained during this phase is basis for developing and tuning regular
production processes in integration layer.
12
This means that a coherent approach can be used if statisticians plan their actions following a
logical hierarchy of the variables estimation in a common frame. What the IT must support is then
an adequate environment for designing this strategy.
Then, according to a common strategy, Surveys 1 and 2 which collect data with questionnaires and
one administrative data source serve as examples. But this time, decisions being in design phase,
like design of the questionnaire, sample selection, imputation method, etc., are made “globally”, in
view of the interests of all three surveys. This way, integration of processes gives us reusable data
in the warehouse. Our warehouse now contains each variable only once, making it much easier to
reuse and manage our valuable data.
13
Another way of reusing data already in the warehouse is to calculate new variables. The following
figure illustrates the scenario where a new variable E is calculated from variables C* and D, loaded
already into the warehouse.
It means that data can be moved back from the warehouse to the integration layer. Warehouse data
can be used in the integration layer in multiple purposes, calculating new variables is only one
example.
Integrated variable based warehouse opens the way to any new possible sub-sequent statistical
actions that do not have to collect and process data and can produce statistics right from the
warehouse. Skipping the collection and processing phases, one can produce new statistics and
analyses are very fast and much cheaper than in case of the classical survey.
To design and build a statistical production system according to the integrated warehouse model
takes initially more time and effort than building the stovepipe model. But maintenance costs of
integrated warehouse system should be lower and new products which can be produced faster and
cheaper, to meet the changing needs, should compensate the initial investments soon.
14
4. S-DWH as layered modular system
4.1
Layered architecture
In a generic S-DWH system we identified four functional layers in which we group functionalities.
The ground level corresponds to the area where the external sources are incoming and interfaced,
while the top of the pile is where produced data are published to external user or system. In the
intermediate layers we manage the ETL functions for the DWH in which coherence analysis, data
mining, design for possible new strategies or data re-use are carried out.
Specifically, from the top to the bottom of the architectural pile, we define:
IV
III
II
I
access layer, for the final presentation, dissemination and delivery of the information
sought;
interpretation and data analysis layer, is specifically for statisticians and enables any data
analysis, data mining and support to design production processes or data re-use;
integration layer where all operational activities needed for any statistical production
process are carried out;
source layer the level in which we locate all the activities related to storing and managing
internal or external data sources.
S-DWH layers are in specific order and the data go through layers without skipping any layers. It is
impossible to use the data directly from the other layer. If the data are needed, they have to be
moved to the layer where they are needed. And they cannot be moved so that some layers are
skipped. The data can be moved between neighbouring layers.
15
4.2
Layered approach of a full active S-DWH
The layered architecture reflects a conceptual organization in which we will consider the first two
levels as pure statistical operational infrastructures, functional for acquiring, storing, editing and
validating data, and the last two layers as the effective data warehouse, i.e. levels in which data are
accessible for data analysis.
These reflect two different IT environments, an operational where we support semi-automatic
computer interaction systems and an analytical, the warehouse, where we maximize human free
interaction.
4.3
Source layer
The Source layer is the gathering point for all data that is going to be stored in the Data warehouse.
Input to the Source layer is data from both internal and external sources. Internal data is mainly data
from surveys carried out by the NSI, but it can also be data from maintenance programs used for
manipulating data in the Data warehouse. External data means administrative data which is data
collected by someone else, originally for some other purpose.
The structure of data in the Source layer depends on how the data is collected and the designs of the
various direct and internal to any NSI data collection processes. The specifications of collection
processes and their output, the data stored in the Source layer, have to be thoroughly described.
Vital information are names and meaning, definition and description, of any collected variable.
Also the collection process itself must be described, for example the source of a collected item,
when it was collected and how.
When data are entering in the source layer from an external source, or administrative archive, data
and relative metadata must be checked in terms of completeness and coherence.
From a data structure point of view, external data are stored with the same data structure as they
arrive. The integration toward the integration layer should be then realized by a mapping of the
source variable with the target variable, i.e. the variable internal to the S-DWH.
16
ADMIN DATA
Metadata of
source layer
Data
mapping
The mapping is a graphic or conceptual representation of information to represent some
relationships within the data; i.e. the process of creating data element mappings between two
distinct data models.
The common and original practice of mapping is effective interpretation of an administrative
archive in term of S-DWH definition and meaning.
Data mapping involves combining data residing in different sources and providing users with a
unified view of these data. These systems are formally defined as a triple <T,S,M> where T is the
target schema, S is the heterogeneous set of source schemas, and M is the mapping that maps
queries between the source and the target schemas.
Queries over the data mapping system also assert the data linking between elements in the sources
and the business register units.
ADMIN
Metadata of
source layer
Data mapping
ADMIN
DATA
TARGET
All the internal sources doesn’t need mapping since the data collection process is defined in an
S-DWH during the design phase using internal definitions.
17
4.4
Integration layer
From the Source layer, data is loaded into the Integration layer. This represents an operational
system used to process the day-to-day transactions of an organization. These systems are designed
to process efficient and integrity transactional. The process of translating data from source systems
and transform it into useful content in the data warehouse is commonly called ETL. In the Extract
step, data is moved from the Source layer and made accessible in the Integration layer for further
processing.
-
In the Extract step, data is moved from the Source layer and made accessible in the
Integration layer for further processing.
Source
Layer
Integration
Layer
18
-
The Transformation step involves all the operational activities usually associated with the
typical statistical production process.
Integration
Layer
-
As soon as a variable is processed in the Integration layer in a way that makes it useful in
the context of data warehouse it has to be Loaded into the Interpretation layer and the
Access layer.
Integration
Layer
Interpretation
Layer
The Transformation step involves all the operational activities usually associated with the typical
statistical production process, examples of activities carried out during the transformation are:
•
•
•
•
•
•
Find, and if possible, correct incorrect data;
Transform data to formats matching standard formats in the data warehouse;
Classify and code;
Derive new values;
Combine data from multiple sources;
Clean data, that is for example correct misspellings, remove duplicates and handle missing
values.
To accomplish the different tasks in the transformation of new data to useful output, data already in
the data warehouse is used to support the work. Examples of such usage are using existing data
together with new to derive a new value or using old data as a base for imputation.
Each variable in the data warehouse may be used for several different purposes in any number of
specified outputs. As soon as a variable is processed in the Integration layer in a way that makes it
useful in the context of data warehouse output it has to be loaded into the Interpretation layer and
the Access layer.
19
The Integration layer is an area for processing data; this is realized by operators specialized in ETL
functionalities. Since the focus for the Integration layer is on processing rather than search and
analysis, data in the Integration layer should be stored in generalized normalized structure,
optimized for OLTP (Online transaction processing, is a class of information systems that facilitate
and manage transaction-oriented applications, typically for data entry and retrieval transaction
processing), where all data are stored in similar data structure independently from the domain or
topic and each fact is stored only in one point in order to makes easier maintain consistent data.
It is well known that these databases are very powerful responding to data manipulation as
inserting, updating and deleting, but are very ineffective when we need to analyse and deal with a
large amount of data. Another constraint in the use of OLTP is their complexity. Users must have a
great expertise to manipulate them and it is not easy to understand all of that intricacy.
Some of OLTP characteristics are:
Source of data
Operational data
Purpose of data
To control and run fundamental business tasks
Processing Speed
Typically Very Fast
Database Design
Highly normalized with many tables
Backup and Recovery
Backup religiously; operational data is critical to run the business, data
loss is likely to entail significant monetary loss and legal liability
Age Of Data
Current
Queries
Relatively standardized and simple queries.
Returning relatively few records
Data Base Operations
Insert, Delete and Update
What the data Reveals
A snapshot of on-going business processes
20
During the several ETL process a variable will likely appear in several versions. Every time a value
is corrected or changed by some other reason, the old value should not be erased but a new version
of that variable should be stored. That is a mechanism used to ensure that all items in the database
can be followed over time.
4.5
Interpretation layer
This layer contains all collected data processed and structured to be optimized for analysis and as
base for output planned by the NSI. The Interpretation layer is specially designed for statistical
experts and is built to support data manipulation of big complex search operations. Typical
activities in the Interpretation layer are:
•
•
•
•
•
•
Basis analysis
Correlation and Multivariate analysis
Hypothesis testing, simulation and forecasting,
Data mining,
Design of new statistical strategies,
Design data cubes to the Access layer.
Its underlying data model is not specific to a particular reporting or analytic requirement. Instead of
focusing on a process-oriented design, the repository design is modelled based on data interrelationships that are fundamental to the organization across processes.
21
Data warehousing became an important strategy to integrate heterogeneous information sources in
organizations, and to enable their analysis and quality.
The Interpretation layer will contain micro data, elementary observed facts, aggregations and
calculated values, but it will still also contain all data at the finest granular level in order to be able
to cover all possible queries and joins. A fine granularity is also a condition to manage changes of
required output over time.
Besides the actual data warehouse content, the Interpretation layer may contain temporary data
structures and databases created and used by the different on-going analysis projects carried out by
statistics specialists.
The ETL process in integration level continuously creates metadata regarding the variables and the
process itself that is stored as a part of the data warehouse.
Although data warehouses are built on relational database technology, the design of a data
warehouse database differs substantially from the design of an online transaction processing system
(OLTP) database.
OnLine Analytical Processing (OLAP):
•
•
•
•
Subject orientated
Designed to provide real-time analysis
Data is historical
Highly De-normalized
OLAP are multi-dimensional and are optimised for processing very complex real-time ad-hoc read
queries
Some of OLAP characteristics are:
Source of data
Consolidated data; OLAP data comes from the various OLTP
Databases
Purpose of data
To help with planning, problem solving, and decision support
Processing Speed
Depends on the amount of data involved; batch data refreshes and
complex queries may take many hours; query speed can be improved
by creating indexes
Database Design
Typically de-normalized with fewer tables; use of star schemas.
Backup and Recovery
Regular backups
Age Of Data
Historical
Queries
Often complex queries involving aggregations
Data Base Operations
Read
What the data Reveals
Multi-dimensional views of various kinds of statistical activities
22
In this layer a specific type of OLAP should be used - ROLAP - Relational Online Analytical
Processing - uses specific analytical tools on a relational dimensional data model which is easy to
understand and does not require pre-computation and storage of the information.
In a relational database, fact tables of the Interpretation layer should be organized in dimensional
structure to support data analysis in an intuitive and efficient way. Dimensional models are
generally structured with fact tables and their belonging dimensions. Facts are generally numeric,
and dimensions are the reference information that gives context to the facts. For example, a sales
trade transaction can be broken up into facts, such as the number of products moved and the price
paid for the products, and into dimensions, such as order date, customer name and product number.
A fact table consists of measurements, metrics or facts of a statistical topic. Fact table in the DWH
are organized in a dimensional model, built on a star-like schema, with dimensions surrounding the
fact table. In the S-DWH, fact table are defined at the higher level of granularity with information
organized in columns distinguished in dimensions, classifications, and measures. Dimensions are
the descriptions of the fact table. Typically dimensions are nouns like date, class of employ,
territory, NACE, etc. and could have hierarchy on it, for example, the date dimension could contain
data such as year, month and weekday.
The definition of a star schema would be realized by dynamic ad hoc queries from the integration
layer, by the proper metadata, in order to realize, generally, a data transposition query. With a
dynamic approach, any expert user should define their own analysis context starting from the
already exiting materialized DM, virtual or a temporary environment derived from the data
structure of the integration layer. This method allows users to automatically build permanent or
temporary data marts in function of their needs, leaving them free to test any possible new strategy.
Figure 1 – Star-schema
23
A key advantage of a dimensional approach is that the data warehouse is easy to use and operations
on data are very quick. In general, dimensional structures are easy to understand for business users,
because the structures are divided into measurements/facts and context/dimensions related to the
organization’s business processes.
A dimension is sometimes referred to as an axis for analysis. Time, Location and Product are the
classic dimensions.
A dimension is a structural attribute of a cube that is a list of members, all of which are of a similar
type in the user's perception of the data. For example, all months, quarters, years, etc., make up a
time dimension; likewise all cities, regions, countries, etc., make up a geography dimension.
A dimension table is one of the set of companion tables to a fact table and normally contains
attributes or (fields) used to constrain and group data when performing data warehousing queries.
Dimensions correspond to the "branches" of a star schema.
The positions of a dimension organised according to a series of cascading one to many
relationships. This way of organizing data is comparable to a logical tree, where each member has
only one parent but a variable number of children.
For example the positions of the Time dimension might be months, but also days, periods or years.
Dimension could have hierarchy, wich are classified into levels. All the positions for a level
correspond to a unique classification. For example, in a "Time" dimension, level one stands for
days, level two for months and level three for years.
The dimensions could be balenced, unbaleced or ragged. In balanced hierarchies, the branches of
the hierarchy all descend to the same level, with each member's parent being at the level
immediately above the member. Unbalenced hierarchies all of the branches of the hierarchy don't
reach to the same level but each member's parent do belong to the level immediately above it. In
ragged hierarchies, the parent member of at least one member of a dimension is not in the level
immediately above the member. Like unbalanced hierarchies, the branches of the hierarchies can
descend to different levels.
Ussualy, unbalanced and ragged hierarchys must be transformed in balanced hierachies.
Figure 2: Balanced Hierarchy
24
Figure 3: Unbalanced Hierarchy
Figure4: Ragged Hierarchy
25
4.6
Access layer
The Access layer is the layer for the final presentation, dissemination and delivery of information.
This layer is used by a wide range of users and computer instruments. The data is optimized to
effectively present and compile data. Data may be presented in data cubes and different formats
specialized to support different tools and software. Generally the data structure are optimized for
MOLAP (Multidimensional Online Analytical Processing) uses specific analytical tools on a
multidimensional data model.
Multidimensional structure is defined as “a variation of the relational model that uses
multidimensional structures to organize data and express the relationships between data”. The
structure is broken into cubes and the cubes are able to store and access data within the confines of
each cube. “Each cell within a multidimensional structure contains aggregated data related to
elements along each of its dimensions”. Even when data is manipulated it remains easy to access
and continues to constitute a compact database format. The data still remains interrelated.
Multidimensional structure is quite popular for analytical databases that use online analytical
processing (OLAP) application.
26
Analytical databases use these databases because of their ability to deliver answers to complex
business queries swiftly. Data can be viewed from different angles, which gives a broader
perspective of a problem unlike other models.
Some Data Mart might need to be refreshed from the Data Warehouse daily, whereas user groups
might want refreshes only monthly.
Each Data Mart can contain different combinations of tables, columns and rows from the Statistical
Data Warehouse. For example, a statistician or user group that doesn't require a lot of historical
data might only need transactions from the current calendar year in the database. The analysts might
need to see all details about data, whereas data such as "salary" or "address" might not be
appropriate for a Data Mart that focuses on Trade.
Three basic types of data marts are dependent, independent, and hybrid. The categorization is based
primarily on the data source that feeds the data mart. Dependent data marts draw data from a central
data warehouse that has already been created. Independent data marts, in contrast, are standalone
systems built by drawing data directly from operational or external sources of data or both. Hybrid
data marts can draw data from operational systems or data warehouses.
The Data Mart in ideal information system architecture of a full active S-DWH, are dependent data
marts: data in a data warehouse is aggregated, restructured, and summarized when it passes into the
dependent data mart. The architecture of a dependent data mart is as follows:
Operational
databases
Operational
databases
Data Warehouse
Dependent Data
Marts
Figure 5 – Dependent data mart
Independent Data
Marts
Figure 6 – Independent data mart
27
There are benefits of building a dependent data mart:


Performance: when performance of a data warehouse becomes an issue, build one or two
dependent data marts can solve the problem. Because the data processing is performed
outside the data warehouse.
Security: by putting data outside data warehouse in dependent data marts, each department
owns their data and has complete control over their data.
28
5. Workflow scenarios
The metadata-driven system of a S-DWH is well-suited for supporting the management of modules
in generic workflows. This modular approach can reduce the “time to market”, i.e. the length of
time it takes from a product being conceived until its availability for use. In order to suggest a
possible roadmap towards process optimization and cost reduction, in this paragraph we will
introduce a possible simple description of a generic workflow, which links the business model with
the information system.
Layered architecture, modular tools and variable based warehouse is powerful combination that can
be used for different scenarios. Here are some examples of workflows that S-DWH supports.
5.1
Scenario 1: full linear end-to-end workflow
To publish data in access layer, raw data need to be collected into raw database in source layer, then
extracted into integration layer for processing, then loaded into warehouse in interpretation layer
and after that someone can calculate statistics or make an analyse and publish it in access layer.
29
5.2
Scenario 2: Monitoring collection
Sometime it is necessary to monitor collection process and analyse the raw data during the
collection. Then the raw data is extracted from the collection raw database, processed in integration
layer so that the data can be easily analysed with specific tools in use for operational activities, or
loaded to interpretation layer, where it can be freely analysed. This process is repeated as often as
needed – for example, once a day, once a week or hourly.
5.3
Scenario 3: Evaluating new data source
When we receive a dataset from new data source, it should be evaluated by statisticians. Dataset is
loaded by the integration layer from the source to the interpretation layer, where statisticians can
make their source-evaluation or, due to any changes on the administrative regulations, define new
variables or new process-up-date for existents production process. From technical perspective, this
workflow is same as described in scenario 2. It is interesting to note that this update must be
included in the coherent S-DWH by proper metadata.
30
5.4
Scenario 4: Re-using data for new standard output
Statisticians can analyse data already prepared in integration layer, compile new products and load
them to access layer. If S-DWH is built correctly and correct metadata is provided, then compiling
new products using already collected and prepared data should be easy and preferred way of
working.
31
5.5
Scenario 5: re-using data for complex custom query
This is variation from scenario 4, where instead of generating new standard output from data
warehouse, statistician can make ad-hoc analyse using data already collected and prepared in
warehouse and prepare custom query for customer.
5.6
Generic workflow suitable for reuse of components
A workflow identifies a collection of actions, operations and procedures with a predetermined
order. Each activity starts, or may start, only if all the activities that precede it in the order are
accomplished. Typically, workflows model or represent a process, this involves the way in which
activities have to be completed in order to carry out the process.
There are many ways to describe a workflow. In this document the Directed Acyclic Graph (DAG)
is used to facilitate immediate interpretation. According to the DAG definition, an activity is
represented by a node and a dependence by an arrow (figure below).
A node represents a well-defined activity (input, processing mechanism and output), while an arrow
represents a dependence relationship. The figure above depicts a simple workflow with two explicit
dependences (node B depends on node A and node C depends on node B), and the transitive
dependence (node C depends on node A). The semantic of the term “dependence” is: if activity B
depends on A, then B cannot start if A is not complete.
32
5.7
A simple statistical process
This paragraph gives some examples of the concepts introduced above. The first one represents a
simple statistical process from a high level perspective. A generic statistical process, in accordance
with the Generic Statistical Business Process Model, can be subdivided into nine phases: specify
need, design, build, collect, process, analyse, disseminate, archive and evaluate. Each of them can
be broken down into sub-processes. For instance the Collect phase is divided into: select sample,
setup collection, run collection and finalize collection.
Therefore, a generic workflow is:
where every phase has to end before the next one can start.
Clearly not all phases and processes in the GSBPM have to be used: it depends on the purpose and
the characteristics of the survey.
This is an example of a high level point of view and therefore does not show the intrinsic
complexity of a statistical survey because it hides single processes and because every phase is
sequential.
Sometimes a process in a subsequent phase could start even though all the previous phases have not
completely ended. This leads to a more complex web of relationships between single processes.
In more depth, the next example focuses on the Process phase of the statistical production. The
Process step comprises several activities. Adapting the same model approach used before, this
section shows a few examples.
Looking at the Process phase in more detail, there are sub-processes. These elementary tasks are the
finest-grained elements of the GSBPM.
We will try to sub-divide the sub-processes into elementary tasks in order to create a conceptual
layer closer to the IT infrastructure. With this aim we will focus on “Review, validate, edit” and we
will describe a possible generic sub-task implementation in what follows.
33
Let's take a sample of five statistical units (represented in the following diagram by three triangles
and two circles) each containing the values from three variables (V1, V2 and V3) which have to be
edited (checked and corrected). Every elementary task has to edit a sub-group of variables.
Therefore a unit entering a task is processed and leaves the task with all that task's variables edited.
We will consider a workflow composed of 6 activities (tasks): S, starting , F, finishing, and S1, S2,
S3, S4, editing data, activities. Suppose also each type of unit needs a different activity path, where
triangle shaped units need more articulated treatment on variables V1 and V2. For this purpose a
“filter” element F is introduced (the diamond in the diagram), which diverts each unit to the correct
part of the workflow. It is important to note that only V1 and V2 are processed differently because
in task S4 two branches re-join.
During the workflow, all the variables are inspected task by task and, when necessary, transformed
into a coherent state. Therefore each task contributes to the set of coherent variables. Note that
every path in the workflow meets the same set of variables. This incremental approach ensures that
at the end of the workflow every unit has its variables edited. The table below shows some
interesting attributes of the tasks.
34
Task
Input
Output
Purpose
Module
Data source
Data target
S
All
units
All units
Dummy
task
-
TAB_L_I_START
TAB_L_II_TARGE
T
S1
Circl
e
units
Circle
units
(V1,V2
corrected)
Edit
and
correct
V1 and
V2
EC_V1(CU, P1)
S2
Trian
gle
units
Triangle
units (V1
corrected)
Edit
and
correct
V1
EC_V1(TU,
P11)
S3
Trian
gle
units
(V1
corre
cted)
Triangle
units
(V1,V2
corrected)
Edit
and
correct
V2
EC_V2(TU,
P22)
TAB_L_II_TARGET
S4
All
units
(V1,V
All units
(all
variables
corrected)
Edit
and
correct
V3
EC_V3(U, P3)
TAB_L_II_TARGET
All units
Dummy
task
-
TAB_L_
II_TAR
GET
2
TAB_L_II_TARGET
EC_V2(CU, P2)
TAB_L_II_TARGET
corre
cted)
F
All
units
TAB_L_III_FINAL
The columns in the table above provide useful elements for the building and definition of modular
objects. These objects could be employed in an applicative framework where data structures and
interfaces are shared in a common infrastructure.
The task column identifies the sub-activities in the workflow: the subscript, when present,
corresponds to different sub-activities.
Input and output columns identify the statistical information units that must be processed and
produced respectively by each sub-activity. A simple textual description of the responsibility of
each sub-activity or task is given in the purpose column.
The module column shows the function needed to fulfil the purpose. As in the table above, we
could label each module with a prefix, representing a specific sub-process EC function (Edit and
Correct), and a suffix indicating the variable to work with. The first parameter in the function
indicates the unit to treat (CU stands for circle unit, TU for triangle unit), the second parameter
indicates the procedure, i.e. a threshold, a constant, a software component.
Structuring modules in such away could enable the reuse of components. The example in the table
above shows the activity S1 as a combination of EC_V1 and EC_V2 where EC_V1 is used by S1
and also S2 and EC_V2 is used by S1and also S3. Moreover, because the work on each variable is
similar, single function could be considered like a skeleton containing a modular system in order to
reduce building time and maximize re-usability.
Lastly, the data source and target columns indicate references to data structures necessary to
manage each step of the activity in the workflow.
35
5.8
CORE services and reuse of components
There are three main groups of workflows in S-DWH. One workflow updates data in the
warehouse, the second one updates in-house dissemination database and the third one updates the
public dissemination database.
These three automated data flows are quite independent from each other. Flow 1 is the biggest and
most complex. It extracts raw data from the source layer, processes it in integration layer and loads
to interpretation layer. And on the other hand, it brings cleansed data to the source layer for prefilling questionnaires, prepares sample data for collection systems, etc. Let’s name this flow the
processing flow.
Flow 2 and Flow 3 are very similar, both generate standard output to dissemination database. One
updates data in in-house dissemination database and the second one in public database. Both are
unidirectional flows. Let’s call Flow 2 as generate cube and Flow 3 as publish cube. In this context
the cube is a multidimensional table, for example .Stat or PC-Axis table.
Processing flows should be built up around input variables or groups of input variables to feed
variable based warehouse. Generate and publish cube flows are built around cubes, i.e. each flow
generates or publishes a cube.
36
There are many software tools available to build these modular flows. S-DWH’s layered
architecture itself provides the possibility to use different platforms and software in separate layers,
i.e. to re-use components already available in-house or internationally. In addition, different
software can be used inside the same layer to build up one particular flow. The problems arise when
we try to use these different modules and different data formats together.
Take a deeper look into CORE services. They are used to move data between S-DWH layers and
also inside the layers between different sub-tasks (e.g. edit, impute, etc.), then it is easier to use
software provided by statistical community or re-use self-developed components to build flows for
different purposes.
Generally CORE (COmmon Reference Environment) is an environment supporting the definition of
statistical processes and their automated execution. CORE processes are designed in a standard
way, starting from available services; specifically, process definition is provided in terms of abstract
statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering
the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped
according to CORE principles, and thus is easily integrated within a statistical process of another
NSI. Moreover, having a single environment for the execution of all statistical processes provides a
high level of automation and a complete reproducibility of processes execution.
NSI produce Official Statistics sharing very similar goals, hence several activities related to the
production of statistics are common. Nevertheless, such activities are currently carried on in an
independent way, without relying on shared solutions. Sharing a common architecture would result
in a reduction of costs due to duplicated activities and, at the same time, in an improvement of the
quality of produced statistics, due to the adoption of standardized solutions.
The main principles underlying CORA design are:
Platform Independence. NSIs use various platforms (e.g., hardware, operating systems, database
management systems, statistical software, etc.), hence architecture is bound to fail if it endeavours
to impose standards at a technical level. Moreover, the platform independence allows to model
statistical processes at a “conceptual level”, so that they do not need to be modified when the
implementation of a service changes.
Service Orientation. The vision is that the production of statistics takes place through services
calling other services. Hence services are the modular building blocks of the architecture. By
having clear communication interfaces, services implement principles of modern software
engineering like encapsulation and modularity.
Layered Approach. According to this principle, some services are rich and are positioned at the top
of the statistical process, so, for instance a publishing service requires the output of all sorts of
services positioned earlier in the statistical process, such as collecting data and storing information.
The ambition of this model is to bridge the whole range of layers from collection to publication by
describing all layers in terms of services delivered to a higher layer, in such a way that each layer is
dependent only on the first lower layer.
CORE principal objective is the design and implementation of an environment supporting the
definition of statistical processes and their automated execution. CORE processes are designed in a
standard way, starting from available services; specifically, process definition is provided in terms
of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of
fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be
wrapped according to CORE principles, and thus easily integrated within a statistical process of
another NSI. Moreover, having a single environment for the execution of all statistical processes
provides a high level of automation and a complete reproducibility of processes execution.
37
For us it is very important to make some transitions and mappings between different models and
approaches. Unfortunately mapping a CORE process to a business model is not possible because
the CORE model is an information model and there is no way to map a business model onto an
information model in a direct way. The two models are about different things. They can only be
connected if this connection is in some way a part of the models.
The CORE information model was designed with such a mapping in mind. Within this model, a
statistical service is an object, and one of its attributes is a reference to its GSBPM process.
Considering the GSBPM a business model, any mapping of the CORE model onto a business model
has to go through this reference to the GSBPM.
Usually different services use different services with their own tools which expect different data
formats. So for service interactions we need conversions. Evidently conversions are expensive.
Using for interactions CORE services number of conversions can be reduced noticeably.
In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint,
i.e. a CORE executable service. CORE service is indeed composed by an inner part, which is the
tool to be wrapped, and by input and output integration APIs. Such APIs transform from/to CORE
model into the tool specific format.
As anticipated, CORE mappings are designed for classes of tools and hence integration APIs should
support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE
and CORE-to-Relational, etc.
Basically, the integration API consists of a set of transformation components. Each transformation
component corresponds to a specific data format and the principal elements of their design are
specific mapping files, description files and transform operations.
In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE
operation is invoked. Conversely, the output of the tool is converted by Transform-to-CORE
operation. For each single input or output file a transformation must be launched.
In that way reusing of components can be performed in a very easy and efficient way.
38
6. Conclusion
Today, prevalent model for producing statistics is the stovepipe model. But there is also integrated
model and warehouse approach. In this paper integration model and warehouse approach was put
together. Integration can be looked at form three main viewpoints:
1. Technical integration – integrating IT platforms and software tools.
2. Process integration – integrating statistical processes like survey design, sample selection,
data processing and so on.
3. Data integration – data is stored once, but used for multiple purposes.
When we put all these three integration aspects together, we get S-DWH, which is built on
integrated technology, uses integrated processes to produce statistics and reuses data efficiently.
We made recommendations also about data models of each layer:
Source layer don’t have a specific data model but a mapping assistance is needed when
external data is used:
In integration layer for ETL functionalities and processing, data should be stored in
generalized normalized structure, optimized for OLTP, where all data are stored in similar data
structure independently from the domain or topic and each fact is stored only in one point in order
to makes easier maintain consistent data.
Since interpretation layer contains all collected data processed and structured to be
optimized for analysis and as base for output planned by the NSI and is specially designed for
statistical experts and is built to support data manipulation of big complex search operations an
OLAP (OnLine Analytical Processing) with a star-schema design.
In access layer, data marts can be built where data may be presented in data cubes with
different formats, specialized to support different tools and software.
In a S-DWH, the information is organized using a defined data model, which enables a structured
modular approach. This is because, a S-DWH is a metadata-driven system, which can also be easily
extended to manage operational tasks.
The main advantage of the workflow approach given here resides in the decomposition and
articulation of complex activities by elementary modules. These modules can be reused, reducing
effort and costs in the implementation of statistical processes.
39
References
B. Sundgren (2010a) “The Systems Approach to Official Statistics”, Official Statistics in Honour of
Daniel Thorburn, pp. 225–260. Available at: https://sites.google.com/site/bosundgren/mylife/Thorburnbokkap18Sundgren.pdf?attredirects=0
W. Radermacher (2011) “Global consultation on the draft Guidelines on Integrated Economic
Statistics”.
UNSC (2012) “Guidelines on Integrated Economic Statistics”. Available at:
http://unstats.un.org/unsd/statcom/doc12/RD-IntegratedEcoStats.pdf
W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating
To The Implementation Of The Vision On The Production Method Of EU Statistics”.
Available at: http://ec.europa.eu/eurostat/ramon/coded_files/TERMS-INSTATISTICS_version_4-0.pdf
European Union, Communication from the Commission to the European Parliament and the
Council on the production method of EU statistics: a vision for the next decade, COM(2009)
404 final. Available at: http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF
ESSnet CORE (COmmon Reference Environment) http://www.cros-portal.eu/content/core-0
40
Download