Best Practice Case Estonia v3

advertisement
Country
Estonia
NSI
Statistics Estonia
Name contact person
Maia Ennok
E-mail contact person
maia.ennok@stat.ee
1. What is the current
status of your S-DWH?
(planned / in
development / in
implementation or
operational)
Indicated in the questionnaire: question 6
No single coherent system which covers most of the data in the
production of business statistics yet
Implementing
S-DWH is implemented at first for Population and Housing
Census in 2011.
Implementing all social studies and 6 economical statistical
activities by the end of 2014
General Description
2.Please give a (short)
general description of
the nature of your SDWH by describing
topics as:
Which problems do/did
you want to solve by
implementing a S-DWH?
What is the goal of the SDWH?
What are the most
important solutions
desired by the NSI, to be
solved by implementing a
S-DWH?
What functionalities are
(to be) implemented in
the S-DWH?
Indicated in the questionnaire: question 1 + 1.1 resp 1 + 1.2 +
remarks, 2.1 and 14
No Single conceptual approach for processing data in the
production of business statistics.
The current situation be best characterized as ‘The 'Process Model'
We have the "Process Model" approach if there is also possible
more than one input for output and inputs (surveys, admin data,
census, observation register) and outputs (aggregate statistics,
micro data, time series) are the same for the "Process Model" as
for the "Data Model". We have the "Data Model" approach for the
business register for statistical purposes. (Remark made)
Main motivation to start DWH in your business statistics systems
are:
More ways to (re)use data
Data linking / integration
To implement changes in business process architecture
To improve efficiency in business statistics production
Technical efficiency
Most important challenges of the NSI
- Changes in the society are fast.
- The users of statistics in Estonia as well as on international
level require data which are reliable, comparable and
corresponding to the users’ needs.
- The demand for statistics is continuously increasing.
- For the EU Member States the compatibility of the needs of
the European Union and the country is a challenge.
- Local governments and counties need data for observing
their development and for making decisions.
- For the new decade the qualitative data have more important
role than ever in the society as a whole (in public
administration and enterprise). That is the reason why the
importance of objectivity and professionalism of statistics is
-
increasing.
The users of statistics need the data as soon as possible. For
this reason the statisticians must choose between being upto-date and quality, compromises must not be made with
regard to the latter.
Most important technical weaknesses
 data is duplicated
 data consistency is not assured
 metadata is missing or not harmonized
 different data sets are hard or impossible to link
 data re-using is very difficult
 impact analysis is not possible
 different statistical activities use different platforms for data
processing and data storing
 standardizing of processes is very time consuming
Goals from business strategy
1. Increase the coherence and comparability of statistics of
different subject areas
2. Increase the volume of analytical products
3. Respond faster to the needs of clients
4. Standardise the process of data processing
5. Separate product development and orders for information
from production process
6. Develop the department-driven organisational culture into
the institution-driven one
7. Create a modern working environment
8. Reuse of statistical data, metadata and data steps (data
processing code)
Solutions to achieve above mentioned goals
- Centralizing and standardizing meta data
(Goals:1;3;4;6;7;8)
- Standardizing and automating processes of data processing
(Goals:1;3;4;6;7;8)
- Building and using S-DWH (Goals:1;2;3;5;6;7;8)
- Centralizing technical specialists of data warehouse
(Goals:1;4;6)
- Single integrated platform for metadata, data processing and
data storing (creating precondition for creating unified
archiving solution) (Goals:1;2;3;4;5;6;7;8)
-
Functionalities:
Data processing software (VAIS) has essential role in IT
architecture. It is central integration tool and main data
processing software. VAIS moves data from one system to
another. For example from data collection to data processing
and statistical registries, from processing to data warehouse etc.
VAIS is metadata driven and enables to combine workflows from
prior programmed data processing steps. Manual intervention
(edit, impute, code etc) is also possible using operator’s
application.
iMeta is central metadata repository, based on MMX MOF 3
meta-meta model, that enables to manage several different
meta models. As metadata we manage both reference metadata
and structural metadata (including process metadata, technical
metadata, user roles and privileges etc).
Data warehouse (conformed collection of datasets) consists of
data processed and prepared for analysis. In data warehouse
variables (columns) are linked with variable descriptions in
iMeta. Data sets in Data Warehouse are versionised. With each
version is stored also suitable data processing package version,
with what data was produced. Data sets are mutually linked with
common dimensions and facts in different data sets are unique
(avoid data duplication in different data sets).
3. What is the scope of
your S-DWH?
(ETL, Data Warehousing,
metadata, integrations
between metadata
system and ETL)
4. How is the S-DWH
organised?
(activities,
responsibilities, roles etc)
Need assistance of centre of competence or best practice from
some other NSI’s
 Data duplication in S-DWH
 Store everything vs easy access
 Best practice for storing population and sample
 Role of Statistical Business Registry (?)
 Implementation of standards (for example DDI)
Statistical Data Warehouse system consists of:
Phase 2: iMeta application for designing statistical activity
(specifying outputs, variables, data collection methods etc)
Phase 2 and 3: VAIS Designer for designing, building and
configuring workflows
Phase 5: VAIS Operator for manual processing, XDTL Runtime
for automatic meta data driven data processing (ETL) and Data
Warehouse (conformed datasets) for storing processed data to
be analysed in Phase 6.
Phase 6: VAIS Cube Designer for designing and building
metadata enriched cubes, calculating new variables etc.
All metadata is stored in one metadata repository.
Best practice:
 Metadata repository - MMX (integrated metadata system
(iMeta), data processing workflow design, build,
configuration (VAIS Designer))
 Metadata driven data processing system (XDTL Runtime,
VAIS Operator)
 Metadata driven cube generation (VAIS Cube Designer)
In data processing, there are fallowing roles:
Lead statistician – describes metadata of statistical activity in
metadata system and data processing rules and checks in data
processing system.
Workflow designer – compile workflow in data processing
system.
Data Warehouse programmer – programs and complement
reusable data processing steps.
Data Warehouse architect– develops and maintains data model
of Data Warehouse.
Operator – manually corrects errors found during data
processing.
Analyst – analyze data in Data Warehouse and produces
statistics.
Lead statistician, operators and analysts are structurally in
departments of statistical domains.
Data Warehouse architect, workflow designer, Data Warehouse
programmer are situated in unit of data warehouse.
Statistical domains are responsible for statistical methodological
side and data warehouse unit is responsible for data warehouse
technical realization.
5. Which problems did
you encounter so far?
Indicated in the questionnaire: questions 15, 16, 17
Main methodological barriers to implementing an integrated system
are:
Not enough expertise
Too difficult, complex
Data linking
Insufficient development time
Too high costs
Problems encountered in data integration you desire to solve with
an integrated DWH-system are:
With Data Warehouse system we could have more data available
to use and reuse, and by using metadata-driven DWH-system we
could have better data quality for output data. DWH could
positively effect the response burden (diminish administrative
response burden). (remark made)
Problems encountered in process integration you desire to solve
with an integrated DWH-system
With Data Warehouse system we could have more effective and
unified process (using one framework for all data warehouse
processes) and with monitoring process metadata we could have
better process quality. (remark made)






It is difficult to develop massive system if requirements
for system are not cleared.
Creation (agreement) of metadata to new system takes a
lot of time.
People resistance to changes.
Support of administration for changes could be bigger.
No agreement how new system has influence division of
work between departments.
Technical problems with implementing freshly finished
software.
Characteristics
6. Please check the
characteristics as
indicated in the
questionnaire and
correct / complete if
necessary?
Indicated in the questionnaire: questions 3 + 4
There is not a one-to-one correspondence between input data and
outputs. The data warehouse is passive.
ETL part is active but data storing is passive (no changes in data
warehouse).
7. Is the data in the SDWH unit based (micro
data)?
8. Do you have
weighted data in the
S-DWH?
Generally yes, but we can store aggregates also if needed.
We plan to store both, original and weighted data together with
weights.
Statistical Registers
9. Is the Business
Register (BR) or other
Statistical Register
(SR) managed in the
S-DWH?
10. Which BR / SR data
are stored in the
S-DWH?
Indicated in questionnaire: question 10
The business register currently sits (or will it be) outside the current
(or planned) DWH system.
BR is stored and managed outside of SDWH.
 stratification variables (from snapshots to Statistical Unit
dimension)
 identifying variables (only in house ID, district level
address)
11. Is the S-DWH used
for updating the BR?
Yes, mainly for updating stratification variables in BR.
12. Do you have a
snapshot of the BR in
the S-DWH?
No, snapshots of BR is stored together with BR.
13. Do you keep
different versions of
BR snapshot in the
S-DWH?
No, snapshots of BR are versioned in BR. Statistical Unit
dimension in S-DWH is based on BR snapshot and is versioned
according to snapshot versions.
Metadata
concepts!)
(please read Annex 1 for an explanation of the underlined Metadata
14. What kind of
statistical metadata is
used?
15. What kind of
process metadata is
used?
16. What kind of
technical metadata is
used?
A. What kind of
reference (business)
metadata is used?
We use iMETA system for
statistical metadata (both
reference and structural).
Neuchatel model: statistical
activity descriptions,
collection methods.
B. What kind of structural
(technical) metadata is used
Codes in Neuchatel model:
statistical activity, statistical
activity version, variables,
classifiers (inc code lists) etc.
S-DWH creates and uses process
metadata. XDTL metamodel:
package, tasks, steps, parameter,
variable, connection. Rule,
transformation, condition etc.
Relational Database (RDB)
Metamodel (database, schema,
table, column, roles etc.).
17. What kind of
quality metadata is
used?
18. What kind of
authorisation metadata
is used?
17. Specify the key
relations between the
metadata classes
according to your
metamodel(s)?
(eg statistical activity,
classifier, variable)
18. Specify the key
relations between the
different kinds of
metadata
(reference, structural,
quality)
19. How is the
metadata maintenance
organised and stored?
20. What are your
metadata quality
requirements?
Neuchatel model: statistical Imputation rate, unit and item
activity instance
response rate etc.
description: user needs,
user satisfaction, quality
assurance, quality
assessment, quality
management, quality
documentation.
Descriptions which
Role-Based Access Control Model
statistical activity uses
and role based security: privileges
which administrative data.
by statistical activity, by variables,
Who is statistical activity
by operations.
manager.
Relation between statistical activity and variables, variables and
classifiers etc.
Different kind of metadata is somehow related. Key relations:
relations between variable and database column, relations
between variable and privilege (role based data access), etc.
iMeta is central metadata repository, based on MMX MOF 3
meta-meta model, that enables describe several different meta
models. As metadata we manage reference metadata, structural
metadata (including process metadata, technical metadata, user
roles and privileges etc.).
Methodological unit is responsible for metadata management.
Different units are responsible for metadata inputs and updates.
Principle is that metadata is filled where it formed (process and
unit).
Reference metadata is just descriptive and not used for
metadata driven system (it gives user the context).
Structural metadata is used for metadata driven system
(example variables codes).
Metadata is described in metadata systems by templates to
assure: no missing values, only unique codes, metadata
property quality rules (dependence between different types of
metadata elements) etc.
Annex 1: Metadata concepts1
Reference metadata are metadata that describe the contents and quality of the data in order
to help the user understand and evaluate them (conceptually)
Examples: Quality information on survey, register and variable levels; variable
definitions; reference dates; confidentiality information; contact information; relations
between metadata items
Structural metadata are metadata that help the user find, identify, access and utilise the data
(physically)
Examples: Classification codes; parameter lists
Statistical metadata are data about statistical data
This definition will obviously cover all kinds of documentation with some reference to
any type of statistical data and is applicable to metadata that refer to data stored in a
S-DWH as well as any other type of data store
Examples: Variable definition; register description; code list.
Process metadata are metadata that describe the expected or actual outcome of one or more
processes using evaluable and operational metrics
Examples: Operator’s manual (active, structured, reference); parameter list (active,
structured, reference); log file (passive, structured, reference/structural)
Technical metadata are metadata that describe or define the physical storage or location of
data.
Examples: Server, database, table and column names and/or identifiers; server,
directory and file names and/or identifiers
Quality metadata are any kind of metadata that contribute to the description or interpretation
of the quality of data.
Examples: Quality declarations for a survey or register (passive, free-form, reference);
documentation of methods that were used during a survey (passive, free-form,
reference); most log lists (passive, structured, reference/structural)
Authorisation metadata are administrative data that are used by programmes, systems or
subsystems to manage users’ access to data.
Examples: User lists with privileges; cross references between resources and users
1
Metadata Framework for Statistical Data Warehousing v0.9; ESSnet on Data Warehousing
Download