2.5 Confidentiality aspects within a SDWH_version 1.0 (final).

advertisement
in partnership with
Title:
Confidentiality aspects of combining data
within a Statistical Data Warehouse
WP:
2
Deliverable:
2.5
Version: 1.0 (Final)
Date:
31-8-2013
Author:
NSI:
ONS (UK)
Pete Brodie
ESSNET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Index
1.
Introduction ........................................................................................................................ 3
2.
Combining data sources...................................................................................................... 4
3.
Previous ESSnet work.......................................................................................................... 5
4.
Information gathered at Tallinn Workshop ........................................................................ 6
5.
Outstanding issues to be investigated ................................................................................ 7
6.
Conclusion and recommendations ..................................................................................... 8
7.
Reference ............................................................................................................................ 8
Annex 1 ....................................................................................................................................... 9
2
1. Introduction
The main objective for the ESSnet Project on Micro Data Linking and Data Warehousing
(ESSnet DWH) is to provide assistance in the development of more integrated databases and
data production systems for business statistics in ESS Member States.
The outcome from Phase 1 (SGA-1) of the ESSnet DWH showed that the design and
implementation of a Statistical Data Warehouse is “work in progress” for most surveyed
National Statistics Institutes (NSIs), or that their system is yet only partially integrated.
Flexibility in output, data linking, and efficiency in process are the main motivations for
implementing a Statistical Data Warehouse (S-DWH).
There are many perceived benefits of implementing a data warehouse approach. These
include decreased cost of data access and analysis, using a common data model, as well as
common tools, and faster and more automated data management and dissemination.
So far, few NSIs have implemented a statistical data warehouse encompassing all phases of
the Generic Statistical Business Process Model (GSBPM). One of the key challenges
identified for Phase 2 (SGA-2) of the ESSnet DWH was the identifying and matching of the
different statistical requirements of the S-DWH. Work Package 2 addresses this and other
challenges in moving to a Statistical Data Warehouse from the methodological standpoint
This report is one of the outputs of Work Package 2, deliverable 2.5:
Confidentiality aspects of combining data (survey / administrative) including
options for various hierarchies in a Statistical Data Warehouse .
This report outlines the options for understanding and dealing with the confidentiality
aspects of combining and re-using data from and within a statistical data warehouse.
At the start of SGA2, the scope of the confidentiality requirements first needed to be tightly
specified and outlined. In the first two workshops it became clear that Member States had a
wide variety of interpretations of this topic. It was therefore decided that this report be
restricted to the highlighting of issues raised in other ESSnet projects and developing ideas
particularly related to the SDWH context, conclusions and recommendations on high level
and recommending areas for further work (section 5).
3
2. Combining data sources
In order to further improve and optimise statistical production, NSIs are searching for ways
to make optimal use of all available data sources, existing and new. One of the major
challenges in this process of change is the integration of the re-use of statistical data that is
already available in statistical production processes.
More and more NSIs are considering the option of re-using collected data for multiple
outputs - so not only for the purpose they are collected for - and of combining this collected
data with other sources of data such as administrative data sources for producing statistical
outputs.
The potential advantages of using administrative sources and re-using data include:
 a reduction in data collection and statistical production costs;
 the possibility of producing estimates at a very detailed level thanks to almost
complete coverage of the population;
 re-use of already existing data to reduce respondent burden.
But the implementation of a S-DWH also comes with some methodological challenges.
One of the major risks is the increased potential for compromising the confidentiality of the
data. When publishing trusted, high quality statistical outputs from a S-DWH, these outputs
need to be as detailed as possible. This objective conflicts with the obligation NSIs have to
protect the confidentiality of the information provided by respondents. The possibility of
producing multiple outputs by linking and combining various data sources generally
increases the size of the disclosure problem. A S-DWH offers many different outputs,
covering a wide range of various topics, for many different users.
NSIs need to ensure that the statistical data in these different outputs can be published
without giving a way confidential information on specific individuals or entities. This variety
in possible outputs requires different – potentially even combined – approaches in
disclosure control, with a mixture of different tools. So a thorough study of all possible
methods to protect confidentiality and defining and implementing a confidentiality strategy
is an absolute precondition when applying a warehouse approach.
4
3. Previous ESSnet work
Deliverable 2.7 of the ESSnet provided a mapping of links between existing and on-going
ESSnet projects and a number of these links were indicated pertaining to the area of Data
Confidentiality. In this report we elaborate on the two main pieces of work that have been
undertaken.
Substantial work was completed during the ESSnet on Statistical Disclosure Control and a
comprehensive handbook was produced in January 2010. The report can be found at the
following link http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf. This handbook aims to provide
technical guidance on statistical disclosure control for NSIs on how to approach the problem
of balancing the need to provide users with statistical outputs and the need to protect the
confidentiality of survey respondents. Main challenge for NSIs is to optimize SDC methods
and solutions to on the one hand minimize disclosure risks and on the other hand maximize
data usability.
The handbook provides guidance for all types of statistical outputs. From a S-DWH
perspective, in particular there is reference within the handbook on dynamic databases
whereby successive statistical queries to obtain aggregate information could possibly be
combined with earlier data to increase the disclosure risk. This is very much aligned to SDC
issues with a Statistical data Warehouse.
There is also substantial discussion relating to the release of micro data which is the newest
sub-discipline of SDC. Chapters 3, 4 and 5 of the handbook examine the separate problems
of Micro data, Magnitude Tabular Data and finally Frequency Tables. Each of these chapters
have also some discussion of the available software.
Chapter 6 also focuses on remote access issues which is likely to have implications for any
pan-European Statistical Data Warehouse and in more detail section 6.6 talks about the
confidentiality protection of analyses that are produced.
The handbook is well supplemented by information produced by Work Package 1 of the
ESSnet on Data Integration with a report outlining the “State of the art on Statistical
Methodologies for Data Integration” (the report can be found at the following link
http://www.cros-portal.eu/wp1-state-art), which has a chapter dedicated to a literature
review update on data integration methods in Statistical Disclosure Control (Chapter 4). Here
the two main areas covered are those of contingency tables and of micro data
dissemination. Paragraph 4.2 focusses on statistical disclosure control and record linkage.
The main conclusion is that it is strongly recommended that a system of disclosure risk
measure be set up to monitor the data dissemination processes, in order to minimize the
risk of compromising data confidentiality. There is also a comprehensive reference list within
this chapter.
Finally the main deliverable of the Memobust project will be a handbook on business survey
design and it is intended that two modules will cover Statistical Disclosure Control. More
information is available in the reference section 7 below.
5
4. Information gathered at Tallinn Workshop
To more clarify the S-DWH context in relation to data security and confidentiality a small
case study was done by Statistics Lithuania looking into 3 perspectives:
 The dimensions of data security and confidentiality
• The physical aspect
• The legal aspects
• The technical elements (hardware, software)
 Possible methods for micro data protection
• Geographical thresholds
• Top and bottom limits for variables
• Releasing only samples of data
• Recoding to broad categories
•
Deletion of especially sensitive data
•
Micro aggregation
 Possible methods for tabular data protection
• Geographical thresholds
• Cells need more than three respondents
• Suppression below thresholds
• Recoding to broad categories
• Rounding of aggregates
• Dominance rules
As indicated in the introduction above the primary focus of deliverable 2.5 was on
identifying specific ideas for practical implementation. At the workshop in Tallinn in March
2013 four questions were asked to establish the requirements in this area (see annex 1).
The main feedback received from delegates at the Tallinn workshop was that a summary of
the state of play with statistical disclosure would be good.
Regarding the desired focus, the answers were not so clear. As on the one hand there was a
request for concentrating on control for publication of aggregates on the other hand there
were also other important aspects mentioned relating to micro data access.
Amongst the delegates who were aware of their NSI’s processes for dealing with statistical
disclosure control there was an equal split between those using commercial software, open
source software or internally developed software though there were also a small number of
countries that were using manual methods.
Finally it was overwhelmingly accepted that this ESSnet should concentrate on addressing
the issues of SDC where the Statistical Data Warehouse aspects were unique. In the next
section we will list those aspects that were found to satisfy this criterion. It is anticipated
that these should be further investigated by evaluating particular manifestations that can be
implemented as part of the operation of the Centre of Competency on Data Warehousing.
6
5. Outstanding issues to be investigated
Although within the ESS there is great interest in a data warehouse approach in statistical
production, up to now only few practical implementations of statistical data warehouses
have been realised. There is now a great opportunity to investigate and evaluate the issues
on confidentiality and disclosure as highlighted within this ESSnet.
as part of the Centre of Competence on Data Warehousing.
These investigations should cover:
a. Consistency of multiple outputs. Since in a S-DWH it is likely that a greater number of
outputs are being produced from sets of - combined and/or linked - micro data, it is
seen as useful to examine how secondary suppression may be best applied. Here the
performance of stepwise suppression could be compared with the efficiency of
performing suppression only once all outputs are known.
b. Timing aspects of suppression. Related to the first point it also can be necessary to
over suppress early outputs form a data warehouse to allow flexibility for later
outputs. So another issue to investigate could be the effect of over suppression, if
possible compared with the two methods as mentioned under issue a.
c. It may be that with the potentially larger amount of micro data some methods like
record swapping may become easier. An evaluation of different methods when
considering a data warehouse would be useful. For example controlled rounding may
prove problematic when not all outputs are known.
d. Considering the size of the disclosure problem when dealing with multiple outputs
and linked data, there may be a need to produce a new specific tool for performing
disclosure control. This best should be investigated in conjunction with the ESSnet on
Free and Open Source Software (FOSS).
e. In the line of a possible development of micro data access by means of a panEuropean warehouse solution, the effects and consequences of such an approach
need to be investigated, more specific looking at legal, IT and SDC aspects. Given the
fact that this is a envisaged long term development, this issue has no direct priority.
7
6. Conclusion and recommendations
A great deal of work has been done in focussing the scope of the requirements for statistical
disclosure control. This has pulled together existing work and highlighted the areas which
would benefit most from practical investigation. As there are no real practical S-DWH
implementations yet, the ESSnet could so far not make clear and specific recommendations
on SDC or confidentiality issues related to a S-DWH. Looking at the context of a S-DWH
more in a broader, more general perspective it is clear that data confidentiality is an issue
that needs to be thoroughly investigated and that a study of all possible methods to protect
confidentiality as well as defining and implementing a confidentiality strategy is an absolute
precondition when applying a data warehouse approach.
Therefore, when setting up/implementing a S-DWH it is recommended to set up a clear
strategy regarding data confidentiality:

Start with defining a clear data dissemination strategy that should be based on a risk
management approach;

Identify possible risks for compromising the confidentiality of the data,
split up for all various specific output processes;

Evaluate if the SDC methods that are already in use still cover the new S-DWH
output processes;

Match existing methods per identified risk, set up and integrate the control
mechanism(s) in the production processes;

Setting up a system for monitoring disclosure/confidentiality risks.
Furthermore, the ESSnet work has revealed the areas where existing work across Europe has
been inadequate in the particularly unique circumstances surrounding a statistical data
warehouse and has defined a framework for a natural practical extension via the Centre of
Competence on Data Warehousing.
7. Reference
Willenborg L., Scholtus S., 2011, “Workplan of the Memobust project”,
available by registered users from
http://www.essnet-portal.eu/sites/default/files/139/memobust_workplan_del_1_1.pdf
8
Annex 1
Questions and answers from the Tallinn workshop
 Should we attempt to summarise all the guidance on SDC from other ESSnets
(beyond what was produced for deliverable 2.7) ?
- 17 answers ‘YES’
5 answers ‘NO’
- 20 answers
‘DON’T KNOW’
 Should we attempt to summarise potential solutions for SDC,
only where the S-DWH aspects are unique ?
- 25 answers ‘YES’
3 answers ‘NO’
- 13 answers
‘DON’T KNOW’
 Thinking about your own NSI what tools/methods are used
for ensuring confidentiality ?
8 answers ‘Use commercial software’
- 10 answers ‘Use open source software‘
- 11 answers ‘Use internally developed tools/software’
5 answers ‘Only manual checking’
8 answers ‘Don’t know’
 Should we offer guidance on confidentiality of aggregates only ?
- 17 answers ‘Aggregates only’
9 answers ‘Other important aspects’
- 13 answers ‘Don’t know’
9
Download