in partnership with Title: Confidentiality aspects of combining data within a Statistical Data Warehouse WP: 2 Deliverable: 2.5 Version: 1.0 (Final) Date: 31-8-2013 Author: NSI: ONS (UK) Pete Brodie ESSNET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Index 1. Introduction ........................................................................................................................ 3 2. Combining data sources...................................................................................................... 4 3. Previous ESSnet work.......................................................................................................... 5 4. Information gathered at Tallinn Workshop ........................................................................ 6 5. Outstanding issues to be investigated ................................................................................ 7 6. Conclusion and recommendations ..................................................................................... 8 7. Reference ............................................................................................................................ 8 Annex 1 ....................................................................................................................................... 9 2 1. Introduction The main objective for the ESSnet Project on Micro Data Linking and Data Warehousing (ESSnet DWH) is to provide assistance in the development of more integrated databases and data production systems for business statistics in ESS Member States. The outcome from Phase 1 (SGA-1) of the ESSnet DWH showed that the design and implementation of a Statistical Data Warehouse is “work in progress” for most surveyed National Statistics Institutes (NSIs), or that their system is yet only partially integrated. Flexibility in output, data linking, and efficiency in process are the main motivations for implementing a Statistical Data Warehouse (S-DWH). There are many perceived benefits of implementing a data warehouse approach. These include decreased cost of data access and analysis, using a common data model, as well as common tools, and faster and more automated data management and dissemination. So far, few NSIs have implemented a statistical data warehouse encompassing all phases of the Generic Statistical Business Process Model (GSBPM). One of the key challenges identified for Phase 2 (SGA-2) of the ESSnet DWH was the identifying and matching of the different statistical requirements of the S-DWH. Work Package 2 addresses this and other challenges in moving to a Statistical Data Warehouse from the methodological standpoint This report is one of the outputs of Work Package 2, deliverable 2.5: Confidentiality aspects of combining data (survey / administrative) including options for various hierarchies in a Statistical Data Warehouse . This report outlines the options for understanding and dealing with the confidentiality aspects of combining and re-using data from and within a statistical data warehouse. At the start of SGA2, the scope of the confidentiality requirements first needed to be tightly specified and outlined. In the first two workshops it became clear that Member States had a wide variety of interpretations of this topic. It was therefore decided that this report be restricted to the highlighting of issues raised in other ESSnet projects and developing ideas particularly related to the SDWH context, conclusions and recommendations on high level and recommending areas for further work (section 5). 3 2. Combining data sources In order to further improve and optimise statistical production, NSIs are searching for ways to make optimal use of all available data sources, existing and new. One of the major challenges in this process of change is the integration of the re-use of statistical data that is already available in statistical production processes. More and more NSIs are considering the option of re-using collected data for multiple outputs - so not only for the purpose they are collected for - and of combining this collected data with other sources of data such as administrative data sources for producing statistical outputs. The potential advantages of using administrative sources and re-using data include: a reduction in data collection and statistical production costs; the possibility of producing estimates at a very detailed level thanks to almost complete coverage of the population; re-use of already existing data to reduce respondent burden. But the implementation of a S-DWH also comes with some methodological challenges. One of the major risks is the increased potential for compromising the confidentiality of the data. When publishing trusted, high quality statistical outputs from a S-DWH, these outputs need to be as detailed as possible. This objective conflicts with the obligation NSIs have to protect the confidentiality of the information provided by respondents. The possibility of producing multiple outputs by linking and combining various data sources generally increases the size of the disclosure problem. A S-DWH offers many different outputs, covering a wide range of various topics, for many different users. NSIs need to ensure that the statistical data in these different outputs can be published without giving a way confidential information on specific individuals or entities. This variety in possible outputs requires different – potentially even combined – approaches in disclosure control, with a mixture of different tools. So a thorough study of all possible methods to protect confidentiality and defining and implementing a confidentiality strategy is an absolute precondition when applying a warehouse approach. 4 3. Previous ESSnet work Deliverable 2.7 of the ESSnet provided a mapping of links between existing and on-going ESSnet projects and a number of these links were indicated pertaining to the area of Data Confidentiality. In this report we elaborate on the two main pieces of work that have been undertaken. Substantial work was completed during the ESSnet on Statistical Disclosure Control and a comprehensive handbook was produced in January 2010. The report can be found at the following link http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf. This handbook aims to provide technical guidance on statistical disclosure control for NSIs on how to approach the problem of balancing the need to provide users with statistical outputs and the need to protect the confidentiality of survey respondents. Main challenge for NSIs is to optimize SDC methods and solutions to on the one hand minimize disclosure risks and on the other hand maximize data usability. The handbook provides guidance for all types of statistical outputs. From a S-DWH perspective, in particular there is reference within the handbook on dynamic databases whereby successive statistical queries to obtain aggregate information could possibly be combined with earlier data to increase the disclosure risk. This is very much aligned to SDC issues with a Statistical data Warehouse. There is also substantial discussion relating to the release of micro data which is the newest sub-discipline of SDC. Chapters 3, 4 and 5 of the handbook examine the separate problems of Micro data, Magnitude Tabular Data and finally Frequency Tables. Each of these chapters have also some discussion of the available software. Chapter 6 also focuses on remote access issues which is likely to have implications for any pan-European Statistical Data Warehouse and in more detail section 6.6 talks about the confidentiality protection of analyses that are produced. The handbook is well supplemented by information produced by Work Package 1 of the ESSnet on Data Integration with a report outlining the “State of the art on Statistical Methodologies for Data Integration” (the report can be found at the following link http://www.cros-portal.eu/wp1-state-art), which has a chapter dedicated to a literature review update on data integration methods in Statistical Disclosure Control (Chapter 4). Here the two main areas covered are those of contingency tables and of micro data dissemination. Paragraph 4.2 focusses on statistical disclosure control and record linkage. The main conclusion is that it is strongly recommended that a system of disclosure risk measure be set up to monitor the data dissemination processes, in order to minimize the risk of compromising data confidentiality. There is also a comprehensive reference list within this chapter. Finally the main deliverable of the Memobust project will be a handbook on business survey design and it is intended that two modules will cover Statistical Disclosure Control. More information is available in the reference section 7 below. 5 4. Information gathered at Tallinn Workshop To more clarify the S-DWH context in relation to data security and confidentiality a small case study was done by Statistics Lithuania looking into 3 perspectives: The dimensions of data security and confidentiality • The physical aspect • The legal aspects • The technical elements (hardware, software) Possible methods for micro data protection • Geographical thresholds • Top and bottom limits for variables • Releasing only samples of data • Recoding to broad categories • Deletion of especially sensitive data • Micro aggregation Possible methods for tabular data protection • Geographical thresholds • Cells need more than three respondents • Suppression below thresholds • Recoding to broad categories • Rounding of aggregates • Dominance rules As indicated in the introduction above the primary focus of deliverable 2.5 was on identifying specific ideas for practical implementation. At the workshop in Tallinn in March 2013 four questions were asked to establish the requirements in this area (see annex 1). The main feedback received from delegates at the Tallinn workshop was that a summary of the state of play with statistical disclosure would be good. Regarding the desired focus, the answers were not so clear. As on the one hand there was a request for concentrating on control for publication of aggregates on the other hand there were also other important aspects mentioned relating to micro data access. Amongst the delegates who were aware of their NSI’s processes for dealing with statistical disclosure control there was an equal split between those using commercial software, open source software or internally developed software though there were also a small number of countries that were using manual methods. Finally it was overwhelmingly accepted that this ESSnet should concentrate on addressing the issues of SDC where the Statistical Data Warehouse aspects were unique. In the next section we will list those aspects that were found to satisfy this criterion. It is anticipated that these should be further investigated by evaluating particular manifestations that can be implemented as part of the operation of the Centre of Competency on Data Warehousing. 6 5. Outstanding issues to be investigated Although within the ESS there is great interest in a data warehouse approach in statistical production, up to now only few practical implementations of statistical data warehouses have been realised. There is now a great opportunity to investigate and evaluate the issues on confidentiality and disclosure as highlighted within this ESSnet. as part of the Centre of Competence on Data Warehousing. These investigations should cover: a. Consistency of multiple outputs. Since in a S-DWH it is likely that a greater number of outputs are being produced from sets of - combined and/or linked - micro data, it is seen as useful to examine how secondary suppression may be best applied. Here the performance of stepwise suppression could be compared with the efficiency of performing suppression only once all outputs are known. b. Timing aspects of suppression. Related to the first point it also can be necessary to over suppress early outputs form a data warehouse to allow flexibility for later outputs. So another issue to investigate could be the effect of over suppression, if possible compared with the two methods as mentioned under issue a. c. It may be that with the potentially larger amount of micro data some methods like record swapping may become easier. An evaluation of different methods when considering a data warehouse would be useful. For example controlled rounding may prove problematic when not all outputs are known. d. Considering the size of the disclosure problem when dealing with multiple outputs and linked data, there may be a need to produce a new specific tool for performing disclosure control. This best should be investigated in conjunction with the ESSnet on Free and Open Source Software (FOSS). e. In the line of a possible development of micro data access by means of a panEuropean warehouse solution, the effects and consequences of such an approach need to be investigated, more specific looking at legal, IT and SDC aspects. Given the fact that this is a envisaged long term development, this issue has no direct priority. 7 6. Conclusion and recommendations A great deal of work has been done in focussing the scope of the requirements for statistical disclosure control. This has pulled together existing work and highlighted the areas which would benefit most from practical investigation. As there are no real practical S-DWH implementations yet, the ESSnet could so far not make clear and specific recommendations on SDC or confidentiality issues related to a S-DWH. Looking at the context of a S-DWH more in a broader, more general perspective it is clear that data confidentiality is an issue that needs to be thoroughly investigated and that a study of all possible methods to protect confidentiality as well as defining and implementing a confidentiality strategy is an absolute precondition when applying a data warehouse approach. Therefore, when setting up/implementing a S-DWH it is recommended to set up a clear strategy regarding data confidentiality: Start with defining a clear data dissemination strategy that should be based on a risk management approach; Identify possible risks for compromising the confidentiality of the data, split up for all various specific output processes; Evaluate if the SDC methods that are already in use still cover the new S-DWH output processes; Match existing methods per identified risk, set up and integrate the control mechanism(s) in the production processes; Setting up a system for monitoring disclosure/confidentiality risks. Furthermore, the ESSnet work has revealed the areas where existing work across Europe has been inadequate in the particularly unique circumstances surrounding a statistical data warehouse and has defined a framework for a natural practical extension via the Centre of Competence on Data Warehousing. 7. Reference Willenborg L., Scholtus S., 2011, “Workplan of the Memobust project”, available by registered users from http://www.essnet-portal.eu/sites/default/files/139/memobust_workplan_del_1_1.pdf 8 Annex 1 Questions and answers from the Tallinn workshop Should we attempt to summarise all the guidance on SDC from other ESSnets (beyond what was produced for deliverable 2.7) ? - 17 answers ‘YES’ 5 answers ‘NO’ - 20 answers ‘DON’T KNOW’ Should we attempt to summarise potential solutions for SDC, only where the S-DWH aspects are unique ? - 25 answers ‘YES’ 3 answers ‘NO’ - 13 answers ‘DON’T KNOW’ Thinking about your own NSI what tools/methods are used for ensuring confidentiality ? 8 answers ‘Use commercial software’ - 10 answers ‘Use open source software‘ - 11 answers ‘Use internally developed tools/software’ 5 answers ‘Only manual checking’ 8 answers ‘Don’t know’ Should we offer guidance on confidentiality of aggregates only ? - 17 answers ‘Aggregates only’ 9 answers ‘Other important aspects’ - 13 answers ‘Don’t know’ 9