The new Bank of Italy Remote access to micro Data (BIRD) G. Bruno, L. D’Aurizio, R. Tartaglia-Polcini Q2008 – Rome, July 10, 2008 1 Motivation • Information release and data protection as competing goals • The risk-utility tradeoff: • risk of data disclosure • utility of widespread availability of data for research 2 Motivation GOALS (UTILITY): • satisfy growing demand from external researchers for business data • improve the accountability of the Central Bank as economic research centre • provide a service to the scientific community CONSTRAINTS (RISK): • Data confidentiality must be guaranteed: • as a prerequisite for respondents’ collaboration • to foster quality of the data provided • is required by the law • Public Use File (PUF) with individual data judged unfeasible: anonymisation very problematic with business data 3 Motivation SYNTHETIC DATA LIMITATIONS: • Identity disclosure impossible in principle, but, particularly with extreme values, it may be possible to re-identify a source record • Attribute disclosure may happen • Ample literature on data confounding and synthetic data (Duncan & Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; Fienberg et al. 1996; Kennickell 1997; Abowd & Woodcock 2001; Reiter 2002; Raghunathan et al. 2003; etc.) 4 Choices • Data confounding: create a PUF containing • • perturbed data to prevent identification of individual information. Downside: results (esp. regressions) may heavily depend on the confounding technique adopted - controversial literature Data lab (à la Istat: ADELE) – the researcher has to go to the lab in person. Remote processing, using internet, without direct access to individual data (à la Luxembourg Income Study: LISSY) 5 Other remote processing systems • Luxembourg Income Study (LISSY, 1987) • Statistics Canada (2001) • Statistic Denmark (2001) • Statistic Netherlands (2002) • Australian Bureau of Statistics (2003) • Statistic Sweden (2003) • US Federal Agencies: NCHS (1997), NCES (1998), Census Bureau (2003) 6 The solution adopted at the Bank of Italy BIRD • • • • • • • Modeled on LISSY Low setup cost Easily customisable Supports multiple packages Maximum accessibility for users Multi-level control (user/group, dataset, keyword) Automatic and manual checks & review 7 How BIRD works USER ELIGIBILITY CRITERIA • • • Researcher status (not necessarily academic) proved by a presentation letter Identification via valid personal id Detailed information via form to be filled in 8 How BIRD works USER PROFILE CREATION • • • • The researcher indicates an e-mail address which will be recognised by the system. The researcher indicates her own user and password User-chosen parameters are input in the user database Access profile is created 9 How BIRD works SUBMISSION PROCEDURE • • • • • • Communication with the processing environment via email Send a message containing user authentication info + statements to be submitted Input message is parsed and checks are performed If no error/security violation submit statements Output is parsed (automatically / manually) If no security violation forward to the user via email 10 Confidentiality safeguards • User level • Data level • Processing level 11 Confidentiality safeguards User level: • Users are identified, qualified and registered • Registered mailboxes are whitelisted; ordinarily only one mailbox per user • Outputs are monitored and archived • Deontological code, privacy law, specific penalties Sanctions • Forbidden submissions or outputs are deleted • Grant of access for users trying to perform forbidden commands may be revoked • Any other sanctions or penalties required by the law where applicable 12 Confidentiality safeguards Data level: • Extreme data are censored (Winsorized) • Identifying variables (ids, names, addresses) are • expunged from the datasets used for remote processing Stratification variables are collapsed (geographical areas and not regions; Ateco aggregations and not codes) 13 Confidentiality safeguards Processing level: • Formally forbidden to display individual data • Keyword parser implemented with ceiling, • • blacklist e graylist Particularly long and/or complex programmes are always reviewed manually In the learning stage, all submissions are reviewed manually 14 How the parser works check type check performed action if failed on INPUT action if failed on OUTPUT authentication checking user authentication data job cancelled n/a blacklist parsing text for specific words and sequences job cancelled n/a length checking the length of text n/a graylist (*) parsing text for specific words and sequences manual review soft ceiling: manual review hard ceiling: job cancelled manual review (*) This feature will be available in the next release of the system. 15 Datasets available STANDARD DATASET: quantitative data for the biggest firms (in terms of workforce) are censored (Winsorised) COMPLETE DATASET: no data censoring Id variables are expunged from both datasets, obviously 16 Datasets available Aggravated procedure for accessing the complete dataset: • Access must be explicitly requested – a special profile • • is created Review is exclusively manual Wait times are longer than average as time allocated to manual review on complete dataset is reduced 17 Documentation on the website • • • • Application form Instruction manual Dataset description Examples of submissions in the supported packages (SAS, Stata) • Methodological notes on the survey 18 Support 1. Documentation available on the Bank of Italy website (manuals, variables description, questionnaires) http://www.bancaditalia.it/statistiche/indcamp/indimpser/bird 2. Mailbox for queries and assistance: bird_assist@bancaditalia.it 19 An example Program submitted by the user in Stata. Authentication is in the first four lines. 20 An example Output forwarded after review 21 Usage of the system in the first weeks • System started officially on Mar 13, 2008 • Beta users from Feb 1, 2008 • 8 registered users • 172 submissions in 21 weeks 22 Usage of the system in the first weeks BIRD: # of weekly submissions, from Feb 1, 2008 35 30 25 20 15 10 5 0 w1 w3 w5 w7 w9 w 11 w 13 w 15 w 17 w 19 w 21 23 Future developments • • • • • Web submission available alongside e-mail submission Other datasets will be made available in the future (e.g. data from the Business Outlook Survey) Open source packages processing (e.g. R) Merging with external datasets provided by the user, for special projects, on a discretionary basis, under an aggravated procedure and higher security levels. Creation of closed groups with special authorisation levels for specific projects 24 Thank you for your attention 25