Presentation - Q2008

advertisement
The new Bank of Italy Remote
access to micro Data (BIRD)
G. Bruno, L. D’Aurizio, R. Tartaglia-Polcini
Q2008 – Rome, July 10, 2008
1
Motivation
• Information release and data protection as
competing goals
• The risk-utility tradeoff:
• risk of data disclosure
• utility of widespread availability of data
for research
2
Motivation
GOALS (UTILITY):
• satisfy growing demand from external researchers for business data
• improve the accountability of the Central Bank as economic research
centre
• provide a service to the scientific community
CONSTRAINTS (RISK):
• Data confidentiality must be guaranteed:
• as a prerequisite for respondents’ collaboration
• to foster quality of the data provided
• is required by the law
• Public Use File (PUF) with individual data judged unfeasible:
anonymisation very problematic with business data
3
Motivation
SYNTHETIC DATA LIMITATIONS:
• Identity disclosure impossible in principle, but, particularly with
extreme values, it may be possible to re-identify a source record
• Attribute disclosure may happen
• Ample literature on data confounding and synthetic data (Duncan &
Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; Fienberg et al.
1996; Kennickell 1997; Abowd & Woodcock 2001; Reiter 2002;
Raghunathan et al. 2003; etc.)
4
Choices
• Data confounding: create a PUF containing
•
•
perturbed data to prevent identification of
individual information. Downside: results (esp.
regressions) may heavily depend on the
confounding technique adopted - controversial
literature
Data lab (à la Istat: ADELE) – the researcher has
to go to the lab in person.
Remote processing, using internet, without
direct access to individual data
(à la
Luxembourg Income Study: LISSY)
5
Other remote processing systems
• Luxembourg Income Study (LISSY, 1987)
• Statistics Canada (2001)
• Statistic Denmark (2001)
• Statistic Netherlands (2002)
• Australian Bureau of Statistics (2003)
• Statistic Sweden (2003)
• US Federal Agencies: NCHS (1997), NCES
(1998), Census Bureau (2003)
6
The solution adopted at the Bank of Italy
BIRD
•
•
•
•
•
•
•
Modeled on LISSY
Low setup cost
Easily customisable
Supports multiple packages
Maximum accessibility for users
Multi-level control (user/group, dataset,
keyword)
Automatic and manual checks & review
7
How BIRD works
USER ELIGIBILITY CRITERIA
•
•
•
Researcher status (not necessarily academic) proved
by a presentation letter
Identification via valid personal id
Detailed information via form to be filled in
8
How BIRD works
USER PROFILE CREATION
•
•
•
•
The researcher indicates an e-mail address which will
be recognised by the system.
The researcher indicates her own user and password
User-chosen parameters are input in the user database
Access profile is created
9
How BIRD works
SUBMISSION PROCEDURE
•
•
•
•
•
•
Communication with the processing environment via email
Send a message containing user authentication info +
statements to be submitted
Input message is parsed and checks are performed
If no error/security violation  submit statements
Output is parsed (automatically / manually)
If no security violation  forward to the user via email
10
Confidentiality safeguards
• User level
• Data level
• Processing level
11
Confidentiality safeguards
User level:
• Users are identified, qualified and registered
• Registered mailboxes are whitelisted; ordinarily only one
mailbox per user
• Outputs are monitored and archived
• Deontological code, privacy law, specific penalties
Sanctions
• Forbidden submissions or outputs are deleted
• Grant of access for users trying to perform forbidden
commands may be revoked
• Any other sanctions or penalties required by the law
where applicable
12
Confidentiality safeguards
Data level:
• Extreme data are censored (Winsorized)
• Identifying variables (ids, names, addresses) are
•
expunged from the datasets used for remote
processing
Stratification variables are collapsed
(geographical areas and not regions; Ateco
aggregations and not codes)
13
Confidentiality safeguards
Processing level:
• Formally forbidden to display individual data
• Keyword parser implemented with ceiling,
•
•
blacklist e graylist
Particularly long and/or complex programmes
are always reviewed manually
In the learning stage, all submissions are
reviewed manually
14
How the parser works
check type
check performed
action if failed on INPUT
action if failed on OUTPUT
authentication
checking user
authentication data
job cancelled
n/a
blacklist
parsing text for specific
words and
sequences
job cancelled
n/a
length
checking the length of
text
n/a
graylist (*)
parsing text for specific
words and
sequences
manual review
soft ceiling: manual review
hard ceiling: job cancelled
manual review
(*) This feature will be available in the next release of the system.
15
Datasets available
STANDARD DATASET: quantitative data for the biggest
firms (in terms of workforce) are censored (Winsorised)
COMPLETE DATASET: no data censoring
Id variables are expunged from both datasets, obviously
16
Datasets available
Aggravated procedure for accessing the complete
dataset:
• Access must be explicitly requested – a special profile
•
•
is created
Review is exclusively manual
Wait times are longer than average as time allocated
to manual review on complete dataset is reduced
17
Documentation on the website
•
•
•
•
Application form
Instruction manual
Dataset description
Examples of submissions in the
supported packages (SAS, Stata)
• Methodological notes on the survey
18
Support
1. Documentation available on the Bank of Italy website
(manuals, variables description, questionnaires)
http://www.bancaditalia.it/statistiche/indcamp/indimpser/bird
2. Mailbox for queries and assistance:
bird_assist@bancaditalia.it
19
An example
Program
submitted by
the user in
Stata.
Authentication
is in the first
four lines.
20
An example
Output
forwarded after
review
21
Usage of the system in the first weeks
• System started officially on Mar 13, 2008
• Beta users from Feb 1, 2008
• 8 registered users
• 172 submissions in 21 weeks
22
Usage of the system in the first weeks
BIRD: # of weekly submissions, from Feb 1, 2008
35
30
25
20
15
10
5
0
w1
w3
w5
w7
w9
w 11
w 13
w 15
w 17
w 19
w 21
23
Future developments
•
•
•
•
•
Web submission available alongside e-mail submission
Other datasets will be made available in the future (e.g.
data from the Business Outlook Survey)
Open source packages processing (e.g. R)
Merging with external datasets provided by the user, for
special projects, on a discretionary basis, under an
aggravated procedure and higher security levels.
Creation of closed groups with special authorisation levels
for specific projects
24
Thank you for your attention
25
Download