Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester Overview • The disclosure risk problem • Some e-science possibilities – Monitored data access – Grid based Data environment Analysis • The meaning of privacy Data Data Everywhere… • Massive and exponential increase in data; Mackey and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service. • Singer(1999) noted three behavioural tendencies: – Collect more information on each population unit – Replace aggregate data with person specific databases – Given the opportunity collect personal information • Purdam and Elliot add: – Link data whenever you can Disclosure Risk I: Microdata The Disclosure Risk Problem: Type I: Identification Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file ID variables Key variables Target variables Disclosure Risk II: Aggregate Tables of Counts The Disclosure Risk Problem: Type II: Attribution Incom e lev els for two occupations High Medium Low T otal Accadem ics 0 1 00 50 1 50 Lawy ers 1 00 50 5 1 55 T otal 1 00 1 50 55 305 The Disclosure Risk Problem: Type II: Attribution Incom e lev els for two occupations High Medium Low T otal Accadem ics 1 1 00 50 1 50 Lawy ers 1 00 50 5 1 55 T otal 1 00 1 50 55 305 The Disclosure Risk Problem: Type II: Attribution Incom e lev els for two occupations High Medium Low T otal Accadem ics 0 1 00 50 1 50 Lawy ers 1 00 50 5 1 55 T otal 1 00 1 50 55 305 Multiple datasets • Disclosure Risk assessment for single datasets is a reasonably understood problem. • But what happens with multiple datasets? Data Mining and the Grid • Traditional Data Mining examines and identifies patterns on single (if massive) datasets. • But Data Mining is really a method/approach/technology that has been waiting for the grid to happen. • Smith and Elliot (2005,06,07) • Increases in data availability lead inexorably to an increase in disclosure risk • My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z. • It’s all about information! CLEF: Clinical e-Science Framework A solution involving monitored access CLEF Consortium Approximately 40 Staff from • University of Manchester • University of Sheffield • University College London • University of Brighton • Royal Marsden Hospital, London Purpose • To provide a system for allowing research access to patient data, whilst maintaining privacy. • Patient records – Database • Texts such as referral letters and other clinical texts – Text mining system convert to microdata CLEF one possible architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench Data Sentry: an AI system • Monitors patterns of analytical requests – 3 levels: users, institution, world. – Looking for intrusive patterns. – Numbers of requests • Stores Analytical requests for future use. CLEF Proposed Architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusion sentry Workbench Data Quality • User analyses are run on both treated and untreated data. – Outputs are compared and assessed for difference. – Major research area – Knowledge Engineering • Analyses are stored and collectively run over pre and post SDC files for assessment of impact. The Grid: the context for massive combining. • “Integrated infrastructure for highperformance distributed computation” Cannataro and Talia (2002) – Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002) • Data grid • Knowledge grid Grid based Data Environment Analysis What’s it about? • Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house being vulnerable to flooding without looking at where it is located! • Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing….. What would it involve? • • • • Web Crawling Data Monitoring Synthetic Data Generation Grid based disclosure risk analysis Web crawling • Untrained Screen scraping of all web sites that collect personal data. • Generic info gathering of web published personal info (personal web pages, My space etc) Data Monitoring • The development of sophisticated metadatabases representing available info fields • Combined Database of web available data. – Involves intelligent interpretation of web data, record linkage and other AI crossover techniques. Architecture Web Crawler Web Crawler Web Crawler Web Crawler Web Crawler SDRA system Synthesiser Data monitor Repository: Data & Metadata What next? • Decide on roles. • Identify funder. • Develop grant application. Synthetic Data Generation • Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories. Closing thoughts A Blurring of Concepts • The boundaries between data and processes become less distinct. • Cyberidenties – I am my data? • The distinction between informational and physical privacy becomes less distinct. Data Growth • There is no reason to suppose that data growth will not continue at the same break neck pace – The data environment will become increasingly richer • In this context the meaning of “privacy” will undoubtedly change. – But how? The meaning of Privacy • Do people care about privacy in an orthodox, absolute sense? – What does a blog mean? • Private-public: Public Privacy – Control and ownership are more important than the absolute right to secrecy. From Data Subjects to Data Citizens • A data actualised individual in control and self aware of their own data. • What would data citizens be concerned about? – – – – Ownership The use/abuse of their data Harm Permission/Consent • This suggests that the law should focus on data abuse rather than privacy per se. Summary • Statistical Disclosure prevents a problem for the use of data • Multiple linkable datasets exacerbate that problem. • E-science provides some tools for new modes of data access But….. • Assuming that the global culture continues to feed and be fed by the information explosion: – Our view of ourselves/our data will/must change. – The meaning of privacy must change with it. • The key question is what sort of society we are constructing; the meaning of privacy will reflect this.