Data Provenance and Security The Role of Provenance in Data Security (or vice versa) Jens Jensen <jens.jensen@stfc.ac.uk> ESI Provenance Workshop, Edinburgh 20 April 2009 Data Security Goals • • • • • Confidentiality Integrity Availability Assurance (Policies) Confidentiality • What it is – Protect data against disclosure • Why – Sensitive information (commercial, personal data), legal compliance • Issues – Enabling and maintaining access for authorised persons (only) – Tech Confidentiality – example • Gov’t historical papers – (or soon-to-be historical) – Keep for 30 years (say), then disclose • Medical/personal data – Compliance with data protection act – Keep, then destroy (and prove it) • Licensed data – E.g. census data Integrity • What it is – Ensure content does not change • Why – Essential for most data storage • Issues – Intentional vs unintentional changes – Changes to file (bitlevel) vs changes to content (migration ok) – Tech Integrity – Example • Bank money transfer – Not so good if recipient adds a zero… – Proof of integrity req’d for auditing • Original research data – Prevent change • unintended (disk corruption), • accidental (oops), • malicious (forgery) Availability • What it is – Ensure data can be reached when needed • Why – Often neglected aspect of security – High profile DoS cases show why • Issues – Multiple copies goes against confidentiality – Technically challenging – Scale Availability – Example • Tax returns… – Everybody (almost) submits in the last minute – Online server and instructions must be available under high load • Accessing remote data – Clouds, grids, buzzword-du-jour – Increase distributed collaborations – Hot files Assurance • What it is – Do you trust the repository – Does 3rd party (eg court of law) trust it? • Why – Legal admissibility – Compliance with regulations • Issues – Understanding duty of care – Full lifecycle Assurance – Example • BIP 0008 – Legal admissibility and evidential weight of • information stored electronically • information communicated electronically • linking identity of documents – Summary: duty of care Policies • What is it – Higher level goals • Why – Help clarify the wherefores and whys • Out of scope of this talk Proposals in This Presentation • Use provenance data to implement (some) security goals • Use security components to implement (some) provenance goals Obvious Things First • Keep security metadata along with provenance metadata • Secure provenance data – Integrity – Availability (data useless without metadata) Obvious Things First • Protect confidential aspects of provenance – E.g., data origin, owner – E.g., conditional release of owner/subject • Assurance for auditing purposes Components and Roles • • • • • Identity data owner Time timestamps File bitlevel File/datasets type, contents Text comments, annotations, taxonomies, user defined • Process Workflow, parameters • History versions The Gory(?) Details – Identity • Uniqueness – Id not reallocated to others – (User has no other id) • Persistence – Lifetime of long id tokens… Identity Today • Name, local uid, affiliation, email – Not unique – People move around, change names • X.509 certificates – Commercial PKI – UK e-Science CA • UK Access Management Federation – Pseudonymity at best – Can be recycled Time • Accuracy – PC clocks drift (sync often) – Needs network or special clock cards (or GPS!) • Assurance of time – (Re)setting time on PC is easy File Curation File metadata • Filename, path – Should or should not • • • • Creation time Last modified Dataset Origin Content • Migration • Extraction • Functions Bitlevel • Curation • Length • Checksums Process and History • Log workflow and parameters • Version control, changes/patches IN OUT Provenance data ASPiS • JISC funded collaboration between KCL, STFC, Reading • Shibboleth access to iRODS – Rule-based workflow in μservices • Use PASOA for provenance data – Also recording workflow • Prototype to be deployed for the National Grid Service Shib service Apache PERMIS PDP User PASOA iRODS Disk Shibboleth login Home Inst. iRODS Example Rule workflow iRODS Rule Engine Log attrs Access Ctrl Update metadata Branch on file type Image metadata Document metadata PERMIS PDP PASOA Two Federations ASPiS iRODS Federation UK Access Management Federation (Shibboleth) King’s iRODS STFC iRODS Reading iRODS Specific Use Cases • Digitising material – Existing collections – Preserve and extend provenance • Change during processing – E.g., RAW to ESD, AOD – New ACL for processed data – Provenance determines storage (custodial) Specific Use Cases • Automated data processing – E.g., bots – Process data according to provenance – Process step recorded • Recording discovery – @home processing – whodunnit, is it right Today’s (New) Challenges • Distributed collaborations – National, international • Scale – Large files, large numbers of files/datasets – Large number of users/concurrent users – Hot files (1000s of jobs accessing file) • Virtualisation – Double edged sword? Who are the Enemies? • • • • • • Time Entropy Tech advances/research Unauthorised access Usability Poor design, unclear req’s, underestimates Advice to Your Project • Don’t underestimate security – Not an afterthought – Understand data threat model and risks – Be wary of theoreticians…☺ • Duty of care if needed – Audit procedures and metadata • Exit strategy? Conclusion • Use security to implement dependable provenance • Use provenance to manage security • Understand strengths and limitations of what we have today Thank you!