Data Provenance and Security

advertisement
Data Provenance and Security
The Role of Provenance in Data Security
(or vice versa)
Jens Jensen <jens.jensen@stfc.ac.uk>
ESI Provenance Workshop, Edinburgh
20 April 2009
Data Security Goals
•
•
•
•
•
Confidentiality
Integrity
Availability
Assurance
(Policies)
Confidentiality
• What it is
– Protect data against disclosure
• Why
– Sensitive information (commercial,
personal data), legal compliance
• Issues
– Enabling and maintaining access for
authorised persons (only)
– Tech
Confidentiality – example
• Gov’t historical papers
– (or soon-to-be historical)
– Keep for 30 years (say), then disclose
• Medical/personal data
– Compliance with data protection act
– Keep, then destroy (and prove it)
• Licensed data
– E.g. census data
Integrity
• What it is
– Ensure content does not change
• Why
– Essential for most data storage
• Issues
– Intentional vs unintentional changes
– Changes to file (bitlevel) vs changes to
content (migration ok)
– Tech
Integrity – Example
• Bank money transfer
– Not so good if recipient adds a zero…
– Proof of integrity req’d for auditing
• Original research data
– Prevent change
• unintended (disk corruption),
• accidental (oops),
• malicious (forgery)
Availability
• What it is
– Ensure data can be reached when needed
• Why
– Often neglected aspect of security
– High profile DoS cases show why
• Issues
– Multiple copies goes against confidentiality
– Technically challenging
– Scale
Availability – Example
• Tax returns…
– Everybody (almost) submits in the last
minute
– Online server and instructions must be
available under high load
• Accessing remote data
– Clouds, grids, buzzword-du-jour
– Increase distributed collaborations
– Hot files
Assurance
• What it is
– Do you trust the repository
– Does 3rd party (eg court of law) trust it?
• Why
– Legal admissibility
– Compliance with regulations
• Issues
– Understanding duty of care
– Full lifecycle
Assurance – Example
• BIP 0008
– Legal admissibility and evidential weight of
• information stored electronically
• information communicated electronically
• linking identity of documents
– Summary: duty of care
Policies
• What is it
– Higher level goals
• Why
– Help clarify the wherefores and whys
• Out of scope of this talk
Proposals in This Presentation
• Use provenance data to implement
(some) security goals
• Use security components to implement
(some) provenance goals
Obvious Things First
• Keep security metadata along with
provenance metadata
• Secure provenance data
– Integrity
– Availability (data useless without metadata)
Obvious Things First
• Protect confidential aspects of
provenance
– E.g., data origin, owner
– E.g., conditional release of owner/subject
• Assurance for auditing purposes
Components and Roles
•
•
•
•
•
Identity
data owner
Time
timestamps
File
bitlevel
File/datasets
type, contents
Text
comments, annotations,
taxonomies, user defined
• Process
Workflow, parameters
• History
versions
The Gory(?) Details – Identity
• Uniqueness
– Id not reallocated to others
– (User has no other id)
• Persistence
– Lifetime of long id tokens…
Identity Today
• Name, local uid, affiliation, email
– Not unique
– People move around, change names
• X.509 certificates
– Commercial PKI
– UK e-Science CA
• UK Access Management Federation
– Pseudonymity at best
– Can be recycled
Time
• Accuracy
– PC clocks drift (sync often)
– Needs network or special clock cards (or
GPS!)
• Assurance of time
– (Re)setting time on PC is easy
File Curation
File metadata
• Filename, path
– Should or should not
•
•
•
•
Creation time
Last modified
Dataset
Origin
Content
• Migration
• Extraction
• Functions
Bitlevel
• Curation
• Length
• Checksums
Process and History
• Log workflow and parameters
• Version control, changes/patches
IN
OUT
Provenance
data
ASPiS
• JISC funded collaboration between
KCL, STFC, Reading
• Shibboleth access to iRODS
– Rule-based workflow in μservices
• Use PASOA for provenance data
– Also recording workflow
• Prototype to be deployed for the
National Grid Service
Shib service
Apache
PERMIS
PDP
User
PASOA
iRODS
Disk
Shibboleth login
Home
Inst.
iRODS
Example
Rule workflow
iRODS
Rule
Engine
Log attrs
Access Ctrl
Update
metadata
Branch on
file type
Image
metadata
Document
metadata
PERMIS
PDP
PASOA
Two Federations
ASPiS
iRODS
Federation
UK Access
Management
Federation
(Shibboleth)
King’s
iRODS
STFC
iRODS
Reading
iRODS
Specific Use Cases
• Digitising material
– Existing collections
– Preserve and extend provenance
• Change during processing
– E.g., RAW to ESD, AOD
– New ACL for processed data
– Provenance determines storage (custodial)
Specific Use Cases
• Automated data processing
– E.g., bots
– Process data according to provenance
– Process step recorded
• Recording discovery
– @home processing – whodunnit, is it right
Today’s (New) Challenges
• Distributed collaborations
– National, international
• Scale
– Large files, large numbers of files/datasets
– Large number of users/concurrent users
– Hot files (1000s of jobs accessing file)
• Virtualisation
– Double edged sword?
Who are the Enemies?
•
•
•
•
•
•
Time
Entropy
Tech advances/research
Unauthorised access
Usability
Poor design, unclear req’s,
underestimates
Advice to Your Project
• Don’t underestimate security
– Not an afterthought
– Understand data threat model and risks
– Be wary of theoreticians…☺
• Duty of care if needed
– Audit procedures and metadata
• Exit strategy?
Conclusion
• Use security to implement dependable
provenance
• Use provenance to manage security
• Understand strengths and limitations of
what we have today
Thank you!
Download