Presentation Slides - UNC School of Information and Library Science

advertisement
Data Triage and Data Analytics
for Personal Digital Collections
Kam Woods
CRADLE Talk Series, January 27 2012
University of North Carolina at Chapel Hill
The Andrew W. Mellon Foundation
SCALE AND THE GROWING PROBLEM OF PERSONAL COLLECTIONS
Dramatic increases in the size of personal data collections over the past decade point
to significant preservation problems in the future
Management of personal data collections increasingly depends on sophisticated
automation – to identify PII, and perform search, backup, deduplication, and
preservation tasks
Most of the advanced automation facilities used to store and manage such data in
modern operating systems and (or used to locate and process it in forensic analytics
software) are not reflected in the functionality of most preservation software
Finding ways to translate this automation to preservation packages is key to long-term
preservation
Issues of context; the further we get from the original context, the harder it gets to
reconstruct the original environment
KEY ISSUES FOR ACQUISITION OF PERSONAL DIGITAL COLLECTIONS IN
COLLECTING INSTITUTIONS
Locating and protecting personally identifying, private, and/or sensitive information
located on fixed and removable digital media
Providing archivists and acquisitions staff with the tools required to quickly and
accurately assess raw data from donor media
Generating metadata that facilitates interoperability between tools and can readily be
crosswalked to current collections and preservation standards
FORENSIC DATA TRIAGE WISH LIST FOR [PERSONAL DIGITAL] COLLECTIONS
Handling Private and Sensitive Data
email addresses / email header data
credit card numbers
search terms / URL / history
phone numbers
GPS coordinates
EXIF metadata
word lists
Establishing ground truth
residual and system data
deleted files
slack space between files
cached data
Windows pagefile
hibernation file
Registry
system and device logs
Patterns of use and activity to support identity and
authenticity
User activity
Filesystem permissions
Device access
Annotations, metadata, and context
Logs, events, and filesystem activity to record
chain of custody (production, pre-ingest, in-archive)
DFXML
PREMIS events, METS records, other techMD
WHY IS HANDLING PRIVATE AND SENSITIVE DATA SUCH A PERVASIVE PROBLEM?
Objects created using modern operating systems have complex structural,
semantic, and relational properties not necessarily captured within the file
format
Much of this information is lost in staging procedures that simply copy files out of
the filesystem
Retaining disk images (raw or forensically packaged bitstreams) eliminates this
loss. Because much of the original context is maintained, this eases the task of
distinguishing between private and non-private data on donor media.
Digital forensics tools can capture and process this data, but the language
ofdigital forensics is often confusing or inappropriate for digital archivists
A MODEL FOR A FORENSICALLY ENHANCED WORKFLOW
Extraction of
fixed media
Acquisition of
raw disk image
and forensic packaging
Donor device
Staging area
Packaging for ingest
Preparation of
redacted image /
permissions overlays +
crosswalks to archival
metadata
Context-sensitive
identification of
private and
confidential info
Acquisition
metadata
(afflib)
Filesystem
metadata
(fiwalk)
Dissemination of full/redacted images (distributed)
Archival storage
File-level access (AFFLIB via C, Python API hooks)
6
AUTOMATED REPORTING FOR DATA ON DIGITAL MEDIA DISK IMAGES
Extract filesystem
Metadata as DFXML
AUTOMATED REPORTING FOR DATA ON DIGITAL MEDIA DISK IMAGES
Volume information
File type histograms
Location of user data by profile
Disk utilization
Deleted and corrupted files
Encryption
Extract filesystem
Metadata as DFXML
Summary report
AUTOMATED REPORTING FOR DATA ON DONOR MEDIA
Volume information
File type histograms
Location of user data by profile
Disk utilization
Deleted and corrupted files
Encryption
Extract filesystem
Metadata as DFXML
Summary report
SID
S-1-5-32-1045337234-12924708993-5683276719-19000
(Revision level, identifier authority, subauthority, etc.)
Flag unique user identifiers
Extract Registry hives (Windows)
or equivalent login and system info
AUTOMATED REPORTING FOR DATA ON DONOR MEDIA
Volume information
File type histograms
Location of user data by profile
Disk utilization
Deleted and corrupted files
Encryption
Extract filesystem
Metadata as DFXML
Summary report
SID
S-1-5-32-1045337234-12924708993-5683276719-19000
(Revision level, identifier authority, subauthority, etc.)
Flag unique user identifiers
Advanced report
Identify external device usage
Extract Registry hives (Windows)
or equivalent login and system info
Number of active users
User activity
Removable storage
Network transfers
Applications installed and recently
used
WHAT ABOUT PRIVATE AND SENSITIVE DATA?
Context-sensitive extraction
of private and sensitive data
WHAT ABOUT PRIVATE AND SENSITIVE DATA?
WHAT ABOUT PRIVATE AND SENSITIVE DATA?
Bulk Extractor processing image courtesy Simson Garfinkel, 2011
WE EXPECT THIS PROCESS TO BE EFFICIENT AND RELIABLE
Versus commercial forensics tools and string search…
Bulk Extractor processing image courtesy Simson Garfinkel, 2011
THE FRONT END
EXAMINING THE FEATURES
“WHERE'S WALDO”?
CHECKING THE USUAL SUSPECTS: EXIF
File name
: P1010395.JPG
File size
: 2864229 Bytes
MIME type
: image/jpeg
Image size : 2304 x 3072
Camera make : Panasonic
Camera model : DMC-FX07
Image timestamp : 2011:06:16 11:41:03
Image number :
Exposure time : 1/30 s
Aperture
: F2.8
Exposure bias : 0 EV
Flash
: Yes, auto, red-eye reduction
Flash bias : 0
Focal length : 4.6 mm (35 mm equivalent: 28.0 mm)
Subject distance:
ISO speed
: 100
Exposure mode : Auto
Metering mode : Multi-segment
Macro mode : Off
Image quality : High
Exif Resolution : 2304 x 3072
White balance : Auto
Thumbnail
: image/jpeg, 8406 Bytes
Copyright
:
Exif comment :
File name
: P1010395-1.JPG
File size
: 2874083 Bytes
MIME type
: image/jpeg
Image size : 2304 x 3072
Camera make : Panasonic
Camera model : DMC-FX07
Image timestamp : 2011:06:16 11:41:03
Image number :
Exposure time : 1/30 s
Aperture
: F2.8
Exposure bias : 0 EV
Flash
: Yes, auto, red-eye reduction
Flash bias : 0
Focal length : 4.6 mm (35 mm equivalent: 28.0 mm)
Subject distance:
ISO speed
: 100
Exposure mode : Auto
Metering mode : Multi-segment
Macro mode : Off
Image quality : High
Exif Resolution : 2304 x 3072
White balance : Auto
Thumbnail
: image/jpeg, 8250 Bytes
Copyright
:
Exif comment :
CHECKING THE USUAL SUSPECTS: HASH
MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466
Filename: P1010395.JPG
MD5 sum: 48b660d2c28f302eed5dfb56e89f149e
Filename: P1010395-1.JPG
A DIFFERENT APPROACH: FUZZY HASHING
MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466
Filename: P1010395.JPG
*deep fuzzy hash signature:
wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog;
MD5 sum: 48b660d2c28f302eed5dfb56e89f149e
Filename: P1010395-1.JPG
*deep fuzzy hash signature:
wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog;
A DIFFERENT APPROACH: FUZZY HASHING
Fuzzy hash
analysis:
Files match
93.75%
MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466
Filename: P1010395.JPG
*deep fuzzy hash signature:
wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog;
MD5 sum: 48b660d2c28f302eed5dfb56e89f149e
Filename: P1010395-1.JPG
*deep fuzzy hash signature:
wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog;
BITCURATOR - WHAT WE’RE DOING
Developing tools, methods, and approaches for collecting professionals that
incorporate the functionality of modern digital forensics tools (focusing on open
source)
Cultivating professional connections and community building around the methods and
software
Generating and disseminating supporting documentation
BITCURATOR - WHY WE’RE DOING IT
There are already many cases of self-contained Linux-based packages that bundle
many of the tools in order to support digital forensics activities. However, they are not
very approachable to library/archives professionals in terms of interface and
documentation.
Some ongoing fundamental needs:
Incorporation into the workflow of archives/library ingest and collection
management, e.g. metadata conventions, hooks into existing CMS environments.
Provision of public access to the data. The typical digital forensics scenario is a
criminal investigation in which the public never gets access to the evidence that
was seized. By contrast, collecting institutions who are creating disk images face
issues of how to provide access to the data. This includes access interface issues,
but also how to redact or restrict access to components of the image, based on
confidentiality, intellectual property or other sensitivities.
DATA ANALYTICS AND REPORTING
Knowing what you have, finding sensitive info, identifying problem areas
FORENSIC AUGMENTATION OF EXISTING WORKFLOWS
Every workflow has weak spots and compromises. BitCurator will provide forensic software
tools that can be deployed independently and as supporting microservices.
WHO AND WHEN
Funded by Andrew W. Mellon Foundation - October 1, 2011 – September 30, 2013
Partners: SILS and Maryland Institute for Technology in the Humanities (MITH)
Core Team:
–Cal Lee, PI
–Matt Kirschenbaum, Co-PI
–Kam Woods, Technical Led
–Alex Chassonoff, Project Manager (UNC SILS)
–Porter Olsen (MITH)
Professional Experts Panel
Development Advisory Group
•
•
•
•
•
•
•
•
•
•
•
•
Bradley Daigle, University of Virginia Library
Erika Farr, Emory University
Jeremy Leighton John, British Library
Leslie Johnston, Library of Congress
Courtney Mumma, City of Vancouver Archives
Naomi Nelson, Duke University
Erin O’Meara
Michael Olson, Stanford University Libraries
Gabriela Redwine, Harry Ransom Center,
University of Texas, Austin
Susan Thomas, Digital Archivist, Bodleian
Library, University of Oxford
•
•
•
•
•
•
•
Geoffrey Brown, Indiana University
Barbara Guttman, National Institute of Standards and
Technology
Jerome McDonough, University of Illinois
Mark Matienzo, Yale University
David Pearson, National Library of Australia
Doug Reside, New York Public Library
Seth Shaw, University Archives, Duke University
William Underwood, Georgia Tech
Peter Van Garderen, Artefactual Systems
FOLLOW US ONLINE
Read our blog, find out more about the project staff, examine FAQs, download software,
and more.
http://www.bitcurator.net/
On Twitter @bitcurator
Download