Data Triage and Data Analytics for Personal Digital Collections Kam Woods CRADLE Talk Series, January 27 2012 University of North Carolina at Chapel Hill The Andrew W. Mellon Foundation SCALE AND THE GROWING PROBLEM OF PERSONAL COLLECTIONS Dramatic increases in the size of personal data collections over the past decade point to significant preservation problems in the future Management of personal data collections increasingly depends on sophisticated automation – to identify PII, and perform search, backup, deduplication, and preservation tasks Most of the advanced automation facilities used to store and manage such data in modern operating systems and (or used to locate and process it in forensic analytics software) are not reflected in the functionality of most preservation software Finding ways to translate this automation to preservation packages is key to long-term preservation Issues of context; the further we get from the original context, the harder it gets to reconstruct the original environment KEY ISSUES FOR ACQUISITION OF PERSONAL DIGITAL COLLECTIONS IN COLLECTING INSTITUTIONS Locating and protecting personally identifying, private, and/or sensitive information located on fixed and removable digital media Providing archivists and acquisitions staff with the tools required to quickly and accurately assess raw data from donor media Generating metadata that facilitates interoperability between tools and can readily be crosswalked to current collections and preservation standards FORENSIC DATA TRIAGE WISH LIST FOR [PERSONAL DIGITAL] COLLECTIONS Handling Private and Sensitive Data email addresses / email header data credit card numbers search terms / URL / history phone numbers GPS coordinates EXIF metadata word lists Establishing ground truth residual and system data deleted files slack space between files cached data Windows pagefile hibernation file Registry system and device logs Patterns of use and activity to support identity and authenticity User activity Filesystem permissions Device access Annotations, metadata, and context Logs, events, and filesystem activity to record chain of custody (production, pre-ingest, in-archive) DFXML PREMIS events, METS records, other techMD WHY IS HANDLING PRIVATE AND SENSITIVE DATA SUCH A PERVASIVE PROBLEM? Objects created using modern operating systems have complex structural, semantic, and relational properties not necessarily captured within the file format Much of this information is lost in staging procedures that simply copy files out of the filesystem Retaining disk images (raw or forensically packaged bitstreams) eliminates this loss. Because much of the original context is maintained, this eases the task of distinguishing between private and non-private data on donor media. Digital forensics tools can capture and process this data, but the language ofdigital forensics is often confusing or inappropriate for digital archivists A MODEL FOR A FORENSICALLY ENHANCED WORKFLOW Extraction of fixed media Acquisition of raw disk image and forensic packaging Donor device Staging area Packaging for ingest Preparation of redacted image / permissions overlays + crosswalks to archival metadata Context-sensitive identification of private and confidential info Acquisition metadata (afflib) Filesystem metadata (fiwalk) Dissemination of full/redacted images (distributed) Archival storage File-level access (AFFLIB via C, Python API hooks) 6 AUTOMATED REPORTING FOR DATA ON DIGITAL MEDIA DISK IMAGES Extract filesystem Metadata as DFXML AUTOMATED REPORTING FOR DATA ON DIGITAL MEDIA DISK IMAGES Volume information File type histograms Location of user data by profile Disk utilization Deleted and corrupted files Encryption Extract filesystem Metadata as DFXML Summary report AUTOMATED REPORTING FOR DATA ON DONOR MEDIA Volume information File type histograms Location of user data by profile Disk utilization Deleted and corrupted files Encryption Extract filesystem Metadata as DFXML Summary report SID S-1-5-32-1045337234-12924708993-5683276719-19000 (Revision level, identifier authority, subauthority, etc.) Flag unique user identifiers Extract Registry hives (Windows) or equivalent login and system info AUTOMATED REPORTING FOR DATA ON DONOR MEDIA Volume information File type histograms Location of user data by profile Disk utilization Deleted and corrupted files Encryption Extract filesystem Metadata as DFXML Summary report SID S-1-5-32-1045337234-12924708993-5683276719-19000 (Revision level, identifier authority, subauthority, etc.) Flag unique user identifiers Advanced report Identify external device usage Extract Registry hives (Windows) or equivalent login and system info Number of active users User activity Removable storage Network transfers Applications installed and recently used WHAT ABOUT PRIVATE AND SENSITIVE DATA? Context-sensitive extraction of private and sensitive data WHAT ABOUT PRIVATE AND SENSITIVE DATA? WHAT ABOUT PRIVATE AND SENSITIVE DATA? Bulk Extractor processing image courtesy Simson Garfinkel, 2011 WE EXPECT THIS PROCESS TO BE EFFICIENT AND RELIABLE Versus commercial forensics tools and string search… Bulk Extractor processing image courtesy Simson Garfinkel, 2011 THE FRONT END EXAMINING THE FEATURES “WHERE'S WALDO”? CHECKING THE USUAL SUSPECTS: EXIF File name : P1010395.JPG File size : 2864229 Bytes MIME type : image/jpeg Image size : 2304 x 3072 Camera make : Panasonic Camera model : DMC-FX07 Image timestamp : 2011:06:16 11:41:03 Image number : Exposure time : 1/30 s Aperture : F2.8 Exposure bias : 0 EV Flash : Yes, auto, red-eye reduction Flash bias : 0 Focal length : 4.6 mm (35 mm equivalent: 28.0 mm) Subject distance: ISO speed : 100 Exposure mode : Auto Metering mode : Multi-segment Macro mode : Off Image quality : High Exif Resolution : 2304 x 3072 White balance : Auto Thumbnail : image/jpeg, 8406 Bytes Copyright : Exif comment : File name : P1010395-1.JPG File size : 2874083 Bytes MIME type : image/jpeg Image size : 2304 x 3072 Camera make : Panasonic Camera model : DMC-FX07 Image timestamp : 2011:06:16 11:41:03 Image number : Exposure time : 1/30 s Aperture : F2.8 Exposure bias : 0 EV Flash : Yes, auto, red-eye reduction Flash bias : 0 Focal length : 4.6 mm (35 mm equivalent: 28.0 mm) Subject distance: ISO speed : 100 Exposure mode : Auto Metering mode : Multi-segment Macro mode : Off Image quality : High Exif Resolution : 2304 x 3072 White balance : Auto Thumbnail : image/jpeg, 8250 Bytes Copyright : Exif comment : CHECKING THE USUAL SUSPECTS: HASH MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466 Filename: P1010395.JPG MD5 sum: 48b660d2c28f302eed5dfb56e89f149e Filename: P1010395-1.JPG A DIFFERENT APPROACH: FUZZY HASHING MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466 Filename: P1010395.JPG *deep fuzzy hash signature: wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog; MD5 sum: 48b660d2c28f302eed5dfb56e89f149e Filename: P1010395-1.JPG *deep fuzzy hash signature: wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog; A DIFFERENT APPROACH: FUZZY HASHING Fuzzy hash analysis: Files match 93.75% MD5 sum: d983482b1a6aa6a2b9fa1ef2611d2466 Filename: P1010395.JPG *deep fuzzy hash signature: wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog; MD5 sum: 48b660d2c28f302eed5dfb56e89f149e Filename: P1010395-1.JPG *deep fuzzy hash signature: wxqNw5+Nh5/Nh4vNE4tBE0ZBjUYgjSog; BITCURATOR - WHAT WE’RE DOING Developing tools, methods, and approaches for collecting professionals that incorporate the functionality of modern digital forensics tools (focusing on open source) Cultivating professional connections and community building around the methods and software Generating and disseminating supporting documentation BITCURATOR - WHY WE’RE DOING IT There are already many cases of self-contained Linux-based packages that bundle many of the tools in order to support digital forensics activities. However, they are not very approachable to library/archives professionals in terms of interface and documentation. Some ongoing fundamental needs: Incorporation into the workflow of archives/library ingest and collection management, e.g. metadata conventions, hooks into existing CMS environments. Provision of public access to the data. The typical digital forensics scenario is a criminal investigation in which the public never gets access to the evidence that was seized. By contrast, collecting institutions who are creating disk images face issues of how to provide access to the data. This includes access interface issues, but also how to redact or restrict access to components of the image, based on confidentiality, intellectual property or other sensitivities. DATA ANALYTICS AND REPORTING Knowing what you have, finding sensitive info, identifying problem areas FORENSIC AUGMENTATION OF EXISTING WORKFLOWS Every workflow has weak spots and compromises. BitCurator will provide forensic software tools that can be deployed independently and as supporting microservices. WHO AND WHEN Funded by Andrew W. Mellon Foundation - October 1, 2011 – September 30, 2013 Partners: SILS and Maryland Institute for Technology in the Humanities (MITH) Core Team: –Cal Lee, PI –Matt Kirschenbaum, Co-PI –Kam Woods, Technical Led –Alex Chassonoff, Project Manager (UNC SILS) –Porter Olsen (MITH) Professional Experts Panel Development Advisory Group • • • • • • • • • • • • Bradley Daigle, University of Virginia Library Erika Farr, Emory University Jeremy Leighton John, British Library Leslie Johnston, Library of Congress Courtney Mumma, City of Vancouver Archives Naomi Nelson, Duke University Erin O’Meara Michael Olson, Stanford University Libraries Gabriela Redwine, Harry Ransom Center, University of Texas, Austin Susan Thomas, Digital Archivist, Bodleian Library, University of Oxford • • • • • • • Geoffrey Brown, Indiana University Barbara Guttman, National Institute of Standards and Technology Jerome McDonough, University of Illinois Mark Matienzo, Yale University David Pearson, National Library of Australia Doug Reside, New York Public Library Seth Shaw, University Archives, Duke University William Underwood, Georgia Tech Peter Van Garderen, Artefactual Systems FOLLOW US ONLINE Read our blog, find out more about the project staff, examine FAQs, download software, and more. http://www.bitcurator.net/ On Twitter @bitcurator