End to End Life Cycle Management for Research Data Capturing Metadata Throughout the Research Pipeline and Facilitating the Handoff to Formal Curation Jacob Farmer, CTO Cambridge Computer © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved www.CambridgeComputer.com – 781-250-3000 A Little Background On Cambridge Computer 2 A Little Background On Cambridge Computer Founded in 1991 as a boutique integrator for backup and archive solutions Approximately 75 employees nationwide Clients of all shapes and sizes across all industries • Particularly strong in research and higher ed Industry-wide reputation for: • defining best practices for enterprise class data protection, and • for the early adoption of next generation storage solutions A unique business model that allows us to straddle the fence between academia and industry 3 Seminars and Workshops Through The Usenix Association Tiered Storage and Archiving: Best Practices for Data Life Cycle Management and Digital Preservation Cornell, Dartmouth, Duke, Harvard, Penn LISA Data Storage Day • Storage Virtualization • Application Acceleration with Solid State • A Crash Course in Object Storage LISA Conference, Broad Institute, Georgia State, University Maryland, Davenport, Princeton PASIG - 2013 End to End Life Cycle Management for Research 4 Our Product: Starfish 5 Our Project – Defining Best Practices for File Management Inspiration for our project comes from SRB/IRODS • Bring parts of the SRB/IRODS vision to reality – Define a general purpose feature set – Intuitive user interface – Simplified API Inspiration also comes from numerous home grown solutions in our client base. The paradigm: • • • • Stat() your file systems Make database records for each file and/or directory Relate metadata to the file and directory records Report and/or take action PASIG - 2013 End to End Life Cycle Management for Research 6 Starfish - *FS Virtual Global File System • It’s not really a file system, but it looks like one and serves as a hierarchical catalog of files Like a file system • CIFS and POSIX permissions • File system attributes and extended attributes But more • • • • PASIG - 2013 User specified metadata Persistent addresses Versioning Point in time collections End to End Life Cycle Management for Research 7 Basic Starfish Topology PASIG - 2013 End to End Life Cycle Management for Research 8 Targetted Use Cases 1) Data life cycle management for unstructured data at very large scale • • • Scientific research data Media / entertainment workflows Engineering data 2) Storage middleware for digital asset management systems at very large scale • • • • • Fixity automation Backup restore Tiered storage Persistent file addresses / links Cloud interface PASIG - 2013 End to End Life Cycle Management for Research 9 Typical Content Management “Stack” PASIG - 2013 End to End Life Cycle Management for Research 10 Inserting File System Middleware PASIG - 2013 End to End Life Cycle Management for Research 11 Simple Storage Workflow While Mirroring File Systems to Object Store PASIG - 2013 End to End Life Cycle Management for Research 12 Metadata is the Great Enabler Collaboration • How else would researchers know what to do with one another’s data? • How can data be organized to meet different groups’ needs? Storage management policies • How does a storage management system know what to do with your files? File system attributes are not descriptive enough. Preservation / retrieval / provenance • How do you know what to keep? • How do you find it again? • How do you know what it was used for and when? Reporting / chargeback • File system permissions are not descriptive enough. PASIG - 2013 End to End Life Cycle Management for Research 13 What Would a Metadata System for Research Data Look Like? Very flexible Allows scientists to work the way they want to work Out of the data path • The system cannot introduce latency to file I/O Enormous scale • Billions of files, Petabytes of capacity, 1000s of file systems Device / vendor independence • Must work with all storage devices, object stores, clouds, etc. API driven PASIG - 2013 End to End Life Cycle Management for Research 14 The Real Trick – Getting the Metadata The Golden Rule of Data Preservation – “Preserve at the time of creation” • Translation: Capture metadata throughout the research pipeline Perhaps capture metadata when storage is provisioned • The presumes that there is a structured process for provisioning storage Capture metadata through an API • This requires a simple API that anyone can use Programmatically extract metadata from file headers, tags, and content Capture metadata through a GUI • Try to create incentives for users to key in metadata PASIG - 2013 End to End Life Cycle Management for Research 15 Getting from Here to There 16 Problem Statements for Research Data Management Scientists don’t want to enter metadata No one wants to pay for long term storage Data management planning disconnect between grant applicants and their institutions There are more pressing problems related to storing data • Collaboration • Cost control: Chargeback, Showback, Tiering • Backup Organizational gridlock • Conflicting priorities • Unspecific mandates PASIG - 2013 End to End Life Cycle Management for Research 17 Yes, We Too Have a Triangle! PASIG - 2013 End to End Life Cycle Management for Research 18 Where it Starts: Scalable and Flexible Backup/Archive NAS Backup Clients PASIG - 2013 Disk-Based Object Storage Tape Archive NAS or File Server Cloud Service End to End Life Cycle Management for Research 19 How To Play 20 Looking for Collaborators The ideal collaborator: • Has an immediate need that is within our current feature set and scale – This tells us that you can/will invest time with us • Has additional needs that will put us to test • Is an existing client of Cambridge Computer, or – Is willing to become one, or – Is able to contribute some funds – Is able to make a meaningful investment in time If not now, maybe next year! • Email me: jfarmer@CambridgeComputer.com PASIG - 2013 End to End Life Cycle Management for Research 21