End to End Life Cycle Management for Research Data

End to End Life Cycle
Management for Research Data
Capturing Metadata Throughout the
Research Pipeline and Facilitating the
Handoff to Formal Curation
Jacob Farmer, CTO
Cambridge Computer
© Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved
www.CambridgeComputer.com – 781-250-3000
A Little Background On
Cambridge Computer
2
A Little Background On Cambridge Computer
Founded in 1991 as a boutique integrator
for backup and archive solutions
Approximately 75 employees nationwide
Clients of all shapes and sizes across all
industries
• Particularly strong in research and higher ed
Industry-wide reputation for:
• defining best practices for enterprise class
data protection, and
• for the early adoption of next generation
storage solutions
A unique business model that allows us to
straddle the fence between academia and
industry
3
Seminars and Workshops Through
The Usenix Association
Tiered Storage and Archiving: Best
Practices for Data Life Cycle Management
and Digital Preservation
Cornell, Dartmouth, Duke, Harvard, Penn
LISA Data Storage Day
• Storage Virtualization
• Application Acceleration with Solid State
• A Crash Course in Object Storage
LISA Conference, Broad Institute, Georgia State,
University Maryland, Davenport, Princeton
PASIG - 2013
End to End Life Cycle Management for Research
4
Our Product: Starfish
5
Our Project – Defining Best Practices
for File Management
Inspiration for our project comes from SRB/IRODS
• Bring parts of the SRB/IRODS vision to reality
– Define a general purpose feature set
– Intuitive user interface
– Simplified API
Inspiration also comes from numerous home grown
solutions in our client base.
The paradigm:
•
•
•
•
Stat() your file systems
Make database records for each file and/or directory
Relate metadata to the file and directory records
Report and/or take action
PASIG - 2013
End to End Life Cycle Management for Research
6
Starfish - *FS
Virtual Global File System
• It’s not really a file system, but it
looks like one and serves as a
hierarchical catalog of files
Like a file system
• CIFS and POSIX permissions
• File system attributes and
extended attributes
But more
•
•
•
•
PASIG - 2013
User specified metadata
Persistent addresses
Versioning
Point in time collections
End to End Life Cycle Management for Research
7
Basic Starfish Topology
PASIG - 2013
End to End Life Cycle Management for Research
8
Targetted Use Cases
1) Data life cycle management for unstructured data at
very large scale
•
•
•
Scientific research data
Media / entertainment workflows
Engineering data
2) Storage middleware for digital asset management
systems at very large scale
•
•
•
•
•
Fixity automation
Backup restore
Tiered storage
Persistent file addresses / links
Cloud interface
PASIG - 2013
End to End Life Cycle Management for Research
9
Typical Content Management “Stack”
PASIG - 2013
End to End Life Cycle Management for Research
10
Inserting File System Middleware
PASIG - 2013
End to End Life Cycle Management for Research
11
Simple Storage Workflow While
Mirroring File Systems to Object Store
PASIG - 2013
End to End Life Cycle Management for Research
12
Metadata is the Great Enabler
Collaboration
• How else would researchers know what to do with one another’s
data?
• How can data be organized to meet different groups’ needs?
Storage management policies
• How does a storage management system know what to do with
your files? File system attributes are not descriptive enough.
Preservation / retrieval / provenance
• How do you know what to keep?
• How do you find it again?
• How do you know what it was used for and when?
Reporting / chargeback
• File system permissions are not descriptive enough.
PASIG - 2013
End to End Life Cycle Management for Research
13
What Would a Metadata System for
Research Data Look Like?
Very flexible
Allows scientists to work the way they want to work
Out of the data path
• The system cannot introduce latency to file I/O
Enormous scale
• Billions of files, Petabytes of capacity, 1000s of file
systems
Device / vendor independence
• Must work with all storage devices, object stores, clouds,
etc.
API driven
PASIG - 2013
End to End Life Cycle Management for Research
14
The Real Trick – Getting the Metadata
The Golden Rule of Data Preservation – “Preserve at the
time of creation”
• Translation: Capture metadata throughout the research pipeline
Perhaps capture metadata when storage is provisioned
• The presumes that there is a structured process for provisioning
storage
Capture metadata through an API
• This requires a simple API that anyone can use
Programmatically extract metadata from file headers,
tags, and content
Capture metadata through a GUI
• Try to create incentives for users to key in metadata
PASIG - 2013
End to End Life Cycle Management for Research
15
Getting from Here to There
16
Problem Statements for Research
Data Management
Scientists don’t want to enter metadata
No one wants to pay for long term storage
Data management planning disconnect between grant
applicants and their institutions
There are more pressing problems related to storing data
• Collaboration
• Cost control: Chargeback, Showback, Tiering
• Backup
Organizational gridlock
• Conflicting priorities
• Unspecific mandates
PASIG - 2013
End to End Life Cycle Management for Research
17
Yes, We Too Have a Triangle!
PASIG - 2013
End to End Life Cycle Management for Research
18
Where it Starts: Scalable and Flexible
Backup/Archive
NAS
Backup Clients
PASIG - 2013
Disk-Based
Object Storage
Tape Archive
NAS or
File Server
Cloud
Service
End to End Life Cycle Management for Research
19
How To Play
20
Looking for Collaborators
The ideal collaborator:
• Has an immediate need that is within our current feature
set and scale
– This tells us that you can/will invest time with us
• Has additional needs that will put us to test
• Is an existing client of Cambridge Computer, or
– Is willing to become one, or
– Is able to contribute some funds
– Is able to make a meaningful investment in time
If not now, maybe next year!
• Email me: jfarmer@CambridgeComputer.com
PASIG - 2013
End to End Life Cycle Management for Research
21