File and Metadata Management Breakout • File & Meta-data Management – Mechanisms – Policy – Replication • Breakout Considerations – Who are the early adopter/active communities? • To gather detailed requirements – How uniform are the requirements within the community? • Are there gaps? Revise emphasis? – Targets over the next 12 months • Longer term? NeSC Workshop - February 2007 1/14 Reporting back • Changes to docs – Cross-links that are needed to identify • Suggested actions – Specific if we can – OMII UK could do X, JCSR could do Y NeSC Workshop - February 2007 2/14 Summary NeSC Workshop - February 2007 3/14 Deliverables in document • Automatic data annotation and provenance tools to support domain specific schema • Mechanisms to support controlled and convenient sharing of files between groups • Best practice document to support research groups in developing their own data curation and file management policy • Development of common annotation schemes for individual communities to enable consistent metadata labeling within these communities NeSC Workshop - February 2007 4/14 • A) SW Evaluation & report: We need to evaluate commonly available network file systems (GPFS, PVFS, etc) in comparison to distributed file management tools (SRB, SRM, dCache, etc) – What are criteria – include collaboration, deployment, ease of use, cost, etc – Communication with other communities esp HPC and DCC – workshop on available tools to follow on • Short term • ETF/NGS – Dave Wallom • Targeted community- general NeSC Workshop - February 2007 5/14 • B) Best practice document: current metadata and annotation practices, and possible policies – aim is that to contrib data you have to have a policy for how annotation and metadata will be done (within standards for interop)? • Medium term – could start now • JISC could do a call – Ann Borda • Targeted community- general NeSC Workshop - February 2007 6/14 • C) Information: What standards are available for successful data curation, and DCC workshop on metadata standards • Medium term • DCC with JISC – Chris R. • Targeted community- general NeSC Workshop - February 2007 7/14 • D) Why aren’t common tools available (eg from the DCC) being used by the scientists - Workshop/Outreach in this space might be a help • Short term • DCC? – Chris R. NeSC Workshop - February 2007 8/14 • E) Reporting already completed evaluations of institutional repository systems – Linking to a common place – Making sure criteria are comparable, etc – Goal: JISC/JCSR could come up with recommendation for universities getting involved in this space • Short term • JISC – Ann Borda NeSC Workshop - February 2007 9/14 • F) Survey to understand why open source or commercial solutions for distributed filesystems aren’t in more common use – Short term – ETF? JISC? – Ann Borda • G) Survey: Are users using data bases and we simply don’t know it since they didn’t mention it? – Medium Term – NeSC? Neil CH NeSC Workshop - February 2007 10/14 Notes • Note: Action items in RED • Document adaptations in GREEN NeSC Workshop - February 2007 11/14 Access to Data Requirements • Solutions exist to at least pieces of the problem – but many people didn’t know what was available • Dave Wallom suggests that there are commercial systems esp AFS – Why aren’t the HEP folks using this now? • Neil CH – do different groups have different requirements or just seeming different requirements? – Would a tool summary help or a workshop to tell people about it? (DB) – Hard to get (new) people to meetings – do they not care? Do they not know what they need? Are the workshops being targetted in the wrong way? Do the aps folks know this will help them? NeSC Workshop - February 2007 12/14 Federated repository solutions • Isn’t in document • What about tying data to publications? – Didn’t come up really except in tying to grants and grant requirements NeSC Workshop - February 2007 13/14 Data to share • People also want to share software – Files are everything from software to results NeSC Workshop - February 2007 14/14 We didn’t hear • Files meant moving around files – No mention of bits of files or sub pieces • Many users assume that how you store data IS in a file- there simply isn’t the line that other (CS) folks would have • Are users using data bases and we simply don’t know it since they didn’t mention it? – Perhaps another survey for this (dave berry sugg) NeSC Workshop - February 2007 15/14 Section naming • Collaborative file management? NeSC Workshop - February 2007 16/14 Metadata is key to sharing • This could be emphasized better • A guide for metadata standards might help, and roadmap where they are going – Data curation people and digital library people have methods to address this – could their formalism be made into best practice to transfer it across? – They need to tie standard extensions back to the standard – perhaps some way to encourage this to happen more is needed? • Need better tools to add metadata to the “files” – Auto-generation (or even simple creation at source) has different requirements depending on the type of data – Suggestion (DW) - policy recommendation that in order to contrib data you have to have a policy for how this will be done (within standards for interop)? NeSC Workshop - February 2007 17/14 Metadata 2 • Clarify: People are happy with the metadata frameworks for general knowledge (date stamps, etc) NeSC Workshop - February 2007 18/14 Before lunch list • Note: data curation means something very different to domain scientist than to a curation person • Why aren’t common tools available (eg from the DCC) being used by the scientists? – Outreach in this space might be a help – How do file management and curation interact? • Different projects have different curation needs – Make all data accessible – Make only some of it accessible after time? NeSC Workshop - February 2007 19/14 • 3) Dave Wallom suggests that there are commercial systems esp AFS • Survey to understand Why isn’t this and other commercial solutions aren’t in more common use? • 3) Document: Would a summary of available data tools help or a workshop to tell people about it? (DB) • 3) Are users using data bases and we simply don’t know it since they didn’t mention it? • Perhaps another survey for this (dave berry sugg) • 1) Best practice document: policy recommendation (based on current experimental work) that in order to contrib data you have to have a policy for how annotation and metadata will be done (within standards for interop)? • 2) Information: What standards are available for successful data curation, and DCC workshop on metadata standards • 2) Why aren’t common tools available (eg from the DCC) being used by the scientists - Workshop/Outreach in this space might be a help NeSC Workshop - February 2007 20/14 After lunch topics • 1) Follow on policy • 2) What is best practice for file management • 3) Description of tools • 4) Provenance NeSC Workshop - February 2007 21/14 What is best practice for file management • SW Evaluation & report: We need to evaluate commonly available network file systems (GPFS, PVFS, etc) in comparison to distributed file management tools (SRB, SRM, dCache, etc) – What are criteria – include collaboration, deployment, ease of use, cost, etc – Communication with other communities esp HPC and DCC NeSC Workshop - February 2007 22/14 Institutional repository systems • Reporting already completed evaluations of institutional repository systems – Linking to a common place – Making sure criteria are comparable, etc – Goal: JISC/JCSR could come up with recommendation for universities getting involved in this space NeSC Workshop - February 2007 23/14 Provenance • Define better in document – currently means history of how data was created • Interlinking of metadata across experimental process • Edit to document – add a phrase to the effect: Once provenience data is collected there will also be a requirement to navigate and analyze the data NeSC Workshop - February 2007 24/14 • See summary slides for to do’s in order NeSC Workshop - February 2007 25/14