Storage, Data Movement, Grid, Network subgroup DOE Data Management Workshop Day 1, 5-22-04 Plan • Outline of the report – Additions refinement and corrections • Costs and gaps – Classification into development hardening deployment Storage, Data Movement, Grid, Network (initial) • • • • Dynamic Data Storage and Caching Robust Terrabyte Scale data movers Dataflow automation between components Multi-resolution Data movement The whole system environment in the large (even across a WAN) • Placement depending on – Separate (more abstract more high level) • Mechanism (apropos to layering) • Policy (apropos to layering) – Includes “Storage Management “ – Placement QoS and Qos derived from Policy • • Management of replicas Access – Robust, performant, at large volumes of data. (x500-x1000 in 5-10yrs). • N.b faster than evolution of disk speed – Dynamic Data Storage and Caching • Includes pre-staging. – Supporting for sending the function and’/or query to the data. – Access QoS • • Security, authorization authentication and access control. Dataflow automation between components – E.g apropos to workflows, and systematic integration. Specialized specific needs • Multi-resolution Data movement • Fine-grained object access and latencies Gaps costs, priorities. • Cost –$ – $$ – $$$ o >100,000 o >1,000,000 o >10,000,000 • Priority – low med high – High – barrier to Science – Med – substantial cost or waste – Low – annoying • Type of work – RD, HP, DS Gaps, Costs, Priorities • Placement depending on…. $$$, H, RD,HP,DS – Storage Management (storage space availability, quality, etc) • Permanence at the archival scale • Investigation of how to do this apropos to Scientific Storage syst. – Analogs to industry – information life cycle management – Appropriate mix of Exposed interfaces and hints with a preference for standard interfaces (as opposed to parochial, per-system interfaces) – Automatic and manual configurations need to be investigated • Including hints about future accesses. Physical Considerations • How to deal with the increase of capacity per device. $$, H ,RD?,DS? – No aspects of performance expand with Moore’s law – Possibly mitigated by placement strategy: • Mixed “temperature” on the same spindle Gaps, Costs, and Priorities • Management of replicas ($$; H; RD, HP, DS) – Movement of files – Movement of namespaces. – Less-than-whole-file level replication. – Consistency of replicated files • Write once (immutable file) case is an important use case. • Investigation of utility of mutable files • Trade off of version management v.s. mutable files Gaps, Costs, and Priorities • Access (movement) ($$$; H, RD HP DS) – Access requirements are increasing faster than evolution of disk speed. – Exploitation of IP and non-IP based networking – Access contention on physically large volume. – Latency v.s. small grained access. – Investigate supporting sending the function and’/or query to the data. – Investigate supporting virtual data techniques – Investigation of choice of copies and choice of path – Investigation of where to put compression in system architectures. Gaps costs…. • Security, authorization authentication and access control. ($, M, RD, DS) – Investigate expression of access control • And how it moves with the data. • Dataflow automation between components ($$; H; RD, hs, DS) – API for wide area distributed computing, exposing as apropos many items mentioned – Scheduling., access optimization analogous to query optimization. Gaps Costs Priorities • Multi-resolution Data movement – Restricted to framework and not solving specific problems ($, ?, ?)? – Important use case for Office of Science – Investigate if a special case of moving functions to the data. (appropriate framework) • Grid