Storage, Data Movement, Grid, Network subgroup DOE Data Management Workshop

advertisement
Storage, Data Movement,
Grid, Network subgroup
DOE Data Management
Workshop
Day 1, 5-22-04
Plan
• Outline of the report
– Additions refinement and corrections
• Costs and gaps
– Classification into development hardening
deployment
Storage, Data Movement, Grid,
Network (initial)
•
•
•
•
Dynamic Data Storage and Caching
Robust Terrabyte Scale data movers
Dataflow automation between components
Multi-resolution Data movement
The whole system environment in
the large (even across a WAN)
•
Placement depending on
– Separate (more abstract more high level)
• Mechanism (apropos to layering)
• Policy (apropos to layering)
– Includes “Storage Management “
– Placement QoS and Qos derived from Policy
•
•
Management of replicas
Access
– Robust, performant, at large volumes of data. (x500-x1000 in 5-10yrs).
• N.b faster than evolution of disk speed
– Dynamic Data Storage and Caching
• Includes pre-staging.
– Supporting for sending the function and’/or query to the data.
– Access QoS
•
•
Security, authorization authentication and access control.
Dataflow automation between components
– E.g apropos to workflows, and systematic integration.
Specialized specific needs
• Multi-resolution Data movement
• Fine-grained object access and latencies
Gaps costs, priorities.
• Cost
–$
– $$
– $$$
o >100,000
o >1,000,000
o >10,000,000
• Priority – low med high
– High – barrier to Science
– Med – substantial cost or waste
– Low – annoying
• Type of work
– RD, HP, DS
Gaps, Costs, Priorities
• Placement depending on…. $$$, H, RD,HP,DS
– Storage Management (storage space availability,
quality, etc)
• Permanence at the archival scale
• Investigation of how to do this apropos to Scientific Storage
syst.
– Analogs to industry – information life cycle
management
– Appropriate mix of Exposed interfaces and hints with
a preference for standard interfaces (as opposed to
parochial, per-system interfaces)
– Automatic and manual configurations need to be
investigated
• Including hints about future accesses.
Physical Considerations
• How to deal with the increase of capacity
per device. $$, H ,RD?,DS?
– No aspects of performance expand with
Moore’s law
– Possibly mitigated by placement strategy:
• Mixed “temperature” on the same spindle
Gaps, Costs, and Priorities
• Management of replicas ($$; H; RD, HP,
DS)
– Movement of files
– Movement of namespaces.
– Less-than-whole-file level replication.
– Consistency of replicated files
• Write once (immutable file) case is an important
use case.
• Investigation of utility of mutable files
• Trade off of version management v.s. mutable files
Gaps, Costs, and Priorities
• Access (movement) ($$$; H, RD HP DS)
– Access requirements are increasing faster than
evolution of disk speed.
– Exploitation of IP and non-IP based networking
– Access contention on physically large volume.
– Latency v.s. small grained access.
– Investigate supporting sending the function and’/or
query to the data.
– Investigate supporting virtual data techniques
– Investigation of choice of copies and choice of path
– Investigation of where to put compression in system
architectures.
Gaps costs….
• Security, authorization authentication and
access control. ($, M, RD, DS)
– Investigate expression of access control
• And how it moves with the data.
• Dataflow automation between components
($$; H; RD, hs, DS)
– API for wide area distributed computing, exposing as
apropos many items mentioned
– Scheduling., access optimization analogous to query
optimization.
Gaps Costs Priorities
• Multi-resolution Data movement
– Restricted to framework and not solving
specific problems ($, ?, ?)?
– Important use case for Office of Science
– Investigate if a special case of moving
functions to the data. (appropriate framework)
• Grid
Download