STORK - University of Wisconsin

advertisement
STORK: Making Data
Placement a First Class Citizen
in the Grid
Tevfik Kosar and Miron Livny
University of Wisconsin-Madison
March 25th, 2004
Tokyo, Japan
Some Remarkable Numbers
Characteristics of four physics experiments targeted by GriPhyN:
Application
First
Data
Data Volume
(TB/yr)
User
Community
SDSS
1999
10
100s
LIGO
2002
250
100s
ATLAS/
CMS
2005
5,000
1000s
Source: GriPhyN Proposal, 2000
Stork: Making Data Placement a First Class Citizen in the Grid
2
Even More Remarkable…
“ ..the data volume of CMS is
expected to subsequently increase
rapidly, so that the accumulated data
volume will reach 1 Exabyte (1
million Terabytes) by around 2015.”
Source: PPDG Deliverables to CMS
Stork: Making Data Placement a First Class Citizen in the Grid
3
More Data Intensive Applications..
Genomic information processing applications
Biomedical Informatics Research Network (BIRN)
applications
Cosmology applications (MADCAP)
Methods for modeling large molecular systems
Coupled climate modeling applications
Real-time observatories, applications, and datamanagement (ROADNet)
Stork: Making Data Placement a First Class Citizen in the Grid
4
Need for Data Placement
Data placement: locate, move, stage,
replicate, cache data; allocate and deallocate storage space for data;
clean-up everything afterwards.
Stork: Making Data Placement a First Class Citizen in the Grid
5
State of the Art
FedEx
Manual Handling
Using Simple Scripts
RFT, LDR, SRM
Data placement activities are simply
regarded as “second class citizens”
in the Grid world.
Stork: Making Data Placement a First Class Citizen in the Grid
6
Our Goal
Make data placement activities “first
class citizens” in the Grid just like
the computational jobs!
They will be queued, scheduled,
monitored, managed and even
checkpointed.
Stork: Making Data Placement a First Class Citizen in the Grid
7
Outline
Introduction
Methodology
Grid Challenges
Stork Solutions
Case Studies
Conclusions
Stork: Making Data Placement a First Class Citizen in the Grid
8
Methodology
•
Stage-in
•
Execute the Job
•
Stage-out
Individual Jobs
Stork: Making Data Placement a First Class Citizen in the Grid
9
Methodology
Stage-in
•
Stage-in
•
Execute the Job
•
Stage-out
Execute the job
Stage-out
Individual Jobs
Stork: Making Data Placement a First Class Citizen in the Grid
10
Methodology
Allocate space for
input & output data
Stage-in
•
Stage-in
•
Execute the Job
•
Stage-out
Execute the job
Release input space
Stage-out
Release output space
Individual Jobs
Stork: Making Data Placement a First Class Citizen in the Grid
11
Methodology
Allocate space for
input & output data
Stage-in
•
Stage-in
•
Execute the Job
•
Stage-out
Execute the job
Release input space
Data Placement Jobs
Stage-out
Release output space
Computational Jobs
Stork: Making Data Placement a First Class Citizen in the Grid
12
Methodology
Compute
C
Job
Queue
DAG specification
DaP A A.submit
DaP B B.submit
Job C C.submit
…..
Parent A child B
Parent B child C
Parent C child D, E
…..
D
A
B
F
C
E
E
DAG Executer
Stork: Making Data Placement a First Class Citizen in the Grid
DaP
Job
Queue
13
Grid Challenges
Heterogeneous Resources
Different Job Requirements
Hiding Failures from Applications
Overloading Limited Resources
Stork: Making Data Placement a First Class Citizen in the Grid
14
Stork Solutions to Grid Challenges
Interaction with Heterogeneous
Resources
Flexible Job Representation and
Multilevel Policy Support
Run-time Adaptation
Failure Recovery and Efficient
Resource Utilization
Stork: Making Data Placement a First Class Citizen in the Grid
15
Interaction with
Heterogeneous Resources
Modularity, extendibility
Plug-in protocol support
A library of “data placement” modules
Inter-protocol Translations
Stork: Making Data Placement a First Class Citizen in the Grid
16
Direct Protocol Translation
Stork: Making Data Placement a First Class Citizen in the Grid
17
Protocol Translation using Disk Cache
Stork: Making Data Placement a First Class Citizen in the Grid
18
Stork Solutions to Grid Challenges
Interaction with Heterogeneous
Resources
Flexible Job Representation and
Multilevel Policy Support
Run-time Adaptation
Failure Recovery and Efficient
Resource Utilization
Stork: Making Data Placement a First Class Citizen in the Grid
19
Flexible Job Representation
and Multilevel Policy Support
ClassAd job description language
Flexible and extendible
Persistent
Allows global and job level policies
Stork: Making Data Placement a First Class Citizen in the Grid
20
Sample Stork submit file
[
]
Type = “Transfer”;
Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”;
Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”;
……
……
Max_Retry = 10;
Restart_in = “2 hours”;
Stork: Making Data Placement a First Class Citizen in the Grid
21
Stork Solutions to Grid Challenges
Interaction with Heterogeneous
Resources
Flexible Job Representation and
Multilevel Policy Support
Run-time Adaptation
Failure Recovery and Efficient
Resource Utilization
Stork: Making Data Placement a First Class Citizen in the Grid
22
Run-time Adaptation
Dynamic protocol selection
[
]
[
]
dap_type = “transfer”;
src_url = “drouter://slic04.sdsc.edu/tmp/foo.dat”;
dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/foo.dat”;
alt_protocols = “nest-nest, gsiftp-gsiftp”;
dap_type = “transfer”;
src_url = “any://slic04.sdsc.edu/tmp/foo.dat”;
dest_url = “any://quest2.ncsa.uiuc.edu/tmp/foo.dat”;
Stork: Making Data Placement a First Class Citizen in the Grid
23
Stork Solutions to Grid Challenges
Interaction with Heterogeneous
Resources
Flexible Job Representation and
Multilevel Policy Support
Run-time Adaptation
Failure Recovery and Efficient
Resource Utilization
Stork: Making Data Placement a First Class Citizen in the Grid
24
Failure Recovery and Efficient
Resource Utilization
Hides failures from users
 “retry” mechanism
 “kill and restart” mechanism
Control number of concurrent transfers
from/to any storage system
 Prevents overloading
Space allocation and De-allocations
 Make sure space is available
Stork: Making Data Placement a First Class Citizen in the Grid
25
Case Study I:
SRB-UniTree Data Pipeline
Transfer ~3 TB of DPOSS data
(2611 x 1.1 GB files) from SRB
@SDSC to UniTree @NCSA
No direct interface
Only way for transfer is building a
pipeline
Stork: Making Data Placement a First Class Citizen in the Grid
26
SRB-UniTree Data Pipeline
Submit Site
SRB Server
UniTree Server
SRB get
UniTree put
Grid-FTP/
DiskRouter
third-party
SDSC Cache
NCSA Cache
Stork: Making Data Placement a First Class Citizen in the Grid
27
Failure Recovery
UniTree not responding
Diskrouter reconfigured
and restarted
Stork: Making Data Placement a First Class Citizen in the Grid
SDSC cache reboot &
UW CS Network outage
Software problem
28
Case Study -II
Stork: Making Data Placement a First Class Citizen in the Grid
29
Dynamic Protocol Selection
Stork: Making Data Placement a First Class Citizen in the Grid
30
Conclusions
Regard data placement as real jobs.
Treat computational and data placement
jobs differently.
Introduce a specialized scheduler for data
placement.
End-to-end automation, fault tolerance,
run-time adaptation, multilevel policy
support.
Stork: Making Data Placement a First Class Citizen in the Grid
31
Future work
Enhanced interaction between Stork
and higher level planners
co-scheduling of CPU and I/O
Enhanced authentication mechanisms
More run-time adaptation
Stork: Making Data Placement a First Class Citizen in the Grid
32
You don’t have to FedEx your
data anymore..
Stork delivers it for you!
For more information
 URL:
• http://www.cs.wisc.edu/condor/stork
 Email to:
• kosart@cs.wisc.edu
Stork: Making Data Placement a First Class Citizen in the Grid
33
Download