A Client-centric Grid Knowledgebase George Kola, and Miron Livny Tevfik Kosar

advertisement
A Client-centric Grid Knowledgebase
George Kola, Tevfik Kosar and Miron Livny
University of Wisconsin-Madison
September 23rd, 2004
Cluster 2004
San Diego, CA
A Client-centric Grid Knowledgebase
Grid Trivia
 How many of you have submitted a job to the
Grid resources and did never hear back from
it?
 How many of you got mad by the inconsistent
behavior of some grid resources?
• Completing successfully some jobs and failing
others..
• Similar jobs performing completely different..
... We did!
George Kola, Tevfik Kosar and Miron Livny
2
A Client-centric Grid Knowledgebase
Goal: Prevent Unexpected Behavior in a Grid
 Learn from experience and prevent them from repeating
in the future again.
 Causes for unexpected behavior in a Grid:
• Black holes
• Resources with
– Faulty hardware
– Buggy or misconfigured software
• Extremely slow computational sites
• Memory leaks
..etc
George Kola, Tevfik Kosar and Miron Livny
3
A Client-centric Grid Knowledgebase
Black holes
George Kola, Tevfik Kosar and Miron Livny
4
A Client-centric Grid Knowledgebase
Black holes
 Definition: “A black hole is a region of spacetime
from which nothing can escape, even light.”
 If you send a light beam to a black hole, you never
hear back from it.
 You can only know it after you have encounter it. Is
it too late?
• No. You should learn from experience..
George Kola, Tevfik Kosar and Miron Livny
5
A Client-centric Grid Knowledgebase
Black holes in the Grid
 Resources that accept jobs but never complete them
• You send a job to a resource, but never hear back from it.
George Kola, Tevfik Kosar and Miron Livny
6
A Client-centric Grid Knowledgebase
Black hole examples from real life:
 In the WCER educational video processing pipeline:
• A specific pool was accepting and processing our jobs for
a couple of hours, but evicting before completion.
• A machine accepted a job, but due to a memory leak it
kept throwing “shadow exceptions” and retrying the job
forever.
• Some thirdparty (GridFTP, DiskRouter) transfers hang
occasionally and never returned.
• A machine caused an error because of a corrupted FPU.
It successfully completed MPEG-1 encoding but failed
MPEG-4.
George Kola, Tevfik Kosar and Miron Livny
7
A Client-centric Grid Knowledgebase
Grid is good.. but not perfect..
 Heterogeneous resources
 Multi administrative domains
 Spanning wide area networks
 Consists of commodity hardware and software
Prone to network-, hardware-, software-, middlewarefailures!
We cannot expect everything from the Grid or Grid
middleware!
George Kola, Tevfik Kosar and Miron Livny
8
A Client-centric Grid Knowledgebase
Take the Ethernet Approach
 A truly distributed (and very effective) access
control protocol to a shared service
 Client responsible access control
 Client responsible for error detection
 Client responsible for fairness
Keep track of job/resource performance & failure
characteristics as observed by the client.
Use job/user log files collected at the client side
to build a grid knowledgebase.
George Kola, Tevfik Kosar and Miron Livny
9
A Client-centric Grid Knowledgebase
Grid Knowledgebase
 Parse user/job log files
 Load them into a database
 Aggregate experience of different jobs
 Interpret them
 Plan action
 Generate feedback to the scheduler as well as to
the user
George Kola, Tevfik Kosar and Miron Livny
10
JOB
DESCRIPTIONS
PLANNER
JOB QUEUE
MATCH
MAKER
JOB
SCHEDULER
Clusters
Storage Servers
Personal Computers
GRID RESOURCES
JOB LOGS
JOB
DESCRIPTIONS
PLANNER
JOB QUEUE
MATCH
MAKER
ADAPTATION
LAYER
JOB
SCHEDULER
DATA MINER
DATABASE
Clusters
Storage Servers
Personal Computers
GRID RESOURCES
JOB PARSER
JOB LOGS
GRID
KNOWLEDGEBASE
NOTIFICATION
LAYER
A Client-centric Grid Knowledgebase
Database Schema
Field
Type
JobId
Int
JobName
string
State
Int
SubmitHost
string
SubmitTime
Int
ExecuteHost
string []
ExecuteTime
string []
ImageSize
int[]
ImageSizeTime
int []
EvictTime
int []
Checkpointed
bool []
EvictReason
string
TerminateTime
int []
TotalLocalUsage
string
TotalRemoteUsage
string
TerminateMessage
string
ExceptionTime
int []
ExceptionMessage
string []
User
Submit
Schedule
Evicted
Suspend
Execute
Un-suspend
Exception
Terminated
Abnormally
No
Terminated
Normally
Exit code = 0?
Yes
Job
Failed
George Kola, Tevfik Kosar and Miron Livny
Job
Succeeded
13
A Client-centric Grid Knowledgebase
Difference from existing approaches
 Client view
 Use only job/user log files at the client side
• Many administrators do not want to share
resource/scheduler log files.
 We do not need to know everything going on in the
whole grid
• Scalable
George Kola, Tevfik Kosar and Miron Livny
14
A Client-centric Grid Knowledgebase
What do we get?
 Collecting job execution time statistics
• Average job execution time
• Standard deviation
• Fit a distribution
 Detect and avoid black holes
• For normal distribution:
– 99.7% of job execution times should lie between
(avg-3*stdev) and (avg+3*stdev)
– 96% of job execution times should lie between
(avg-2*stdev) and (avg+2*stdev)
George Kola, Tevfik Kosar and Miron Livny
15
A Client-centric Grid Knowledgebase
Detecting hanging transfers
Transfer Time (T) vs Probability (t<T)
120
80
60
40
20
15.3
14.2
11.9
9.8
9.3
8.4
7.9
7.3
6.9
6.6
6.2
5.9
5.7
5.5
5.3
5.1
5.0
4.8
0
4.6
Probability (t<T)
(%)
100
Transfer Time (T)
(minutes)
George Kola, Tevfik Kosar and Miron Livny
16
A Client-centric Grid Knowledgebase
Setting Execution Time Limits
 Avg = 7.8 min
 Stdev = 3.17min
 For normal distribution:
• %99.7 : [0 – 17.31 min]
• %96 : [1.46 min – 14.14 min]
George Kola, Tevfik Kosar and Miron Livny
17
A Client-centric Grid Knowledgebase
What do we get? (2)
 Identifying misconfigured machines
• e.g. find set of machines which fail jobs with I/O data
size larger than 2 GB (i.e. OS limitations)
 Identifying factors affecting job run-time
 Bug hunting
• Job failures on certain inputs
• Memory leaks
– Scheduler logs image size regularly
George Kola, Tevfik Kosar and Miron Livny
18
A Client-centric Grid Knowledgebase
Job Memory Image Size (MB)
Catching Memory Leaks
Time
George Kola, Tevfik Kosar and Miron Livny
19
A Client-centric Grid Knowledgebase
What do we get? (3)
 Application optimization
• How long does each step of an application/pipeline
take to execute?
 Adaptation
• Find resources that take least time to execute jobs
from a particular class
George Kola, Tevfik Kosar and Miron Livny
20
A Client-centric Grid Knowledgebase
Conclusions
 View of the Grid from the client side
 Job/user log files as main source of information
 Aggregate experience of different jobs and pass
them to future ones
 Helps in:
•
•
•
•
Catching black holes
Identify faulty/misconfigured resources
Bug tracking
Statistics collection
 Future work:
• Merge experience of different clients
George Kola, Tevfik Kosar and Miron Livny
21
A Client-centric Grid Knowledgebase
Thank you…
For more information, contact:
Tevfik Kosar
http://www.cs.wisc.edu/~kosart
kosart@cs.wisc.edu
George Kola, Tevfik Kosar and Miron Livny
22
Download