A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23rd, 2004 Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase Grid Trivia How many of you have submitted a job to the Grid resources and did never hear back from it? How many of you got mad by the inconsistent behavior of some grid resources? • Completing successfully some jobs and failing others.. • Similar jobs performing completely different.. ... We did! George Kola, Tevfik Kosar and Miron Livny 2 A Client-centric Grid Knowledgebase Goal: Prevent Unexpected Behavior in a Grid Learn from experience and prevent them from repeating in the future again. Causes for unexpected behavior in a Grid: • Black holes • Resources with – Faulty hardware – Buggy or misconfigured software • Extremely slow computational sites • Memory leaks ..etc George Kola, Tevfik Kosar and Miron Livny 3 A Client-centric Grid Knowledgebase Black holes George Kola, Tevfik Kosar and Miron Livny 4 A Client-centric Grid Knowledgebase Black holes Definition: “A black hole is a region of spacetime from which nothing can escape, even light.” If you send a light beam to a black hole, you never hear back from it. You can only know it after you have encounter it. Is it too late? • No. You should learn from experience.. George Kola, Tevfik Kosar and Miron Livny 5 A Client-centric Grid Knowledgebase Black holes in the Grid Resources that accept jobs but never complete them • You send a job to a resource, but never hear back from it. George Kola, Tevfik Kosar and Miron Livny 6 A Client-centric Grid Knowledgebase Black hole examples from real life: In the WCER educational video processing pipeline: • A specific pool was accepting and processing our jobs for a couple of hours, but evicting before completion. • A machine accepted a job, but due to a memory leak it kept throwing “shadow exceptions” and retrying the job forever. • Some thirdparty (GridFTP, DiskRouter) transfers hang occasionally and never returned. • A machine caused an error because of a corrupted FPU. It successfully completed MPEG-1 encoding but failed MPEG-4. George Kola, Tevfik Kosar and Miron Livny 7 A Client-centric Grid Knowledgebase Grid is good.. but not perfect.. Heterogeneous resources Multi administrative domains Spanning wide area networks Consists of commodity hardware and software Prone to network-, hardware-, software-, middlewarefailures! We cannot expect everything from the Grid or Grid middleware! George Kola, Tevfik Kosar and Miron Livny 8 A Client-centric Grid Knowledgebase Take the Ethernet Approach A truly distributed (and very effective) access control protocol to a shared service Client responsible access control Client responsible for error detection Client responsible for fairness Keep track of job/resource performance & failure characteristics as observed by the client. Use job/user log files collected at the client side to build a grid knowledgebase. George Kola, Tevfik Kosar and Miron Livny 9 A Client-centric Grid Knowledgebase Grid Knowledgebase Parse user/job log files Load them into a database Aggregate experience of different jobs Interpret them Plan action Generate feedback to the scheduler as well as to the user George Kola, Tevfik Kosar and Miron Livny 10 JOB DESCRIPTIONS PLANNER JOB QUEUE MATCH MAKER JOB SCHEDULER Clusters Storage Servers Personal Computers GRID RESOURCES JOB LOGS JOB DESCRIPTIONS PLANNER JOB QUEUE MATCH MAKER ADAPTATION LAYER JOB SCHEDULER DATA MINER DATABASE Clusters Storage Servers Personal Computers GRID RESOURCES JOB PARSER JOB LOGS GRID KNOWLEDGEBASE NOTIFICATION LAYER A Client-centric Grid Knowledgebase Database Schema Field Type JobId Int JobName string State Int SubmitHost string SubmitTime Int ExecuteHost string [] ExecuteTime string [] ImageSize int[] ImageSizeTime int [] EvictTime int [] Checkpointed bool [] EvictReason string TerminateTime int [] TotalLocalUsage string TotalRemoteUsage string TerminateMessage string ExceptionTime int [] ExceptionMessage string [] User Submit Schedule Evicted Suspend Execute Un-suspend Exception Terminated Abnormally No Terminated Normally Exit code = 0? Yes Job Failed George Kola, Tevfik Kosar and Miron Livny Job Succeeded 13 A Client-centric Grid Knowledgebase Difference from existing approaches Client view Use only job/user log files at the client side • Many administrators do not want to share resource/scheduler log files. We do not need to know everything going on in the whole grid • Scalable George Kola, Tevfik Kosar and Miron Livny 14 A Client-centric Grid Knowledgebase What do we get? Collecting job execution time statistics • Average job execution time • Standard deviation • Fit a distribution Detect and avoid black holes • For normal distribution: – 99.7% of job execution times should lie between (avg-3*stdev) and (avg+3*stdev) – 96% of job execution times should lie between (avg-2*stdev) and (avg+2*stdev) George Kola, Tevfik Kosar and Miron Livny 15 A Client-centric Grid Knowledgebase Detecting hanging transfers Transfer Time (T) vs Probability (t<T) 120 80 60 40 20 15.3 14.2 11.9 9.8 9.3 8.4 7.9 7.3 6.9 6.6 6.2 5.9 5.7 5.5 5.3 5.1 5.0 4.8 0 4.6 Probability (t<T) (%) 100 Transfer Time (T) (minutes) George Kola, Tevfik Kosar and Miron Livny 16 A Client-centric Grid Knowledgebase Setting Execution Time Limits Avg = 7.8 min Stdev = 3.17min For normal distribution: • %99.7 : [0 – 17.31 min] • %96 : [1.46 min – 14.14 min] George Kola, Tevfik Kosar and Miron Livny 17 A Client-centric Grid Knowledgebase What do we get? (2) Identifying misconfigured machines • e.g. find set of machines which fail jobs with I/O data size larger than 2 GB (i.e. OS limitations) Identifying factors affecting job run-time Bug hunting • Job failures on certain inputs • Memory leaks – Scheduler logs image size regularly George Kola, Tevfik Kosar and Miron Livny 18 A Client-centric Grid Knowledgebase Job Memory Image Size (MB) Catching Memory Leaks Time George Kola, Tevfik Kosar and Miron Livny 19 A Client-centric Grid Knowledgebase What do we get? (3) Application optimization • How long does each step of an application/pipeline take to execute? Adaptation • Find resources that take least time to execute jobs from a particular class George Kola, Tevfik Kosar and Miron Livny 20 A Client-centric Grid Knowledgebase Conclusions View of the Grid from the client side Job/user log files as main source of information Aggregate experience of different jobs and pass them to future ones Helps in: • • • • Catching black holes Identify faulty/misconfigured resources Bug tracking Statistics collection Future work: • Merge experience of different clients George Kola, Tevfik Kosar and Miron Livny 21 A Client-centric Grid Knowledgebase Thank you… For more information, contact: Tevfik Kosar http://www.cs.wisc.edu/~kosart kosart@cs.wisc.edu George Kola, Tevfik Kosar and Miron Livny 22