Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July 2002 Outline An Exercise: Condor + Java Bad News: Error Explosion A Theory of Error Propagation Down with Generic Errors! Condor Revisited Parting Thoughts An Exercise: Coupling Condor and Java The Condor Project, est. 1985. – – The Java Language, est. 1991. – – Production high-throughput computing facility. Provides a stable execution environment on a Grid of unstable, autonomous resources. Production language, compiler, and interpreter. Provides a standard instruction set and libraries on any processor and system. The Grid, est. ???? – – – Execute any code any where at any time. Dependable, consistent, pervasive, inexpensive... Are we there yet? The Condor High Throughput Computing System HTC != HPC – All participants are autonomous. – – – Users give constraints on usable machines. Machines give constraints on jobs and users. ClassAds: a language for matchmaking. If you are willing to re-link jobs... – – – Measured in sims/week, frames/month, cycles/year. Remote system calls for transparent mobility. Binary checkpointing for migration and fault-tolerance. Can’t relink? All other features available. Special “universes” support software environments. – PVM, MPI, Master-Worker, Vanilla, Globus, Java Submission Site Execution Site MatchMaker Policy Control User Agent (schedd) Claiming Protocol Fork Job Agent (shadow) Machine Agent (startd) Fork Execution Protocol Job Agent (starter) Fork Home File System The Job Policy Control Java Universe Execution: – – User specifies .class and .jar files. Machine provides the JVM details. Input and Output: – Know all of your files? – Condor transfers whole files for you. Need online I/O? Link program with Chirp I/O Library. Execution site provides proxy to home site. Submission Site Execution Site Job Agent (shadow) Job Agent (starter) Secure Remote I/O I/O Server Local System Calls I/O Proxy Fork Local RPC (Chirp) JVM Wrapper Home File System The Job I/O Library Initial Experience Bad news! Any kind of error sent the job back to the user with an exception message: – – – – NullPointerException - Program is faulty. OutOfMemory - Program outgrew machine. ClassNotFoundError - Machine incorrectly installed. ConnectionRefused - Network temporarily unavailable. Users were frustrated because they had to evaluate whether the job failed or the system failed. These were correct in the sense they were true. These were not bugs. We deliberately trapped all possible errors and passed them up the chain. What’s the Problem? To reason about this problem, we began to construct a theory of error propagation. This theory offers some common definitions and four principles that outline a design discipline. We re-examined the Java Universe according to this theory. Our most serious mistake: We failed to propagate errors according to their scope. We are NOT Talking About: Fault Tolerance – – – Language Structures – – What algorithms are fault-resistant? How many disks can I lose without losing data? How many copies should I make for five nines? Should I use Objects or Strings to represent errors? Should I use Exceptions or Signals to communicate errors? These are important and valuable questions, but we are asking something different! We ARE Talking About: Where is the problem? How should a program respond to an error? Who should receive an error message? What information should an error carry? How can we even reason about this stuff? Engineering Perspective Fault – Error – An information state that reflects a fault. Failure – A physical disruption of the machine. A violation of documented/guaranteed behavior. Fault – (A failure in one’s underlying components.) Interface Perspective Implicit Error – – Explicit Error – – A result presented as valid, but found to be false. Example: sqrt(3) -> 2. A result describing an inability to carry out the request. Example: open(“file”) -> ENOENT. Escaping Error – – – A return to a higher level of abstraction. Example: read -> virt mem failure -> process abort. Example: server out of memory -> shutdown socket Parent Process Abnormal Exit Normal Exit Program load Escaping error: Tell the parent that the program could not complete. Would like to return an explicit error, but a load insn has no exit code. data Could return a default value, but that creates an implicit error. Virtual Memory System Physical Memory Backing Store Interface Contracts int load( int address ); The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error. Exceptions int open( String filename ) throws FileNotFound, AccessDenied; A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible. Parent Process Abnormal Exit Normal Exit Program Success, FileNotFound, AccessDenied open MemoryCorrupt, DiskOffline, PigeonLost INTERFACE Virtual File System IMPLEMENTATION Memory Disk Error Scope In order to be accepted by end users, a distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment. We use the term scope to draw the distinction. Error Scope The scope of an error is the portion of the system that it invalidates. An error must be delivered to the process responsible for managing that scope. Error Scope Handler FileNotFound File Calling Function RPC Disconnect Process Parent Process Cache Coherency Problem Machine Hypervisor or Operator PVM Node Crash PVM Cluster Parent Process Error Detail The detail of an error describes in phenomenological terms the cause of the error. In the right hands, the detail is useful. In the wrong hands, the detail can be misleading. Suppose open returns AccessDenied... – – File is not accessible - Ok. Library containing ‘open’ is not accessible Problem! What To Do With An Error? A program cannot possibly know what to do with an error outside its scope. – Propagate an error to the manager of the scope as directly as possible. Sometimes, a direct mechanism: – Should sin(x) deal with “math library not available?” Signal, exception, dropped connection, message. Sometimes, an indirect mechanism: – Touch a file, then exit by any means available. Principles for Error Design Principle 1: – Principle 2: – An escaping error converts a potential implicit error into an explicit error at a higher level. Principle 3: – A routine must not generate an implicit error as a result of receiving an explicit error. An escaping error must be propagated to the program that manages the error’s scope. Principle 4: – Error interfaces must be concise and finite. Return to Condor What did we do wrong? – – We failed to carefully consider the scope of an error. We fell prey to the deadly generic error. What’s the solution? – – Identify error scopes in Condor. Find more direct mechanisms to send escaping errors to the managing process. schedd Job Scope shadow Local Resource Scope starter Remote Resource Scope JVM Prog Image User Policy Prog Args I/O Server Input Data Output Space Owner Policy Java Pkg Virtual Machine Scope program Program Scope Mem & CPU Code Data Scope in Condor Detail Scope Handler Program exited normally. Program User Null pointer exception. Program User Out of memory. Virtual Machine JVM Java misconfigured. Remote Resource Local Resource Starter Job Schedd Home file system offline. Program image corrupt. Shadow Scope in Condor: JVM Exit Code Detail Scope Handler Exit Code Program exited normally. Program User (x) Null pointer exception. Program User 1 Out of memory. Virtual Machine Remote Resource JVM 1 Starter 1 Local Resource Job Shadow 1 Schedd 1 Java misconfigured. Home file system offline. Program image corrupt. Job Agent (shadow) Starter Result + Program Result Result File Job Agent (starter) JVM Result JVM Home File System Program Result or Error and Scope Wrapper The Job I/O Library Result File JVM Result JVM Errors of Larger Scope Errors Inside Program Scope Wrapper The Job I/O Library Half-Way Conclusion Small but powerful changes drastically improved the Java Universe. Our mistake was to represent all possible errors explicitly in the closest interface. Error scope is an analytic tool that helps the designer decide how to propagate an error. But, we were initially confused by the presence of the deadly generic error. The Deadly Generic Error Whereas, a program may fail in more ways than we can possibly imagine... And whereas, generality and flexibility are virtues of programming... Be if therefore resolved that interfaces should return general, flexible, arbitrary values: – int open( String name ) throws IOException; What’s Wrong with Generality? The structure and types of errors are as essential to an interface as the arguments and return values. Every error requires a different recovery mechanism, according to its scope: – – – – EINTR - try again right away ETIMEDOUT - will be available again in the future EPERM - you can’t at all without talking to a person ESTALE - must kill process A program must know the *specific* details of an error in order to take the right action. Guesses don’t work. – – Exit on unknown errors? Program is brittle. Retry on unknown errors? Program waits endlessly. An Example of Generality int open( String name ) throws IOException; int write( int data ) throws IOException; An Example of Generality Java defines several types of IOException: – Can open throw...? – – – FileNotFound EndOfFile DiskFull Can write throw...? – – – AccessDenied, FileNotFound, EndOfFile... AccessDenied FileNotFound DiskFull Trick Question! My Disk Runneth Over! What can a program expect for a full disk? – – – DiskFullException OutOfSpaceException It’s really neither! (How would we know?) What should an implementor do when the disk fills up? – – – There is no appropriate exception to throw. Making up an exception is not useful. Only solution: an escaping error. (Example later.) Advice for Constructing Error Interfaces Export a small set of expected error types. – – – Bad Arguments Lost Connection No Such File Choose an internal error management strategy. You know the cost of retry vs the cost of failure. – – – Retry internally Abort process Drop connection A Better Interface int open( String name ) throws AccessDenied, throws FileNotFound; int write( int data ) throws DiskFull; Conclusion Small but powerful changes drastically improved the Java Universe. Our mistake was to represent all possible errors explicitly in the closest interface. Error scope is an analytic tool that helps the designer decide how to propagate an error. An error discipline saves precious resources: time and aggravation! For more information... “Error Scope” Paper – Douglas Thain – miron@cs.wisc.edu Condor Software, Manuals, Papers, and More – thain@cs.wisc.edu Miron Livny – http://www.cs.wisc.edu/~thain http://www.cs.wisc.edu/condor Questions now?