Error Scope on a Computational Grid: Theory and Practice Douglas Thain

advertisement
Error Scope
on a Computational Grid:
Theory and Practice
Douglas Thain
Computer Sciences Department
University of Wisconsin
USC Reliability Workshop
July 2002
Outline






An Exercise: Condor + Java
Bad News: Error Explosion
A Theory of Error Propagation
Down with Generic Errors!
Condor Revisited
Parting Thoughts
An Exercise:
Coupling Condor and Java

The Condor Project, est. 1985.
–
–

The Java Language, est. 1991.
–
–

Production high-throughput computing facility.
Provides a stable execution environment on a Grid of unstable,
autonomous resources.
Production language, compiler, and interpreter.
Provides a standard instruction set and libraries on any
processor and system.
The Grid, est. ????
–
–
–
Execute any code any where at any time.
Dependable, consistent, pervasive, inexpensive...
Are we there yet?
The Condor High Throughput
Computing System

HTC != HPC
–

All participants are autonomous.
–
–
–

Users give constraints on usable machines.
Machines give constraints on jobs and users.
ClassAds: a language for matchmaking.
If you are willing to re-link jobs...
–
–
–

Measured in sims/week, frames/month, cycles/year.
Remote system calls for transparent mobility.
Binary checkpointing for migration and fault-tolerance.
Can’t relink? All other features available.
Special “universes” support software environments.
–
PVM, MPI, Master-Worker, Vanilla, Globus, Java
Submission Site
Execution Site
MatchMaker
Policy
Control
User
Agent
(schedd)
Claiming Protocol
Fork
Job
Agent
(shadow)
Machine
Agent
(startd)
Fork
Execution Protocol
Job
Agent
(starter)
Fork
Home
File
System
The Job
Policy
Control
Java Universe

Execution:
–
–

User specifies .class and .jar files.
Machine provides the JVM details.
Input and Output:
–
Know all of your files?

–
Condor transfers whole files for you.
Need online I/O?


Link program with Chirp I/O Library.
Execution site provides proxy to home site.
Submission Site
Execution Site
Job Agent
(shadow)
Job Agent
(starter)
Secure Remote I/O
I/O Server
Local System Calls
I/O Proxy
Fork
Local RPC
(Chirp)
JVM
Wrapper
Home
File
System
The Job
I/O Library
Initial Experience

Bad news! Any kind of error sent the job back to the
user with an exception message:
–
–
–
–



NullPointerException - Program is faulty.
OutOfMemory - Program outgrew machine.
ClassNotFoundError - Machine incorrectly installed.
ConnectionRefused - Network temporarily unavailable.
Users were frustrated because they had to evaluate
whether the job failed or the system failed.
These were correct in the sense they were true.
These were not bugs. We deliberately trapped all
possible errors and passed them up the chain.
What’s the Problem?




To reason about this problem, we began to
construct a theory of error propagation.
This theory offers some common definitions
and four principles that outline a design
discipline.
We re-examined the Java Universe according
to this theory.
Our most serious mistake: We failed to
propagate errors according to their scope.
We are NOT Talking About:

Fault Tolerance
–
–
–

Language Structures
–
–

What algorithms are fault-resistant?
How many disks can I lose without losing data?
How many copies should I make for five nines?
Should I use Objects or Strings to represent errors?
Should I use Exceptions or Signals to communicate errors?
These are important and valuable questions, but we
are asking something different!
We ARE Talking About:





Where is the problem?
How should a program respond to an error?
Who should receive an error message?
What information should an error carry?
How can we even reason about this stuff?
Engineering Perspective

Fault
–

Error
–

An information state that reflects a fault.
Failure
–

A physical disruption of the machine.
A violation of documented/guaranteed behavior.
Fault
–
(A failure in one’s underlying components.)
Interface Perspective

Implicit Error
–
–

Explicit Error
–
–

A result presented as valid, but found to be false.
Example: sqrt(3) -> 2.
A result describing an inability to carry out the request.
Example: open(“file”) -> ENOENT.
Escaping Error
–
–
–
A return to a higher level of abstraction.
Example: read -> virt mem failure -> process abort.
Example: server out of memory -> shutdown socket
Parent
Process
Abnormal
Exit
Normal
Exit
Program
load
Escaping
error: Tell the
parent that
the program
could not
complete.
Would like to return an
explicit error, but a load
insn has no exit code.
data
Could return a default
value, but that creates an
implicit error.
Virtual Memory System
Physical
Memory
Backing
Store
Interface Contracts
int load( int address );
The implementor must either compute a result
that conforms to the contract, or is obliged to
cause an escaping error.
Exceptions
int open( String filename )
throws FileNotFound, AccessDenied;
A language with exceptions provides more
structure to the contract. A declared exception
is an explicit error. Yet, escaping errors are
still possible.
Parent
Process
Abnormal
Exit
Normal
Exit
Program
Success,
FileNotFound,
AccessDenied
open
MemoryCorrupt,
DiskOffline,
PigeonLost
INTERFACE
Virtual File System
IMPLEMENTATION
Memory
Disk
Error Scope


In order to be accepted by end users, a
distributed system must be able to distinguish
between errors computed by the program and
errors forced upon it by the environment.
We use the term scope to draw the distinction.
Error Scope


The scope of an error is the portion of the
system that it invalidates.
An error must be delivered to the process
responsible for managing that scope.
Error
Scope
Handler
FileNotFound
File
Calling Function
RPC Disconnect
Process
Parent Process
Cache Coherency
Problem
Machine
Hypervisor or
Operator
PVM Node Crash
PVM Cluster
Parent Process
Error Detail



The detail of an error describes in
phenomenological terms the cause of the error.
In the right hands, the detail is useful. In the
wrong hands, the detail can be misleading.
Suppose open returns AccessDenied...
–
–
File is not accessible - Ok.
Library containing ‘open’ is not accessible Problem!
What To Do With An Error?

A program cannot possibly know what to do with an
error outside its scope.
–


Propagate an error to the manager of the scope as
directly as possible.
Sometimes, a direct mechanism:
–

Should sin(x) deal with “math library not available?”
Signal, exception, dropped connection, message.
Sometimes, an indirect mechanism:
–
Touch a file, then exit by any means available.
Principles for Error Design

Principle 1:
–

Principle 2:
–

An escaping error converts a potential implicit error into an
explicit error at a higher level.
Principle 3:
–

A routine must not generate an implicit error as a result of
receiving an explicit error.
An escaping error must be propagated to the program that
manages the error’s scope.
Principle 4:
–
Error interfaces must be concise and finite.
Return to Condor

What did we do wrong?
–
–

We failed to carefully consider the scope of an error.
We fell prey to the deadly generic error.
What’s the solution?
–
–
Identify error scopes in Condor.
Find more direct mechanisms to send escaping
errors to the managing process.
schedd
Job Scope
shadow
Local Resource Scope
starter
Remote Resource Scope
JVM
Prog
Image
User
Policy
Prog
Args
I/O
Server
Input
Data
Output
Space
Owner
Policy
Java
Pkg
Virtual Machine Scope
program Program Scope
Mem
& CPU
Code
Data
Scope in Condor
Detail
Scope
Handler
Program exited normally.
Program
User
Null pointer exception.
Program
User
Out of memory.
Virtual
Machine
JVM
Java misconfigured.
Remote
Resource
Local
Resource
Starter
Job
Schedd
Home file system offline.
Program image corrupt.
Shadow
Scope in Condor:
JVM Exit Code
Detail
Scope
Handler
Exit
Code
Program exited normally.
Program
User
(x)
Null pointer exception.
Program
User
1
Out of memory.
Virtual
Machine
Remote
Resource
JVM
1
Starter
1
Local
Resource
Job
Shadow
1
Schedd
1
Java misconfigured.
Home file system offline.
Program image corrupt.
Job Agent
(shadow)
Starter Result +
Program Result
Result
File
Job Agent
(starter)
JVM Result
JVM
Home
File
System
Program
Result
or
Error and
Scope
Wrapper
The Job
I/O Library
Result
File
JVM Result
JVM
Errors of Larger Scope
Errors Inside
Program Scope
Wrapper
The Job
I/O Library
Half-Way Conclusion




Small but powerful changes drastically
improved the Java Universe.
Our mistake was to represent all possible
errors explicitly in the closest interface.
Error scope is an analytic tool that helps the
designer decide how to propagate an error.
But, we were initially confused by the presence
of the deadly generic error.
The Deadly Generic Error



Whereas, a program may fail in more ways
than we can possibly imagine...
And whereas, generality and flexibility are
virtues of programming...
Be if therefore resolved that interfaces should
return general, flexible, arbitrary values:
–
int open( String name ) throws IOException;
What’s Wrong with Generality?


The structure and types of errors are as essential to an
interface as the arguments and return values.
Every error requires a different recovery mechanism,
according to its scope:
–
–
–
–

EINTR - try again right away
ETIMEDOUT - will be available again in the future
EPERM - you can’t at all without talking to a person
ESTALE - must kill process
A program must know the *specific* details of an error
in order to take the right action. Guesses don’t work.
–
–
Exit on unknown errors? Program is brittle.
Retry on unknown errors? Program waits endlessly.
An Example of Generality
int open( String name )
throws IOException;
int write( int data )
throws IOException;
An Example of Generality

Java defines several types of IOException:
–

Can open throw...?
–
–
–

FileNotFound
EndOfFile
DiskFull
Can write throw...?
–
–
–

AccessDenied, FileNotFound, EndOfFile...
AccessDenied
FileNotFound
DiskFull
Trick Question!
My Disk Runneth Over!

What can a program expect for a full disk?
–
–
–

DiskFullException
OutOfSpaceException
It’s really neither! (How would we know?)
What should an implementor do when the disk
fills up?
–
–
–
There is no appropriate exception to throw.
Making up an exception is not useful.
Only solution: an escaping error. (Example later.)
Advice for Constructing Error
Interfaces

Export a small set of expected error types.
–
–
–

Bad Arguments
Lost Connection
No Such File
Choose an internal error management strategy. You
know the cost of retry vs the cost of failure.
–
–
–
Retry internally
Abort process
Drop connection
A Better Interface
int open( String name )
throws AccessDenied,
throws FileNotFound;
int write( int data )
throws DiskFull;
Conclusion




Small but powerful changes drastically
improved the Java Universe.
Our mistake was to represent all possible
errors explicitly in the closest interface.
Error scope is an analytic tool that helps the
designer decide how to propagate an error.
An error discipline saves precious resources:
time and aggravation!
For more information...

“Error Scope” Paper
–

Douglas Thain
–

miron@cs.wisc.edu
Condor Software, Manuals, Papers, and More
–

thain@cs.wisc.edu
Miron Livny
–

http://www.cs.wisc.edu/~thain
http://www.cs.wisc.edu/condor
Questions now?
Download