XII.1 Debugging of Distributed Systems XII.2 Debugging of Distributed Systems • Example of a tool for distributed systems • Approach to fault search during testing • Control and inspection of internal program runtime XII.3 Debugging of Distributed Systems Requirements – User-friendliness – Problem-orientation (symbolic Debugging) (String c = „xyz“ instead of „LOC FF2243 AC32...) – Reproducibility (quasi-deterministic) – Presentation of state information (Variables, Registers, Ports etc: „show c“) – Modification of system state (set c = „ABC“) – Supervision mechanisms Query / Modification User Debugger state information Tested program XII.4 Special problems • • • • • • • Parallel processing Indeterminism Absence of a global state Absence of a common clock Interference “Debugger System” Resulting information flooding Semantics of special constructs (breakpoint, break conditions) • Improved functionality (inter-process communication) XII.5 Inter-process communication • State information contains in addition to process-/object state also communication state Manipulated intervention preferable • Separation in intra-process layer (conventional) and interprocess-layer (special) Functionality of the inter-process layer • Access to messages: – – – – insert <m> in <port> read <m> from <port> extract <m> from <port> forward <m> to <port> XII.6 Inter-process communication • Break points – set break <port> <mtype> [send | receive] – set break <port1> ... <portn> • Statistic accounting records • Access to operating system objects (Semaphore, Processes) XII.7 Consistent state representations Problem: no common clock and storage no consistent state representation • Approaches – Clock synchronization (in the range of milliseconds) – Logic arrangement of the events • Basis: Lamport-Approach – Half-order „Pre-Relation“ – Events are ordered by causal context (sending before receiving) – Unordered if events are independent XII.8 Consistent state presentations • Rules – a and b in the same process, a before b : ab – a to send, b to receive a message : ab – ab, bc ac (transitively) All essential events for distributed processing can be ordered (consistent logic “snapshots“) XII.9 Lamport-Approach Realization via the algorithm – each process has event counter Z (initially “Null”) – each inter-process event has a number N(E), as well as the messages ( = N(E)) • Sending: – increment of Z (Z:=Z+1) – marking Sending Event: N(E) := Z – marking message: :=Z • Receiving of message with number – if > Z (Receiver) set Z:= + 1 – otherwise set Z:=Z+1 – Receiving Event N(E) := Z • Intra-process Event: – Z:=Z+1 – N(E) := Z XII.10 Lamport-Approach P1 1 2 3 4 5 6 P2 P3 1 2 3 4 5 7 8 9 7 10 11 9 12 12 • Causal events ordered completely • Non-causal events unordered (for instance, Nr.12 within P2 and P3) XII.11 Semantics of breakpoints Problem: When does a break point satisfy distributed conditions? Approach: – simple predicates (a process, „call proc“) – disjunctive predicates („P1: call proc | P2: call xy“) – subjunctive predicates („P1: call proc & P1: x=1“) only a process inside – joint predicates: coupling of events in pre-relation: t11 Process 1 t12 t21 t22 t23 Process 2 t31, t22 : ordered t11, t21 : unordered Process 3 t31 t32 t33 XII.12 Consistent stopping of processes Problem: Time delay after issuing of a halt-command Approach: Backtracking to consistent state directly before a stopping event („reset line“) Procedure: Backtracking of the causal contexts regarding to the pre-relation of messages t11 Process 1 t21 t22 t23 Process 2 Process 3 t12 t13 t14 t12: stop point event t24 Process 2: Backtracking on t23 Process 3: Backtracking on t32 t31 t32 t33 t34 XII.13 Distributed trace-steps Basis: Step-Mode from sequential Debuggers (interactive) – one trace-step means movement up to the next point (inter-process event) – local calculations build a entity – sending operations are carried out on all participating processes – receiving operations only if a message exists (as the case may be after sending step) 1 2 3 Distributed trace-steps Calculation phase Interaction point XII.14 Indeterminism handling Indeterministic program behavior: race conditions Decisions: – Testing of different possible execution sequences via distributed Single Step – Re-execution / Replay via output recording Approach: – recording of all inter-process events – control of repeated execution based on this (Re-execution) – high storage requirements but reduction via check points without precedent events – Replay also to a single process possible (important also in the technical processes) XII.15 Handling of information flooding Requirement: Recorded / output information to be reduced • Limitation on inter-process events • Limitation on relevant time intervals • Abstraction forms for – process groups – execution (Timing-Diagram) – ports (abstract message flow) • Graphics support (control windows, animation tools) XII.16 Distributed debugging: concepts Hierarchized influencing • Level 1 : „Free runtime“ – no modification, only trace-recording – minimal interference • Level 2 : „Self-responsibility“ – freely modifiable execution – strong interference – full responsibility of the tester for execution control • Level 3 : „Pseudo-Real-time“ – – – – “the best possible compensation for strong interference” “private clock” per process “private clock” runs, except in the Debugger-Code “private clock” synchronized via, for instance, Lamport-Algorithm on partial order XII.17 Architecture principles Alternatives: 1. Separate processes: Program / Debugger 2. Separate processes with common data (also lightweight processes) 3. Integrated processes with direct instrumentation as a rule alternative 2 or 3 are most common XII.18 Architecture proposal Computer A Process 1 local debugging control Centralized dialogue process Process 2 Computer B Process 3 local debugging control Process 4