SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825 Introduction 1 INTRODUCTION 236825 Introduction 2 Distributed Systems • Distributed systems are everywhere: – share resources – communicate – increase performance (speed & fault tolerance) • Characterized by – independent activities (concurrency) – loosely coupled parallelism (heterogeneity) – inherent uncertainty 236825 E.g. • operating systems • (distributed) database systems • software fault-tolerance • communication networks • multiprocessor architectures Introduction 3 Main Admin Issues • Goal: Read some interesting papers, related to some open problems in the area • Mandatory (active) participation – 1 absence w/o explanation • Tentative list of papers already published – First come first served • Lectures in English 236825 Introduction 4 Course Overview: Basic Models message shared passing memory synchronous PRAM asynchronous 236825 Introduction 5 Message-Passing Model • processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state. • bidirectional point-to-point channels are the undirected edges of the graph. • Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue) p0 2 1 2 p3 236825 1 3 2 p2 1 Introduction 1 p1 6 Modeling Processors and Channels • processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state. • bidirectional point-to-point channels are the undirected edges of the graph. • Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue) inbuf[1] outbuf[2] p1's local variables p2's local variables outbuf[1] 236825 inbuf[2] Introduction 7 Configuration A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels. Formally, a vector of processor states (including outbufs, i.e., channels), one per processor 236825 Introduction 8 Deliver Event Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step p1 p1 236825 m3 m2 m1 m3 m2 p2 m1 p2 Introduction 9 Computation Event Occurs at one processor • Start with old accessible state (local vars + incoming messages) • Apply processor's state machine transition function; handle all incoming messages • End with new accessible state with empty inbufs & new outgoing messages b a c 236825 old local state new local state d e Introduction 10 Execution configuration, event, configuration, event, configuration, … • In the first configuration: each processor is in initial state and all inbufs are empty • For each consecutive triple configuration, event, configuration new configuration is same as old configuration except: – if delivery event: specified msg is transferred from sender's outbuf to receiver's inbuf – if computation event: specified processor's state (including outbufs) change according to transition function 236825 Introduction 11 Asynchronous Executions • An execution is admissible in asynchronous model if – every message in an outbuf is eventually delivered – every processor takes an infinite number of steps • No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out • Models a reliable system (no message is lost and no processor stops working) 236825 Introduction 12 Example: Simple Flooding Algorithm • Each processor's local state consists of variable color, either red or green • Initially: – p0: color = green, all outbufs contain M – others: color = red, all outbufs empty • Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs 236825 Introduction 13 Example: Flooding M p0 M M p0 M p2 deliver event at p1 from p0 p1 M p0 p2 M p1 p0 M p2 p1 M deliver event at p2 from p1 M 236825 computation event by p1 Introduction p2 p1 computation event by p2 M 14 Example: Flooding (cont'd) M p0 M M M p2 p2 236825 p2 M p1 p1 M p0 M M M M deliver event at p1 from p2 p1 M M p0 p0 M M deliver event at p0 from p1 Introduction p2 computation event by p1 p1 etc. to deliver rest of msgs 15 (Worst-Case) Complexity Measures • Message complexity: maximum number of messages sent in any admissible execution • Time complexity: maximum "time" until all processes terminate in any admissible execution. • How to measure time in an asynchronous execution? – Produce a timed execution by assigning nondecreasing real times to events so that time between sending and receiving any message is at most 1. – Time complexity: maximum time until termination in any timed admissible execution. 236825 Introduction 16 Complexities of Flooding Algorithm A state is terminated if color = green. • One message is sent over each edge in each direction message complexity is 2m, where m = number of edges. • A node turns green once a "chain" of messages reaches it from p0 time complexity is diameter + 1 time units. 236825 Introduction 17 Synchronous Message Passing Systems An execution is admissible for the synchronous model if it is an infinite sequence of rounds – A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor. Captures the lockstep behavior of the model Also implies – every message sent is delivered – every processor takes an infinite number of steps. Time is the number of rounds until termination 236825 Introduction 18 Example: Flooding in the Synchronous Model M p0 p2 p0 M p1 p0 p2 236825 p1 M M p2 round 1 events p1 M M round 2 events Time complexity is diameter + 1 Message complexity is 2m Introduction 19 Broadcast Over a Rooted Spanning Tree • Processors have information about a rooted spanning tree of the communication topology – parent and children local variables at each processor • root initially sends M to its children • when a processor receives M from its parent – sends M to its children – terminates (sets a local Boolean to true) • Complexities (synchronous and asynchronous model) – time is depth of the spanning tree, which is at most n - 1 – number of messages is n - 1, since one message is sent over each spanning tree edge 236825 Introduction 20 Finding a Spanning Tree from a Root • root sends M to all its neighbors • when non-root first gets M – set the sender as its parent – send "parent" msg to sender – send M to all other neighbors (if no other neighbors, then terminate) • when get M otherwise – send "reject" msg to sender • use "parent" and "reject" msgs to set children variables and terminate (after hearing from all neighbors) 236825 Introduction 24 Execution of Spanning Tree Algorithm root root a b d c e f g h b d Asynchronous: not necessarily BFS tree 236825 a Both models: O(m) messages O(diam) time c e f g h Synchronous: always gives breadth-first search (BFS) tree Introduction 25 Execution of Spanning Tree Algorithm root a b d a c b e f g h d An asynchronous execution gave a depth-first search (DFS) tree. Is DFS property guaranteed? 236825 root No! Introduction c e f g h Another asynchronous execution results in this tree: neither BFS nor DFS 26 Shared Memory Model Processors (also called processes) communicate via a set of shared variables Each shared variable has a type, defining a set of primitive operations (performed atomically) • • • • p0 read, write compare&swap (CAS) read LL/SC, DCAS, kCAS, … read-modify-write (RMW), kRMW 236825 Introduction p1 write X p2 RMW write Y 29 Changes from the Message-Passing Model • no inbuf and outbuf state components • configuration includes values for shared variables • one event type: a computation step by a process – pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive – shared variable's value in the new configuration changes according to the primitive's semantics – pi 's state in the new configuration changes according to its old state and the result of the primitive An execution is admissible if every processor takes an infinite number of steps 236825 Introduction 30 Abstract Data Types • Abstract representation of data & set of methods (operations) for accessing it • Implement using primitives on base objects • Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones 236825 Introduction data 31 Executing Operations deq 1 invocation P1 response enq(1) ok P2 enq(2) P3 236825 Introduction 32 Interleaving Operations, or Not enq(1) ok deq 1 enq(2) Sequential behavior: invocations & responses alternate and match (on process & object) Sequential Specification: All legal sequential behaviors 236825 Introduction 33 Correctness: Sequential consistency [Lamport, 1979] • For every concurrent execution there is a sequential execution that – Contains the same operations – Is legal (obeys the sequential specification) – Preserves the order of operations by the same process 236825 Introduction 34 Example 1: Multi-Writer Registers Using (multi-reader) single-writer registers Add logical time to values Read only own value Write(v,X) read TS1,..., TSn TSi = max TSj +1 write v,TSi Read(X) read v,TSi return v Once in a while read TS1,..., TSn and write to TSi Need to ensure writes are eventually visible 236825 Introduction 35 Timestamps 1. The timestamps of two write operations by the same process are ordered 2. If a write operation completes before another one starts, it has a smaller timestamp Write(v,X) read TS1,..., TSn TSi = max TSj +1 write v,TSi 236825 Introduction 36 Multi-Writer Registers: Proof Create sequential execution: – Place writes in timestamp order – Insert reads after the appropriate write Write(v,X) read TS1,..., TSn TSi = max TSj +1 write v,TSi 236825 Read(X) read v,TSi return v Once in a while read TS1,..., TSn and write to TSi Introduction 37 Multi-Writer Registers: Proof Create sequential execution: – Place writes in timestamp order – Insert reads after the appropriate write Legality is immediate Per-process order is preserved since a read returns a value (with timestamp) larger than the preceding write by the same process 236825 Introduction 38 Correctness: Linearizability [Herlihy & Wing, 1990] • For every concurrent execution there is a sequential execution that – Contains the same operations – Is legal (obeys the specification of the ADTs) – Preserves the real-time order of non-overlapping operations • Each operation appears to takes effect instantaneously at some point between its invocation and its response (atomicity) 236825 Introduction 39 Example 2: Linearizable Multi-Writer Registers [Vitanyi & Awerbuch, 1987] Using (multi-reader) single-writer registers Add logical time to values Write(v,X) read TS1,..., TSn TSi = max TSj +1 write v,TSi 236825 Read(X) read TS1,...,TSn return value with max TS Introduction 40 Multi-writer registers: Linearization order Create linearization: – Place writes in timestamp order – Insert each read after the appropriate write Write(v,X) read TS1,..., TSn TSi = max TSj +1 write v,TSi 236825 Read(X) read TS1,...,TSn return value with max TS Introduction 41 Multi-Writer Registers: Proof Create linearization: – Place writes in timestamp order – Insert each read after the appropriate write Legality is immediate Real-time order is preserved since a read returns a value (with timestamp) larger than all preceding operations 236825 Introduction 42 Example 3: Atomic Snapshot • n components • Update a single component • Scan all the components “at once” (atomically) update ok scan v1,…,vn Provides an instantaneous view of the whole memory 236825 Introduction 43 Atomic Snapshot Algorithm [Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993] Update(v,k) A[k] = v,seqi,i double Scan() collect repeat read A[1],…,A[n] read A[1],…,A[n] if equal return A[1,…,n] Linearize: • Updates with their writes • Scans inside the double collects 236825 Introduction 44 Atomic Snapshot: Linearizability Double collect (read a set of values twice) If equal, there is no write between the collects – Assuming each write has a new value (seq#) read A[1],…,A[n] read A[1],…,A[n] write A[j] Creates a “safe zone”, where the scan can be linearized 236825 Introduction 45 Liveness Conditions • Wait-free: every operation completes within a finite number of (its own) steps no starvation for mutex • Nonblocking: some operation completes within a finite number of (some other process) steps deadlock-freedom for mutex • Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps – Also called solo termination wait-free nonblocking obstruction-free Bounded wait-free bounded nonblocking bounded obstruction-free 236825 Introduction 46 Wait-free Atomic Snapshot [Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993] • Embed a scan within the Update. Update(v,k) V = scan A[k] = v,seqi,i,V Linearize: • Updates with their writes • Direct scans as before • Borrowed scans in place 236825 Scan() direct scan repeat read A[1],…,A[n] read A[1],…,A[n] if equal return A[1,…,n] else record diff if twice pj borrowed scan return Vj Introduction 47 Atomic Snapshot: Borrowed Scans Interference by process pj And another one… pj does a scan inbeteween read A[j] … … write A[j] read A[j] … … read A[j] … … embedded scan read A[j] … … write A[j] Linearizing with the borrowed scan is OK. 236825 Introduction 48 List of Topics (Indicative) • Atomic snapshots • Renaming • Space complexity of consensus • Maximal independent set • Dynamic storage • Routing • Vector agreement and possibly others… 236825 Introduction 49