VAXclusters: A Closely Coupled Distributed System Landon Cox February 19, 2016 Tight and loose coupling • Characteristics of a tightly-coupled system • Close proximity of processors • High-bandwidth communication via shared memory • Single copy of operating system • Characteristics of a loosely-coupled system • Physically separated processors • Low-bandwidth message-based communication • Independent operating systems Tight and loose coupling • Tightly-coupled systems can provide great performance • What are the disadvantages of a tightly-coupled system? • Scaling gets extremely expensive (e.g., supercomputer) • Relatively hard to extend (add components) • Failed component can often bring down entire system • “Closely-coupled” VAXClusters tried to resolve this tension • • • • Want extensibility (should be easy to add components over time) Want availability (i.e., fault tolerance) Should be relatively affordable Performance should be acceptable Tight and loose coupling • Performance bottleneck for loose coupling • Communication between processors • Processes in tightly-coupled systems use shared memory • Processes in a loosely-coupled system use messages Physical memory Process 1 Process 2 Phys mem Phys mem Process 1 Process 2 Tight and loose coupling • What makes message passing so much slower? • The interconnect (network versus memory bus) • Need to copy data into and out of various address spaces Physical memory Process 1 Process 2 Phys mem Phys mem Process 1 Process 2 Close coupling • So we have to make communication fast • System Communication Architecture (SCA) • Computer interconnect (CI) for message passing • CI Port provided hardware support • What types of messages did SCA support? • Messages (small w/ ordered, reliable delivery) • Datagrams (small w/ unordered, unreliable delivery) • Block transfers (large w/ ordered, reliable delivery) CI Port interface • Data structures through which software and CI Port communicate • CI Port registers • 7 queues (4 command queues, response queue, free queues) Phys Mem Commands Port driver Responses Free queues CI Port CI Port interface • Block-transfer commands include • • • Hosts know this info because • • • Exchanged via messages Messages/datagrams act as control plane, block transfers as data plane How does this reduce copying? • • • • Command points into src/dst address spaces (page tables) Identifies contiguous regions for data to be copied Don’t have to copy data into individual messages Datagram/messages include payload in command CI Ports can reach into memory themselves How else does this improve performance? • • • • Far fewer interrupts for OS to handle Instead of interrupting every 576 bytes for new packet CI Port can copy data into dst address space w/o interruption Only interrupt when transfer is complete Storage • Single, networked storage interface • Disks were distributed across cluster • Single namespace for all files • Advantages of a unified file-system namespace • Makes it easier to add new nodes • Makes sharing files easier • Can login anywhere and get to your data Storage • If we want to allow concurrent access to files, we need • Synchronization primitives • Otherwise, processes can’t coordinate their activities • This was a problem for single-host file systems • But the OS kernel synchronized access • Why is synchronization harder in a cluster? • Cluster is a distributed system • Nodes can fail, messages can be lost, etc. Storage • When synchronizing access to in-memory data • Locks are also in memory • No magic: locks, objects all in the same address space • Why not use file content itself as the basis for syncing? • • • • (i.e., write “lock=owned” to file lock.txt) Would require very strong consistency guarantees Equivalent to memory accesses Horrendously slow, and potentially have to pay penalty at all times Implementing locking • First have to agree on cluster membership • If we don’t all agree on who is around • It’s going to be really hard to agree on anything else • How does a cluster agree on its membership? • Each node has a connection manager • Each connection manager has a copy of the membership • Use a quorum voting scheme • Consensus is a really hard problem • Nodes can fail and come back on-line arbitrarily • Messages can be lost or slow • Impossible to distinguish between failures and slow performance Locking interface • Users define locks and their names • How are locks named? • Locks represent a hierarchical namespace • Maps nicely onto the file-system namespace • What kinds of modes can locks be in? • Exclusive access, protected read • Concurrent read, concurrent write, null, etc. Locks thus far • Lock anytime shared data is read/written • Ensures correctness • Only one thread can read/write at a time • Would like more concurrency • How, without exposing violated invariants? • Allow multiple threads to read • (as long as none are writing) Reader-writer interface (called when thread begins reading) readerFinish (called when thread is finished • readerStart • reading) (called when thread begins writing) writerFinish (called when thread is finished • writerStart • writing) • If no threads between writerStart/writerFinish • Many threads between readerStart/readerFinish • Only 1 thread between writerStart/writerFinish Reader-writer interface vs locks • R-W interface looks a lot like locking • *Start ~ lock • *Finish ~ unlock • Standard terminology • Four functions called “reader-writer locks” • Between readStart/readFinish has “read lock” • Between writeStart/writeFinish has “write lock” • Pros/cons of R-W vs standard locks? • Trade-off concurrency for complexity • Must know how data is being accessed in critical section Back to VAXclusters • Hierarchical locks • Allows coarse-grained mutual exclusion (tree roots) • Allows fine-grained concurrency (tree leaves) • Who maintains the locking state (i.e., the queues)? • Each lock has a master node • First node to request the lock is the master • How do I find a lock’s master node? • Through the resource directory • The resource directory is replicated at several nodes • If you don’t find a lock master in the directory • You’re it! • Have to update the directory to reflect your status Locking interface • Locks also pass information between holders • Called the value block • Value can be updated on lock release • Value can be read on lock acquire • How can we use this to manage cached data? • • • • Can use lock value to encode resource version To check freshness of cached version, acquire lock If current value > cached value, cache is stale When updating data, increment value on lock release Handling failure • Connection manager to lock managers • “We are in transition, please de-allocate locks.” • What does a lock manager do? • Releases all non-local locks • Re-acquires all locks owned before transition • (creates new directory nodes and re-distributes masters) • What does this guarantee about the state of data? • Not much • Can leave in an inconsistent state • No guarantee that previous lock holder will get it again Influence of VAXClusters • Many of the concepts are relevant today • Distributed locking • Consensus and failure detection • High availability using cheap hardware • For example … • Google, Facebook and every other cloud service • e.g., infrastructure to support MapReduce jobs How can things fall apart • • • • • Easier Machines can get slow Machines can crash and reboot Machines can crash and die Machines can become partitioned Machines can behave arbitrarily Harder Step 1: don’t lose data if machines crash and reboot Step 2: don’t lose data if machines crash and die What has to happen if machines are not guaranteed to restart after a crash? Transactions have to commit at > 1 machine Paxos • ACM TOCS: • Transactions on Computer Systems • Submitted: 1990. Accepted: 1998 • Introduced: Butler W. Lampson / http://research.microsoft.com/en-us/um/people/blampson Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004. Barbara Liskov • MIT professor • 2008 Turing Award • “View-stamped replication” • PODC ’88 • Very similar to Raft State machines At any moment, machine exists in a “state” What is a state?Should think of as a set of named variables and their values State machines Clients can ask a machine about its current state What is your state? Client 1 4 2 3 6 5 My state is “2” State machines “actions” change the machine’s state What is an action? Command that updates named variables’ values State machines “actions” change the machine’s state Is an action’s effect deterministic? For our purposes, yes. Given a state and an actio we can determine next state w/ 100% certainty. State machines “actions” change the machine’s state Is the effect of a sequence of actions deterministic? Yes, given a state and a sequence of actions, can be 100% certain of end state Replicated state machines Each state machine should compute same state, even if some fail. Client What is the state? Client What is the state? What is the state? Client Replicated state machines What has to be true of the actions that clients submit? Applied in same order Client Apply action a. Client Apply action c. Apply action b. Client State machines How should a machine make sure it applies action in same order across reboot Store them in a log! Actio n Actio n Actio n … Replicated state machines Once we have a leader it begins to service client requests. Leader=L Leader=L Apply action a. Leader=L … … Client Leader=L … … Replicated state machines • Common approach • Take simple, general service • Implement using consensus • Allow more complex services to use simple one • Examples • Chubby (Google’s distributed locking service) • Zookeeper (Yahoo! clone of Chubby) Replicated state machines • Common approach • Take simple, general service • Implement using consensus • Allow more complex services to use simple one • Examples • Chubby (Google’s distributed locking service) • Zookeeper (Yahoo! clone of Chubby) Zookeeper • Hierarchical file system • A tree of named nodes / • Directories (contain data, pointers) • Files (contain data) /app1 /app2 • Clients can • Create/delete nodes • Read/write files • Receive notifications • Used for coordination /app1/c onfig /app1/o nline/ /app1/o nline/S 1 … /app1/o nline/S n Zookeeper • Two kinds of nodes / • Persistent • Ephemeral • Persistent • Exist until deleted • e.g., service-wide config data • Ephemeral • Exist until client departs • e.g., group membership /app1 /app1/c onfig /app2 /app1/o nline/ /app1/o nline/S 1 … /app1/o nline/S n Zookeeper • Automated sequence numbers / • Node names can be sequenced • Will prove very useful • • • • • Clients create nodes /app/foo-X ZK chooses an order /app/foo-1 /app1/c onfig /app/foo-2 /app/foo-… /app1 /app2 /app1/o nline/ /app1/o nline/S 1 … /app1/o nline/S n Zookeeper API • Create • creates node in tree • Delete • deletes node in tree • Exists Can embed information in the hierarchical namespace • tests if node exists at location • Get children • lists children of node • Get data • reads data from node • Set data • writes data to node • Sync • waits for data to commit Can store extra details in individual node’s data Zookeeper implementation Zookeeper Follower Follower Client Leader Follower Follower Client A client connects to one server (any server will do). Zookeeper guarantees • Sequential consistency • Updates from client applied in order sent. • Atomicity • No partial results. Updates succeed or fail. • Single system image • Client has same view, regardless of server. • Reliability • Applied updates persist until overwritten by another update. • Timeliness • Client’s view guaranteed to be up-to-date within a time bound. Zookeeper does not provide strong consistency Client reads not guaranteed to contain all other client updates Zookeeper guarantees • Sequential consistency • Updates from client applied in order sent. • Atomicity • No partial results. Updates succeed or fail. • Single system image • Client has same view, regardless of server. • Reliability • Applied updates persist until overwritten by another update. • Timeliness • Client’s view guaranteed to be up-to-date within a time bound. Zookeeper’s consistency guarantees Read my writes (see your own updates) Consistent prefix (see a snapshot of the state) Monotonic reads (never go backward in time) Example use • Want to crawl and index the web • Would like multiple machines to participate • Want URLs explored at most once • Solution • Maintain an ordered queue of URLs • Crawling machines assign themselves a URL to explore • Crawling machines may add new URLs to queue • Requires a producer-consumer queue // Class simulating workers adding URLs to the queue public class CreateQueue { private class QueueAddWorker extends ConnectionWatcher implements Runnable { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; public QueueAddWorker(String name){ this.name = name; } @Override public void run() { try { while(true){ this.connect("localhost"); String added = zk.create("/queue/q-", null, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); this.close(); Thread.sleep(new Long(r.nextInt(1000)+50)); } } catch (Exception e){ e.printStackTrace(); } } } Nodes’ names are sequenced, e.g., /queue/q-0000000003 public static void main(String[] args){ CreateQueue cQ = new CreateQueue(); Thread addWorker1 = new Thread(cQ.new QueueAddWorker("worker1")); Thread addWorker2 = new Thread(cQ.new QueueAddWorker("worker2")); Thread addWorker3 = new Thread(cQ.new QueueAddWorker("worker3")); addWorker1.start(); addWorker2.start(); addWorker3.start(); } } Now we can use watches to be notified when new nodes are added. http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Registers object as a watcher for the node /queue @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } When /queue changes, we check if any children were added @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Start to iterate through the new queue entries @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } What is the problem with workers concurrently processing new entries? Work would be duplicated, as all workers process all new entries http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Start to iterate through the new queue entries @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } How do we prevent duplicate work? Have workers lock entries that they want to process http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } What happens if workers concurrently try to create the same file? Only one will succeed, the failed worker will catch an exception http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } Why is it possible for only one create to succeed? ZK returns to client on commit; new nodes must reach majority to comm http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } Are workers reading children of /queue_lock, guaranteed to see all locks No, a worker is only guaranteed to see the locks it created http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } What happens if a worker fails immediately after creating the lock file? The file is ephemeral and disappears when creator stops responding http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } After processing a queue entry, we delete it and try another } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ Leader election on top of ZK • Zookeeper elects a leader internally • Uses Paxos, but could use Raft too • Internal ZK leader election isn’t exposed to services • Services can only manipulate ZK nodes • But many services need to elect leaders • e.g., a storage service that uses two-phase commit • Easy to implement leader election on top of ZK Leader election on top of ZK • To volunteer to be a leader 1. 2. 3. Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Can two servers create the same “/Election/n_” node? No, ZK ensures that each file has a unique sequence number Leader election on top of ZK • To volunteer to be a leader 1. 2. 3. Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Which server is the leader? The one that created “/Election/n_j” Leader election on top of ZK • To volunteer to be a leader 1. 2. 3. Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i How will we know when the leader fails? When node “/Election/n_j” is deleted Leader election on top of ZK • To volunteer to be a leader 1. 2. 3. • Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Server is notified that a child of “/Election/” was deleted 1. 2. 3. Let C be the new set of children for “/Election/” If z is the smallest node in C, then the volunteer is the leader Otherwise, keep watching for changes in smallest n_j Can two servers ever think that they are the leader? Something to work out on your own …