VAXclusters: A Closely Coupled Distributed System Landon Cox February 19, 2016

advertisement
VAXclusters: A Closely
Coupled Distributed System
Landon Cox
February 19, 2016
Tight and loose coupling
• Characteristics of a tightly-coupled system
• Close proximity of processors
• High-bandwidth communication via shared memory
• Single copy of operating system
• Characteristics of a loosely-coupled system
• Physically separated processors
• Low-bandwidth message-based communication
• Independent operating systems
Tight and loose coupling
• Tightly-coupled systems can provide great
performance
• What are the disadvantages of a tightly-coupled
system?
• Scaling gets extremely expensive (e.g., supercomputer)
• Relatively hard to extend (add components)
• Failed component can often bring down entire system
• “Closely-coupled” VAXClusters tried to resolve this
tension
•
•
•
•
Want extensibility (should be easy to add components over time)
Want availability (i.e., fault tolerance)
Should be relatively affordable
Performance should be acceptable
Tight and loose coupling
• Performance bottleneck for loose coupling
• Communication between processors
• Processes in tightly-coupled systems use shared memory
• Processes in a loosely-coupled system use messages
Physical memory
Process 1
Process 2
Phys mem
Phys mem
Process 1
Process 2
Tight and loose coupling
• What makes message passing so much slower?
• The interconnect (network versus memory bus)
• Need to copy data into and out of various address spaces
Physical memory
Process 1
Process 2
Phys mem
Phys mem
Process 1
Process 2
Close coupling
• So we have to make communication fast
• System Communication Architecture (SCA)
• Computer interconnect (CI) for message passing
• CI Port provided hardware support
• What types of messages did SCA support?
• Messages (small w/ ordered, reliable delivery)
• Datagrams (small w/ unordered, unreliable delivery)
• Block transfers (large w/ ordered, reliable delivery)
CI Port interface
• Data structures through which software and CI Port
communicate
• CI Port registers
• 7 queues (4 command queues, response queue, free queues)
Phys Mem
Commands
Port driver
Responses
Free queues
CI Port
CI Port interface
•
Block-transfer commands include
•
•
•
Hosts know this info because
•
•
•
Exchanged via messages
Messages/datagrams act as control plane, block transfers as data plane
How does this reduce copying?
•
•
•
•
Command points into src/dst address spaces (page tables)
Identifies contiguous regions for data to be copied
Don’t have to copy data into individual messages
Datagram/messages include payload in command
CI Ports can reach into memory themselves
How else does this improve performance?
•
•
•
•
Far fewer interrupts for OS to handle
Instead of interrupting every 576 bytes for new packet
CI Port can copy data into dst address space w/o interruption
Only interrupt when transfer is complete
Storage
• Single, networked storage interface
• Disks were distributed across cluster
• Single namespace for all files
• Advantages of a unified file-system
namespace
• Makes it easier to add new nodes
• Makes sharing files easier
• Can login anywhere and get to your data
Storage
• If we want to allow concurrent access to files, we
need
• Synchronization primitives
• Otherwise, processes can’t coordinate their activities
• This was a problem for single-host file systems
• But the OS kernel synchronized access
• Why is synchronization harder in a cluster?
• Cluster is a distributed system
• Nodes can fail, messages can be lost, etc.
Storage
• When synchronizing access to in-memory
data
• Locks are also in memory
• No magic: locks, objects all in the same address
space
• Why not use file content itself as the basis for
syncing?
•
•
•
•
(i.e., write “lock=owned” to file lock.txt)
Would require very strong consistency guarantees
Equivalent to memory accesses
Horrendously slow, and potentially have to pay penalty at all
times
Implementing locking
• First have to agree on cluster membership
• If we don’t all agree on who is around
• It’s going to be really hard to agree on anything else
• How does a cluster agree on its membership?
• Each node has a connection manager
• Each connection manager has a copy of the membership
• Use a quorum voting scheme
• Consensus is a really hard problem
• Nodes can fail and come back on-line arbitrarily
• Messages can be lost or slow
• Impossible to distinguish between failures and slow performance
Locking interface
• Users define locks and their names
• How are locks named?
• Locks represent a hierarchical namespace
• Maps nicely onto the file-system namespace
• What kinds of modes can locks be in?
• Exclusive access, protected read
• Concurrent read, concurrent write, null, etc.
Locks thus far
• Lock anytime shared data is
read/written
• Ensures correctness
• Only one thread can read/write at a time
• Would like more concurrency
• How, without exposing violated
invariants?
• Allow multiple threads to read
• (as long as none are writing)
Reader-writer interface
(called when thread begins reading)
readerFinish (called when thread is finished
• readerStart
•
reading)
(called when thread begins writing)
writerFinish (called when thread is finished
• writerStart
•
writing)
• If no threads between writerStart/writerFinish
• Many threads between readerStart/readerFinish
• Only 1 thread between writerStart/writerFinish
Reader-writer interface vs locks
• R-W interface looks a lot like locking
• *Start ~ lock
• *Finish ~ unlock
• Standard terminology
• Four functions called “reader-writer locks”
• Between readStart/readFinish has “read lock”
• Between writeStart/writeFinish has “write lock”
• Pros/cons of R-W vs standard locks?
• Trade-off concurrency for complexity
• Must know how data is being accessed in critical section
Back to VAXclusters
• Hierarchical locks
• Allows coarse-grained mutual exclusion (tree roots)
• Allows fine-grained concurrency (tree leaves)
• Who maintains the locking state (i.e., the queues)?
• Each lock has a master node
• First node to request the lock is the master
• How do I find a lock’s master node?
• Through the resource directory
• The resource directory is replicated at several nodes
• If you don’t find a lock master in the directory
• You’re it!
• Have to update the directory to reflect your status
Locking interface
• Locks also pass information between
holders
• Called the value block
• Value can be updated on lock release
• Value can be read on lock acquire
• How can we use this to manage cached
data?
•
•
•
•
Can use lock value to encode resource version
To check freshness of cached version, acquire lock
If current value > cached value, cache is stale
When updating data, increment value on lock release
Handling failure
• Connection manager to lock managers
• “We are in transition, please de-allocate locks.”
• What does a lock manager do?
• Releases all non-local locks
• Re-acquires all locks owned before transition
• (creates new directory nodes and re-distributes masters)
• What does this guarantee about the state of
data?
• Not much
• Can leave in an inconsistent state
• No guarantee that previous lock holder will get it again
Influence of VAXClusters
• Many of the concepts are relevant today
• Distributed locking
• Consensus and failure detection
• High availability using cheap hardware
• For example …
• Google, Facebook and every other cloud service
• e.g., infrastructure to support MapReduce jobs
How can things fall apart
•
•
•
•
•
Easier
Machines can get slow
Machines can crash and reboot
Machines can crash and die
Machines can become partitioned
Machines can behave arbitrarily Harder
Step 1: don’t lose data if machines crash and reboot
Step 2: don’t lose data if machines crash and die
What has to happen if machines are
not guaranteed to restart after a
crash?
Transactions have to
commit at > 1 machine
Paxos
• ACM TOCS:
•
Transactions on Computer Systems
• Submitted: 1990. Accepted: 1998
• Introduced:
Butler W. Lampson
/
http://research.microsoft.com/en-us/um/people/blampson
Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at
MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal
distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the
Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the
Microsoft Palladium high-assurance stack, and several programming languages. He received
the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer
Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the
NAE’s Draper Prize in 2004.
Barbara Liskov
• MIT professor
• 2008 Turing Award
• “View-stamped replication”
• PODC ’88
• Very similar to Raft
State machines
At any moment,
machine exists in a
“state”
What is a state?Should think of as a set of named variables and their values
State machines
Clients can ask a machine about its current state
What is your
state?
Client
1
4
2
3
6
5
My state is “2”
State machines
“actions” change the
machine’s state
What is an action?
Command that updates named variables’ values
State machines
“actions” change the
machine’s state
Is an action’s effect deterministic?
For our purposes, yes. Given a state and an actio
we can determine next state w/ 100% certainty.
State machines
“actions” change the
machine’s state
Is the effect of a sequence of actions deterministic?
Yes, given a state and a
sequence of actions, can
be 100% certain of end
state
Replicated state machines
Each state machine should compute same state, even if some fail.
Client
What is
the state?
Client
What is
the state?
What is
the state?
Client
Replicated state machines
What has to be true of the actions that clients submit?
Applied in same order
Client
Apply
action a.
Client
Apply
action c.
Apply
action b.
Client
State machines
How should a machine make sure it applies action in same order across reboot
Store them in a log!
Actio
n
Actio
n
Actio
n
…
Replicated state machines
Once we have a leader it begins to service client requests.
Leader=L
Leader=L
Apply
action a.
Leader=L
…
…
Client
Leader=L
…
…
Replicated state machines
• Common approach
• Take simple, general service
• Implement using consensus
• Allow more complex services to use simple
one
• Examples
• Chubby (Google’s distributed locking service)
• Zookeeper (Yahoo! clone of Chubby)
Replicated state machines
• Common approach
• Take simple, general service
• Implement using consensus
• Allow more complex services to use simple
one
• Examples
• Chubby (Google’s distributed locking service)
• Zookeeper (Yahoo! clone of Chubby)
Zookeeper
• Hierarchical file system
• A tree of named nodes
/
• Directories (contain data, pointers)
• Files (contain data)
/app1
/app2
• Clients can
• Create/delete nodes
• Read/write files
• Receive notifications
• Used for coordination
/app1/c
onfig
/app1/o
nline/
/app1/o
nline/S
1
…
/app1/o
nline/S
n
Zookeeper
• Two kinds of nodes
/
• Persistent
• Ephemeral
• Persistent
• Exist until deleted
• e.g., service-wide config data
• Ephemeral
• Exist until client departs
• e.g., group membership
/app1
/app1/c
onfig
/app2
/app1/o
nline/
/app1/o
nline/S
1
…
/app1/o
nline/S
n
Zookeeper
• Automated sequence numbers
/
• Node names can be sequenced
• Will prove very useful
•
•
•
•
•
Clients create nodes /app/foo-X
ZK chooses an order
/app/foo-1
/app1/c
onfig
/app/foo-2
/app/foo-…
/app1
/app2
/app1/o
nline/
/app1/o
nline/S
1
…
/app1/o
nline/S
n
Zookeeper API
• Create
• creates node in tree
• Delete
• deletes node in tree
• Exists
Can embed
information in the
hierarchical
namespace
• tests if node exists at location
• Get children
• lists children of node
• Get data
• reads data from node
• Set data
• writes data to node
• Sync
• waits for data to commit
Can store extra
details in
individual node’s
data
Zookeeper implementation
Zookeeper
Follower
Follower
Client
Leader
Follower
Follower
Client
A client connects to one server (any server will
do).
Zookeeper guarantees
• Sequential consistency
• Updates from client applied in order sent.
• Atomicity
• No partial results. Updates succeed or fail.
• Single system image
• Client has same view, regardless of server.
• Reliability
• Applied updates persist until overwritten by another update.
• Timeliness
• Client’s view guaranteed to be up-to-date within a time bound.
Zookeeper does not provide strong consistency
Client reads not guaranteed to contain all other client updates
Zookeeper guarantees
• Sequential consistency
• Updates from client applied in order sent.
• Atomicity
• No partial results. Updates succeed or fail.
• Single system image
• Client has same view, regardless of server.
• Reliability
• Applied updates persist until overwritten by another update.
• Timeliness
• Client’s view guaranteed to be up-to-date within a time bound.
Zookeeper’s consistency guarantees
Read my writes (see your own updates)
Consistent prefix (see a snapshot of the state)
Monotonic reads (never go backward in time)
Example use
• Want to crawl and index the web
• Would like multiple machines to participate
• Want URLs explored at most once
• Solution
• Maintain an ordered queue of URLs
• Crawling machines assign themselves a URL to
explore
• Crawling machines may add new URLs to queue
• Requires a producer-consumer queue
// Class simulating workers adding URLs to the queue
public class CreateQueue {
private class QueueAddWorker extends ConnectionWatcher implements Runnable {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
public QueueAddWorker(String name){ this.name = name; }
@Override
public void run() {
try {
while(true){
this.connect("localhost");
String added = zk.create("/queue/q-",
null,
Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT_SEQUENTIAL);
this.close();
Thread.sleep(new Long(r.nextInt(1000)+50));
}
} catch (Exception e){ e.printStackTrace(); }
}
}
Nodes’ names are
sequenced,
e.g., /queue/q-0000000003
public static void main(String[] args){
CreateQueue cQ = new CreateQueue();
Thread addWorker1 = new Thread(cQ.new QueueAddWorker("worker1"));
Thread addWorker2 = new Thread(cQ.new QueueAddWorker("worker2"));
Thread addWorker3 = new Thread(cQ.new QueueAddWorker("worker3"));
addWorker1.start(); addWorker2.start(); addWorker3.start();
}
}
Now we can use watches to be notified when new nodes are added.
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
Registers object as a watcher
for the node /queue
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
When /queue changes, we
check if any children were
added
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
Start to iterate through the
new queue entries
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
What is the problem with workers concurrently processing new entries?
Work would be duplicated, as all workers process all new entries
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
Start to iterate through the
new queue entries
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
How do we prevent duplicate work?
Have workers lock entries that they want to process
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
These lines try to create a
lock file for the entry.
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
What happens if workers concurrently try to create the same file?
Only one will succeed, the failed worker will catch an exception
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
These lines try to create a
lock file for the entry.
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
Why is it possible for only one create to succeed?
ZK returns to client on commit; new nodes must reach majority to comm
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
These lines try to create a
lock file for the entry.
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
Are workers reading children of /queue_lock, guaranteed to see all locks
No, a worker is only guaranteed to see the locks it created
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
These lines try to create a
lock file for the entry.
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
What happens if a worker fails immediately after creating the lock file?
The file is ephemeral and disappears when creator stops responding
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
public class PullQueue extends ConnectionWatcher {
private class PullWatcher implements Watcher {
private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z");
private Random r = new Random(); private String name;
private List<String> children;
public PullWatcher(String name) throws Exception {
this.name = name;
children = zk.getChildren("/queue", this);
}
@Override
public void process(WatchedEvent event) {
try {
if (event.getType().equals(EventType.NodeChildrenChanged)) {
children = zk.getChildren("/queue", this); // get the children and renew watch
Collections.sort(children);
// we are getting an unsorted list
for (String child : children) {
if (zk.exists("/queue_lock/" + child, false) == null) {
try {
zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// PROCESS QUEUE ENTRY
zk.delete("/queue/" + child, -1);
}
// even so we check the existence of a lock,
// it could have been created in the mean time
// making create fail. We catch and ignore it.
catch (Exception ignore) { }
}
}
}
} catch (Exception e){ e.printStackTrace(); }
}
}
After processing a queue
entry, we delete it and try
another
}
public static void main(String[] args) throws Exception {
PullQueue pQ = new PullQueue();
pQ.connect("localhost");
try {
pQ.new PullWatcher(args[0]);
Thread.sleep(Long.MAX_VALUE);
} finally { pQ.zk.close(); }
}
http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/
Leader election on top of ZK
• Zookeeper elects a leader internally
• Uses Paxos, but could use Raft too
• Internal ZK leader election isn’t exposed to services
• Services can only manipulate ZK nodes
• But many services need to elect leaders
• e.g., a storage service that uses two-phase commit
• Easy to implement leader election on top of
ZK
Leader election on top of ZK
•
To volunteer to be a leader
1.
2.
3.
Create node z “/Election/n_” with sequence and ephemeral
flags
Let C be “/Election/”’s children, and i be z’s sequence number
Watch “Election/n_j” where j is smallest sequence < i
Can two servers create the same “/Election/n_” node?
No, ZK ensures that each file has a unique sequence number
Leader election on top of ZK
•
To volunteer to be a leader
1.
2.
3.
Create node z “/Election/n_” with sequence and ephemeral
flags
Let C be “/Election/”’s children, and i be z’s sequence number
Watch “Election/n_j” where j is smallest sequence < i
Which server is the leader?
The one that created “/Election/n_j”
Leader election on top of ZK
•
To volunteer to be a leader
1.
2.
3.
Create node z “/Election/n_” with sequence and ephemeral
flags
Let C be “/Election/”’s children, and i be z’s sequence number
Watch “Election/n_j” where j is smallest sequence < i
How will we know when the leader fails?
When node “/Election/n_j” is deleted
Leader election on top of ZK
•
To volunteer to be a leader
1.
2.
3.
•
Create node z “/Election/n_” with sequence and ephemeral
flags
Let C be “/Election/”’s children, and i be z’s sequence number
Watch “Election/n_j” where j is smallest sequence < i
Server is notified that a child of “/Election/” was
deleted
1.
2.
3.
Let C be the new set of children for “/Election/”
If z is the smallest node in C, then the volunteer is the leader
Otherwise, keep watching for changes in smallest n_j
Can two servers ever think that they are the leader?
Something to work out on your own …
Download