Taming Your Cluster with ZooKeeper

I was recently tasked with finding a solution for leader election in a distributed system we are developing. I began to explore the realm of distributed algorithms, which eventually resulted in a more fundamental problem: How can we easily make all the nodes communicate reliably? After exploring ZeroMQ and UDP broadcasting, I decided to explore the Hadoop ecosystem to see if any of their solutions were reusable, and I discovered the excellent Apache ZooKeeper.

ZooKeeper is an open-source server that allows for highly reliable distributed coordination. It is a clustered service that can serve as a single source of truth for configuration information, distributed synchronization and other services. Zookeeper makes a number of guarantees about it’s data, it will be: sequential, atomic, reliable, timely, and consistent across the cluster.

ZooKeeper Components

There are very few working parts of ZooKeeper to understand, but when these pieces are paired together, they become very powerful. At it’s core, ZooKeeper is nothing more than a hiearchical, key value store that looks like a filesystem. Nodes can be created, tagged with data, and nested under other nodes. In ZooKeeper, these nodes are called znodes, and they come in 2 different flavors.

Persistent znode

Persistent znodes are just as they sound, znodes that are created and kept around until they are explicitly deleted. A persistent znode can have many other znodes nested under it, and can be tagged with data.

Ephemeral znode

Ephemeral znodes are similar to persistent znodes — they can be nested under persistent znodes and they can be tagged with data — however they can not have znodes nested under them. If a znode is ephemeral, it will always be a leaf node of the tree. Ephemeral nodes give ZooKeeper its fault detection power in distributed systems. Ephemeral nodes are only existant for the duration of a client’s connection. Once the client’s connection is lost, the ephemeral nodes are deleted.


Persistent and ephemeral znode’s can be a part of an atomic sequence. These sequential nodes will be created with a monotonic number added to the end of their path. For example, if you were to create the sequential znode with the path /clients/server.mydomain.com-, it would be created as /clients/server.mydomain.com-0000000000, with the next node being created as having the sequence number of 0000000001 and so on. Sequences are always shared between sibling nodes.


The last basic feature of ZooKeeper is the ability to create watches. When using an API, a watch allows you to associate functionality and logic that is triggered upon a change in child elements in the tree. When paired with ephemeral nodes, watches make leader election algorithms and distributed locking services very easy.

Command Line Example

With just mere definitions, ZooKeeper can be a little bit difficult to understand. Here is an example of using the command line interface that showcases the basic components of ZooKeeper.

Once connected using zkCli, lets list the root path. We can see that only one znode exists, which is used by zookeeper. We will create a persistent znode named /groups and create a persistent znode under /groups with the name master-election. The last argument to the create command is the data attribute, for this example I am just using an empty string.

[zkClient] ls /
[zkClient] create /groups ''
 Created /groups
[zkClient] ls /
 [zookeeper, groups]
[zkClient] create /groups/all-servers ''
 Created /groups/all-servers
[zkClient] create /groups/master-election ''
 Created /groups/master-election
[zkClient] ls /groups
 [all-servers, master-election]


Now let’s create a sequenced ephemeral node under master-election on each of our clients.

[zkClient1] create -e -s /groups/master-election/server1.hostname.com- ''
 Created /groups/master-election/server1.hostname.com-0000000000
[zkClient1] ls /groups/master-election
[zkClient2] create -e -s /groups/master-election/server2.hostname.com- ''
 Created /groups/master-election/server2.hostname.com-0000000001
[zkClient2] ls /groups/master-election
 [server1.hostname.com-0000000000, server2.hostname.com-0000000001]


Now that we have created epemeral nodes for each of our clients, let’s create a watch on the /groups/master-election node on the second client. We will kill the connection to zookeeper on our first client.

[zkClient2] ls /groups/master-election watch
 [server1.hostname.com-0000000000, server2.hostname.com-0000000001]
[zkClient1] quit


On the second client, we can see that the watch fired. If we were using a ZooKeeper API, we could trigger a function and execute failover logic at this point.

WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/groups/master-election
[zkClient2] ls /groups/master-election


The command line example above is only one of many uses for taming your cluster with zookeeper. What are you using ZooKeeper for?