Keeping the peace: how ZooKeeper stops database nodes from fighting

When you build an application, you expect your database to work: connect to an endpoint, run queries, and get results. That’s the contract, and your expectation is completely reasonable. Your application should focus on business logic and features, not on distributed database coordination or handling cluster topology changes. That’s what infrastructure does. On Upsun’s Dedicated Generation 2 (DG2) architecture, we run MariaDB in a three-node Galera Cluster. Galera is a multi-master setup where any node can accept writes, which provides high availability but creates coordination challenges. Those challenges belong in the infrastructure layer. We provide you with a stable database endpoint, and behind it runs a resilient cluster. This is where ZooKeeper comes in.

The coordination challenge

Galera uses a quorum system where transactions must commit to at least two of three nodes before succeeding, which provides strong consistency across the cluster. The design follows the CAP theorem, meaning Galera chooses consistency and partition tolerance over constant availability. In practice, transactions can occasionally fail because another node wrote conflicting data, network latency spiked, or the quorum wasn’t reachable. Multi-master databases like Galera are designed for applications to retry on transaction conflicts, but most applications don’t implement this retry logic by default. Magento, Drupal, WordPress, and many custom applications connect to a database and expect consistent availability without having to handle these edge cases themselves. You could solve this in two ways: build retry logic into your application or handle coordination at the infrastructure layer. Given our position, we’ve chosen to solve this problem at the infrastructure level so it works for most of our customers by default.

Our approach with ZooKeeper

We handle the complexity at the infrastructure layer. When you provision a triple-redundant MariaDB cluster on Upsun DG2, we expose a single primary write node while the other two nodes serve as read replicas (though they remain capable of accepting writes for failover scenarios). Your application connects to one stable endpoint, and behind the scenes all nodes stay synchronized through Galera’s multi-master replication. This gives you read-after-write consistency and distributed system reliability through a simple interface. But which node is the primary, and how do we handle transitions when a node becomes unavailable? ZooKeeper answers these questions. Apache ZooKeeper is a coordination service originally developed at Yahoo!. It’s a hierarchical key-value store that looks like a file system, where the root is / and you can create child nodes (called znodes) under any path. Written in Java, it’s been doing this job reliably since 2008. You might know etcd, which serves a similar purpose, but we chose ZooKeeper for its battle-tested stability and specific features for handling node failures gracefully.

Three ZooKeeper features that make it work

ZooKeeper gives us three key capabilities that solve the coordination problem: sequences, watchers, and ephemeral nodes. Let’s look at each one and how we use it.

Sequences: Establishing node order

The first challenge is getting all nodes to agree on who’s primary, and ZooKeeper solves this with sequential znodes. When you create a sequential znode, ZooKeeper appends a monotonically increasing number that’s consistent across all clients. Even if three nodes create znodes simultaneously, ZooKeeper assigns them an order that all clients see the same way. Here’s what it looks like in Python using the kazoo library:

from kazoo.client import KazooClient

zk = KazooClient(hosts='zookeeper:2181')
zk.start()

# Create a sequential znode
path = zk.create(
    '/mariadb/primary/node-',
    b'node-hostname',
    sequence=True,
    ephemeral=True
)

print(f"Created: {path}")
# Output: Created: /mariadb/primary/node-0000000001

# Get all nodes and sort them
children = sorted(zk.get_children('/mariadb/primary'))
primary = children[0]

print(f"Primary node: {primary}")
# Output: Primary node: node-0000000001

Each MariaDB node runs a local agent that creates a sequential znode in /mariadb/primary/. The first node in the sequence becomes the primary, and all nodes agree on this order because ZooKeeper guarantees consistency. The primary node gets traffic while the others stand by as read replicas.

Watchers: Staying in sync

What happens when the primary node dies? The other nodes need to know immediately so they can promote a new primary. ZooKeeper provides watchers, which are one-time notifications that fire when a znode changes. Each node sets a watch on /mariadb/primary/, and when nodes join or leave, those watchers fire. Here’s how it works:

from kazoo.client import KazooClient

zk = KazooClient(hosts='zookeeper:2181')
zk.start()

def primary_changed(children):
    """Called when the primary list changes"""
    if not children:
        print("No primary available!")
        return

    sorted_children = sorted(children)
    new_primary = sorted_children[0]

    print(f"Primary changed to: {new_primary}")

    # Reconfigure the local proxy to point to new primary
    configure_proxy(new_primary)

# Set up a watch on the primary path
@zk.ChildrenWatch('/mariadb/primary')
def watch_children(children):
    primary_changed(children)
    return True  # Keep watching

def configure_proxy(primary_node):
    """Update local proxy configuration"""
    # This would update iptables or HAProxy configuration
    # to redirect database traffic to the new primary
    pass

When the primary node dies, its znode disappears (we’ll explain why in a moment), and all watching nodes get notified within seconds. They read the updated list, identify the new primary, and reconfigure their local proxies. Your application keeps sending queries to the same connection string while we’ve redirected traffic to a new primary behind the scenes. This works seamlessly because Galera is multi-master, meaning every node can accept writes at any time. The failover happens without you noticing.

Ephemeral nodes: Automatic cleanup

The third piece is ephemeral nodes, which are znodes tied to a client session that vanish when the client disconnects. This solves the hardest problem in distributed systems: detecting failures. Did a node die, or did it temporarily lose network connectivity? ZooKeeper handles this through session timeouts. Here’s what an ephemeral node looks like:

from kazoo.client import KazooClient

zk = KazooClient(
    hosts='zookeeper:2181',
    timeout=10.0  # Session timeout in seconds
)
zk.start()

# Create an ephemeral sequential znode
path = zk.create(
    '/mariadb/primary/node-',
    b'node-hostname',
    sequence=True,
    ephemeral=True  # Disappears when session ends
)

# Keep session alive by sending heartbeats
while True:
    if not is_mariadb_healthy():
        # MariaDB is down, close connection
        # This removes our ephemeral node
        zk.stop()
        break

    time.sleep(10)

def is_mariadb_healthy():
    """Check if local MariaDB is responding"""
    try:
        # Run: mysql -e "SELECT 1"
        result = subprocess.run(
            ['mysql', '-e', 'SELECT 1'],
            capture_output=True,
            timeout=5
        )
        return result.returncode == 0
    except:
        return False

Each node runs an agent that monitors the local MariaDB instance every 10 seconds. If MariaDB responds, the agent keeps the ZooKeeper session alive. If MariaDB stops responding, the agent drops the session, the ephemeral node disappears, and the other nodes see the change through their watchers and reconfigure. This handles different failure scenarios:

MariaDB crashes: Health check fails, agent drops its session, and the node is removed
Network partition: The node can’t reach ZooKeeper, session timeout expires, and the node is removed
Entire VM dies: Session times out and the ephemeral node vanishes

We don’t need to distinguish between failure types because any problem that prevents the node from maintaining its ZooKeeper session triggers automatic removal. The cluster heals itself.

Beyond databases: Worker management

We use the same ZooKeeper pattern for worker processes. Many applications run background workers to process queues, send emails, or generate reports, and while you want workers for high availability, running the same worker on multiple nodes creates problems. Queue systems like RabbitMQ can coordinate multiple workers so each job gets processed once, but that’s extra complexity. What if you could run the worker on one node at a time with automatic failover? Same ZooKeeper pattern. Each node’s agent creates an ephemeral sequential znode in /workers/email-sender/, the first node in sequence starts the worker, and the others wait. When that node dies, its ephemeral node disappears, the next node in sequence sees the change, and it starts its worker. You get high availability without building distributed coordination into your worker code. The worker runs somewhere, and if that node dies, it runs somewhere else. Your application doesn’t need to know which node.

The takeaway

ZooKeeper provides a single source of truth for cluster coordination through three features that work together:

Sequences establish consistent ordering across all nodes
Watchers enable immediate coordination when cluster state changes
Ephemeral nodes provide automatic cleanup when nodes become unavailable

This lets us provide the stable database interface your application expects while running a highly available three-node cluster underneath. You get both reliability and simplicity. Your application connects to a database endpoint and it behaves as expected. When cluster state changes, a node becomes unavailable, or we perform maintenance, the infrastructure layer handles coordination. This is the robustness principle in practice: we accept applications with standard expectations and provide dependable service. Infrastructure complexity belongs in the infrastructure layer. You build features for your users, and we’ll handle distributed systems coordination.

​The coordination challenge

​Our approach with ZooKeeper

​Three ZooKeeper features that make it work

​Sequences: Establishing node order

​Watchers: Staying in sync

​Ephemeral nodes: Automatic cleanup

​Beyond databases: Worker management

​The takeaway