Fixing Connection Failures in ZooKeeper Mixed Mode

If you're running ZooKeeper in mixed mode, double-check your ensemble config.

Mixed mode in ZooKeeper means you've enabled both plaintext and TLS client ports. This is safe to enable, but you need to make sure ZooKeeper isn't redirecting TLS-enabled clients to plaintext (or plaintext-enabled clients to TLS).

The Issue

Consider an Apache Curator service configured to reach ZooKeeper behind a load balancer at zookeeper:3181, with 3181 being the secureClientPort (TLS). You see the service connect successfully in logs:

[zkNetty-EpollEventLoopGroup-1-1] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server zookeeper/172.18.0.5:3181, session id = 0x300030662be0000, negotiated timeout = 40000
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED

Later, however, you see it flailing repeatedly with disconnects:

[main-SendThread(zookeeper2:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper2/172.18.0.4:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException or SessionTimeoutException.
EndOfStreamException: channel for sessionid 0x0 is lost
    at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:287)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1291)

Talk about bait and switch. What's going on? Let's look at the ensemble.

Inspecting the Ensemble

Consider the following zoo.cfg snippet for a mixed-mode ZooKeeper:

secureClientPort=3181
server.1=zookeeper1:2888:3888;2181
server.2=zookeeper2:2888:3888;2181
server.3=zookeeper3:2888:3888;2181

The line secureClientPort=3181 configures port 3181 to serve as the TLS client port.

These other lines define the ensemble, or members of the ZooKeeper cluster. ZooKeeper broadcasts this ensemble to connecting clients, shown in the following client log:

[main-EventThread] INFO org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.2=zookeeper2:2888:3888:participant;0.0.0.0:2181, server.1=zookeeper1:2888:3888:participant;0.0.0.0:2181, server.3=zookeeper3:2888:3888:participant;0.0.0.0:2181, version=0}

Clients use this ensemble as a fallback when the initial connection is terminated (for example, when ZooKeeper is restarted or a timeout occurs). This is problematic for TLS-enabled clients because each member of the ensemble has the ;2181 suffix, indicating the client port that should be used for the connection.

This is what's happening:

The Curator client reaches a ZooKeeper server successfully on zookeeper:3181.
The client receives the full ensemble with the plaintext port 2181 specified.
ZooKeeper restarts or a connection timeout occurs.
The client falls back to the ensemble and retries the connection on port 2181. It fails indefinitely; the client won't reconnect successfully until it is restarted.

Fixing the Issue

A couple of ways to fix the issue are to omit ;2181 from the ensemble or to disable Curator's EnsembleTracker.

Solution #1 - Updating the Ensemble

The first solution is to simply drop ;2181 from the ensemble. For example:

server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

Now when Curator connects, you should no longer see the client port in the ensemble:

[main-EventThread] INFO org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.2=zookeeper2:2888:3888:participant, server.1=zookeeper1:2888:3888:participant, server.3=zookeeper3:2888:3888:participant, version=0}

Later, when a ZooKeeper connection is re-established, you'll see the client has effectively ignored the Ensemble and reconnected to zookeeper:3181:

[zkNetty-EpollEventLoopGroup-1-1] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server zookeeper/172.18.0.5:3181, session id = 0x200001a73910000, negotiated timeout = 40000
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

This is a good solution if you control the ZooKeeper cluster and want to fix the issue without client-side code changes.

Solution #2 - Disabling EnsembleTracker

Another solution is to disable EnsembleTracker on the client side. Just add .ensembleTracker(false) to your CuratorFrameworkFactory.builder() chain. Full example below:

val config = ZKClientConfig()
config.setProperty("zookeeper.clientCnxnSocket", "org.apache.zookeeper.ClientCnxnSocketNetty")
config.setProperty("zookeeper.client.secure", "true")
config.setProperty("zookeeper.ssl.trustStore.location", System.getenv("TRUSTSTORE_LOCATION"))
config.setProperty("zookeeper.ssl.trustStore.password", "truststorepass")
config.setProperty("zookeeper.ssl.trustStore.type", "PKCS12")

val client: CuratorFramework = CuratorFrameworkFactory.builder()
    .connectString("zookeeper:3181")
    .retryPolicy(RetryOneTime(1000))
    .zkClientConfig(config)
    .ensembleTracker(false) // HERE
    .build()
client.start()

Now, when your client connects, it won't receive the Ensemble from ZooKeeper and will always reconnect using the connectString. You'll also notice the INFO org.apache.curator.framework.imps.EnsembleTracker logs are no longer present.

This is a good solution if you don't have access to ZooKeeper to update the ensemble, or you need a quick fix on a single service.

Thanks For Reading!

Hopefully this helps you fix connection issues when operating ZooKeeper in mixed mode. Remove the client port from the ZooKeeper ensemble, or disable EnsembleTracker for a client-side solution.