NSX Application Platform upgrade stuck at Zookeeper and Druid upgrade
search cancel

NSX Application Platform upgrade stuck at Zookeeper and Druid upgrade

book

Article ID: 381566

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

During NSX Application Platform upgrade, a Zookeeper bug may cause the Zookeeper pods to remain in CrashLoopBackOff state and create a cascading failure. While Zookeeper pods are down, Druid pods which are dependent on Zookeeper also cannot be ready. While Druid pods are down, configure-druid job will timeout before submitting the configurations correctly. 

If NSX Application Platform upgrade goes through without the configure-druid job succeeding, user may experience issues when using the visualization feature in NSX Intelligence. For example, right-clicking on any group or compute to view the Flow Details may show an empty result.

Environment

This may occur when NSX Application Platform is upgraded from 4.1.1 (or lower) version to 4.2.0.

Cause

During Napp upgrade to 4.2.0, the number of Druid Historical pods are reduced. This caused a higher number of segment announcements per Historical. Thus the number of ephemeral nodes in per ZNode increased. Due to a bug introduced in Zookeeper 3.6.0 ([ZOOKEEPER-4306] CloseSessionTxn contains too many ephemeral nodes cause cluster crash - ASF JIRA ), high number of ephemeral nodes per session can cause Zookeeper to crash.

Zookeeper logs in CrashLoopBackOff:

2024-10-25T06:34:05.189182378Z stdout F 2024-10-25 06:34:05,188 [myid:2] - ERROR [main:o.a.z.s.q.QuorumPeer@1200] - Unable to load database on disk
2024-10-25T06:34:05.189203935Z stdout F java.io.IOException: Unreasonable length = 2523193
2024-10-25T06:34:05.189209089Z stdout F     at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
2024-10-25T06:34:05.18921199Z stdout F  at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
2024-10-25T06:34:05.189214779Z stdout F     at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
2024-10-25T06:34:05.189218336Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
2024-10-25T06:34:05.189231978Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
2024-10-25T06:34:05.189234862Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
2024-10-25T06:34:05.189237541Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
2024-10-25T06:34:05.189240155Z stdout F     at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
2024-10-25T06:34:05.189242724Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
2024-10-25T06:34:05.189245284Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132)
2024-10-25T06:34:05.189247859Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
2024-10-25T06:34:05.189250612Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
2024-10-25T06:34:05.189253205Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)
2024-10-25T06:34:05.18938307Z stdout F 2024-10-25 06:34:05,189 [myid:2] - ERROR [main:o.a.z.s.q.QuorumPeerMain@114] - Unexpected exception, exiting abnormally
2024-10-25T06:34:05.189387489Z stdout F java.lang.RuntimeException: Unable to run quorum server
2024-10-25T06:34:05.18939149Z stdout F  at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1201)
2024-10-25T06:34:05.189394737Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132)
2024-10-25T06:34:05.189397324Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
2024-10-25T06:34:05.189399996Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
2024-10-25T06:34:05.189402545Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)
2024-10-25T06:34:05.189405133Z stdout F Caused by: java.io.IOException: Unreasonable length = 2523193
2024-10-25T06:34:05.189407703Z stdout F     at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
2024-10-25T06:34:05.189410242Z stdout F     at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
2024-10-25T06:34:05.189412761Z stdout F     at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
2024-10-25T06:34:05.189415367Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
2024-10-25T06:34:05.189418563Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
2024-10-25T06:34:05.18942122Z stdout F  at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
2024-10-25T06:34:05.189423763Z stdout F     at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
2024-10-25T06:34:05.189426333Z stdout F     at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
2024-10-25T06:34:05.189428867Z stdout F     at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
2024-10-25T06:34:05.18943175Z stdout F  ... 4 common frames omitted

 

While zookeeper is down, Druid pods will be in crash loop or error state. For example, the Druid Broker may be in CrashLoopBackOff and its logs show:

2024-10-25T06:47:50.665517863Z stdout F 2024-10-25T06:47:50,665 INFO [main-SendThread(zookeeper:3181)] org.apache.zookeeper.client.ZooKeeperSaslClient - Client will use DIGEST-MD5 as SASL mechanism.
2024-10-25T06:47:50.665817452Z stdout F 2024-10-25T06:47:50,665 INFO [main-SendThread(zookeeper:3181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper/10.97.72.224:3181.
2024-10-25T06:47:50.665864089Z stdout F 2024-10-25T06:47:50,665 INFO [main-SendThread(zookeeper:3181)] org.apache.zookeeper.ClientCnxn - SASL config status: Will attempt to SASL-authenticate using Login Context section 'Client'
2024-10-25T06:47:50.749038453Z stdout F 2024-10-25T06:47:50,748 INFO [epollEventLoopGroup-2-1] org.apache.zookeeper.ClientCnxnSocketNetty - SSL handler added for channel: [id: 0xee0425bc]
2024-10-25T06:47:50.749914187Z stdout F 2024-10-25T06:47:50,749 INFO [epollEventLoopGroup-2-1] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.14.40:56520, server: zookeeper/10.97.72.224:3181
2024-10-25T06:47:50.750777095Z stdout F 2024-10-25T06:47:50,750 INFO [epollEventLoopGroup-2-1] org.apache.zookeeper.ClientCnxnSocketNetty - channel is connected: [id: 0xee0425bc, L:/192.168.14.40:56520 - R:zookeeper/10.97.72.224:3181]
2024-10-25T06:47:50.770499267Z stdout F 2024-10-25T06:47:50,770 INFO [epollEventLoopGroup-2-1] org.apache.zookeeper.ClientCnxnSocketNetty - channel is disconnected: [id: 0xee0425bc, L:/192.168.14.40:56520 ! R:zookeeper/10.97.72.224:3181]
2024-10-25T06:47:50.770549112Z stdout F 2024-10-25T06:47:50,770 INFO [epollEventLoopGroup-2-1] org.apache.zookeeper.ClientCnxnSocketNetty - channel is told closing
2024-10-25T06:47:50.770776321Z stdout F 2024-10-25T06:47:50,770 WARN [main-SendThread(zookeeper:3181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper/10.97.72.224:3181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2024-10-25T06:47:50.770788707Z stdout F org.apache.zookeeper.ClientCnxn$EndOfStreamException: channel for sessionid 0x0 is lost
2024-10-25T06:47:50.770793561Z stdout F     at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:286) ~[zookeeper-3.8.4.jar:3.8.4]
2024-10-25T06:47:50.770798723Z stdout F     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289) ~[zookeeper-3.8.4.jar:3.8.4]


While Druid pods are unhealthy, the configure-druid pod will be stuck in incomplete state.

nsxi-platform                  common-agent-create-kafka-topic-hhm5r                                   0/1     Completed          0                 59m
nsxi-platform                  configure-druid-2jb9c                                                                  1/1     Running              1              (19m ago)

Resolution

Please contact Broadcom Support for further assistance.