NSX-T Manager upgrade gets stuck/fails at run_migration_tool due to the improper cleanup PersistentQueue

search cancel

NSX-T Manager upgrade gets stuck/fails at run_migration_tool due to the improper cleanup PersistentQueue

book

Article ID: 312606

calendar_today

Updated On: 12-31-2024

Products

VMware NSX

Issue/Introduction

The NSX-T Manager Upgrade seems to be stuck and when the "get upgrade progress-status" command is issued to check the status following is observed:

run_migration_tool [2021-06-09 20:59:46 - ] FAILED
run_migration_tool [2021-06-15 19:59:15 - ] IN_PROGRESS
Status: Corfu Infrastructure Server is not running.

NSX-T manager Upgrade is halted
In the /var/log/policy/data-migration-with-old-protobuf.log following error stack is observed:

2022-03-04T19:20:39.448Z ERROR main ClusteringCorfuCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP1" level="ERROR" subcomp="corfu-compactor"] Checkpoint failed for clustering data with namespace cbm
java.lang.IllegalArgumentException: Unknown type! Long <<<<<<<<<<<<<<<<
at org.corfudb.util.serializer.CorfuQueueSerializer.serialize(CorfuQueueSerializer.java:70) ~[policy-data-migration-with-old-protobuf-gc-ga.jar:?]
at org.corfudb.protocols.logprotocol.SMREntry.lambda$serialize$0(SMREntry.java:157) ~[policy-data-migration-with-old-protobuf-gc-ga.jar:?]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_251]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_251]

In /var/log/policy/cbm/cbm.log we see entries indicating that the system attempted to delete the PersistentQueue,

2022-03-04T19:20:22.843Z INFO main PersistentQueueServiceImpl - - [nsx@6876 comp="PersistentQueueService" level="INFO" subcomp="PersistentQueueServiceImpl"] Deleted Persistent Queue QueueInfo{namespace=Policy, name=d.Policy.LM_2_GM_NOTIFICATION.source.87397fe7-####-####-####-246de1d5d2ab, queueProperties=QueueProperties{maxMessages=1000, maxMessageSize=8192, isMana <<<< From the cbm.log indicating Delete task

<<<<<<<<<<<<<<<<<<<<<<
java.lang.ClassNotFoundException: com.google.protobuf.LiteralByteString
at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_251]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_251]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) ~[?:1.8.0_251]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_251]

However, /var/log/corfu/corfu-compactor-audit.log indicates that the clean up did not go as expected.

2022-03-04T19:20:23.121Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 5399225d-####-####-####-9cf189e00c50 at snapshot Token(epoch=2193, sequence=2034030118)
2022-02-16T19:22:57.123Z ERROR main JsonSerializer - Exception during deserialization!

2022-03-04T19:20:23.609Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 5399225d-####-####-####-9cf189e00c50 at snapshot Token(epoch=2192, sequence=2034022332)
corfu-compactor-audit.log:2022-03-04T19:20:23.737Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 5399225d-####-####-####-9cf189e00c50, entries(526), cpSize(160699) bytes at snapshot Token(epoch=2192, sequence=2034022332) in 128 ms <<< Post delete time stamp from cbm.log, corfu-compactor-audit.log indicating the presence of "entries"

Environment

VMware NSX-T Data Center 3.x

Cause

The persistent queues should be cleared out during an upgrade (during the reboot of the orchestrator node). In certain scenarios due to a race condition, the cleanup doesn't go through properly leaving behind some stale entries in the database, which cause the upgrade to pause.

Resolution

Should you encounter this issue, contact Broadcom Support and reference this KB article.

Feedback

thumb_up Yes

thumb_down No