NSX-T Manager upgrade gets stuck/fails at run_migration_tool due to the improper cleanup PersistentQueue
search cancel

NSX-T Manager upgrade gets stuck/fails at run_migration_tool due to the improper cleanup PersistentQueue

book

Article ID: 312606

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • The NSX-T Manager Upgrade seems to be stuck and when the "get upgrade progress-status" command is issued to check the status following is observed:
run_migration_tool [2021-06-09 20:59:46 - ] FAILED            
run_migration_tool [2021-06-15 19:59:15 - ] IN_PROGRESS
Status: Corfu Infrastructure Server is not running.
  • NSX-T manager Upgrade is halted
  • In the /var/log/policy/data-migration-with-old-protobuf.log following error stack is observed:
2022-xx-xxTxx:xx:39.448Z ERROR main ClusteringCorfuCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP1" level="ERROR" subcomp="corfu-compactor"] Checkpoint failed for clustering data with namespace cbm
 java.lang.IllegalArgumentException: Unknown type! Long                          <<<<<<<<<<<<<<<<
         at org.corfudb.util.serializer.CorfuQueueSerializer.serialize(CorfuQueueSerializer.java:70) ~[policy-data-migration-with-old-protobuf-gc-ga.jar:?]
         at org.corfudb.protocols.logprotocol.SMREntry.lambda$serialize$0(SMREntry.java:157) ~[policy-data-migration-with-old-protobuf-gc-ga.jar:?]
         at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_251]
         at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_251]
 
  • In /var/log/policy/cbm/cbm.log we see entries indicating that the system attempted to delete the PersistentQueue,
 
2022-xx-xxTxx:xx:22.843Z INFO main PersistentQueueServiceImpl - - [nsx@6876 comp="PersistentQueueService" level="INFO" subcomp="PersistentQueueServiceImpl"] Deleted Persistent Queue QueueInfo{namespace=Policy, name=d.Policy.LM_2_GM_NOTIFICATION.source.87397fe7-1b6b-45a2-b93e-246de1d5d2ab, queueProperties=QueueProperties{maxMessages=1000, maxMessageSize=8192, isMana            <<<< From the cbm.log indicating Delete task
 
     <<<<<<<<<<<<<<<<<<<<<<
 java.lang.ClassNotFoundException: com.google.protobuf.LiteralByteString
         at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_251]
         at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_251]
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) ~[?:1.8.0_251]
         at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_251]
  • However, /var/log/corfu/corfu-compactor-audit.log indicates that the clean up did not go as expected.
2022-xx-xxTxx:xx:23.121Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 5399225d-b311-3f04-bb5a-9cf189e00c50 at snapshot Token(epoch=2193, sequence=2034030118)
 2022-02-16T19:22:57.123Z ERROR main JsonSerializer - Exception during deserialization!      

2022-xx-xxTxx:xx:23.609Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 5399225d-b311-3f04-bb5a-9cf189e00c50 at snapshot Token(epoch=2192, sequence=2034022332)  
 corfu-compactor-audit.log:2022-xx-xxTxx:xx:23.737Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 5399225d-b311-3f04-bb5a-9cf189e00c50, entries(526), cpSize(160699) bytes at snapshot Token(epoch=2192, sequence=2034022332) in 128 ms            <<< Post delete time stamp from cbm.log, corfu-compactor-audit.log indicating the presence of "entries" 


Environment

VMware NSX-T Data Center

Cause

The persistent queues should be cleared out during an upgrade (during the reboot of the orchestrator node). In certain scenarios due to a race condition, the cleanup doesn't go through properly leaving behind some stale entries in the database, which cause the upgrade to pause.

Resolution

Currently there is no resolution to this issue

Workaround:
Should you encounter this issue, please raise a Support Request with VMware referencing this article.