VMware NSX Federated environments have core dumps and checkpoint failing for LogicalSwitchUpdateInfo table
search cancel

VMware NSX Federated environments have core dumps and checkpoint failing for LogicalSwitchUpdateInfo table

book

Article ID: 322644

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are running NSX-T Federated environment.
  • On the local manager core dump like the following are seen in /var/dump:
    core.java.#####.gz
    core.netty-0.#####.gz
  • And the below files could be found in /image/core:
    compactor_oom.hprof.gz
    proton_oom.hprof.gz
  • The disk image is increasing for /config and /image and /var/ as seen when running df -h:
  • In the /var/log/corfu/corfu-compactor-audit.log we see the table ########-8dd3-####-####-############ fails to checkpoint, when you run the following command:
    INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for ########-8dd3-####-####-############ at snapshot Token(epoch=143, sequence=684087246)
    ERROR main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP1" level="ERROR" subcomp="corfu-compactor"] Checkpoint failed for framework data with namespace nsx-manager
  • Checking back for the last time the table was checkpointed, we see the following:
    INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-8dd3-####-####-############, entries(1240805), cpSize(461703546) bytes at snapshot Token(epoch=4907, sequence=6294324773) in 4767668 ms
  • In /var/log/proton we see entries like the following:
    INFO workerTaskExecutor-17 LogicalSwitchUpdateInfoWorker 1740912 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] LogicalSwitchUpdateInfo handler for LSUInfo LogicalSwitchUpdateInfo/<ID> with null

 

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.x

Cause

The table ########-8dd3-####-####-############ (LogicalSwitchUpdateInfo) is increasing in size and causing compactor to fail and leads to cluster failing also.
The /image/core and /var/dump size increases due to the amount of core dumps. The /config will also increase as corfu checkpointing is not occurring.
This table grows in size due to an issue with the logic that checks the LogicalSwitchUpdateInfo for cleanup, when the operation is null, this cleanup is skipped and leads to this large table.
This has only been seen in large environments with heavy churn.

Resolution

This is a known issue impacting VMware NSX.



Workaround

If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.