Symptoms:
- NSX Manager clusters services may be DOWN or unstable. Accessing UI may show error similar to "Component health: SEARCH:UP, POLICY:DOWN, MANAGER:DOWN, UI:UP, NODE_MGMT:UP."
- /var/log/cloudnet/nsx-ccp.log shows "Timeout while ping backend kv-store with upperBound 15s."
- Corfu is undergoing frequent epoch changes:
ls -ltrh /config/corfu | grep LAYOUT
- /var/log/syslog shows "CorfuDB is disconnected, set Cluster Status Down"
- /var/log/corfu/corfu-compactor-audit.log may contain "WARN CorfuRuntime-0 CorfuRuntime - Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 0 times, systemDownHandlerTriggerLimit = 60"
- /var/log/corfu/corfu.9000.log may contain "WrongEpochException" messages
- High r_await or w_await numbers for nsx_config in /var/log/stats/sys_io.stats
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
nsx-config 0.00 0.00 0.00 0.00 2.96 12.18 0.00 0.00 0.00 0.00 452.14 4.00 0.00 8.41 0.00 0.00 0.28 31699.12 0.00 0.00
VMware NSX-T Data Center
High Manager storage latency causes cluster instability, induces frequent Corfu epoch changes, and other log messages above.
VM performance charts (On vSphere UI, select Manager VM > Monitor > Advanced > Set 'View' to 'Datastore', adjust time Period as needed) show Read and Write latency well above 10ms.
Example chart:
Either datastore read/write latency must be resolved, or Manager VMs can be Storage vMotioned to another datastore with latency under 10ms.
NSX-T Storage Requirement documentation states "The maximum disk access latency is under 10ms."
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/installation/GUID-AECA2EE0-90FC-48C4-8EDB-66517ACFE415.html