Storage latency causes NSX T Manager cluster instability
search cancel

Storage latency causes NSX T Manager cluster instability


Article ID: 316654


Updated On:


VMware NSX


- NSX Manager clusters services may be DOWN or unstable. Accessing UI may show error similar to "Component health: SEARCH:UP, POLICY:DOWN, MANAGER:DOWN, UI:UP, NODE_MGMT:UP."
- /var/log/cloudnet/nsx-ccp.log shows "Timeout while ping backend kv-store with upperBound 15s."
- Corfu is undergoing frequent epoch changes:
ls -ltrh /config/corfu | grep LAYOUT
- /var/log/syslog shows "CorfuDB is disconnected, set Cluster Status Down"
- /var/log/corfu/corfu-compactor-audit.log may contain "WARN CorfuRuntime-0 CorfuRuntime - Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 0 times, systemDownHandlerTriggerLimit = 60"
- /var/log/corfu/corfu.9000.log may contain "WrongEpochException" messages

- High r_await or w_await numbers for nsx_config in /var/log/stats/sys_io.stats

Device            r/s     rkB/s   rrqm/s  %rrqm r_await  rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await  wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
nsx-config      0.00   0.00    0.00      0.00    2.96      12.18         0.00    0.00     0.00        0.00      452.14     4.00           0.00     8.41     0.00       0.00    0.28    31699.12    0.00    0.00


VMware NSX-T Data Center


High Manager storage latency causes cluster instability, induces frequent Corfu epoch changes, and other log messages above.
VM performance charts (On vSphere UI, select Manager VM > Monitor > Advanced > Set 'View' to 'Datastore', adjust time Period as needed) show Read and Write latency well above 10ms.

Example chart:


Either datastore read/write latency must be resolved, or Manager VMs can be Storage vMotioned to another datastore with latency under 10ms.

NSX-T Storage Requirement documentation states "The maximum disk access latency is under 10ms."