"Component health: SEARCH:UP, POLICY:DOWN, MANAGER:DOWN, UI:UP, NODE_MGMT:UP."
/var/log/cloudnet/nsx-ccp.log
shows "Timeout while ping backend kv-store with upperBound 15s."
ls -ltrh /config/corfu | grep LAYOUT
/var/log/syslog
shows "CorfuDB is disconnected, set Cluster Status Down"
/var/log/corfu/corfu-compactor-audit.log
may contain "WARN CorfuRuntime-0 CorfuRuntime - Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 0 times, systemDownHandlerTriggerLimit = 60"
/var/log/corfu/corfu.9000.log
may contain "WrongEpochException"
messages /var/log/stats/sys_io.stats
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
nsx-config
0.00 0.00 0.00 0.00 2.96 12.18
0.00 0.00 0.00 0.00 452.14 4.00 0.00 8.41 0.00 0.00 0.28 31699.12 0.00 0.00
VMware NSX
High Manager storage latency causes cluster instability, induces frequent Corfu epoch changes, and other log messages above.
VM performance charts (On vSphere UI, select Manager VM > Monitor > Advanced > Set 'View' to 'Datastore', adjust time Period as needed) show Read and Write latency well above 10ms.
Example chart:
Either datastore read/write latency must be resolved, or Manager VMs can be Storage vMotioned to another datastore with latency under 10ms.
NSX Storage Requirement documentation states,
"NSX appliance VMs that are backed by VSAN clusters may see intermittent disk write latency spikes of 10+ms. This is expected due to the way VSAN handles data (burst of incoming IOs resulting in queuing of data and delay). As long as the average disk access latency continues to be less than 10ms, intermittent latency spike should not have an impact on NSX Appliance VMs."
https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/vdefend-firewall/4-2/nsx-manager-and-host-transport-node-system-requirements.html
Following Knowledge Base articles can be helpful with further troubleshooting of the storage issue.
"performance has deteriorated" messages in ESXi host logs
Using esxtop to identify storage performance issues for ESXi (multiple versions)
"state in doubt; requested fast path state update" error in ESXi