Storage latency causes NSX T Manager cluster instability
search cancel

Storage latency causes NSX T Manager cluster instability

book

Article ID: 316654

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
- NSX Manager clusters services may be DOWN or unstable. Accessing UI may show error similar to "Component health: SEARCH:UP, POLICY:DOWN, MANAGER:DOWN, UI:UP, NODE_MGMT:UP."
 
- /var/log/cloudnet/nsx-ccp.log shows "Timeout while ping backend kv-store with upperBound 15s."
 
- Corfu is undergoing frequent epoch changes:
ls -ltrh /config/corfu | grep LAYOUT
 
- /var/log/syslog shows "CorfuDB is disconnected, set Cluster Status Down"
 
- /var/log/corfu/corfu-compactor-audit.log may contain "WARN CorfuRuntime-0 CorfuRuntime - Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 0 times, systemDownHandlerTriggerLimit = 60"
 
- /var/log/corfu/corfu.9000.log may contain "WrongEpochException" messages

Environment

VMware NSX-T Data Center

Cause

High Manager storage latency causes cluster instability, induces frequent Corfu epoch changes, and other log messages above.
 
VM performance charts (On vSphere UI, select Manager VM > Monitor > Advanced > Set 'View' to 'Datastore', adjust time Period as needed) show Read and Write latency well above 10ms.

Example chart:
image.png

Resolution

Either datastore read/write latency must be resolved, or Manager VMs can be Storage vMotioned to another datastore with latency under 10ms.

NSX-T Storage Requirement documentation states "The maximum disk access latency is under 10ms."
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/installation/GUID-AECA2EE0-90FC-48C4-8EDB-66517ACFE415.html