NSX Manager cluster nodes's CORFU_NONCONFIG exhibit persistent DB_SYNCING status and DOWN status
Database synchronization between the primary node and secondary nodes is failing to complete.
<timestamp> | | DEBUG | logging-metrics-publisher | org.corfudb.client.metricsdata | failure-detector_ping-latency,id=c8f2051a-2243-4cda-a6dc-8de61ea7e26e,node=144.215.4.82:9040,metric_type=timer sum=6019988.469,count=26,mean=231538.018038,upper=1923807.666 1775721884462
<timestamp> | DEBUG | failAfter-0 | o.c.r.c.NettyClientRouter | sendRequestAndGetCompletable: Remove request <REQUEST_ID> to <IP_ADDRESS>:<PORT> due to timeout! Request:version { corfu_source_code_version: <VERSION_ID> } request_id: <REQUEST_ID> priority: HIGH epoch: <EPOCH_VALUE> cluster_id { lsb: <MASKED_ID> msb: <MASKED_ID> } client_id { lsb: <MASKED_ID> msb: <MASKED_ID> } ignore_cluster_id: true ignore_epoch: true
VMware NSX
Corfu database operations timeout when latency exceeds the critical threshold , average disk access latency to be less than 10ms.
The datastore read/write latency must be resolved.
Alternatively, we can perform a Storage vMotion to migrate the NSX Manager VMs to a datastore maintaining a consistent latency of under 10ms.
Per the NSX Storage Requirements:
"NSX appliance VMs backed by vSAN clusters may experience intermittent disk write latency spikes of 10ms+ due to standard vSAN I/O handling. However, provided the average latency remains below 10ms, these intermittent spikes should not impact the NSX Appliance VMs."