Storage latency causes NSX Manager cluster instability
search cancel

Storage latency causes NSX Manager cluster instability

book

Article ID: 316654

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX Manager cluster services may be DOWN or unstable. Accessing UI may show an error similar to "Component health: SEARCH:UP, POLICY:DOWN, MANAGER:DOWN, UI:UP, NODE_MGMT:UP."
     
  • /var/log/cloudnet/nsx-ccp.log shows:
    "Timeout while ping backend kv-store with upperBound 15s."
     
  • Corfu is undergoing frequent epoch changes:
    ls -ltrh /config/corfu | grep LAYOUT
     
  • /var/log/syslog shows "CorfuDB is disconnected, set Cluster Status Down"
     
  • /var/log/corfu/corfu-compactor-audit.log may contain:
    "WARN CorfuRuntime-0 CorfuRuntime - Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 0 times, systemDownHandlerTriggerLimit = 60"
     
  • /var/log/corfu/corfu.9000.log may contain "WrongEpochException" messages like:
    <timestamp> | DEBUG | client-2 | o.c.r.c.NettyClientRouter | completeExceptionally: Remove request (Type: QUERY_NODE_REQUEST ID: #######) to <NSX-Manager-IP>:9000 due to WrongEpochException.
    <timestamp> | ERROR | client-2 | o.c.i.m.f.EpochHandler | Updating layout servers failed due to: org.corfudb.runtime.exceptions.WrongEpochException: Wrong epoch. [expected=##]
  • /var/log/corfu/corfu-metrics.log may contain a high "upper" number for corfu fsync. If this upper number is over 250000, it indicates high latency:
    <timestamp> | logunit_fsync_timer,id=<NSX-Manager-IP>:9000,metric_type=timer sum=22602729.174,count=1095,mean=20641.761803,upper=584631.735 
    In this scenario, the Corfu database experienced 584 ms of latency, which is such a high number for a latency-sensitive database

  • High r_await or w_await numbers for nsx_config in /var/log/stats/sys_io.stats

    Device            r/s     rkB/s   rrqm/s  %rrqm r_await  rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await  wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
    nsx-config      0.00   0.00    0.00      0.00    2.96      12.18          0.00    0.00     0.00     0.00         452.14    4.00           0.00     8.41     0.00       0.00    0.28     31699.12    0.00    0.00

Environment

VMware NSX

Cause

High Manager storage latency causes cluster instability, induces frequent Corfu epoch changes, and other log messages above.
 
VM performance charts (On vSphere UI, select Manager VM > Monitor > Advanced > Set 'View' to 'Datastore', adjust time Period as needed) show Read and Write latency well above 10ms.

Example chart:

Resolution

Either datastore read/write latency must be resolved, or Manager VMs can be Storage vMotioned to another datastore with latency under 10ms.

NSX Storage Requirement documentation states, 

"NSX appliance VMs that are backed by VSAN clusters may see intermittent disk write latency spikes of 10+ms. This is expected due to the way VSAN handles data (burst of incoming IOs resulting in queuing of data and delay). As long as the average disk access latency continues to be less than 10ms, intermittent latency spike should not have an impact on NSX Appliance VMs."

NSX Manager VM and Host Transport Node System Requirements

 

Additional Information