NSX-T manager UI unresponsive with critical services seen flapping on NSX managers

search cancel

NSX-T manager UI unresponsive with critical services seen flapping on NSX managers

book

Article ID: 421453

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX managers were reachable via ping and SSH, but several services like CORFU_NONCONFIG, DATASTORE, MANAGER and HTTPS were flapping (UP/Down) and doesn't reach stable state, because of which NSXT GUI was not accessible.

Environment

3.2.2.0

Cause

From corfu logs observed the error of WrongEpochException consistently which means there could be issues with underlying storage datastore or networking issues.

####-##-##T##:##:##.###Z | DEBUG | client-3 | o.c.r.c.NettyClientRouter | completeExceptionally: Remove request 24 to ###.##.###.##:9000 due to WrongEpochException.  
org.corfudb.runtime.exceptions.WrongEpochException: Wrong epoch. [expected=1374]

Also observed "Loss of connectivity with datastore" on vcenter events and on hostd.log of ESXI host where the NSX managers were deployed

####-##-##T##:##:##.###Z info hostd[2100173] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 62955 : Lost access to volume <UUID of datastore volume>  (<Datastore name>) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Resolution

Migrate NSXT managers from ESXi hosts where "Lost access to volume" was observed to another ESXi host where there is no datastore issue.

After migration, check if the corfu DB gets sync'd and NSX managers and its services stabilizes by executing command "get cluster status" on NSXT manager.

Note : It might take some time for the corfu DB to sync across all 3 NSX managers and to get all the elements and modules visible on NSXT GUI.

Additional Information

If the modules are not visible even after a long time, execute below command on one of the NSX manager.

>> start search resync policy

>> start search resync all (if the above command doesn't work)

It was also observed that all 3 NSXT managers were deployed on same ESXI hosts, so the issue of datastore inaccessibility and thereby wrong epoch exception error was seen on all 3 NSX managers. So it is recommended to keep NSXT managers deployed on separate individual ESXi hosts.

Refer KB 316654 for more information on NSX manager cluster instability due to storage latency.

Feedback

thumb_up Yes

thumb_down No