NSX Manager cluster is DOWN post network outage.
search cancel

NSX Manager cluster is DOWN post network outage.

book

Article ID: 417202

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You faced a network outage. 
  • The NSX cluster went down during this outage and did not recover automatically. 
  • Log lines similar to the below are encountered on the NSX Manager in /var/log/syslog 
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Member: #########-####-####-####-##############, name: #########-####-####-####-##############, status: DOWN
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Member: #########-####-####-####-##############, name: #########-####-####-####-##############, status: DOWN
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Member: #########-####-####-####-##############, name: #########-####-####-####-##############, status: DOWN
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Membership update end
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Received maintenance mode status: {#########-####-####-####-##############=MAINTENANCE_MODE_OFF, #########-####-####-####-##############=MAINTENANCE_MODE_OFF, #########-####-####-####-##############=MAINTENANCE_MODE_OFF}
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="corfu-cluster"] Myself is not in up list, turn the cluster status to be down
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="magpie"] Updating Active Members
    NSX 517167 - [nsx@6876 comp="nsx-controller" level="EVENT" subcomp="magpie"] 0 members active out of
  • Log lines similar to the below are encountered on the NSX Manager in /var/log/cbm/cbm.log
    WARN netty-2 NettyClientRouter 3019 userEventTriggered: unhandled event SslHandshakeCompletionEvent(javax.net.ssl.SSLHandshakeException: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate)
    INFO netty-0 NettyClientRouter 3019 Connect Async <NSX manager IP>:9040
    ERROR netty-0 ClientHandshakeHandler 3019 exceptionCaught: Exception DecoderException caught.
    io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate

Environment

VMware NSX

Cause

When a network outage triggers the sync task between the NSX manager nodes, a race condition with the CBM certificate replacement task leaves the Corfu DB public trust store un-updated, causing services to fail in connecting to the Corfu DB.

Resolution

This issue is resolved in VMware NSX 4.2.1, available at Broadcom Downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

  • Reboot one NSX Manager at a time, wait for the node to come up, then restart the next node. 
  • Wait for the cluster to reach a stable state. 
    get cluster status

Additional Information

NSX UI inaccessible post CBM cert replacement

Resolving NSX Manager Corruption Post-Storage Outage via Redeployment and Restore