Communication Channel Health Check between NSX manager and NSX prepared hosts may show as DOWN under certain conditions, leading to high CPU utilization on NSX manager
search cancel

Communication Channel Health Check between NSX manager and NSX prepared hosts may show as DOWN under certain conditions, leading to high CPU utilization on NSX manager

book

Article ID: 325127

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

The NSX Data Center for vSphere UI will shows communication channel health check as down. 

You will see the following in the VC WebClient.  Additionally you will see NSX Manager CPU utilization increasing. 


In the vsfwd.log, on the ESXi host, you will notice that the host is connecting to rabbitMQ process on NSX Manager repeatedly due to 'failed to log in':
 
2019-08-11T19:12:14Z vsfwd: [INFO] Removed App of AppType VNET and AppId host-20492
2019-08-11T19:12:15Z vsfwd: [ERROR] Failed to log on to broker 10.236.18.15:5671: Logging in: Input/output error
2019-08-11T19:12:19Z vsfwd: [INFO] Ending NET CLIENT Listen Thread
2019-08-11T19:12:19Z vsfwd: [INFO] Exiting NET bulk data thread...
2019-08-11T19:12:25Z vsfwd: [INFO] Find new password, !!EXITING!!
--
2019-08-11T19:54:17Z vsfwd: [INFO] Removed App of AppType VNET and AppId host-20492
2019-08-11T19:54:19Z vsfwd: [ERROR] Failed to log on to broker 10.236.18.15:5671: Logging in: Input/output error
2019-08-11T19:54:22Z vsfwd: [INFO] Ending NET CLIENT Listen Thread
2019-08-11T19:54:22Z vsfwd: [INFO] Exiting NET bulk data thread...
2019-08-11T19:54:22Z vsfwd: [INFO] Find new password, !!EXITING!!

On the ESXi host experiencing the issue, initiate a grep command as shown below. This will show a list of time stamps where the host is discovering that its been issued a new password by NSX Manager. The NSX Manager provides the host with a new password when it misses certain number of heartbeat responses from host.
The log example below indicates that this host has been repeatedly getting new password from NSX Manager: 
 
prd-csp-5:019017493507:log$ grep "EXITING" vsfwd.log
2019-08-11T19:12:25Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:18:35Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:23:45Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:29:55Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:42:01Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:48:12Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T19:54:22Z vsfwd: [INFO] Find new password, !!EXITING!!
2019-08-11T20:00:32Z vsfwd: [INFO] Find new password, !!EXITING!!

In the vsm.log on NSX Manager, the following log lines indicate that there are heartbeat messages getting missed by the NSX Manager rabbitmq process: 
 
2019-08-11 11:41:43.302 CDT ERROR HeartbeatManagerHeartbeatTimer HeartbeatManager$HeartbeatTask:295 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Client has not responded to the heartbeat for longer than the alert threshold. Peer name = 'com.vmware.vshield.userworld', client token = 'host-20492', client id = '33977938-2b66-4daf-8f1c-488e6b274bd8', last heartbeat response = '306', last published heartbeat = '308'
2019-08-11 11:53:43.438 CDT ERROR HeartbeatManagerHeartbeatTimer HeartbeatManager$HeartbeatTask:295 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Client has not responded to the heartbeat for longer than the alert threshold. Peer name = 'com.vmware.vshield.userworld', client token = 'host-20492', client id = '33977938-2b66-4daf-8f1c-488e6b274bd8', last heartbeat response = '308', last published heartbeat = '312'


Environment

VMware NSX Data Center for vSphere 6.4.x

Cause

The issue occurs because every host is trying to register to NSX manager. During this process every publish event than happens for a VNIC generates a ACK message back to NSX manager. The NSX Manager may get overwhelmed by the volume of messages generated by hosts, which may lead to the NSX Manager not receiving or processing the heartbeat messages from other hosts and respond back in time, thus causing a loop.

Note: Logs for ACK received by NSX Manager from every VNIC in the infrastructure can only be seen when debug logs are enabled.

Resolution

This is a known issue affecting VMware NSX for vSphere 6.4.x.
 
Currently, there is no resolution.



Steps to track issue (event code - 301033)

  1. Run the following REST API to set the log level to debug for ResponseHandler class in app (DFW) module.
    POST https://<nsx-manager>/api/1.0/services/debug/loglevel/com.vmware.vshield.firewall.messaging.response.service.ResponseHandler?level=DEBUG
  2. Search for the following string “Processing VSFWEvent” in vsm.log and get the count in a given time interval to see if there are abnormal number of events related to status code - 30103
  3. Revert log level to INFO using the same REST API as before
    POST https://<nsx-manager>/api/1.0/services/debug/loglevel/com.vmware.vshield.firewall.messaging.response.service.ResponseHandler?level=INFO