NSX-T Alert "Control Channel To Transport Node Down Long" is constantly reported.

Products

VMware NSX

Issue/Introduction

Event Type: Alarm for Control channel to transport node down long
Event ID: control_channel_to_transport_node_down_long

Alarm Description

This Alarm is raised when Transport node is not able to connect to Control Plane (CCP).

Purpose: Controller service to Transport node's connection is down for at least fifteen minutes.
Impact: In this scenario, no new configuration can be pushed down to the Transport node from the Control plane and features like vMotion will not be available.

Issue Description:
On the NSX-T Manager, the alarm "Control Channel To Transport Node Down Long" is reported.
These alarms are observed connected and working transport nodes.
There is no impact on the services or VMs running on the Transport Node (Host or Edge).
Rebooting NSX-T Managers does not resolve the alert.
Alert appears again after resolving in NSX-T Web UI.
As per below example, the Transport Node is connected to the Managers and Controllers correctly.

[root@esxi-host:~] nsxcli -c get controllers
<Time Stamp>
Controller IP Port SSL Status Is Physical Master Session State Controller FQDN
172.#.#.19 1235 enabled connected true up NA
172.#.#.18 1235 enabled not used false null NA
172.#.#.17 1235 enabled not used false null NA

[root@esxi-host:~] nsxcli -c get managers
<Time Stamp>
- 172.#.#.17 Connected (NSX-RPC)
- 172.#.#.18 Connected (NSX-RPC)
- 172.#.#.19 Connected (NSX-RPC) *

In the NSX Manager log /var/log/syslog you see the following entries :

Year-MM-Date:##:##:##.###Z FATAL pool-62-thread-1 MonitoringServiceImpl 3494 MONITORING [nsx@6876 alarmId="<UUID>" alarmState="OPEN" comp="nsx-manager" entId="36#####b-3##0-4###-9###-9##########7" errorCode="MP701099" eventFeatureName="communication" eventSev="CRITICAL" eventState="On" eventType="control_channel_to_transport_node_down_long" level="FATAL" nodeId="a######2-c##d-1##5-b##8-0##########5" subcomp="monitoring"] Controller service on Manager node 172.#.#.19 (7######b-####-4##4-9##1-1##########3) to Transport node 3638uuuu-3bf0-####-uuuu-9c23f580ef77 down for at least 15 minutes from Controller service's point of view.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX-T

Cause

CCP (Central Control Plane) data migration from NSX-T Data Center 3.0.x/3.1.x to later release may leave conflict records, which may generate these alarms.

Resolution

This issue is resolved in NSX-T Data Center 3.2.2, 4.0.1, and 4.1.0 available at Download Broadcom products and software.

Workaround:
Restart the nsx-proxy service.
The following process will not impact the dataplane, the nsx-proxy service connects to CCP on NSX Manager appliance to get new configurations.
Transport nodes have their own database which caches all existing configurations.
The Transport Node will be disconnected from the CCP while nsx-proxy is down, meaning new configuration will not be processed until the CCP is back up.

Step 1
Stopping the nsx-proxy for 5-10 minutes.
Run the following commands on the Transport Node (Host or Edge) in root mode.

/etc/init.d/nsx-proxy stop

Step 2
After nsx-proxy is stopped for more than 5 mins, start it back up.

/etc/init.d/nsx-proxy start

Wait for a short period to confirm the alarms are now gone.
If the alarms are gone, it is permanent and they should not occur again.

Additional Information

When this alarm is raised, check the connectivity between Transport node and Control Plane (CCP)

localcli network ip connection list | grep 1235 (on ESX node)
netstat -anp | grep 1235 (on Edge node)

Impact/Risks:
Once confirmed that connectivity to the controllers and managers are correct for the transport node, as above, this is considered a cosmetic issue.