VMs configured with NSX-T DFW rules become unreachable on the network.

Products

VMware NSX VMware NSX-T Data Center VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

VMs use tag based membership for nsgroups.
They lose tags which, cause the DFW rules to stop working, thereby, resulting in they becoming unreachable on the network.
In the log /var/log/messaging-manager/messaging-manager.log on all three manager nodes, we see the NSX version being different for the other two nodes:

Manager-1:

entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxcb59]], isLocal=false, productVersion=4.1.0.2.0.21761695],
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxc90b]], isLocal=true, productVersion=4.1.0.2.0.21761695]].
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx0f2f]], isLocal=false, productVersion=4.1.0.2.0.21761695],

Manager-2:

entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxcb59]], isLocal=false, productVersion=4.1.0.2.0.21761695],
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxc90b]], isLocal=false, productVersion=3.2.1.2.0.20541216]].
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx0f2f]], isLocal=true, productVersion=4.1.0.2.0.21761695],

Manager-3:

entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxcb59]], isLocal=true, productVersion=4.1.0.2.0.21761695],
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxc90b]], isLocal=false, productVersion=3.2.1.2.0.20541216]].
entity_uuid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx0f2f]], isLocal=false, productVersion=3.2.1.2.0.20541216],
In the affected host's /var/run/log/nsx-syslog.log, we see below entries indicating a config mismatch:

2024-11-18T18:03:56.172Z In(182) nsx-proxy[2101919]: NSX 2101919 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101919" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://xx.xx.xx.23:1234 client_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx8a7f
158718 2024-11-18T18:03:56.172Z In(182) nsx-proxy[2101919]: NSX 2101919 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101919" level="INFO"] HeartbeatManager: configuration hash mismatch in heartbeat callback. Old hash - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb1d4, New hash - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5242. Invoking RESET on Forwarding Engine

Environment

VMware NSX
VMware NSX-T Data Center

Cause

It is the nsx manager's local cache having different product versions for all the three nodes causing config hash mismatch, resulting in continuous discovery loop on the affected ESXi host.

The in-memory map holding the version information of the MP cluster is updated (read from ClusterNodeConfigModel in Corfu) on:

-MP node add
-MP node remove
-MP node up
-MP node down

This causes the in-memory version(s) of the MP cluster to deviate from the in-corfu version.

As the MP nodes are upgraded one by one, (and each gets rebooted after upgrade), each sees a different in-version.

Resolution

This issue is fixed in NSX version 4.2.1.2.

Reference - Release notes VMware NSX 4.2.1.2 "Fixed Issue 3466841: DFW rules might not work as expected and can cause network disruption due to loss of tags."
To workaround the issue, follow the below steps:

On the first nsx manager, run the commands </etc/init.d/messaging-manager stop> and </etc/init.d/messaging-manager start> to restart the messaging-manager service.
Wait for the nsx manager cluster status to show green and stable. Wait for a few mins (may be 10-15mins once the cluster turns green).
Repeat step-1 for the second manager node.
Wait for the nsx manager cluster status to show green and stable. Wait for a few mins (may be 10-15mins once the cluster turns green).
Repeat step-1 for the third manager node.
Wait for the nsx manager cluster status to show green and stable. Wait for a few mins (may be 10-15mins once the cluster turns green).
Root SSH login to each nsx manager node, run the command <grep -a "productVersion=" var/log/messaging-manager/messaging-manager.log> to check the product version of the three manager nodes in each manager local cache.
From the above command, all three manager nodes should report the correct product version for all three nodes.
Once above is confirmed, check the affected host's status from NSX UI --> System --> Fabric --> Hosts --> Expand the relevant cluster --> In the host's row, click on view details in the right --> check if everything is green here. In the same view, click on Monitor --> Scroll down --> Tunnels --> All the tunnel status should show green.
If the affected host status too shows good, place the cluster DRS into manual mode, bring the affected host out of maintenance mode, migrate a couple of non-critical VMs to the host and monitor them for any connectivity issue.