During a full sync, CCP pushes stale configuration to host transport nodes.

Products

VMware vDefend Firewall

Issue/Introduction

DFW rules are missing in host transport nodes.

*Relevant log’s location*:
In CCP log, it shows that DFW rules in "RULE_SECTION" are deleted during a Full Sync.

In directory nsx_manager/var/log/cloudnet
$ grep "Full sync entity update for RULE*" nsx-ccp-20210727*
nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.310Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE_SECTION: Added 0, Deleted 0, Updated 7374
nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.480Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE: Added 879, Deleted 1378, Updated 56935
Note that these messages are generic and indicate that DFW rules are deleted. The above messages only show a problem if no DFW rule is deleted in NSX-T Manager.

The following messages are related to the race conditions causing stale configuration in in-memory cache.
1. In Proton log, it shows that a Full Sync started at 2021-07-29T10:55:19:
nsx_manager/var/log/proton
2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/c######f-0##5-4##a-b##9-f##########9] to protobuf
2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message UpmProfile/UpmProfile/3######0-d##9-4##4-8##0-7##########3
2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/3######0-d##9-4##4-8##0-7##########3] to protobuf
2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/a######b-0##d-4##e-a##8-f##########8] to protobuf

2. In Proton log, it shows that there was a new Sync Leader at 2021-07-29T10:55:58
/nsx_manager/var/log/proton
2021-07-29T10:55:58.850Z INFO pool-410-thread-1 WaitingForLeadershipLostAckState - - [nsx@6876 comp="gmle-client" level="INFO" subcomp="WaitingForLeadershipLostAckState"] Entering WaitingForLeadershipLostAckState for lease id <UUID> service TraceflowCacheManager on member <UUID> of group <UUID> from AcquiredState : LeadershipLease{serviceName=TraceflowCacheManager, leaderId=<UUID>, leaseVersion=6560, revocationCount=0, leaseId=<UUID>}

3. In Proton log, it shows that Statesync tried to stop loading messages to in-memory cache
proton$ grep 2021-07-29T10:55:59.845 nsxapi.*
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor StateSyncShardingCoordinator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting shutdown for leave() processing
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor CorfuDbTransactionLogHandler - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping Corfu Transaction Log Processor
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullSyncCommunicator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] FullSyncService ccp full sync communicator bean is stopping.
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping FullStateSyncDataBuilder
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Clearing cache
nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor FullStateSyncCache - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full state sync cache stats.

4. In Proton log, it shows that Fullsync still loaded messages for 5 more seconds. Then, a fatal error occurred at 2021-07-29T10:56:04. lease note that fatal error does not always occur.
/nsx_managervar/log/proton
proton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | grep 021-07-29T10:55:59 | head -5
2021-07-29T10:55:59.015Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:55:59.046Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:55:59.071Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:55:59.094Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:55:59.114Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>

hdixon@gss-prd-csp-1:proton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | tail -10
2021-07-29T10:56:04.019Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:56:04.043Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:56:04.062Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>
2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] closeTransaction
2021-07-29T10:56:04.067Z FATAL FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" errorCode="MP4705" level="FATAL" subcomp="manager"] Unexpected error in full state sync processor
at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadMessage(AbstractFullStateSyncDataBuilder.java:382) ~[ccp-1.0.jar:?]
at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadSpecial(AbstractFullStateSyncDataBuilder.java:363) ~[ccp-1.0.jar:?]
at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadData(AbstractFullStateSyncDataBuilder.java:275) ~[ccp-1.0.jar:?]
at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.run(AbstractFullStateSyncDataBuilder.java:217) ~[ccp-1.0.jar:?]
2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Exiting MessageLoader

Environment

VMware NSX-T Data Center 3.x

Cause

This issue is due to a rare race condition which causes the NSX Manager to fail to publish updated rules to the Transport Nodes. If the NSX Manager with the stale configuration becomes the Sync leader, host transport nodes receives incorrect configuration. The sequence of events cause the race condition is as follows:

A full sync starts. The Sync Leader is loading the configuration into in-memory cache
Before the configuration is fully loaded into in-memory cache, another node takes over as the Sync Leader. This causes Fullsync loading partial configuration to in-memory cache. This creates stale configuration in in-memory cache.
When the node with stale configuration becomes the Sync Leader again, the node pushes stale data to host transport nodes

*The time span between #2 and 3 may be a few days or weeks.

Resolution

Issue is resolved in NSX-T 3.1.3.1
This issue only affects all NSX-T versions up to 3.1.3. As State Sync is not used in 3.2.0, this issue does not apply to 3.2.0 and later versions.

Workaround:
After a Manager loses its leadership in Fullsync, re-start proton via cli:
/etc/init.d/proton restart