NSX-T Tier 0 Gateway failover is not occurring in Active/Standby mode

Products

VMware NSX

Issue/Introduction

Symptoms:

There is a Tier 0 Gateway configured in Active/Standby (A/S) mode.
Preemptive mode is enable.
A failover was attempted by editing the Gateway and changing the preferred edge node.
This failed and the failover did not occur.
In the NSX-T manager /var/log/proton/nsxapi.log we can see Gateway was updated:

2022-10-21T01:55:05.664Z INFO http-nio-127.0.0.1-7440-exec-34 LogicalRouterServiceImpl 4414 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" reqId="9bdd2986-####-####-####-##########30" subcomp="manager" username="admin"] Invoking entity listener with UPDATE for LogicalRouter/eba53a2a-####-####-####-##########f7
2022-10-21T01:55:06.231Z INFO http-nio-127.0.0.1-7440-exec-34 LogicalRouterServiceImpl 4414 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" reqId="9bdd2986-####-####-####-##########30" subcomp="manager" username="admin"] Persisted configuration update for logical router eba53a2a-####-####-####-##########f7 of type TIER0

Checking the three NSX-T manager's /var/log/proton/nsxapi.log we see two managers got the work item, but one did not:

2022-10-21T01:55:06.428Z INFO with operation UPDATE, adding work-item WorkItem{identifier=LogicalRouter/eba53a2a-####-####-####-##########f7, Timestamp{epoch=12666, address=5362095164}} for processing
2022-10-21T01:55:06.428Z INFO policyProviderTaskScheduler-1 WorkerShardManager 4567 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] canCurrentNodeProcess : false, for worker : LogicalRouterWorker, for workItem : WorkItem{identifier=LogicalRouter/eba53a2a-####-####-####-##########f7, Timestamp{epoch=12666, address=5362095164}}

On the impacted NSX-T manager, the one which did not process the work item, we see the following log entries in syslog and nsxapi.log:

/var/log/proton/nsxapi.log

2022-10-21T01:55:06.428Z ERROR org.corfudb.runtime.collections.streaming.StreamPollingScheduler-worker-2 ResumeStreamListener 1758 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP4" level="ERROR" subcomp="manager"] Failed to re-subscribe [tag:worker_framework] nsx$[null]. Listener is NOT SUBSCRIBED yet! lastProcessedTs:epoch: 746

/var/log/syslog
2022-10-21T01:55:06.428Z nsx-mgt-03 NSX 1758 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP4" level="ERROR" subcomp="manager"] Failed to re-subscribe [tag:worker_framework] nsx$[null]. Listener is NOT SUBSCRIBED yet! lastProcessedTs:epoch: 746#012sequence: 1086122231#012, retry 1/20

Note: Above is a sample entry for the two managers which received the work item.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center

Cause

The worker framework got unsubscribed from the corfu listener and was unable to resubscribe again.
From the logs above, we seen the entry canCurrentNodeProcess : false, this indicates that manager is not the owner of the entity and is unable to process the request.

Resolution

This issue is resolved in NSX-T 3.2.3 (VMware NSX-T Data Center 3.2.3 Release Notes, section "Resolved Issues", Fixed Issue 3052786)

Workaround:
As root user on the NSX-T manager which did not receive the work item, restart the proton service.

#service proton restart