NSX-T Controllers are Down in an Openshift environment running NCP 3.1.1

Products

VMware NSX

Issue/Introduction

Symptoms:

NSX-T Data Center with either Openshift or Vanilla Kubernetes.
NCP 3.1.1.
On the NSX-T Manager cluster, the Controller component is down

Manager01> get cluster status
Cluster Id: 4d79d5ac-####-####-####-########204
Overall Status: DEGRADED

Group Type: DATASTORE
Group Status: STABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    a1ae0942-####-####-####-########231       Manager01                             1.#.3.4                UP
    32f00942-####-####-####-########a1d       Manager02                             2.#.4.5                UP
    bde60942-####-####-####-########125       Manager03                             3.#.5.6                UP

Group Type: CLUSTER_BOOT_MANAGER
Group Status: STABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    a1ae0942-####-####-####-########231       Manager01                             1.#.3.4                UP
    32f00942-####-####-####-########a1d       Manager02                             2.#.4.5                UP
    bde60942-####-####-####-########125       Manager03                             3.#.5.6                UP

Group Type: CONTROLLER
Group Status: UNAVAILABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    78cde5ef-####-####-####-########175       Manager01                             1.#.3.4            DOWN
    71d08354-####-####-####-########372       Manager02                             2.#.4.5               DOWN
    0aed1b70-####-####-####-########86f       Manager03                             3.#.5.6                DOWN

State sync of IDs is taking considerable length of time.

In this example from /var/log/proton/nsxapi.log, it takes over 12 minutes to load IDs

2021-07-01T00:13:05.515Z INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading IDs from NSGroupMembershipStatusEvent
2021-07-01T00:25:45.269Z INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loaded 67514 IDs from NSGroupMembershipStatusEvent

Controller sync

/var/log/syslog
2021-06-29T15:04:58.928Z Manager01 NSX 28967 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Could not receive SyncResponse: Reached timeout of 60000 ms: Timeout was reached, restarting sync

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue occurs due to a large number of stale entries that accumulate in the NSX-T Corfu database in an Openshift or Vanilla Kubernetes environment.
These DB entries are a result of NCP setting wait_for_security_policy_sync to True.
The Controller cluster only come up when a state sync completes for at least two of the Controllers.
Due to the large number of stale table entries, the sync operation times out. This prevents the Controllers from coming up.

Resolution

This issue is resolved in NCP 3.1.2 which permanently sets wait_for_security_policy_sync to False.

Workaround:
If the Controller is already down, please open a Support Request and the stale DB entries can be removed by a support engineer.

To prevent the issue from occurring, please apply the following configuration change.

For Openshift environments:

1) Gain admin access to the Openshift cluster from a console.

2) Ensure the operator is running. This can be done in two ways:

A) oc get pods -n nsx-system-operator -> should return a running pod for nsx-ncp-operator
B) oc get co nsx-ncp -> should return the operator as "available"

3) Edit the operator configmap

oc edit cm -n nsx-system-operator nsx-ncp-operator-config

4) Find the [nsx_v3] section and set wait_for_security_policy_sync = False (this line will need to be added)

5) Save the config map (:wq)

6) Wait for the operator to update the config map and recreate the ncp pod

oc get pods -n nsx-system

7) Confirm the config update went correctly

oc get cm -n nsx-system nsx-ncp-config -o yaml

It should now contain: wait_for_security_policy_sync = False

8) Agent pods should be confirmed to be in a "Running" state and to have been recently restarted.

For Vanilla Kubernetes environments:

1) kubectl edit cm -n nsx-system nsx-ncp-config -> set flag to False

2) delete NCP pods (There is no need to restart agents)