NSX-T Controllers are Down in an Openshift environment running NCP 3.1.1
search cancel

NSX-T Controllers are Down in an Openshift environment running NCP 3.1.1

book

Article ID: 324234

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • NSX-T Data Center with either Openshift or Vanilla Kubernetes.
  • NCP 3.1.1.
  • On the NSX-T Manager cluster, the Controller component is down

Manager01> get cluster  status
Cluster Id: 4d79d5ac-e658-433d-be87-459a85125204
Overall Status: DEGRADED

Group Type: DATASTORE
Group Status: STABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    a1ae0942-2e38-a24a-a057-cb0f47f78231       Manager01                             1.2.3.4                 UP
    32f00942-fe12-ffaa-4951-ff8131640a1d       Manager02                             2.3.4.5                 UP
    bde60942-eea3-d4dd-db40-707f9f14e125       Manager03                             3.4.5.6                 UP

Group Type: CLUSTER_BOOT_MANAGER
Group Status: STABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    a1ae0942-2e38-a24a-a057-cb0f47f78231       Manager01                             1.2.3.4                 UP
    32f00942-fe12-ffaa-4951-ff8131640a1d       Manager02                             2.3.4.5                 UP
    bde60942-eea3-d4dd-db40-707f9f14e125       Manager03                             3.4.5.6                 UP

Group Type: CONTROLLER
Group Status: UNAVAILABLE

Members:
    UUID                                       FQDN                                       IP               STATUS
    78cde5ef-3fe1-4c12-84f3-0e0ce2382175       Manager01                             1.2.3.4                 DOWN
    71d08354-4537-4fba-90d9-928c544bf372       Manager02                             2.3.4.5                 DOWN
    0aed1b70-d498-47d0-a93e-6f9138cc786f       Manager03                             3.4.5.6                 DOWN
  • State sync of IDs is taking considerable length of time.
  In this example from /var/log/proton/nsxapi.log, it takes over 12 minutes to load IDs

2021-07-01T00:13:05.515Z  INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading IDs from NSGroupMembershipStatusEvent
2021-07-01T00:25:45.269Z  INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loaded 67514 IDs from NSGroupMembershipStatusEvent
  • Controller sync
/var/log/syslog
 2021-06-29T15:04:58.928Z Manager01 NSX 28967 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Could not receive SyncResponse: Reached timeout of 60000 ms: Timeout was reached, restarting sync


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue occurs due to a large number of stale entries that accumulate in the NSX-T Corfu database in an Openshift or Vanilla Kubernetes environment.
These DB entries are a result of NCP setting wait_for_security_policy_sync to True.
The Controller cluster will only come up when a state sync completes for at least two of the Controllers.
Due to the large number of stale table entries, the sync operation times out. This prevents the Controllers from coming up.

Resolution

This issue is resolved in NCP 3.1.2 which permanently sets wait_for_security_policy_sync to False.

Workaround:
If the Controller is already down, please open a Support Request and the stale DB entries can be removed by a support engineer.

To prevent the issue from occurring, please apply the following configuration change.

For Openshift environments:

1) Gain admin access to the Openshift cluster from a console.

2) Ensure the operator is running. This can be done in two ways:
A) oc get pods -n nsx-system-operator -> should return a running pod for nsx-ncp-operator
B) oc get co nsx-ncp -> should return the operator as "available"

3) Edit the operator configmap
oc edit cm -n nsx-system-operator nsx-ncp-operator-config

4) Find the [nsx_v3] section and set wait_for_security_policy_sync = False (this line will need to be added)

5) Save the config map (:wq)

6) Wait for the operator to update the config map and recreate the ncp pod
   oc get pods -n nsx-system

7) Confirm the config update went correctly
   oc get cm -n nsx-system nsx-ncp-config -o yaml
   It should now contain: wait_for_security_policy_sync = False

8) Agent pods should be confirmed to be in a "Running" state and to have been recently restarted.



For Vanilla Kubernetes environments:

1) kubectl edit cm -n nsx-system nsx-ncp-config -> set flag to False

2) delete NCP pods (There is no need to restart agents)