VMware NSX-T Virtual Server/Pool Members DOWN when added to T0/T1 and nginx core file generated on an Edge Node.

Products

VMware NSX

Issue/Introduction

Symptoms:

You are using VMware NSX-T.
After the Load Balancer (LB) is reconfigured, an nginx core file is generated, from root of the edge you can see nginx core files similar to the below:

root@edge01:/var/dump# ls
total 454M
-rw-rw-rw- 1 root root 321M Jun 26 12:39 core.nginx.1672058350.9414.134.11.gz
-rw-rw-rw- 1 root root 321M Jun 26 12:37 core.nginx.1672058216.8391.134.11.gz

Pool members may report "Connect to Peer Failure" or "TCP Handshake Timeout".
In var/log/syslog of the Edge Node you see log entries for "all pool members are down":

2022-12-27T01:22:23.064227+00:00 edge02 NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG1200000"] [########-####-####-####-##########34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'Pool', Obj.UUID: '8d2c9c89-########-####-####-##########95', Obj.Name: 'cluster:<name>', Lb.UUID: '########-####-####-####-##########34', Lb.Name: 'LB-K8Sworker', Vs.UUID: '########-####-####-####-##########f8', Vs.Name: '<name>', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'.

2022-12-27T01:22:23.064913+00:00 edge02 NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [########-####-####-####-##########34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'VirtualServer', Obj.UUID: '########-####-####-####-##########f8', Obj.Name: 'cluster:<name>', Lb.UUID: '########-####-####-####-##########34', Lb.Name: 'LB-K8Sworker', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'.

The LB CONF process for the LB instance is not running, this can be confirmed by following the below steps:

1. Execute the below command from the root CLI of the Edge Node, this requires the UUID of the LB.

Note: Execute get load-balancer from the admin CLI of the active Edge Node, to retrieve the LB UUID. In the above example the LB UUID is ########-####-####-####-##########a8.

2. Use the nginx process ID (9568, as highlighted above) in the following command to confirm it has a LB CONF process running, if there is no output to the above command, there is no process running and the issue has been encountered.

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center

Cause

During periods of memory starvation, it is possible to encounter this behavior due to an issue in the LB CONF process.
The process is automatically restarted, however the incorrect worker data is used, thus it does not get initialized, as a result, no session between nestdb and LB nginx process is made and the new LB configurations do not take effect.

Resolution

This is resolved in VMware NSX-T Data Center 3.2.3 and VMware NSX version 4.1.1 VMware Downloads.

Workaround:
Restart the Edge Node to fail over services to standby node.

OR

Restart the docker of this LB instance using the below command ran from the CLI as root on the edge node:

#docker ps | grep <LB UUID>
#docker restart <CONTAINER ID>

eg:
root@edge02:~# docker ps | grep fc6c40b4-16ee-49e2-9d00-a6332103eba8
126fa3da65e3 nsx-edge-lb:current "/opt/vmware/edge/lb…" 2 days ago Up 2 days service_lb_fc6c40b4-16ee-49e2-9d00-a6332103eba8

root@edge02:~# docker restart 126fa3da65e3