VMware NSX-T Virtual Server/Pool Members DOWN when added to T0/T1 and nginx core file generated on an Edge Node.
search cancel

VMware NSX-T Virtual Server/Pool Members DOWN when added to T0/T1 and nginx core file generated on an Edge Node.

book

Article ID: 322510

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  •  Load Balancer (LB) is reconfigured, and an nginx core file is generated. We can find these core files on the edge node, typically under the root directory, with names similar to the following:
root@edge_name:/var/dump# ls
total 454M
-rw-rw-rw- 1 root root 321M Jun 26 12:39 core.nginx.####.gz
-rw-rw-rw- 1 root root 321M Jun 26 12:37 core.nginx.####.gz
  • Pool members may report "Connect to Peer Failure" or "TCP Handshake Timeout".
  • In var/log/syslog of the Edge Node you see log entries for "all pool members are down":
2022-12-27T01:22:23.064227+00:00 <edge_name> NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG1200000"] [########-####-####-####-##########34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'Pool', Obj.UUID: '####9c89-########-####-####-##########95', Obj.Name: 'cluster:<name>', Lb.UUID: '########-####-####-####-##########34', Lb.Name: '<LB_LBname>', Vs.UUID: '########-####-####-####-##########f8', Vs.Name: '<name>', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'.

2022-12-27T01:22:23.064913+00:00 <edge_name> NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [########-####-####-####-##########34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'VirtualServer', Obj.UUID: '########-####-####-####-##########f8', Obj.Name: 'cluster:<name>', Lb.UUID: '########-####-####-####-##########34', Lb.Name: '<LB_LBname>', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'.
  • The LB CONF process for the LB instance is not running, this can be confirmed by following the below steps:
1. Execute the below command from the root CLI of the Edge Node, this requires the UUID of the LB.
#ps -ef | grep lb | grep nginx | grep <LB UUID>

example:

root@edge_name:~# ps -ef | grep lb | grep nginx | grep ########-####-####-####-##########a8
lb        9568  9481  0 Jun23 ?        00:00:00 /opt/vmware/nsx-edge/bin/nginx -u ########-####-####-####-##########a8 -g daemon off;


Note: Execute get load-balancer from the admin CLI of the active Edge Node, to retrieve the LB UUID. In the above example the LB UUID is ########-####-####-####-##########a8.

2. Use the nginx process ID (9568, as highlighted above) in the following command to confirm it has a LB CONF process running, if there is no output to the above command, there is no process running and the issue has been encountered.

ps -ef | grep <nginx process ID>| grep CONF

example:
Impacted
root@edge02:~# ps -ef | grep 9568  | grep CONF

Not impacted
root@edge02:~# ps -ef | grep 9568  | grep CONF
lb        9572  9568  0 Jun23 ?        00:00:06 nginx: LB CONF process

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center
VMware NSX

Cause

In LB nginx, only with the specific index, the forked process will work as LB config process. In this issue, the LB config process is forked with a wrong index. So LB config process cannot be regenerated. Then the connection between LB and nestdb is lost and the new configuration cannot be loaded into datapath.

Resolution

This issue is resolved in below NSX versions available at Broadcom downloads.
VMware NSX-T Data Center 3.2.3 and later.
VMware NSX 4.1.1 and later.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.


Workaround:

Restart the Edge Node to fail over services to standby node.

OR

Navigate to Networking > Load Balancers (to gather LB UUID). In the below example : ID ending with 14ea is the UUID (full UUID is masked)


 

Restart the docker of this LB instance using the below command ran from the CLI as root on the edge node:
  1. #docker ps | grep <LB_UUID>
  2. #docker restart <CONTAINER ID>
eg:
root@edge_name:~# docker ps | grep <LB_UUID>
126fa3da65e3   nsx-edge-lb:current           "/opt/vmware/edge/lb_<LB_UUID>"   2 days ago   Up 2 days             service_lb_<lb_uuid>

root@edge_name:~# docker restart 126fa3da65e3