PKS showing load balancer server pools down in NSX-T

Products

VMware

Issue/Introduction

Symptoms:

·     Multiple load balancing services stop working. The NSX-T UI dashboard shows several server pools are down. Checking the load balancer pools effected from the NSX-T edge shows no pool members present.
EdgeNode01> get load-balancer 4dba9dbf-d594-4d1f-a6a1-f29434a9ac47 pool af0d0010-8335-4fa0-b14d-387cab5bedf5 status
Pool
UUID                        : af0d0010-8335-4fa0-b14d-387cab5bedf5
Display-Name                : pks-ce39d2c7-f7d7-4867-a826-0948ad755393-kube-syste-nginx-ingress-controller-443
Status                      : down
Total-Members               : 0
Primary Up                  : 0
Primary Down                : 0
Primary Disabled            : 0
Primary Graceful Disabled   : 0
Backup Up                   : 0
Backup Down                 : 0
Backup Graceful Disabled    : 0
Backup Disabled             : 0

· The ncp.stdout.log shows a message with the load balancer pool ID and no members in the array block the pool was updated with.
1 2020-04-17T13:27:10.482Z 329d9ac7-8c36-4665-9a84-446190f8eeb7 NSX 32065 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"]
nsx_ujo.ncp.nsx.manager.base_k8s_nsxapi Updated LB pool af0d0010-8335-4fa0-b14d-387cab5bedf5 with members []

Environment

VMware PKS 1.x

Cause

A very rare race condition in the NCP service. When the master nodes are undersized and put under and extreme load to that starts causing the kubernetes API to take several seconds to respond. The NCP session will time out. When this happens the NCP service will end the session and attempt a restart gaining a new session. While the service is starting up NCP will check the endpoints and sync the load balancers. However when that kubernetes API is extremely slow the pod store may not populate before the load balancer sync. So NCP will not find matches in the pod store for the endpoint resulting in empty pool members. When a load balancer pool has no pool members, it will show the status as down.

Resolution

This race condition will be addressed in a future version of NCP. The release notes will have more information once available.

Workaround:

Force a resync of the load balancer by NCP. This can be done by making a configuration change to an ingress controller, like adding a fake label. Other way to force a resync is by restarting the NCP service on the master nodes. For example.

bosh -d <deployment-name> ssh master/0 "sudo monit restart ncp"

Additional troubleshooting will need to be done to determine the cause for the slowness of the kubernetes API. Preventing this will prevent the race condition. Performance can be poor due to master nodes that are too small and can be fixed by increasing the master node CPU and memory size by changing the vm_type in the PKS plan. Contact for support for more assistance.