NSX Edge Load Balancer temporary returns 503 after pool member change

Products

VMware NSX

Issue/Introduction

Symptoms:

Connecting to NSX Edge Loadbalancer fails with HTTP Error 503 after a number of pool member change.
NSX Edge Loadbalancer has been configured to use Monitor Service for health monitoring in the past and currently uses BUILT-IN.
To identify which health monitor is used, use "show service loadbalancer virtual" in NSX Edge CLI.

Example for Monitor Service:

+->POOL MEMBER: poolX/memberX, STATUS: DOWN
| | HEALTH MONITOR = MONITOR SERVICE, monitorX:CRITICAL
| | | LAST STATE CHANGE: <DATE> <time></time>
| | | LAST CHECK: <DATE> <time></time>
| | | FAILURE DETAIL: PING CRITICAL - Packet loss = 100%
| | SESSION (cur, cps, total) = (0, 0, 0)
| | BYTES in = (0), out = (0)

Example of Built-IN:

+->POOL MEMBER: poolX/memberX, STATUS: UP
| | HEALTH MONITOR = BUILT-IN, monitorX:L7OK
| | | LAST STATE CHANGE: <DATE> <time> </time> | | SESSION (cur, max, total) = (0, 0, 0)
| | BYTES in = (0), out = (0)

You see entries similar to NSX Edge logs when you cannot access backend servers through Loadbalancer:

loadbalancer[<PID>]: [LB]: [local0.info] XXX.XXX.XXX.XXX - - [<DATE>:<time>] "GET / HTTP/1.1" 503 XXX "" "" XXXXX XXX "X" "X" "<NOSRV>" 0 -1 -1 -1 0 SC-- 0 0 0 0 0 0 0 "" </time>

You see entries similar to NSX Edge logs when you configure loadbalancer pool members:

config: [daemon.warning] WARN :: C_UTILS :: File /var/db/networkmonitor/monitor_retention.dat not exist
loadbalancer[<PID>]: [LB]: [local0.alert] Server poolXX/member1 is DOWN, changed from CLI. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
loadbalancer[<PID>]: [LB]: [local0.alert] Server poolXX/member2 is DOWN, changed from CLI. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
loadbalancer[<PID>]: [LB]: [local0.alert] Server poolXX/member3 is DOWN, changed from CLI. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
loadbalancer[<PID>]: [LB]: [local0.emerg] backend poolXX has no server available!
config: [daemon.info] INFO :: CONFIG_MGR :: update ipvs...
config: [daemon.info] INFO :: C_IPVS :: IPVS: stop connection sync-up daemon
config: [daemon.info] INFO :: CONFIG_MGR :: update nagios...
config: [daemon.info] INFO :: C_ServiceControl :: update nagios to down
config: [daemon.info] INFO :: CONFIG_MGR :: --------------- Collecting the configurator output ---------------
config: [daemon.info] INFO :: Utils :: saved data to /var/db/vmware/vshield/vse_one/resource_save.psf
config: [daemon.info] INFO :: Utils :: saved data to /var/db/vmware/vshield/vse_one/config_save.psf
config: [daemon.info] INFO :: vse_configure :: update success
config: [daemon.info] INFO :: Utils :: ha: UpdateHaResourceFlags:

Environment

NSX for vSphere 6.3.x

NSX for vSphere 6.4.x

Cause

This issue occurs due to mis-loading old health status reported by Monitor Service

Resolution

To work around this issue, configure BUILT-IN with all different id & name pairs for pool / member / monitor from what were used in Monitor Service.

Using Web Client, to create NSX Edge Loadbalancer with new id & name pairs, navigate to Networking & Security > NSX Edges > Manage > Load Balancer.

For monitor, navigate to Service Monitoring.
For pool and its members, navigate to Pools.

Using REST API, change (poolId, name), (memberId, name), (monitorId, name)