NSX-T LB Pool Members reported down with 'Wrong HTTP Status Line' when using custom monitor

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

NSX-T version 2.5.1.
The NSX-T Load Balancer is using a custom health monitor, with a custom response body similar to the following example:

Monitor definition:
"monitor_port": "9020",
"display_name": "https-health-monitor",
"interval": 15,
"rise_count": 3,
"timeout": 15,
"http_monitor": {

"response_body": "<Name>MAINTENANCE_MODE</Name><Status>OFF</Status>",
"request_url": "/?ping",
"request_method": "HTTP_METHOD_GET",

"response_code": [
"200"
],
"request_version": "HTTP_VERSION_1_1"
},
"fall_count": 3,
"type": "HTTP",
"id": {
"right": 10976970400663106615,
"left": 11626635069033957738

From the command 'get load-balancer <LB-UUID> monitor <HEALTH-MONITOR-UUID> status' you see the 'FAIL_REASON' as 'Wrong HTTP Status Line'.
Packet Captures performed on the Load Balancer internal loopback routing port, indicate that packets are being sent and received from the pool members. In the packets you can see in the response to the HTTP GET requests, that the expected response code '200' is returned.
Pool Members report as up if you switch the health monitor to the default 'nsx-default-tcp-monitor' or a custom monitor with no 'HTTP Response Body' defined.
In Syslog you will see similar message like the following:

NSX 864 - [nsx@6876 comp="nsx-edge" subcomp="agg-service" tid="1467" level="INFO"] ExecCmd call output: {"lbs": [{"cpu_usage": "0", "display_name": "T1-LB-01
", "enabled": true, "mem_usage": "2", "pool_num": "32", "pool_up_num": "16", "pools": [{"backup_disabled": "0", "backup_down": "0", "backup_graceful_disabled": "0", "backup_unknown": "0", "backup_unused": "0"
, "backup_up": "0", "display_name": "trf-ecs-1-w2-9020", "member_num": "5", "members": [{"display_name": "", "failure_code": "24400", "failure_reason": "Wrong HTTP Status Line", "ip": "10.1.1.1"

Environment

VMware NSX-T Data Center 2.5.x
VMware NSX-T Data Center

Cause

HTTP health check does not work when the HTTP content is in multiple TCP segments. If the HTTP content is transferred in several TCP segments, the LB cannot parse the content except the first segments.

If you enable debug logging on the Load Balancer, you will see something similar to the following in the LB error.log. There is no "\r\n\r\n" in the first packet. This means there is no HTTP body in this packet. In the second packet, there is no HTTP status line.

2021/01/27 07:31:04 [debug] 21863#0: epoll timer: 995
2021/01/27 07:31:04 [debug] 21863#0: epoll: fd:83 ev:0001 d:000003A160FEA810
2021/01/27 07:31:04 [debug] 21863#0: http check recv.
2021/01/27 07:31:04 [debug] 21863#0: recv: eof:0, avail:1
2021/01/27 07:31:04 [debug] 21863#0: recv: fd:83 176 of 2048 <<<<< first packet
2021/01/27 07:31:04 [debug] 21863#0: http check recv size: 176, peer: 10.1.1.1:9020
2021/01/27 07:31:04 [debug] 21863#0: recv: eof:0, avail:0
2021/01/27 07:31:04 [debug] 21863#0: http check recv size: -2, peer: 10.1.1.1:9020 (11: Resource temporarily unavailable)
2021/01/27 07:31:04 [debug] 21863#0: shmtx lock
2021/01/27 07:31:04 [debug] 21863#0: shmtx unlock
2021/01/27 07:31:04 [debug] 21863#0: hc http parse: rcvd response status 200 from server 10.249.0.38:9020(pool LB15f96656-6f4c-4cbe-af5b-c3100f7c40de), expected http status code: 2xx - 1 0, 3xx - 0 0, 4xx - 0 0, 5xx - 0 0
2021/01/27 07:31:04 [debug] 21863#0: get http body offset, http response len: 159
2021/01/27 07:31:04 [debug] 21863#0: not found \r\n\r\n <<<< there is no http body
2021/01/27 07:31:04 [debug] 21863#0: get http body offset,p: 0000000003F8BCAC

2021/01/27 07:31:04 [debug] 21863#0: http check upstream recv(): -1, fd: 83 (11: Resource temporarily unavailable)
2021/01/27 07:31:04 [info] 21863#0: expected string <Name>MAINTENANCE_MODE</Name><Status>OFF</Status> not found with peer: 10.249.0.38:9020, rc: -2
2021/01/27 07:31:04 [debug] 21863#0: http_parse: expect parse result: -2
2021/01/27 07:31:04 [debug] 21863#0: http check parse rc: -2, peer: 10.1.1.1:9020
2021/01/27 07:31:04 [debug] 21863#0: timer delta: 1
2021/01/27 07:31:04 [debug] 21863#0: worker cycle
2021/01/27 07:31:04 [debug] 21863#0: epoll timer: 994
2021/01/27 07:31:04 [debug] 21863#0: epoll: fd:83 ev:0001 d:000003A160FEA810
2021/01/27 07:31:04 [debug] 21863#0: http check recv.
2021/01/27 07:31:04 [debug] 21863#0: recv: eof:0, avail:1
2021/01/27 07:31:04 [debug] 21863#0: recv: fd:83 290 of 1872 <<<<<< second packet
2021/01/27 07:31:04 [debug] 21863#0: http check recv size: 290, peer: 10.1.1.1:9020
2021/01/27 07:31:04 [debug] 21863#0: recv: eof:0, avail:0
2021/01/27 07:31:04 [debug] 21863#0: http check recv size: -2, peer: 10.1.1.1:9020 (11: Resource temporarily unavailable)
2021/01/27 07:31:04 [info] 21863#0: http parse status line error with peer: 10.1,1,1:9020 <<<<< there is no HTTP status line
2021/01/27 07:31:04 [debug] 21863#0: http check parse rc: 14, peer: 10.1.1.1:9020
2021/01/27 07:31:04 [info] 21863#0: check protocol http error with peer: 10.1.1.1:9020, status code: 200

In a packet capture file from the LB interface you find that the HTTP 200 OK packet is reassembled by two or more TCP segments.

Resolution

Issue is resolved from version NSX-T 2.5.2.

Workaround:
Pool Members report as up if you switch the health monitor to the default 'nsx-default-tcp-monitor' or a custom monitor with no ' HTTP Response Body' defined.