Segmentation Fault in Edge LB Process Causes Service Interruption

Products

VMware NSX

Issue/Introduction

The Load Balancer process (Nginx) running on an NSX Edge unexpectedly crashes due to a Segmentation Fault.

YYYY-MM-DDTHH:MM:SS.NNNZZ ##### kernel - - - [########.######] traps: nginx[#######] general protection fault ip:<ADDR> sp:<ADDR> error:0 in nginx[<ADDR>+<OFFSET>]
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### kernel - - - [########.######] grsec: From ###.###.###.###: Segmentation fault occurred at 0000000000000000 in /opt/vmware/nsx-edge/bin/nginx[nginx:#######] uid/euid:###/### gid/egid:###/###, parent /opt/vmware/nsx-edge/bin/nginx[nginx:#######] uid/euid:###/### gid/egid:###/###
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### - [nsx@#### comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.##########.#######.###.##.gz

This results in an interruption of all traffic and communication flowing through the affected Load Balancer.

YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####
YYYY-MM-DDTHH:MM:SS.NNNZZ ##### NSX ####### LOAD-BALANCER [nsx@#### comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [<LB-UUID>] recv() failed (104: Connection reset by peer) while proxying and reading from upstream, client: ###.###.###.###, server: ###.###.###.###:###, upstream: "###.###.###.###:###", bytes from/to client:####/####, bytes from/to upstream:#####/####

Environment

VMware NSX

Cause

This issue occurs when a specific combination of four configuration conditions are simultaneously met on the Load Balancer, which leads to memory corruption during connection cleanup:

LB statistics functionality is enabled (default setting).
SNAT Translation Mode is set to Deactivated in the server pool, disabling SNAT.
Transport phase rules are configured on the virtual server.
Server keepalive is enabled in the application profile.

Resolution

This issue has been fixed in NSX 3.2.4 , 4.2.0 and later.

To prevent crashes, apply one of the following workarounds:

Enable SNAT in the pool configuration.
OR, disable the server keep-alive setting.
OR, remove the transport phase rule configuration.