NSX Manager VIP Inaccessible Caused by Envoy Proxy Hang
search cancel

NSX Manager VIP Inaccessible Caused by Envoy Proxy Hang

book

Article ID: 423209

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

The NSX Manager VIP become unresponsive and displaying the error message "This site can’t be reached" in the web browser with reason "ERR_CONNECTION_REFUSED" or "ERR_CONNECTION_TIMED_OUT".

Attempting to log in directly to the NSX Manager node that owns the VIP also fails with the same browser error, indicating that the node itself is unresponsive.

Logging in the envoy_access_log under /var/log/proxy/ has stopped, and no incoming API requests to the NSX Manager node are being recorded.

Root access to the NSX Manager node and running the following netstats command, which showed an unusually high count of connections on port 443 in the CLOSE_WAIT state.

netstats -ano | grep ":443" | wc -l

Environment

VMware NSX

Cause

This issue occurs because the Envoy proxy service on the NSX Manager node that owns the VIP becomes unresponsive and is unable to process incoming API requests. As a result, the node stops generating API access logs, leaving the envoy_access_log empty.

Resolution

This issue is currently under investigation to determine the root cause. To assist with analysis, it is necessary to collect the NSX Manager support bundle and an Envoy service thread dump from the NSX Manager node while the problem is occurring.

Please follow the procedures outlined in KB 142884 and contact Broadcom Support for guidance in gathering the required diagnostic information.

Once data collection is complete, perform one of the following remediation steps on the affected NSX Manager node to restart the Envoy service:

  1. root access to the NSX Manager node:

    systemctl restart envoy

  2. admin access to the NSX Manager node:

    restart service http

Alternatively, a rolling reboot of the NSX Manager cluster can be performed to restore functionality.