ESXi hosts disconnect from vCenter when management-interface scanning coincides with high NSX DFW logging

Products

VMware vSphere ESXi

Issue/Introduction

You see ESXi hosts intermittently drop to a "not responding" or "disconnected" state in vCenter Server, then recover on their own or after a reboot.
The host management agent (hostd) becomes unresponsive while it is still running. In /var/run/log/hostd-probe.log, you see watchdog events similar to:
```
hostd detected to be non-responsive
```

In /var/run/log/vmkernel.log, you see Distributed Firewall log-rate high-water-mark messages similar to:

VSIP DFW: Log request HWM during 1800 sec period = NNNNN LPS. Rate limit = 10000 LPS. Logged = NNNNNNN. Dropped = NNNNNN.

In the syslog journal, you see the daemon shedding load similar to:
```
Dropping messages due to log stress (qsize = 25000)
```
vSphere HA may fence the affected host and fail over its virtual machines.
The disconnects line up in time with scheduled external vulnerability scans of the ESXi management interface, even though the high firewall logging is present continuously.

Additional symptoms reported:

"unresponsive ESXi host"
The host stopped reporting statistics to vCenter
The host rebooted on its own

Environment

VMware ESXi 8.0 Update 3
VMware vCenter Server 8.0
VMware NSX 4.2.x with Distributed Firewall packet logging enabled on a Drop rule
ESXi management (vmkernel) interfaces reachable by an external network vulnerability scanner (for example, Tenable Nessus or similar)

Cause

Two independent sources of load arrive at the same ESXi host management plane at the same time. Either source on its own is usually survivable. The combination is what exhausts the host.

This issue occurs when all of the following conditions are met:

A high-volume NSX DFW packet-logging stream is running on the host. A Drop rule with logging enabled is dropping a continuous high-rate traffic flow, frequently traffic destined for a virtual service whose backend pool is down, and each dropped packet produces a log line. This can reach tens of thousands of log lines per second and is a heavy but, on its own, survivable load on the ESXi syslog daemon (vmsyslogd). The host's syslog queue stays drained as long as nothing else competes for the same resources.
An external vulnerability scan reaches the host's management (vmkernel) interface at the same time. Scanner traffic to the vmkernel interface is not evaluated by the Distributed Firewall, because DFW applies only at the virtual-machine vNIC level. The scan therefore reaches the host directly. It consumes concurrent management-connection slots on the host reverse proxy (which has a fixed connection cap), CPU for TLS handshakes, and hostd worker threads for the API calls the scan issues.
The host has limited spare CPU headroom. Higher vCPU oversubscription reduces the margin available to absorb a transient spike, lowering the threshold at which the combined load tips the host over.

When the scan load lands on top of the logging load, the combined demand exceeds what vmsyslogd can drain and what the hostd worker pool can service. The hostd shared logger backend blocks, hostd worker threads stall on their own logging calls, and the hostd watchdog reports the host as non-responsive even though hostd has not crashed. vCenter records the host as disconnected. If the condition persists, vSphere HA fails over the host's virtual machines.

The underlying limitation, that high-rate DFW packet logging can overwhelm the ESXi syslog daemon, is a known issue tracked internally. An improved syslog architecture is expected in a later ESXi release.

Resolution

Reduce the two loads so they no longer combine to exhaust the host. Applying both controls gives the most resilient result; either one on its own substantially reduces the disconnect risk.

Reduce the management-interface scan pressure in your vulnerability scanner's policy configuration.
1. Limit the scanner to one scan at a time per ESXi host, so multiple scanner threads or appliances are not scanning the same host in parallel.
2. Stagger scans across the hosts in a cluster rather than scanning many hosts at once.
3. Schedule scans during the cluster's low-activity windows.
4. Confirm the actual per-host scan cadence matches your intended policy, because parallel or more-frequent-than-expected scanning multiplies the load.
Remove the high-volume DFW packet logging at its source, in NSX Manager under Security > Distributed Firewall. Either disable logging on the default Drop rule, or create a more specific non-logging Drop rule scoped to the high-volume dropped destination (<destination-IP>:<port>).
Remediate the traffic source the firewall is dropping. Restore or decommission the unreachable backend or virtual service whose down state is generating the continuous dropped-traffic flow. This stops the logging storm at its true origin.
Confirm the logging rate has fallen. Connect to the host with SSH and check the rate limit and the high-water-mark messages:
```
vsipioctl getloglimit
grep "VSIP DFW: Log request HWM" /var/run/log/vmkernel.log
```
The reported LPS values should drop well below the rate limit after the change.
Plan the ESXi upgrade path as a longer-term measure. An improved syslog architecture intended to handle this class of log volume is expected in a later ESXi release. Subscribe to this article to receive updates on the release that includes it.

Additional Information

For the firewall-logging side of this issue, see Excessive Distributed Firewall (DFW) Logging Causes Host Resource or Stability Issues.
For syslog message loss under DFW logging, see Message lost in vmsyslog occur on ESXi hosts in an NSX DFW environment.
For the host-disconnect symptom driven by excessive logging, see ESXi hosts disconnected from vCenter due to excessive logging rates, causing dropped syslog messages and services to be unable to log.
As a general practice, avoid enabling logging on a default Drop rule in production for any sustained period. If logging is required, scope it to a specific rule for the traffic flow in question.
Treat external scanning of ESXi management interfaces as a management-plane load. Because that traffic bypasses the Distributed Firewall, firewall-side controls do not throttle it; the control point is the scanner's own concurrency and schedule.