/var/run/log/hostd-probe.log, you see watchdog events similar to:hostd detected to be non-responsive
/var/run/log/vmkernel.log, you see Distributed Firewall log-rate high-water-mark messages similar to:VSIP DFW: Log request HWM during 1800 sec period = NNNNN LPS. Rate limit = 10000 LPS. Logged = NNNNNNN. Dropped = NNNNNN.
Dropping messages due to log stress (qsize = 25000)
Two independent sources of load arrive at the same ESXi host management plane at the same time. Either source on its own is usually survivable. The combination is what exhausts the host.
This issue occurs when all of the following conditions are met:
When the scan load lands on top of the logging load, the combined demand exceeds what vmsyslogd can drain and what the hostd worker pool can service. The hostd shared logger backend blocks, hostd worker threads stall on their own logging calls, and the hostd watchdog reports the host as non-responsive even though hostd has not crashed. vCenter records the host as disconnected. If the condition persists, vSphere HA fails over the host's virtual machines.
The underlying limitation, that high-rate DFW packet logging can overwhelm the ESXi syslog daemon, is a known issue tracked internally. An improved syslog architecture is expected in a later ESXi release.
Reduce the two loads so they no longer combine to exhaust the host. Applying both controls gives the most resilient result; either one on its own substantially reduces the disconnect risk.
Reduce the management-interface scan pressure in your vulnerability scanner's policy configuration.
Remove the high-volume DFW packet logging at its source, in NSX Manager under Security > Distributed Firewall. Either disable logging on the default Drop rule, or create a more specific non-logging Drop rule scoped to the high-volume dropped destination (<destination-IP>:<port>).
Remediate the traffic source the firewall is dropping. Restore or decommission the unreachable backend or virtual service whose down state is generating the continuous dropped-traffic flow. This stops the logging storm at its true origin.
Confirm the logging rate has fallen. Connect to the host with SSH and check the rate limit and the high-water-mark messages:
vsipioctl getloglimit grep "VSIP DFW: Log request HWM" /var/run/log/vmkernel.log
The reported LPS values should drop well below the rate limit after the change.
Plan the ESXi upgrade path as a longer-term measure. An improved syslog architecture intended to handle this class of log volume is expected in a later ESXi release. Subscribe to this article to receive updates on the release that includes it.