vCenter Server vmdird service crashes with SIGABRT due to a race condition in RPC timer cleanup

Products

VMware vCenter Server

Issue/Introduction

The VMware Directory Service (vmdird) on the VMware vCenter Server Appliance (VCSA) may unexpectedly crash during periods of high network connection churn. This causes a temporary disruption of vSphere Single Sign-On (SSO) and environment-wide authentication services.
In the vCenter-/var/log/vmware/vmdird/vmdird.log will typically show a sudden cessation of logging activity followed by service initialization messages. This 'silent window' indicates an abrupt termination of the process.
```
YYYY-MM-DD:THH:MM:SS:t@####:INFO: Modify Entry ...
[Log inactivity window]
YYYY-MM-DD:THH:MM:SS:t@####:INFO: VmDir State (1)
YYYY-MM-DD:THH:MM:SS:t@####:INFO: Lotus Vmdird: starting...
```

vCenter vmon and audit logs will record the invocation of the core dump handler and the abnormal termination of the process with signal 6 (SIGABRT).
vCenter-/var/log/vmware/vmon/vmon.log:

YYYY-MM-DD:THH:MM:SS In(05) host-2481 Client info Uid=0,Gid=0,Pid=2360638,Comm=(vmon-coredumper),PPid=2,Comm=(kthreadd),PPid=0

vCemter-/var/log/audit/audit.log:

YYYY-MM-DD:THH:MM:SS vCenter_fqdn audit[2446]: ANOM_ABEND auid=4294967295 uid=9899 gid=3914 ses=4294967295 subj=unconfined pid=2446 comm="vmdird" exe="/usr/lib/vmware-vmdir/sbin/vmdird" sig=6 res=1

Immediately preceding the crash, the vCenter -/var/log/vmware/vmdird/vmdird.log may show a high volume of SSL handshake failures or malformed packet errors (errno 34 or 104) from network scanners or misconfigured clients.

YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 34 peer (ip_address)
YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 34 peer (ip_address)
YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 104 peer (ip_address)
YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 104 peer (ip_address)
YYYY-MM-DD:THH:MM:SS:t@####:ERROR: Failed SSL function (SSL_read), return value (-1)

Environment

vCenter Server 8.x

Cause

The crash is a memory management defect (heap corruption/double-free) within the DCE/RPC connection-oriented (CN) association reclaim timer.
Diagnostic logs indicate that the service typically stops recording entries several minutes before the crash. During this window, a race condition occurs when the automated cleanup timer attempts to join a network thread (dcethread_join) that has already been terminated and freed by a simultaneous connection shutdown. This conflict triggers a SIGABRT crash detected by the kernel. The Likewise service manager (lwsm) then detects the process failure and initiates an automatic service restart within seconds to restore functionality.
Contributing factors include aggressive vulnerability scanning or high volumes of incompatible SSL probe traffic, both of which increase the probability of encountering the race window during rapid connection teardowns.

Resolution

This issue is a known product defect. Broadcom Engineering is developing a permanent code fix, which will be included in a future VMware vCenter Server release.

To mitigate the risk of recurrence until the fix is available:

Reduce Connection Churn: Identify internal security scanners or clients (such as Tenable Nessus) that may be conducting aggressive vulnerability scans, and reduce their scan frequency or intensity.
Investigate Error Sources: Review logs to identify the source IP addresses of the frequent errno 34 and errno 104 messages and ensure those clients are utilizing compatible TLS protocols.