vCenter Server vmdird service crashes with SIGABRT due to a race condition in RPC timer cleanup
search cancel

vCenter Server vmdird service crashes with SIGABRT due to a race condition in RPC timer cleanup

book

Article ID: 435414

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • The VMware Directory Service (vmdird) on the VMware vCenter Server Appliance (VCSA) may crash unexpectedly during periods of high network connection churn. This results in a temporary disruption to vSphere Single Sign-On (SSO) and environment-wide authentication services.

  • /var/log/vmware/vmdird/vmdird.log on the vCenter server typically shows a sudden cessation of logging activity followed by service initialization messages. This 'silent window' indicates an abrupt process termination.
    YYYY-MM-DD:THH:MM:SS:t@####:INFO: Modify Entry ...
    [Log inactivity window]
    YYYY-MM-DD:THH:MM:SS:t@####:INFO: VmDir State (1)
    YYYY-MM-DD:THH:MM:SS:t@####:INFO: Lotus Vmdird: starting...

  • vCenter vmon and audit logs will record the invocation of the core dump handler and the abnormal termination of the process with signal 6 (SIGABRT).
    /var/log/vmware/vmon/vmon.log:
    YYYY-MM-DD:THH:MM:SS In(05) host-2481 Client info Uid=0,Gid=0,Pid=2360638,Comm=(vmon-coredumper),PPid=2,Comm=(kthreadd),PPid=0

    /var/log/audit/audit.log:
    YYYY-MM-DD:THH:MM:SS vCenter_fqdn audit[2446]: ANOM_ABEND auid=4294967295 uid=9899 gid=3914 ses=4294967295 subj=unconfined pid=2446 comm="vmdird" exe="/usr/lib/vmware-vmdir/sbin/vmdird" sig=6 res=1

  • Immediately preceding the crash, the vmdird.log may show a high volume of SSL handshake failures or malformed packet errors (errno 34 or 104) from network scanners or misconfigured clients. 
    YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 34 peer (ip_address)
    YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 34 peer (ip_address)
    YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 104 peer (ip_address)
    YYYY-MM-DD:THH:MM:SS:t@####:ERROR: ProcessAConnection: ber_get_next() call failed with errno = 104 peer (ip_address)
    YYYY-MM-DD:THH:MM:SS:t@####:ERROR: Failed SSL function (SSL_read), return value (-1)

Environment

vCenter Server 8.x

Cause

  • The crash is a memory management defect (heap corruption/double-free) within the DCE/RPC connection-oriented (CN) association reclaim timer.
  • Diagnostic logs show that the service typically stops recording entries several minutes before the crash occurs. During this window, a race condition is encountered when the automated cleanup timer attempts to join a network thread (dcethread_join) that has already been terminated and freed by a simultaneous connection shutdown. This conflict leads to the SIGABRT crash detected by the kernel. The Likewise service manager (lwsm) subsequently detects the process failure and initiates an automatic service restart within seconds to restore functionality.
  • Contributing factors include aggressive vulnerability scanning or high volumes of incompatible SSL probe traffic, both of which increase the probability of encountering the race window during rapid connection teardowns.

Resolution

This issue is a known product defect. Broadcom Engineering is developing a permanent code fix to be included in a future release of VMware vCenter Server.

To mitigate the risk of recurrence until the fix is available:

  • Reduce Connection Churn: Identify internal security scanners or clients (such as Tenable Nessus) that may be performing aggressive vulnerability probes and reduce their scan frequency or intensity.
  • Investigate Error Sources: Review logs to identify the source IP addresses of the frequent errno 34 and errno 104 messages and ensure those clients are utilizing compatible TLS protocols.