vmware-vsan-health service stops after starting the service
search cancel

vmware-vsan-health service stops after starting the service

book

Article ID: 390274

calendar_today

Updated On:

Products

VMware vSAN VMware Tanzu Kubernetes Grid Service (TKGs)

Issue/Introduction

Symptoms:

On a vSAN cluster, vsan-health related information show blank on vCenter server UI.

vmware-vsan-health service show in stopped status on vCenter server.

When we start the vmware-vsan-health service, the service would start successfully. However, the service would stop after few minutes.

 

Environment

VMware vSAN 8.x

VMware Tanzu Kubernetes Grid Service (TKGs)

Cause

The vsan-health service is crashing due to the underlying issue at the CNS layer.

  • When you check logs on vCenter /var/log/vmware/vmon/vmon.log. You see the events saying the service started fine.

2025-03-07T04:32:13.637Z In(05) host-2547 <vsan-health> Re-check service health since it is still initializing.
2025-03-07T04:32:17.639Z In(05) host-2547 <vsan-health> Running the API Health command as user vsan-health
2025-03-07T04:32:17.639Z In(05) host-2547 <vsan-health-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-vpx/vsan-health/vsanhealth-vmon-apihealth.py
2025-03-07T04:32:18.022Z In(05) host-2547 <vsan-health> Service STARTED successfully.
2025-03-07T04:32:18.023Z Wa(03) host-2547 [ReadSvcSubStartupData] No startup information from vsan-health.

  •  After few minutes, we see the below service crash events.

2025-03-07T04:33:17.521Z In(05) host-2547 Client info Uid=0,Gid=0,Pid=582836,Comm=(vmon-coredumper),PPid=2,Comm=(kthreadd),PPid=0
2025-03-07T04:33:17.521Z In(05) host-2547 <vsan-health> Service is dumping core. Coredump count 43. CurrReq: 0
2025-03-07T04:33:17.521Z In(05) host-2547 <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vsan-health,UNHEALTHY,HEALTHY,1
2025-03-07T04:33:18.192Z Wa(03) host-2547 <vsan-health> Service exited. Exit code 1
2025-03-07T04:33:18.192Z Wa(03) host-2547 <vsan-health> Service exited unexpectedly. Crash count 44. Taking configured recovery action.

  • On vCenter server when you check from command line in the path /var/core, you will see vsanvcmgmtd-worker core dumps as below.

core.vsanvcmgmtd-wor.524575
core.vsanvcmgmtd-wor.528472
core.vsanvcmgmtd-wor.579907

  • When you check vsanvcmgmtd.log under path /var/log/vmware/vmware-vsan-health-service.log on vCenter server, you may see similar events as below.

2025-03-07T04:32:17.390Z info vsanvcmgmtd[580209] [vSAN@6876 sub=PHM::PhmInventoryListener opId=vsan-wfu-2b03] ProcessUpdate called
2025-03-07T04:32:17.390Z info vsanvcmgmtd[580209] [vSAN@6876 sub=PHM::PhmInventoryListener opId=vsan-wfu-2b03] ProcessUpdate: Update kind: 'enter' or 'leave'. Ignoring the update
2025-03-07T04:32:17.390Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsDb] Loaded 34195 volumes out of 34195 volumes from DB.
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580113] [vSAN@6876 sub=pcs[0]] Registered listener '[CnsDatastoreListener:0x000055a5374f3d60]'
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Fail Cns InProgress Tasks
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580211] [vSAN@6876 sub=PropertyCollectorService] CNS: Gathering CNS Tasks
2025-03-07T04:32:17.394Z info vsanvcmgmtd[580217] [vSAN@6876 sub=pcs[0]] Started listerner '[CnsDatastoreListener:0x000055a5374f3d60]'
2025-03-07T04:32:17.394Z info vsanvcmgmtd[580211] [vSAN@6876 sub=vmomi.soapStub[5]] SOAP request returned HTTP failure; <<io_obj p:0x00007fa3a40dbb40, h:27, <TCP '127.0.0.1 : 37400'>, <TCP '127.0.0.1 : 1080'>>, /sdk>, method: GetRecentTask; code: 500(Internal Server Error); fault: (vim.fault.NotAuthenticated) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    object = 'vim.TaskManager:8cbe3917-25c7-4cdc-a28f-53ece89a068e:TaskManager',
-->    privilegeId = "",
-->    missingPrivileges = <unset>
-->    msg = "Received SOAP response fault from [<<io_obj p:0x00007fa3a40dbb40, h:27, <TCP '127.0.0.1 : 37400'>, <TCP '127.0.0.1 : 1080'>>, /sdk>]: GetRecentTask
--> The session is not authenticated."
--> }
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580221] [vSAN@6876 sub=CnsCatalogSvc opId=vsan-wfu-2b03] CNS: CatalogService is initialized successfully
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580211] [vSAN@6876 sub=VpxdCnx] Login to the destination, SessionKey: 5237f67a-ae2b-fdd1-51c8-463a17b87ff5
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580211] [vSAN@6876 sub=VpxdCnx] Recovered session, sid: 1, recoverRequestOnly:false
2025-03-07T04:32:17.416Z info vsanvcmgmtd[580113] [vSAN@6876 sub=pcs[0]] Registered listener '[CnsHostListener:0x000055a537768580]'
2025-03-07T04:32:17.420Z info vsanvcmgmtd[580244] [vSAN@6876 sub=CnsCatalogSvc] Find file service cluster vim.ClusterComputeResource:domain-c138 for datastore ds:///vmfs/volumes/vsan:52532c9ec0986ec0-af########/
2025-03-07T04:32:17.421Z info vsanvcmgmtd[580211] [vSAN@6876 sub=PropertyCollectorService] CNS: Finish gathering CNS Tasks. Total=16, CNS=1
2025-03-07T04:32:17.421Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Total 1 old CNS tasks are found
2025-03-07T04:32:17.425Z info vsanvcmgmtd[580244] [vSAN@6876 sub=PyCppVmomi] Initialized python thread state 0x00007fa3943c3290.
2025-03-07T04:32:17.428Z info vsanvcmgmtd[580222] [vSAN@6876 sub=pcs[0]] Started listerner '[CnsHostListener:0x000055a537768580]'
2025-03-07T04:32:17.430Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Old task=(vim.TaskInfo) {
-->    key = "task-47743075",
-->    task = 'vim.Task:8cbe3917-25c7-4cdc-a28f-53ece89####:task-47743075',
-->    descriptionId = "com.vmware.cns.tasks.updatevolume",
-->    entity = 'vim.Folder:8cbe3917-25c7-4cdc-a28f-####:group-d1',
-->    entityName = "Datacenters",
-->    state = "running",
-->    cancelled = false,
-->    cancelable = false,
-->    error = (vmodl.fault.SystemError) {
-->       reason = "Failing pending CNS tasks during startup",
-->       msg = "",
-->    },
-->    progress = 0,
-->    reason = (vim.TaskReasonUser) {
-->       userName = "com.vmware.cns"
-->    },
-->    queueTime = "2025-03-07T03:35:49.30836Z",
-->    startTime = "2025-03-07T03:35:49.31687Z",
-->    eventChainId = 74107978,
-->    activationId = "3ae1c86d",
--> }
2025-03-07T04:32:17.430Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Finish Failing Cns InProgress Tasks
2025-03-07T04:32:17.448Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsSync] PeriodicSyncManager started
2025-03-07T04:32:17.448Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsSync] Starting sync ...
2025-03-07T04:32:17.448Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsSync] Sync all datastores ...
2025-03-07T04:32:17.448Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsSync] Sync ds:///vmfs/volumes/65f710b0-84b01022-4c93-###########/: startVClock = 0, fullSync = true

Resolution

If the symptoms matches, please contact Broadcom Support for further assistance.