VM reboot/reset due to high stun time
search cancel

VM reboot/reset due to high stun time

book

Article ID: 422316

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • Aria automation / PODs will give the following error during backup/replication issue -

    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: YYYY-MM-DD HH:MM:SS.177149Z: ping delayed by 50.6s
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: YYYY-MM-DD HH:MM:SS.250001Z: Freeze duration: 52.017s
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: YYYY-MM-DD HH:MM:SS.279375Z: Freeze synchronization failed: Traceback (most recent call last):
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:     return fut.result()
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: asyncio.exceptions.CancelledError
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: The above exception was the direct cause of the following exception:
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: Traceback (most recent call last):
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:   File "/opt/scripts/freezer-server.py", line 434, in run
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:     await asyncio.wait_for(self._thaw, timeout=deadline - loop.time())
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]:     raise exceptions.TimeoutError() from exc
    MM-DD HH:MM:SS <hostname-fqdn> vmtoolsd[3183662]: asyncio.exceptions.TimeoutError 

 

  • From the host logs, following error traces are noticed during the issue -

    YYYY-MM-DDTHH:MM:SS.440Z Er(02) vcpu-0 - OBJLIB-VSANOBJ: VsanObjLock: Could not lock object 'vsan://################-################/########-####-####-####-############': Failed to lock the file (4).
    YYYY-MM-DDTHH:MM:SS.441Z In(05) vcpu-0 - DISKLIB-VMFS : "vsan://################-################/########-####-####-####-############" : failed to open (Failed to lock the file): ObjLib_Open failed. Type 20
    YYYY-MM-DDTHH:MM:SS.442Z In(05) vcpu-0 - FILE: File_GetVMFSLockInfo: could not open directory "vsan://################-################/": No such file or directory
    YYYY-MM-DDTHH:MM:SS.442Z Er(02) vcpu-0 - DISKLIB-LINK : DiskLinkOpen: Failed to open '/vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>_3-000001.vmdk': : Failed to lock the file
    YYYY-MM-DDTHH:MM:SS.442Z Er(02) vcpu-0 - DISKLIB-CHAIN : DiskChainOpen: "/vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>_3-000001.vmdk": failed to open: Failed to lock the file.
    YYYY-MM-DDTHH:MM:SS.444Z In(05) vcpu-0 - DISKLIB-VMFS : "vsan://################-################/########-####-####-####-############" : open successful (524293) size = 32212254720, hd =0. Type 20
    YYYY-MM-DDTHH:MM:SS.445Z In(05) vcpu-0 - PluginLdr_Load: Loaded plugin 'libvmiof-disk-emcsplitter.so' from '/usr/lib64/vmware/plugin/libvmiof-disk-emcsplitter.so'
    YYYY-MM-DDTHH:MM:SS.445Z In(05) vcpu-0 - DISKLIB-VMFS : "vsan://################-################/########-####-####-####-############" : closed.
    YYYY-MM-DDTHH:MM:SS.445Z In(05) vcpu-0 - DISKLIB-LIB : Failed to open '/vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>_3-000001.vmdk' with flags 0x8 Failed to lock the file (16392).
    YYYY-MM-DDTHH:MM:SS.445Z In(05) vcpu-0 - SNAPSHOT:Failed to open disk /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>_3-000001.vmdk: Failed to lock the file (16392).
    YYYY-MM-DDTHH:MM:SS.447Z In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 382810936.
    YYYY-MM-DDTHH:MM:SS.448Z In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.
    YYYY-MM-DDTHH:MM:SS.480Z In(05) vcpu-0 - DISKLIB-VMFS : "vsan://################-################/########-####-####-####-############" : closed.
    YYYY-MM-DDTHH:MM:SS.480Z In(05) vcpu-0 - DISK: Failed to open disk for consolidate '/vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>_3-000002.vmdk': Failed to lock the file (16392) 9020
    YYYY-MM-DDTHH:MM:SS.484Z In(05) vcpu-0 - AIOGNRC: Failed to open '/vmfs/devices/vsan/########-####-####-####-############' : Could not find the file (60003) (0x2013).
    YYYY-MM-DDTHH:MM:SS.484Z In(05) vcpu-0 - OBJLIB-VSANOBJ: VsanObjOpenInt: Device '/vmfs/devices/vsan/########-####-####-####-############' not present, retrying to create one.

    YYYY-MM-DDTHH:MM:SS.563Z In(05) vcpu-0 - AIOGNRC: Failed to open '/vmfs/devices/vsan/########-####-####-####-############' : Could not find the file (60003) (0x2013).
    YYYY-MM-DDTHH:MM:SS.563Z In(05) vcpu-0 - OBJLIB-VSANOBJ: VsanObjOpenInt: Device '/vmfs/devices/vsan/########-####-####-####-############' not present, retrying to create one.



  • From machine logs (vmware.log), high-stun is reported as - 

/vmfs/volumes/vsan:################-################/########-####-####-####-############/vmware-390.log:

YYYY-MM-DDTHH:MM:SS.876Z In(05) vmx - SnapshotVMX_TakeSnapshot start: 'Avamar-########################################', deviceState=0, lazy=0, quiesced=1, forceNative=0, tryNative=1, saveAllocMaps=0
YYYY-MM-DDTHH:MM:SS.877Z In(05) vmx - #########:VVOLLIB : VVolLib_IsVectoredSnapSupported:14350: VVolLib_IsVectoredSnapSupported: vvolVectoredSnapEnabled: 1
YYYY-MM-DDTHH:MM:SS.877Z In(05)+ vmx -

YYYY-MM-DDTHH:MM:SS.599Z In(05) vcpu-0 - DISKLIB-CBT   : Shutting down change tracking for untracked fid 3324140716.
YYYY-MM-DDTHH:MM:SS.599Z In(05) vcpu-0 - DISKLIB-CBT   : Successfully disconnected CBT node.
YYYY-MM-DDTHH:MM:SS.637Z In(05) vcpu-0 - DISKLIB-CBT   : ChangeTrackerESX_DestroyMirror: Destroyed mirror node ########-########-cbtmirror. SrcFd: /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>-ctk.vmdk, DestFd: /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>-ctk-mirror.vmdk.
<------------------ ~50 sec spend here.
YYYY-MM-DDTHH:MM:SS.486Z In(05) vcpu-0 - [101542476/1154873088]: RPS:#0 - CommandExecuterBase_destroyCommandExecuterMonitor: NOT IMPLEMENTED FOR IOF
YYYY-MM-DDTHH:MM:SS.496Z In(05) vcpu-0 - FiltLib: Filter ##########/emcsplitter is destroying shared heap.

YYYY-MM-DDTHH:MM:SS.570Z In(05) vcpu-0 - DISKLIB-VMFS  : "vsan://################-################/########-####-####-####-############" : closed.
YYYY-MM-DDTHH:MM:SS.577Z No(00) vcpu-0 - ConfigDB: Setting scsi0:0.fileName = "<vm-name>-000001.vmdk"
YYYY-MM-DDTHH:MM:SS.577Z No(00) vcpu-0 - ConfigDB: Setting scsi0:1.fileName = "<vm-name>_1-000001.vmdk"
YYYY-MM-DDTHH:MM:SS.577Z No(00) vcpu-0 - ConfigDB: Setting scsi0:2.fileName = "<vm-name>_2-000001.vmdk"
YYYY-MM-DDTHH:MM:SS.577Z No(00) vcpu-0 - ConfigDB: Setting scsi0:3.fileName = "<vm-name>_3-000001.vmdk"
YYYY-MM-DDTHH:MM:SS.661Z In(05) vcpu-0 - Checkpoint_Unstun: vm stopped for 50350979 us
YYYY-MM-DDTHH:MM:SS.661Z No(00) vcpu-0 - CheckpointTiming save: VMX took 50341634 us
YYYY-MM-DDTHH:MM:SS.661Z No(00) vcpu-0 - CheckpointTiming save: ALL took 50342204 us      ==> high stun reported 
YYYY-MM-DDTHH:MM:SS.661Z In(05) vcpu-0 - SCSI: switching scsi0 to push completion mode

 

Environment

VMware vCenter Server 7.x 
VMware vCenter Server 8.x 

Cause

  • This issue is caused due to external product involvement taking backup or during replication some 3rd party component or filter/component holding virtual disks of the impacted VM(s)

  • From the logs-

vmfs/volumes/vsan:################-################/########-####-####-####-############/vmware-390.log:

YYYY-MM-DDTHH:MM:SS.594Z In(05) vcpu-0 - Closing disk 'scsi0:0'
YYYY-MM-DDTHH:MM:SS.599Z In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 3324140716.
YYYY-MM-DDTHH:MM:SS.599Z In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.
YYYY-MM-DDTHH:MM:SS.637Z In(05) vcpu-0 - DISKLIB-CBT : ChangeTrackerESX_DestroyMirror: Destroyed mirror node ########-########-cbtmirror. SrcFd: /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>-ctk.vmdk, DestFd: /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vm-name>-ctk-mirror.vmdk.
<------------------ ~50 sec spend here.
YYYY-MM-DDTHH:MM:SS.486Z In(05) vcpu-0 - [101542476/1154873088]: RPS:#0 -
CommandExecuterBase_destroyCommandExecuterMonitor: NOT IMPLEMENTED FOR IOF
  • From the above, the "RPS:#0" log is from Dell IOFilter emcsplitter. May be their IOFilter's diskClose or diskStun callbacks are taking time.

Resolution

Unprotect or disconnect the impacted VM(s) from the third party solution/software/plugins/integrations/replication etc, and monitor further