High VM Stun time during snapshot deletion or SVmotion failure on ESXi 6.7U2 or later
search cancel

High VM Stun time during snapshot deletion or SVmotion failure on ESXi 6.7U2 or later

book

Article ID: 317708

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
  • Environment is running ESXi 6.7U2 and later
  • Storage is using > 1 MB unmap granularity
  • Datastores are using VMFS6.
Snapshot Deletion Operation
  • In the virtual machines vmware.log file, you see log entries similar  to:
2019-05-08T15:22:16.507Z| vcpu-0| I125: DISKLIB-CTK   : Forcing flush of change info for "/vmfs/volumes/5c5b4879-4163c971-b0e6-246e963d8d88/VM/VM-000001-ctk.vmdk".
2019-05-08T15:22:16.518Z| vcpu-0| I125: DISKLIB-CTK   : ChangeTracker_EndCombine()
2019-05-08T15:25:41.336Z| vcpu-0| I125: DISKLIB-CTK   : Unlinked /vmfs/volumes/5c5b4879-4163c971-b0e6-246e963d8d88 /VM/VM-ctk.vmdk, tmp file: /vmfs/volumes/5c5b4879-4163c971-b0e6-246e963d8d88/VM/VM-ctk.vmdk-tmp
019-05-08T15:25:41.556Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 205052348 us
  • In the vmkernel.log file, you see log entries similar to:
2019-05-08T15:22:20.528Z cpu54:2192401)DLX: 4319: vol 'Datastore', lock at 10182656: [Req mode 1]>Checking liveness:
2019-05-08T15:22:20.528Z cpu54:2192401)[type 10c00002 offset 10182656 v 133804, hb offset 3674112 gen 255, mode 1, owner 5cca60c4-0579f659-c9e4-246e963d8fd0 mtime 7306729 num 0 gblnum 0 gblgen 0 gblbrk 0]
2019-05-08T15:22:24.529Z cpu32:2192401)DLX: 4319: vol 'datastore', lock at 10182656: [Req mode 1]>Checking liveness:
2019-05-08T15:22:24.529Z cpu32:2192401)[type 10c00002 offset 10182656 v 133806, hb offset 3674112 gen 255, mode 1, owner 5cca60c4-0579f659-c9e4-246e963d8fd0 mtime 7306869 num 0 gblnum 0 gblgen 0 gblbrk 0]
2019-05-08T15:22:28.529Z cpu32:2192401)DLX: 4968: vol 'datastore', lock at 10182656: [Req mode: 1] >Not free:
2019-05-08T15:22:28.529Z cpu32:2192401)[type 10c00002 offset 10182656 v 133808, hb offset 3674112 gen 255, mode 1, owner 5cca60c4-0579f659-c9e4-246e963d8fd0 mtime 7306910 num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 3473408
2019-05-08T15:22:28.529Z cpu32:2192401)Res3: 2325: Rank violation threshold reached: cid 0xc1d0000c, resType 1, cnum 75 vol datastore
 
SVmotion Failure

In the vmkernel.log file, you see log entries similar to:

2019-11-05T14:32:06.281Z| vmx| I125: [msg.configdb.open] An error occurred while opening configuration file "/vmfs/volumes/5a9e9fcb-f0e68c87-2cbe-000e1eea48b0/vm/vm": Failed to lock the file.
2019-11-05T14:32:06.281Z| vmx| I125: ----------------------------------------
2019-11-05T14:32:06.281Z| vmx| W115: Migrate: Failed to write out config file

2019-11-05T14:31:57.877Z cpu28:2382713)DLX: 4319: vol 'datastore', lock at 59031552: [Req mode 1] Checking liveness:
2019-11-05T14:31:57.877Z cpu28:2382713)[type 10c00001 offset 59031552 v 2986, hb offset 3407872
gen 32505, mode 1, owner 5d8a2dab-ee894bff-6122-000e1eea48b0 mtime 14405428

 
Note: If you are on  an ESXi version lower than 6.7U2 or have a backend storage array that supports <= 1 mb unmap granularity you are NOT impacted with this issue.

Cause

Automatic Unmap manager in VMFS6  locking and unlocking these resource clusters repeatedly that cause this issue.

Resolution

This issue is resolved in

  • VMware vSphere ESXi 6.7 Patch release ESXi670-202004002
  • VMware vSphere 7.0b

Above patches are available for download in your support.broadcom.com account.



Workaround:
To workaround this issue, disable the automatic unmap processing on all the hosts sharing the volume.

Note: You can continue to use manual unmap using esxcli to reclaim space after disabling automatic unmap.

  1. Run this command from one of the hosts sharing the volume.
esxcli storage vmfs reclaim config set --volume-label VolName --reclaim-priority=none
  1. Unmount and remount the volume on all hosts accessing the volume for reclaim priority change to take effect.
esxcli storage filesystem unmount -l VolName
esxcli storage filesystem mount -l VolName
If you are unable to unmount the volume, automatic unmap will need to be toggled on/off on all the hosts accessing the volume.

Alternative step "In case unable to unmount the volume""
Run this command to toggle reclaim priority low/none on all the hosts accessing the volume for the reclaim priority change to take effect.

esxcli storage vmfs reclaim config set --volume-label VolName --reclaim-priority=low
esxcli storage vmfs reclaim config set --volume-label VolName --reclaim-priority=none

 
  1. List the volumes with automatic unmap processing enabled with this command:
vsish -e ls /vmkModules/vmfs3/auto_unmap/volumes/

Notes:
  • None of the volumes with backend storage array having > 1 mb unmap granularity should appear in the volumes listed out by the above command.
  • If even after the Alternate Step 2  any of the hosts still lists such a volume you should fall back to the preferred Step 2 which is to remount the volume in question.