vMotion failed to complete due to Veeam iofilter failing to finish disk operations

search cancel

vMotion failed to complete due to Veeam iofilter failing to finish disk operations

book

Article ID: 400676

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Issue:

ESXi host was placed into maintenance mode.

VMs failed to complete migration causing the host to fail entering maintenance mode and the VMs had to be restarted

Environment

ESXi 8.0 U3

Cause

vCenter vpxd.log shows the host entering maintenance mode and VMs failing to migrate:

info vpxd[420038] [Originator@6876 sub=vpxLro opID=186b591f] [VpxLRO] -- BEGIN task-423425 -- host-20 -- vim.HostSystem.enterMaintenanceMode -- 52987cda-1e9e-4dd7-e893-3532728be41d(525b9309-e79e-cdcc-1023-4f8710b8222d)

error vpxd[23719] [Originator@6876 sub=drsExec opID=CdrsLoadBalancer-1f115075-317110a6-01] Failed migrating VM [vim.VirtualMachine:vm-100544,<vm1>] to host vim.HostSystem:host-20
error vpxd[23731] [Originator@6876 sub=drsExec opID=CdrsLoadBalancer-1f115075-7cba67b0-01] Failed migrating VM [vim.VirtualMachine:vm-100543,<vm2>] to host vim.HostSystem:host-29
error vpxd[419862] [Originator@6876 sub=drsExec opID=2a4a2acf-01-0d] Failed migrating VM [vim.VirtualMachine:vm-116004,<vm2>] to host vim.HostSystem:host-42
error vpxd[419820] [Originator@6876 sub=drsExec opID=186b591f-01-01] Failed migrating VM [vim.VirtualMachine:vm-103291,<vm3>] to host vim.HostSystem:host-29

ESXi hostd.log shows Failed migrating VM

error vpxd[23719] [Originator@6876 sub=drsExec opID=CdrsLoadBalancer-1f115075-317110a6-01] Failed migrating VM [vim.VirtualMachine:vm-100544,<vm1>] to host vim.HostSystem:host-20

In the VM's vmware.log

vcpu-0 closed 3 disks without issue and then got stuck closing scsi0:1.

In(05) vmx - Log for VMware ESX pid=2135295 version=8.0.3 build=build-24674464 option=Release
...
In(05) vmx - Hostname=<vm1>
...
In(05) vmx - DICT scsi0:1.fileName = "<vm1>.vmdk"
...
In(05) vcpu-0 - VTHREAD 65891591936 "vcpu-0" wid 2135717
...
In(05) vcpu-0 - Migrate: Preparing to suspend.
In(05) vcpu-0 - Migrate: VM starting stun, waiting 100 seconds for go/no-go message.
...
In(05) vcpu-0 - Closing disk 'scsi3:0'
In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 4143688.
In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.
In(05) vcpu-0 - DISKLIB-VMFS : "vsan://522c836ff7f38e0a-2d4e6def3b6c0f58/e90fa467-a28a-dd82-227e-b49691db2ddc" : closed.
In(05) vcpu-0 - Closing disk 'scsi2:0'
In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 5913159.
In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.
In(05) vcpu-0 - DISKLIB-VMFS : "vsan://522c836ff7f38e0a-2d4e6def3b6c0f58/b30ea467-8e50-074a-23ff-b49691db2ddc" : closed.
In(05) vcpu-0 - Closing disk 'scsi1:0'
In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 6437446.
In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.
In(05) vcpu-0 - DISKLIB-VMFS : "vsan://522c836ff7f38e0a-2d4e6def3b6c0f58/c10ca467-6ec3-fd2d-6ddf-b49691db2ddc" : closed.
In(05) vcpu-0 - Closing disk 'scsi0:1'
In(05) vcpu-0 - DISKLIB-CBT : Shutting down change tracking for untracked fid 10697289.
In(05) vcpu-0 - DISKLIB-CBT : Successfully disconnected CBT node.

In ESXi vmkernel.log

Based on messages for the first 3 disks, "Disconnecting the cbt device" should be followed shortly by messages from FiltModS and then VDFM. That is not seen for the 4th disk.

In(182) vmkernel: cpu10:2135717)CBT: 765: Disconnecting the cbt device 3f3a48-cbt with filehandle 4143688
In(182) vmkernel: cpu10:2135717)FiltModS: 379: Aborted 0 IOs and completed 0 IOs after exit of upcall thread
In(182) vmkernel: cpu10:2135717)VDFM: 1301: Destroying VDFM file node 293a40-vdfm with fid 2701888.
In(182) vmkernel: cpu10:2135717)CBT: 765: Disconnecting the cbt device 5a3a47-cbt with filehandle 5913159
In(182) vmkernel: cpu10:2135717)FiltModS: 379: Aborted 0 IOs and completed 0 IOs after exit of upcall thread
In(182) vmkernel: cpu10:2135717)VDFM: 1301: Destroying VDFM file node 583a3c-vdfm with fid 5782076.
In(182) vmkernel: cpu10:2135717)CBT: 765: Disconnecting the cbt device 623a46-cbt with filehandle 6437446
In(182) vmkernel: cpu10:2135717)FiltModS: 379: Aborted 0 IOs and completed 0 IOs after exit of upcall thread
In(182) vmkernel: cpu10:2135717)VDFM: 1301: Destroying VDFM file node 263a3f-vdfm with fid 2505279.
In(182) vmkernel: cpu10:2135717)CBT: 765: Disconnecting the cbt device a33a49-cbt with filehandle 10697289
...
Wa(180) vmkwarning: cpu10:2097222)WARNING: VMotion: 1134: 8567820442367493545 S: Maximum switchover time (100 seconds) reached. Failing VMotion; VM should resume on source.

In ESXi /var/run/log/shell.log

vcpu-0 got stuck disconnecting the cbt device and the VM was killed.

2025-05-18T15:52:04.209Z In(14) shell[2240556]: [root]: esxcli vm process kill --type=hard --world-id=2135298

Investigating the disk issue

In ESXi /var/run/log/veecdp.log

In(158) veecdp[2135295]: [0002135717] 468 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c292-3040-1b35-50de-eac293636d49] [DiskFilter] DiskClose finished
In(158) veecdp[2135295]: [0002135717] 591 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c293-b79c-81ce-b790-2a7842fdc5ec] [DiskFilter] DiskClose started
In(158) veecdp[2135295]: [0002135717] 591 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c293-b79c-81ce-b790-2a7842fdc5ec] [DiskSync] Disable
In(158) veecdp[2135295]: [0002135717] 591 (I) [WorkGroup] WorkGroup wait started, self: 0xeffc11b80
In(158) veecdp[2135295]: [0002135717] 591 (I) [WorkGroup] WorkGroup wait finished, self: 0xeffc11b80
...
In(158) veecdp[2135295]: [0002135717] 595 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c293-b79c-81ce-b790-2a7842fdc5ec] [DAT] Session destroy started, this: 0xf544c9650
In(158) veecdp[2135295]: [0002135717] 595 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c293-b79c-81ce-b790-2a7842fdc5ec] [DAT] Session protocol destroy started, self: 0xf544c9340
In(158) veecdp[2135295]: [0002135717] 595 (I) [WorkGroup] WorkGroup wait started, self: 0xeffc12a60
In(158) veecdp[2135295]: [0002135717] 595 (I) [WorkGroup] WorkGroup wait finished, self: 0xeffc12a60
In(158) veecdp[2135295]: [0002135717] 595 (I) [WorkGroup] WorkGroup destroying, self: 0xeffc12a60
In(158) veecdp[2135295]: [0002135717] 595 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c293-b79c-81ce-b790-2a7842fdc5ec] [Transceiver] [DAT] Destroying, self: 0xf544c8650
Er(155) veecdp[2135295]: [0002135717] 595 (E) [Timer] Failed to remove timer, this: 0xeffc07668, status 'Object not found'
In(158) veecdp[2135295]: [0002135717] 595 (I) [WorkGroup] WorkGroup wait started, self: 0xeffc13200
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [OneOfConnector] [DAT] Begin connecting to SHM:33035
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [ShmSocketConnector] [:33035] Begin SHM connection to 127.0.0.1:33035
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Begin connection attempt to 127.0.0.1:33035
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Disable Nagle algorithm for socket (socket=135)
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Setting send & receive timeout for socket (socket=135)
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Setting keep alive settings for socket (socket=135). Idle time: 60 seconds, retry interval: 5 seconds
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Connecting to 127.0.0.1:33035
Er(155) veecdp[2135295]: [0002135709] 728 (E) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33035] Connection attempt failed with error 111
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [OneOfConnector] [DAT] Connection to SHM:33035 failed: error=111
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [OneOfConnector] [DAT] Begin connecting to TCP :33033
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33033] Begin connection attempt to 127.0.0.1:33033
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33033] Setting buffer size for TCP socket (socket=135, size=65536)
In(158) veecdp[2135295]: [0002135709] 728 (I) {501d141e-37a0-3ef8-475e-4bdc339adb9f} [6000c29f-b403-0d50-eaf7-99c84bdc110a] [TcpConnector] [:33033] Disable Nagle algorithm for socket (socket=135)

Cause:

The third party iofilter does not finish it's disk close operation, leading the VM to be stuck.

ESXi: esxcli storage iofilter list

IOFilter:
   Name: veecdp
   Vendor: VEE
   Version: 12.3.20-1OEM.800.1.0.20613240
   Description: Veeam CDP IOFilter
   ID: VEE_bootbank_veecdp_12.3.20-1OEM.800.1.0.20613240
   LocalID: veecdp
   Class: replication
   Release Date: 2025-03-07T16:39:47.883348+00:00
   Enabled: Yes

Resolution

Solution Recommendation:

Vendor support (Veeam) should be contacted to look into why the close operations in the iofilter did not finish.

Feedback

thumb_up Yes

thumb_down No