VM snapshot consolidation failed with error "An error occurred while consolidating disk: 27 (File too large)".

Products

VMware vCenter Server

Issue/Introduction

When attempting to consolidate snapshots on a virtual machine, the following error is observed in the vSphere Client and logs:
- An error occurred while consolidating disk: 27 (File too large).
Datastore is having enough space to consolidate all the snapshots.
After performing compute vMotion of the VM to another host, the consolidation of the VM get completed successfully.

From /var/run/log/hostd.log

YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099756]: [Originator@6876 sub=Vimsvc.TaskManager opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] Task Created : haTask-58-vim.VirtualMachine.consolidateDisks-25036333
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/datastore/vm_name/vm_name.vmx opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_CONSOLIDATE_ALL_DISKS)
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConfigInfoOpenVmsd: Creating new snapshot dictionary, '/vmfs/volumes/datastore/vm_name/vm_name.vmsd.usd'.
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from '/vmfs/volumes/datastore/vm_name/hard_disk_1-000001.vmdk' to '/vmfs/volumes/datastore/vm_nme/hard_disk_1.vmdk'.
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=DiskLib opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] DISKLIB-CTK   : Could not open change tracking file "/vmfs/volumes/datastore/vm_name/hard_disk_1-ctk.vmdk": Change tracking invalid or disk in use.
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] OBJLIB-FILEBE : Error creating file '/vmfs/volumes/datastore/vm_name/hard_disk_1-ctk.vmdk': 3 (The file already exists).
YYYY-MM-DDTHH:MM:SSZ Wa(164) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] File_GetFreeSpace: Couldn't statfs /vmfs/volumes/datastore/vm_name/hard_disk_1.vmdk
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=DiskLib opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] DISKLIB-LIB   : DiskLib_IsCombinePossible: Could not get free space on disk using /vmfs/volumes/datastore/vm_name/hard_disk_1.vmdk.
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=DiskLib opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] DISKLIB-LIB_CHAINMODIFY   : Failed to combine : File too large (1769481).
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Failed to combine: File too large (1769481).
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConsolidate failed: File too large (5)
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: Snapshot_Consolidate failed: File too large (5)
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099751]: [Originator@6876 sub=Libs opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] SnapshotVigorConsolidate: Failed to consolidate: File too large (5)
YYYY-MM-DDTHH:MM:SSZ Db(167) Hostd[2099751]: [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/datastore/vm_name/vm_name.vmx opID=m90r4x03-27366942-auto-gakgv-h5:72017500-91-c0-c63d sid=52adbcc1 user=vpxuser:VSPHERE.LOCAL\Administrator] Consolidate Disks message: An error occurred while consolidating disks: File too large.
YYYY-MM-DDTHH:MM:SSZ Db(167) Hostd[2099685]: -->
YYYY-MM-DDTHH:MM:SSZ In(166) Hostd[2099728]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 325627 : Virtual machine vm_name disks consolidation failed on <ESXi_host_name> in cluster <Cluster_name> in ha-datacenter.

From /var/run/log/vmkwarning.log, we will observe failing to allocate memory events over StorageFPIN.

YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu5:2097669)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu5:2097669)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu1:2097669)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu2:2097669)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu1:2097669)WARNING: StorageFPIN: 521: Failed to allocate memory.

You can check available FPINHeap with the following command. A healthy host will around 5246448 bytes available but an impacted host will show significantly less free space, sometimes 16k bytes or less.

esxcfg-info -a |grep -A3 storageFPINHeap|grep "Max Available"

Examples:

Host-1 shows that it has run out of FPINheap. 
|----Max Available...................................416 bytes

Host-2 shows that we have not running out of FPINheap.
|----Max Available...................................3219872 bytes

Environment

VMware vSphere 8.0 U3 or later.

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues.
Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and it is unable to do so.

Resolution

Patch your ESXi host to 8.0 P05 (8.0 U3e or later) Or
Engage your storage vendor and check if we can disable FPIN setting on the ESXi host.
1. Place the host into Maintenance Mode
2. Reboot the host
  1. If you are using ESXi 8.0U3 and have already run esxcli storage fpin info set -e false, then we only need to do a host reboot.
  2. If you are using ESXi 8.0U2 you will need to reapply the workaround:
    - ```
    esxcli storage fpin info set -e false
```
3. To check that this applied:
  - ```
  esxcli storage fpin info get
```
3. Take the host out of Maintenance Mode

Additional Information

Reference article:

Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC