NetApp LUN goes offline due to array-side snapshot growth despite VMware Thick Provisioning

Products

VMware vCenter Server

Issue/Introduction

An ESXi datastore becomes inaccessible or reports SCSI timeouts, leading to VM outages. While VMware reports logical free space available, the underlying storage array is physically exhausted.

Datastore status appears as "Inaccessible" or "Inactive" in vCenter.
In some scenarios ESXi might enter "Not Responding" state.
vmkernel.log contains repeated SCSI op-code failures:
- 0x89 (ATS): Failed metadata locks.
- 0xfe (WRITE_SAME): Failed with H:0x5 D:0x0 P:0x0 (Host Illegal Request).
- 0x2a/0x8a (Writes): Failed with H:0x0 D:0x8 P:0x0 (Busy).
NMP Throttle logs indicating "last error status repeated X times."

YYYY-MM-DDT20:49:12.918Z In(182) vmkernel: cpu70:2098575)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x89 (0x45dd57ec84c0, 2097288) to dev "naa.#########################" on path "vmhba3:C0:T6:L17" Failed:
YYYY-MM-DDT20:49:13.478Z In(182) vmkernel: cpu70:2098575)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 10 times
YYYY-MM-DDT20:49:14.631Z In(182) vmkernel: cpu74:2098575)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 20 times
YYYY-MM-DDT20:49:16.964Z In(182) vmkernel: cpu74:2098575)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 40 times
YYYY-MM-DDT20:49:20.931Z In(182) vmkernel: cpu77:2097375)ScsiDeviceIO: 4681: Cmd(0x45dd48ddc580) 0xfe, cmdId.initiator=0x430db5696bc0 CmdSN 0x561a77 from world 2097288 to dev "naa.#########################" failed H:0x5 D:0x0 P:0x0 . Cmd count Active:33
YYYY-MM-DDT20:49:20.951Z In(182) vmkernel: cpu72:2098577)ScsiDeviceIO: 4644: Cmd(0x45dd59aeb940) 0x2a, CmdSN 0x80000048 from world 2116310 to dev "naa.#########################" failed H:0x0 D:0x8 P:0x0
YYYY-MM-DDT20:49:20.992Z In(182) vmkernel: cpu0:2098574)ScsiDeviceIO: 4644: Cmd(0x45bd1e5ee600) 0x8a, CmdSN 0x800e0007 from world 2116953 to dev "naa.#########################" failed H:0x0 D:0x8 P:0x0
YYYY-MM-DDT20:49:20.992Z In(182) vmkernel: cpu0:2098574)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 80 times

Environment

VMware Cloud Foundation / vSphere ESXi (All versions)
NetApp ONTAP Storage Array
Thick Provisioned VMDKs

Cause

The NetApp volume reached 100% physical capacity due to array-side snapshot growth. Even with Thick Provisioning at the VMware layer, every write operation to a disk with an active array-side snapshot requires additional physical blocks on the array (Copy-on-Write).

High write churn from a Guest VM can cause these snapshots to expand until they consumed all remaining physical overhead of the LUN.

Resolution

Work with your storage team to increase the physical size of the underlying NetApp Volume or LUN. This provides the necessary space for existing array-side snapshots and critical metadata updates to resume.
Once the storage space is expanded, perform a rescan at the cluster level in vCenter. This clears the stuck I/O queue and allows the hosts to reconnect to the datastore.
Coordinate with your Application or OS teams to identify which VM is generating "abnormal writes." Checking for activities like runaway logs or intensive database reindexing will help pinpoint the cause of the sudden growth.
Please work with your Storage team/vendor to put precautions in place to warn/prevent this issue form occurring again.