Receiving "closed with dirty buffers. Possible data loss." alerts for NFS datastores.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

The virtual machine experienced connectivity issues in an HPE SimpliVity environment that uses NFS as the front-end and the smartpqi driver as the back-end of the storage I/O path. Storage is accessed by the ESXi host through an NFS mount point, with the smartpqi driver handling I/O operations at the physical controller level.
Multiple virtual machines transitioned to an Inaccessible state in the vCenter Server.
ESXi hosts reported below events in in Monitor > Events

ESXi hosts logs were reporting below alerts :

vmkernel: cpu0:28180886)0x4539b351bec0:[0x420001e03e1e]UserObj_FDClose@vmkernel#nover+0xbb stack: 0x0, 0x43276d802010, 0x4539b351bf40, 0x0, 0x3
vmkernel: cpu0:28180886)0x4539b351bf00:[0x420001dc84d2]LinuxFileDesc_Close@vmkernel#nover+0x1b stack: 0x0, 0x0, 0x3, 0x0, 0x0
vmkernel: cpu0:28180886)0x4539b351bf10:[0x420001dd333c]User_LinuxSyscallHandler@vmkernel#nover+0xd9 stack: 0x3, 0x0, 0x0, 0x420001ea10c7, 0x103
vmkernel: cpu0:28180886)0x4539b351bf40:[0x420001ea10c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x3, 0xa9b5d7f96b, 0x1, 0xa9766ee7d0
vmkalert: cpu0:28180886)ALERT: BC: 3042: File host-3216-hb closed with dirty buffers. Possible data loss.

The smartpqi device driver was reporting continuous aborts prior to the issue.

vmkwarning: cpu13:2098252)WARNING: smartpqi: pqisrc_taskMgmt:1832 :[18:0.0][B:T:L 1:0:1]:TMF virt reset is failed with status : -1
vmkernel: cpu13:2098252)smartpqi01: pqisrc_taskMgmt:1816: TMF virt reset Issued for CMD : 2a B:T:L 1:0:1 with req tag : 0x1dc cmd tag 0x7fc
vmkwarning: cpu13:2098252)WARNING: smartpqi: pqisrc_taskMgmt:1832 :[18:0.0][B:T:L 1:0:1]:TMF virt reset is failed with status : -1
vmkernel: cpu13:2098252)smartpqi01: pqisrc_taskMgmt:1846: (virt reset) DONE

Continuous heartbeat failures and lost-connection errors were reported in the vobd logs, indicating a loss of connectivity.

Hostd[11866817] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 229879 : Lost access to volume 61386e5c-####-####-####(datastore-####) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
vobd[2097666] [vmfsCorrelator] 7527998493616us: [esx.problem.vmfs.heartbeat.recovered] 61386e5c-####-####-####datastore-xxxx
vobd[2097666] [vmfsCorrelator] 7527924835979us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server omni.cube.io mount point ####-####-####-####-####, mounted as 37b66ae5-####-####-####("ESXihostname")
vmkernel: cpu0:2097549)StorageApdHandlerEv: 120: Device or filesystem with identifier [37b66ae5-d0798f0a] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
The device was reporting continuous I/O latency fluctuations, with repeated warnings of performance degradation followed by brief recovery events.

vmkernel: cpu10:2098226)ScsiDeviceIO: 1780: Device naa.#### performance has improved. I/O latency reduced from 1008973 microseconds to 120544 microseconds.
vmkwarning: cpu0:2098219)WARNING: ScsiDeviceIO: 1780: Device naa.#### performance has deteriorated. I/O latency increased from average value of 7547 microseconds to 1124595 microseconds.
vmkernel: cpu17:2098222)ScsiDeviceIO: 1780: Device naa.#### performance has improved. I/O latency reduced from 1124595 microseconds to 164883 microseconds.
vmkwarning: cpu1:2098219)WARNING: ScsiDeviceIO: 1780: Device naa.#### performance has deteriorated. I/O latency increased from average value of 10667 microseconds to 922829 microseconds.

Environment

VMware vSphere (All Versions)

Cause

The smartpqi driver began issuing continuous abort commands, which led to a communication breakdown between the ESXi storage stack and the physical storage controller. The controller was unable to process IO requests, resulting in severe I/O latency spikes. These latency conditions ultimately caused the storage paths to fail and triggered an All Paths Down (APD) state.As a result of the abrupt loss of storage connectivity, the ESXi host was forced to close active file handles while write operations were still pending in memory (dirty buffers). Because these buffers could not be flushed to persistent storage before the paths were lost, ESXi triggered a “Possible Data Loss” alert.

Resolution

To validate the root cause of the aborts, it is recommended to contact the hardware vendor for further investigation.

Additional Information

Receiving "closed with dirty buffers. Possible data loss." alerts for VMFS datastores.