Application team received alerts and users were disconnected for a brief period. But the VM's and hosts were running fine on vSphere UI.
VMware vSphere 7.X
VMware vSphere 8.X
A snapshot operation lasting 5 seconds on a Windows Failover Cluster host, especially if it involves a VM-level quiescing or "stun," is highly likely to cause a temporary service disruption and potentially a failover of a PostgreSQL database.
Detected VM snapshot stun operations from 'vmware.log' files, a VM got frozen for more than 5 seconds, it may be problematic if a VM is getting frozen for more than 1 sec.
Log file /vmfs/volumes/<datastore_name>/<vm_name>/vmware.log will show similar entries as below:In(05) vcpu-0 - CPT: vm was stunned for 5493800 usNo(00) vcpu-0 - CheckpointTiming total: ALL took 5493775 usIn(05) vcpu-0 - SnapshotVMXTakeSnapshotWork: Transition to mode 1.In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Done with snapshot '__GX_BACKUP__': 436In(05) vcpu-0 - VVolObjNotifySnapshotDone: isEnabled: 1In(05) vcpu-0 - VigorTransport_ServerSendResponse opID=5a2fe8a7-58-7b9d seq=4819298: Completed Snapshot.Take request in 5892043 US.
Use application-aware backup (VSS-aware)
Avoid simultaneous snapshots of all cluster nodes
Increase cluster timeouts if vendor-approved
Schedule backups during low I/O windows
For PostgreSQL : consider database-level backups instead of VM snapshots.