PostgreSQL cluster application went down while taking the snapshots through Commvault backups
search cancel

PostgreSQL cluster application went down while taking the snapshots through Commvault backups

book

Article ID: 424342

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Application team received alerts and users were disconnected for a brief period. But the VM's and hosts were running fine on vSphere  UI. 

Environment

VMware vSphere 7.X

VMware vSphere 8.X


Cause

A snapshot operation lasting 5 seconds on a Windows Failover Cluster host, especially if it involves a VM-level quiescing or "stun," is highly likely to cause a temporary service disruption and potentially a failover of a PostgreSQL database.

Detected VM snapshot stun operations from 'vmware.log' files, a VM got frozen for more than 5 seconds, it may be problematic if a VM is getting frozen for more than 1 sec.

Log file /vmfs/volumes/<datastore_name>/<vm_name>/vmware.log will show similar entries as below:

In(05) vcpu-0 - CPT: vm was stunned for 5493800 us
No(00) vcpu-0 - CheckpointTiming total: ALL took 5493775 us
In(05) vcpu-0 - SnapshotVMXTakeSnapshotWork: Transition to mode 1.
In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Done with snapshot '__GX_BACKUP__': 436
In(05) vcpu-0 - VVolObjNotifySnapshotDone: isEnabled: 1
In(05) vcpu-0 - VigorTransport_ServerSendResponse opID=5a2fe8a7-58-7b9d seq=4819298: Completed Snapshot.Take request in 5892043 US.

Resolution

  1. Use application-aware backup (VSS-aware)

  2. Avoid simultaneous snapshots of all cluster nodes

  3. Increase cluster timeouts if vendor-approved

  4. Schedule backups during low I/O windows

  5. For PostgreSQL : consider database-level backups instead of VM snapshots.