A PostgreSQL database’s underlying disk may be automatically reverted to read-only by the kernel if it detects a filesystem inconsistency. When this occurs, PostgreSQL cannot accept writes until the issue is resolved.
At a high level:
PostgreSQL database appears stuck in InProgress state.
PostgreSQL write operations fail.
On the affected workload cluster node:
Filesystem is remounted read-only: EXT4-fs error: Detected aborted journal
If the disk is healthy: This will return nothing (empty output).
If the disk is corrupted: This will print lines such as "EXT4-fs error" or "aborted journal.
Ext4 journal on the PostgreSQL data volume (/dev/sdc) is aborted or corrupted.
Write operations fail even when the filesystem shows as read-write (rw mount).
Kernel logs repeatedly report journal or I/O errors.
VMware DSM (Data Services Manager) and PostgreSQL
This issue can be triggered by a variety of factors:
Temporary I/O interruptions via vSphere / CSI driver.
Unclean VM shutdown or reboot during active writes.
Heavy write load or sudden PostgreSQL writes during transient storage errors.
Rarely, underlying hardware or disk failure may contribute.
- SSH access to control plane nodes (workload) and provider VM
fsck can only recover what it can reconcile from the journal.
1. Backups first
* Always take a snapshot of the underlying volume (VMware snapshot or storage snapshot) before running fsck.
* This lets you roll back if the repair causes unexpected corruption.
2. Use WAL + replication for PostgreSQL
* Ensure all WAL segments are replicated to a standby before performing repairs.
* This minimizes data loss if fsck discards uncommitted changes.
3. Perform during maintenance window
* Draining the node and performing fsck should be done during planned maintenance to reduce production impact.
4. Post-repair validation
* After fsck, check PostgreSQL logs for missing or corrupt files.
* Run consistency checks (pg_checksums if enabled, or application-level sanity checks).
Fetch the sshkey from provider VM:
SSH to a control plane node:
List mounted volumes to confirm the PostgreSQL data disk:
Check kernel messages for journal errors:
Confirm write failures:
Expected: fails if filesystem is read-only.
Wait for all pods to be evicted; ignore DaemonSet-managed pods.
Note: If kubectl commands are hanging or failing, that means the node that hosted the api-server and/or etcd has become read-only. This increases the cluster state risk as if the corruption affects etcd data directories, fsck might delete corrupted write-ahead logs (WAL). If this is a single-node control plane or if quorum is lost, the cluster state may be unrecoverable without an etcd snapshot restore. In that case, data loss risk for in-flight API requests. Proceed with the following steps, skipping kubectl commands only if data loss is acceptable. Otherwise contact DSM support for manual intervention.
Prevents automatic remounts by CSI or kubelet while repairing the disk.
Note: If you could not perform Step 3 (Drain) because kubectl is not available, you must also stop the container runtime to release disk locks:
-f forces check even if filesystem is clean.
-y auto-confirms fixes.
Typical output may include:
Ensure fsck completes successfully and reports:
7. Restart the VM
Reboot is required to clear any zombie processes and ensure the kernel re-reads the partition table cleanly.
After these steps, trigger any reconcile on the PostgreSQL and it should become Ready.