New feature available to ensure that /storage/dblog partition will not fill which can cause an outage.
Symptoms:
WAL (Write Ahead Logs), or transaction logs are critical for the database data durability. No transaction will be committed unless all data is written and flushed to disk. It's critical for the WAL data associated to the commit of a transaction to be flushed before the data tracking the commit on disk. If a crash (host powered off, etc.) happens while the WAL data is flushed, a follow-up recovery will consider the transaction is aborted. If a crash happens after the WAL data of the transaction is flushed, the transaction is considered as committed even if the commit has not been reported back to the client application.
If the vPostgres service is unable to write WAL it will cause VCSA to crash and be unable to restart until disk space is cleared. WAL logs are usually used in two situations:
To ensure maximum safety, we have a dedicated process called pg_archiver that connects to the PostgreSQL instance and streams all the WAL data on a dedicated filesystem, called archivelog (location /storage/archive), using the replication protocol, making a copy of all WAL data.
The approach for storing this copy is to make sure that we store as much history as possible by keeping the dedicated archivelog filesystem as full as possible, but making sure that it's never full.
The main reason for making that copy is to make sure that all WAL required for a consistent physical backup can be backed up without risking to saturate the dblog partition.
In case of hardware problem, those WAL can also help to recover corrupted data or include critical data to track out the origin of corruption problems, useful for debugging.
However, having a copy of those WAL is less critical than the service availability. That's why we have another process, implemented in a dedicated background worker provided by our healthstat extension. Among other things,
this process will check if pg_archiver isn't able to stream the WAL fast enough, and will take all necessary action if that's the case to ensure that the WAL won't accumulate in the dblog filesystem as it could otherwise cause a service outage. The actions it will take are: