API Gateway: MySQL partition is full and MySQL won't start/operate, Gateway is "down"

Products

CA API Gateway

Issue/Introduction

This article will discuss what to do when an API Gateway appliance disk is full in particular when caused by MySQL audits and binlogs filling up the disk.

Issue/behaviour observed:

MySQL partition of the API Gateway appliance disk is nearly or completely full
MySQL has crashed/stopped due to the disk being full
API Gateway is "down" because the MySQL database is not operating due to the disk being full

Environment

This article applies to all supported API Gateway appliance versions.

Cause

The MySQL partition can fill up for a few different reasons. Two of the most common ones are usually seen together in a domino-effect:

Audits filling up database, database grows and MySQL disk usage grows, this often will also break replication
If MySQL replication breaks, the binlogs will grow faster and fill up any remaining disk space

Resolution

Short-term / immediate fix

Note: The steps below should only be followed when the disk is completely full and MySQL is not operational, or when otherwise instructed by Broadcom Support.

On both nodes in the cluster, perform the following steps:
1. Stop the MySQL service: service mysqld stop
2. Remove the binary and relay log files with this command: find /var/lib/mysql -type f -regextype posix-extended -regex ".*[0-9]{6}" -exec rm {} \;
  - Note: The filed being removed will be located in the /var/lib/mysql directory and will be files such as ssgbin-log.* and ssgrelay-bin.*
3. 'Reset' the four index & info files with the following commands:
  - cat /dev/null > /var/lib/mysql/ssgbin-log.index
    cat /dev/null > /var/lib/mysql/ssgrelay-log.index
    cat /dev/null > /var/lib/mysql/ssgrelay-bin.index
    cat /dev/null > /var/lib/mysql/ssgrelay-bin.info
4. Verify that the user and group ownership of these files are mysql:mysql and not root:root: chown mysql:mysql <fileName>
  - Be sure to replace <fileName> with the actual file names that may inadvertently be owned as with root:root
  - If all files are owned by mysql:mysql then no chown command needs to be run
5. Start the MySQL service: service mysqld start
Reinitialize replication between the MySQL database nodes
1. It is recommended to purge audits from the database before reinitializing replication to save more disk space and to make the reinitializing process quicker, and can be done by following the self-service KB article on Removing Audit Records from the Gateway database in a multi-node cluster without downtime
2. Now follow the dedicated self-service KB article on Reinitializing Replication in a Multi-node Cluster to re-establish replication between the database nodes

Long-term prevention

Ensure that the audit_purge.sh script is being used in your environment
- You may need to run this script more frequently too
Ensure that the manage_binlog.sh script is being used in your environment
- You may need to run this script more frequently too
If this is a production environment or any critical environment, ensure that audits are disabled (ideally) or if audits are absolutely required then ensure that at least our best practices are being followed closely
- Best practices include keeping auditing to SEVERE or WARNING level at the lowest, anything lower will flood the database with audits
If auditing is required, then make sure it's to a dedicated syslog server located within the same data centre, or if required to be saved to the database then the database should be external from the API Gateway nodes so they can be managed better by a dedicated MySQL DB administrator
Optional: Broadcom Services can be hired to do a thorough audit of your systems to ensure that they are configured for optimal performance and in a way that will avoid this issue in the future

Additional Information

Gateway 11.0 uses mysql 8.0.31

Install September or October 2024 MPP on their GW 11.0 instances used for MySQL database to upgrade their MySQL server to version 8.0.37

Development recommended applying the latest MPP that will update mysql over identified replication bugs.