API Gateway: MySQL partition is full and MySQL won't start/operate, Gateway is "down"
search cancel

API Gateway: MySQL partition is full and MySQL won't start/operate, Gateway is "down"

book

Article ID: 44691

calendar_today

Updated On:

Products

CA API Gateway

Issue/Introduction

This article will discuss what to do when an API Gateway appliance disk is full in particular when caused by MySQL audits and binlogs filling up the disk.

Issue/behaviour observed:

  • MySQL partition of the API Gateway appliance disk is nearly or completely full
  • MySQL has crashed/stopped due to the disk being full
  • API Gateway is "down" because the MySQL database is not operating due to the disk being full

Environment

This article applies to all supported API Gateway appliance versions.

Cause

The MySQL partition can fill up for a few different reasons. Two of the most common ones are usually seen together in a domino-effect:

  • Audits filling up database, database grows and MySQL disk usage grows, this often will also break replication
  • If MySQL replication breaks, the binlogs will grow faster and fill up any remaining disk space

Resolution

Short-term / immediate fix

Note: The steps below should only be followed when the disk is completely full and MySQL is not operational, or when otherwise instructed by Broadcom Support. 

  1. On both nodes in the cluster, perform the following steps:
    1. Stop the MySQL service: service mysqld stop
    2. Remove the binary and relay log files with this command: find /var/lib/mysql -type f -regextype posix-extended -regex ".*[0-9]{6}" -exec rm  {} \;
      • Note: The filed being removed will be located in the /var/lib/mysql directory and will be files such as ssgbin-log.* and ssgrelay-bin.*
    3. 'Reset' the four index & info files with the following commands:
      • cat /dev/null > /var/lib/mysql/ssgbin-log.index
        cat /dev/null > /var/lib/mysql/ssgrelay-log.index
        cat /dev/null > /var/lib/mysql/ssgrelay-bin.index
        cat /dev/null > /var/lib/mysql/ssgrelay-bin.info
    4. Verify that the user and group ownership of these files are mysql:mysql and not root:root: chown mysql:mysql <fileName>
      • Be sure to replace <fileName> with the actual file names that may inadvertently be owned as with root:root
      • If all files are owned by mysql:mysql then no chown command needs to be run
    5. Start the MySQL service: service mysqld start
  2. Reinitialize replication between the MySQL database nodes
    1. It is recommended to purge audits from the database before reinitializing replication to save more disk space and to make the reinitializing process quicker, and can be done by following the self-service KB article on Removing Audit Records from the Gateway database in a multi-node cluster without downtime
    2. Now follow the dedicated self-service KB article on Reinitializing Replication in a Multi-node Cluster to re-establish replication between the database nodes

Long-term prevention

  1. Ensure that the audit_purge.sh script is being used in your environment
    • You may need to run this script more frequently too
  2. Ensure that the manage_binlog.sh script is being used in your environment
    • You may need to run this script more frequently too
  3. If this is a production environment or any critical environment, ensure that audits are disabled (ideally) or if audits are absolutely required then ensure that at least our best practices are being followed closely
    • Best practices include keeping auditing to SEVERE or WARNING level at the lowest, anything lower will flood the database with audits
  4. If auditing is required, then make sure it's to a dedicated syslog server located within the same data centre, or if required to be saved to the database then the database should be external from the API Gateway nodes so they can be managed better by a dedicated MySQL DB administrator
  5. Optional: Broadcom Services can be hired to do a thorough audit of your systems to ensure that they are configured for optimal performance and in a way that will avoid this issue in the future