Controller cluster going down due to 100% disk utilization
search cancel

Controller cluster going down due to 100% disk utilization

book

Article ID: 317225

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
If a deployment has Pulse application rules sync enabled, controller nodes might run out of disk space. This is due to aggressive retry mechanism with WAF signature sync feature causing Postgres database to grow in size. This issue only affects deployments running 21.1.3 version.

This issue was introduced due to a feature added in 21.1.3.

This issue gets exponentially exploited if the WAF signature sync fails continuously which is the case in 21.1.3 version due to an external issue with S3 config of NTIC. If the sync is running successfully, then there won't be any such issue.

The following steps can be followed to verify if the controller is running in a compromised state:
  1. Check the disk usage on each node using the following command: 
    df -kh
  2. Check the postgres database disk usage:
    du -h --max-depth=1 /var/lib/postgresql/10/
Typically, if the controller is hitting the issue, the config/main database size will >100 GB. Configuration database should not be this large.

Resolution

We have released 21.1.3-2p7 patch with the fix for this issue. The issue is inherently fixed starting v21.1.4.

Workaround:
1) To reduce the disk usage and delete the large database tables:
a) Stop supervisor on all the nodes.
Stop supervisor on follower nodes first and then the leader. Use the command below:
systemctl stop process-supervisor
b) Start postgresql on all the nodes:
Start postgresql service on leader first and then the followers:
systemctl start postgresql.service
c) Run vacuum FULL on the leader: 
chmod +x /opt/avi/scripts/postgres_vacuum.py; python3 /opt/avi/scripts/postgres_vacuum.py
d) After vacuum FULL is done, run start the supervisor on all the nodes with the same leader.
Start process-supervisor on leader first and then follower nodes:
systemctl start process-supervisor
 
2) To stop the disk bloat:
a) Disable the appsignature sync to stop sync-req from pulse to WAF:
[admin:avictrl]: > configure albservicesconfig
[admin:avictrl]: albservicesconfig> feature_opt_in_status
[admin:avictrl]: albservicesconfig:feature_opt_in_status> no enable_appsignature_sync