VCF Operations for Networks appliances fail to rotate the log causing /var/log partition to grow beyond 90 Percent which results in NodeManager and FlinkContainer services not healthy and not running.
search cancel

VCF Operations for Networks appliances fail to rotate the log causing /var/log partition to grow beyond 90 Percent which results in NodeManager and FlinkContainer services not healthy and not running.

book

Article ID: 435143

calendar_today

Updated On:

Products

VCF Operations for Networks

Issue/Introduction

  • On the platform node(s), you may see any of the following service errors:

    Service NodeManager is running but not healthy
    Service FlinkContainer is not running.

  • Upon running command df -h you will see that partition /var/log is exceeding upto 90% storage occupation disk space use.

Environment

VCF Operations for Networks version 6.14.x

Cause

Log rotate service logic/mechanism has issues in rotating logs for nginx,syslog,warn, etc.

After log rotation is performed it triggers a reload call to the respective service to reload the log files, so that it will continue to write to correct file. This mechanism is broken and as a result files grow huge in size making /var/log >90% which ends up with service health issues for NodeManager and FlinkContainer.

Resolution

Workaround is to fix the log rotate manually to resolve the disk space issue

Below is the is available workaround which needs to be executed on all the VCF Operations for Networks affected nodes (Platform VMs and collector  VMs).

1. Take a Putty/SSH session to Aria Operations for Networks affected nodes (Platform VMs and collector  VMs).

2. Login with username support

3. Execute below command to switch to ubuntu user.   

ub

4. You will need to run these commands on each node that is effected.

cd /var/log

To identify the size of the files such as warn,syslog.1 and auth.log.1 on the affected nodes

ls -lrth 

5. Look at the last 3 to 4 files, e.g. as below:

-rw-r-----   1 syslog        adm             2.5G Mar 19 16:19 warn
-rw-r-----   1 syslog        adm             4.6G Mar 19 16:19 syslog
-rw-r-----   1 syslog        adm             3.9G Mar 19 16:19 auth.log

6. Execute below command to rotate the logs manually 

sudo dd if=/dev/null of=/var/log/secure
sudo dd if=/dev/null of=/var/log/warn
sudo dd if=/dev/null of=/var/log/syslog.1
sudo dd if=/dev/null of=/var/log/auth.log.1

7. Clean up the warn,syslog.1 and auth.log.1 on the affected nodes

To do so execute below commands:

sudo rm -rf warn
sudo rm -rf syslog.1
sudo rm -rf auth.log.1

8. Delete access.log, access.1.log from /var/log/nginx on all the nodes.

To do so execute below commands:

sudo su
cd nginx/
ls -lrth
sudo rm -rf access.log
sudo rm -rf access.log.1

9. Restart the syslog and ngnix service  on all the nodes, execute below commands:

systemctl restart syslog.service
systemctl restart nginx.service

10. Post executing above mentioned steps, execute below command to validate the size of /var/log  using below command:

df -h

If there is a cluster setup then execute below command:

./run_all.sh df -h

The size of /var/log should now show less than 60%