Service NodeManager is running but not healthyService FlinkContainer is not running.df -h you will see that partition /var/log is exceeding upto 90% storage occupation disk space use.VCF Operations for Networks version 6.14.x
Log rotate service logic/mechanism has issues in rotating logs for nginx,syslog,warn, etc.
After log rotation is performed it triggers a reload call to the respective service to reload the log files, so that it will continue to write to correct file. This mechanism is broken and as a result files grow huge in size making /var/log >90% which ends up with service health issues for NodeManager and FlinkContainer.
Workaround is to fix the log rotate manually to resolve the disk space issue
Below is the is available workaround which needs to be executed on all the VCF Operations for Networks affected nodes (Platform VMs and collector VMs).
1. Take a Putty/SSH session to Aria Operations for Networks affected nodes (Platform VMs and collector VMs).
2. Login with username support
3. Execute below command to switch to ubuntu user.
ub4. You will need to run these commands on each node that is effected.
cd /var/logTo identify the size of the files such as warn,syslog.1 and auth.log.1 on the affected nodes
ls -lrth 5. Look at the last 3 to 4 files, e.g. as below:
-rw-r----- 1 syslog adm 2.5G Mar 19 16:19 warn
-rw-r----- 1 syslog adm 4.6G Mar 19 16:19 syslog
-rw-r----- 1 syslog adm 3.9G Mar 19 16:19 auth.log
6. Execute below command to rotate the logs manually
sudo dd if=/dev/null of=/var/log/secure
sudo dd if=/dev/null of=/var/log/warn
sudo dd if=/dev/null of=/var/log/syslog.1
sudo dd if=/dev/null of=/var/log/auth.log.17. Clean up the warn,syslog.1 and auth.log.1 on the affected nodes
To do so execute below commands:
sudo rm -rf warn
sudo rm -rf syslog.1
sudo rm -rf auth.log.18. Delete access.log, access.1.log from /var/log/nginx on all the nodes.
To do so execute below commands:
sudo su
cd nginx/
ls -lrth
sudo rm -rf access.log
sudo rm -rf access.log.19. Restart the syslog and ngnix service on all the nodes, execute below commands:
systemctl restart syslog.service
systemctl restart nginx.service10. Post executing above mentioned steps, execute below command to validate the size of /var/log using below command:
df -hIf there is a cluster setup then execute below command:
./run_all.sh df -hThe size of /var/log should now show less than 60%