vIDM root partition is full on one node of the cluster

Products

VMware Aria Suite

Issue/Introduction

This article provides steps to add space in the vIDM root partition. If the partition is full on the node/cluster.

Environment

VMware Identity Manager 3.3.x

Cause

A large number of increase in disk space usually we see in under the /var/logs, /var/log/message* and /var/log/messages/* folder.

vIDM appliance /root file system is full, for example usage exceeds about 90% and more.
You will see these below errors:

Partition / 100%
- Dev/sda4 or Dev/sda2 / 100%
Partition /var 100%

Resolution

To resolve this issue, in the disk space in the vIDM root partition:

Take snapshot of the nodes.
Login to the vIDM appliance.
ssh to the node

Option1: (To resolve the issue)

Resize the /dev/sda4 fs to 20G on all nodes/cluster. KB: How to Increase vIDM appliance disk space (broadcom.com)
Remove the journal files with rm -rf /var/log/journal/*
Remove file messages file with rm -rf /var/log/messages
Check if audit log is consuming space
1. Change Directory: cd /var/log/audit
2. Check the size using: do ls -lh
3. If audit log file is being too large then truncate (clean the content without deleting the file) the audit.log file with this command:truncate -s 0 audit.log
Modify the /etc/logrotate.conf from weekly rotation to daily rotation and then run command logrotate /etc/logrotate.conf
Add this to the /etc/cron.d/hzniptables file. The hzniptables file is present under /etc/cron.d
*/1 * * * * cat /dev/null >/var/log/messages
Add/Modify the following lines to /usr/local/horizon/conf/runtime-config.properties on each node:

analytics.deleteOldData=true
analytics.maxQueryDays=90
Edit vi /etc/rsyslog.conf(and remove all the input methods) after confirming that /usr/local/horizon/scripts/enableRSyslog.hzn status shows no syslog present. (Copy backup before edit)
Restart the service on the nodes, as after updating the rsyslog.conf restart is required for changes to take affect.

ps aux | grep rsyslog
systemctl status rsyslog
systemctl restart rsyslog

Option2: (To investigated disk usage)

Check the output of below config file to look for the configuration:

cat /etc/logrotate.conf
cat /etc/cron.d/hzniptables

Note: Check the result of the both the commands again, so that we could compare and check which directory has consumed more space.
Exclude the /opt/vmware and /db directories so that we just focus on root directory:

Run the below command to check disk usage:
du -ah / 2>/dev/null --exclude=/opt/vmware --exclude=/db | sort -rh | head -n 20
And then check the disk usage of /var folder:du -ah /var 2>/dev/null | sort -rh | head -n 20Note: Observe any dip in the Partition Utilization metric or is it still incrementing? You can check the output of above commands for next couple of days more.
You also can reboot the cluster and compare before and after reboot output.
Check all config-state.json file over all 3 nodes and check /etc/systemd/journald.conf for SystemMaxUse property (by default 100M).
1. Check the cache usage with this command: du -sh /var/cache/ if normal in MB size
2. Review the list of opened files which are deleted but are hung due to some issue: lsof | grep '(deleted)'
3. Run this command view /etc/rsyslog.conf where the input methods remove instead of cache. Then change it on a single node to monitor if this resolves the issue, Apply the same changes to the other nodes after few days.
4. Check the frequency of the Directory Sync,
  e.g. if set to hourly then check users are getting synced every 1hr. Also it is recommend to change the sync connector for one of the directories having max Users and see if the other node is also having similar impact on disk space as on another node.
Check if the hardware requirement matches the guidelines: VIDM Guideline
Run the below query and check the csv file result before updating the directory sync frequency:
1. Connect to the database:
  cat /usr/local/horizon/conf/db.pwd
  psql -U postgres saas
2. Run this statement to export:
  copy (select * from "CacheEntry") to '/tmp/CacheEntry.csv’ with csv;
  
  Note: Can run the same above command again after 2-3 days and compare the result.
Compare below command output for all 3 nodes:

Before setting the sync frequency to manual and after setting the sync frequency to manual
You may try after a few days / 2-3 days of soak time require to have the significant difference in the disk size.

Commands to investigate disk usage:

du -ah / 2>/dev/null | sort -rh | head -n 100

find ./ -type f -size +100M | less

find ./ -type d -exec sh -c 'echo -n "{}: " && find "{}" -type f | wc -l' \; | awk '$2 > 100' | sort -k2,2nr |less

Additional Information

Notes: The nodes can be higher load due to mis-configuration of the connectors, we can distribute the load across all the nodes and can check improvement in the disk utilization.

There are few things which we can ignore and delete with time. we also can reboot the node so that jvm and other CPU processes will release the temporary files.

Need to reboot the node and apply the KB (HW-134096 - VMware Identity Manager Connector may fail to communicate due to config-state.json corruption (broadcom.com)) to restore the configuration files

- /var/logs files can be deleted to free up the space. can delete below following directories:

/var/log/journal/*
/var/log/messages