Identity Manager root partition is full on one or more nodes in the cluster

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

The root partition (/dev/sda4) is completely full (100%) on one of the VIDM nodes in the cluster

root@<VIDM-FQDN> [ ~ ]# df -h
Filesystem                     Size  Used Avail Use% Mounted on
devtmpfs                       7.9G     0  7.9G   0% /dev
tmpfs                          7.9G   12K  7.9G   1% /dev/shm
tmpfs                          7.9G  808K  7.9G   1% /run
tmpfs                          7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sda4                       17G  16G    0   100% /
tmpfs                          7.9G  184K  7.9G   1% /tmp
/dev/sda2                      119M   26M   87M  23% /boot
/dev/mapper/db_vg-db            20G  502M   19G   3% /db
/dev/mapper/tomcat_vg-horizon   20G  2.8G   16G  15% /opt/vmware/horizon
tmpfs                          1.6G     0  1.6G   0% /run/user/1001
tmpfs                          1.6G     0  1.6G   0% /run/user/0

The directory sync operation (Navigate to Identity & Access Management > Directories > Sync now) is failing with the following error.

Failed to save config to disk

Environment

VMware Identity Manager 3.3.x

Cause

A large number of increase in disk space usually we see in under the /var/logs, /var/log/message* and /var/log/messages/* folder.
vIDM appliance /root file system is full, for example usage exceeds about 90% and more.

You will see these below errors:

Partition / 100%
```
Dev/sda4 or Dev/sda2 / 100%
```
Partition /var 100%

Resolution

Before proceeding any further, in vCenter take a snapshot of all appliances in the vIDM cluster(non-memory, quiesced).

Note: If extending the disk space per option 3, delete the snapshots and instead clone the appliances.

Option 1: Rotated log files still opened by rsyslogd

Rotated log files that are renamed, while unlinked from the filesystem, are sometimes still in use by rsyslog.

This can be observed with the command:

lsof +L1 | grep delete | grep "rsyslogd"

E.g.:

rsyslogd   512314       root   15w   REG    8,4 97697061     0 196612 /var/log/messages-1770865801 (deleted)
rsyslogd   512314       root   97w   REG    8,4 97697061     0 196612 /var/log/messages-1770865801 (deleted)

These files can sometimes take up enough space to trigger 100% usage of the / partition.

Monitor the usage of the root partition with:

df -B M | grep -iE "filesystem|sda4"

To immediately clear up space, restart rsyslog with:

systemctl restart rsyslog

Often times this will close the file descriptions and release disks blocks, consequently clearing up a significant amount of disk space.

Verify the freed space with:

df -B M | grep -iE "filesystem|sda4"

If this hasn't cleared enough disk space, proceed to one of the other options listed below.

Option 2: Clearing journal files

Remove the journal files with rm -rf /var/log/journal/*
Remove file messages file with rm -rf /var/log/messages
Check if audit log is consuming space.
1. Change Directory: cd /var/log/audit
2. Check the size using: do ls -lh
3. If audit log file is being too large then truncate (clean the content without deleting the file) the audit.log file with this command:
```
truncate -s 0 audit.log
```
Modify the /etc/logrotate.conf from weekly rotation to daily rotation and then run command logrotate /etc/logrotate.conf
Add this to the /etc/cron.d/hzniptables file. The hzniptables file is present under /etc/cron.d
```
*/1 * * * * cat /dev/null >/var/log/messages
```
Add/Modify the following lines to /usr/local/horizon/conf/runtime-config.properties on each node:
```
analytics.deleteOldData=true
analytics.maxQueryDays=90
```
Edit vi /etc/rsyslog.conf(and remove all the input methods) after confirming that /usr/local/horizon/scripts/enableRSyslog.hzn status shows no syslog present. (Copy backup before edit)
Restart the service on the nodes, as after updating the rsyslog.conf restart is required for changes to take affect.
```
ps aux | grep rsyslog
systemctl status rsyslog
systemctl restart rsyslog
```

Option 3: Resize / partition of the appliances

Resize the /dev/sda4 disk to 20G on all appliances of the cluster by following the steps detailed in this KB: How to Increase vIDM appliance disk space (broadcom.com)

Option 4: Investigate disk usage

Check the output of below config file to look for the configuration:
```
cat /etc/logrotate.conf
cat /etc/cron.d/hzniptables
```
Note: Check the result of the both the commands again, so that we could compare and check which directory has consumed more space.
Exclude the /opt/vmware and /db directories so that we just focus on root directory:
1. Run the below command to check disk usage:
```
du -ah / 2>/dev/null --exclude=/opt/vmware/horizon --exclude=/db | sort -rh | head -n 20
```
2. Check the disk usage of /var folder:
```
du -ah /var 2>/dev/null | sort -rh | head -n 20
```
  Note: Observe any dip in the Partition Utilization metric or is it still incrementing? You can check the output of above commands for next couple of days more.
  You also can reboot the cluster and compare before and after reboot output.
  Note: If the /var/log/pgService folder is taking up the space you may need to reference KB article 372820 for the steps to clear that up.
Check all config-state.json file over all 3 nodes and check /etc/systemd/journald.conf for SystemMaxUse property (by default 100M).
1. Check the cache usage with this command: du -sh /var/cache/ if normal in MB size
2. Review the list of opened files which are deleted but are hung due to some issue: lsof | grep '(deleted)'
3. Run this command view /etc/rsyslog.conf where the input methods remove instead of cache. Then change it on a single node to monitor if this resolves the issue, Apply the same changes to the other nodes after few days.
4. Check the frequency of the Directory Sync,
  Note: If set to hourly then check users are getting synced every 1hr. Also it is recommend to change the sync connector for one of the directories having max Users and see if the other node is also having similar impact on disk space as on another node.
Check if the hardware requirement matches the guidelines: VIDM Guideline
Run the below query and check the csv file result before updating the directory sync frequency:
1. Connect to the database:
```
cat /usr/local/horizon/conf/db.pwd
psql -U postgres saas
```
2. Run this statement to export:
```
copy (select * from "CacheEntry") to '/tmp/CacheEntry.csv’ with csv;
```
  Note: Can run the same above command again after 2-3 days and compare the result.
Compare below command output for all 3 nodes:

Before setting the sync frequency to manual and after setting the sync frequency to manual
You may try after a few days / 2-3 days of soak time require to have the significant difference in the disk size.

Commands to investigate disk usage:
```
du -ah / 2>/dev/null | sort -rh | head -n 100
find ./ -type f -size +100M | less
find ./ -type d -exec sh -c 'echo -n "{}: " && find "{}" -type f | wc -l' \; | awk '$2 >  100' | sort -k2,2nr |less
```

Additional Information

If the required space is still not getting reclaimed, follow the below KB to check the GC logs: Uncompressed gc.logs are causing the /root partiton to run out of space in VMware Identity Manager.

Notes:1 # The nodes can be on higher load due to mis-configuration of the connectors, we can distribute the load across all the nodes and can check improvement in the disk utilization.

Notes:2 # Kindly reboot the nodes so that jvm and other CPU processes will release the temporary files.

Notes:3 # Validate the KB (HW-134096 - VMware Identity Manager Connector may fail to communicate due to config-state.json corruption (broadcom.com)) to restore the configuration files if they are corrupt or empty.