VMware Cloud Foundation SDDC-Manager nfs-mount 100% Full

Products

VMware Cloud Foundation

Issue/Introduction

In this article we discuss how to resolve the issue of SDDC-Manager nfs-mount 100% full

Symptoms:

The nfs repo on SDDC manager is either full or getting close to it. When you run prechecks you will see errors about it being less than required (OR)
LCM-Bundle-repo, datastore usage on disk alert: only 25GB free out of 502GB. Need help in expanding the NFS share in SDDC manager to resolve the errors
the nfs repo can be full because of previous scheduled backups as well
You can cross verify by running du - shc * and clear the old backups

Environment

VMware Cloud Foundation 3.9.x

SDDC Manager 5.x

Cause

As per the retention policy properties, cron job should delete the nsx backup files more than 2 weeks also lcm backup files should not exceed more than 5 files.

This can be checked by looking at the following file:

root@sddcmgr-1 [ /nfs/vmware/vcf/nfs-mount/backup/scripts ]# cat nsx_backup_config.properties
# All properties are mandatory.
# Property BACKUP.RETENTION.HOURS is used for retention policy. All backup files past # 'N' hours will be retained and rest will be deleted in current day.
# Constraints : Min 2
BACKUP.RETENTION.HOURS=6

# Property BACKUP.RETENTION.DAYS is used for retention policy. A single latest backup file # for a day for 'N' number of days will be retained and rest will be deleted in that day.
# Example, if there are hourly backups configured and say there are backups for each hour, # files will be evo-nsx-******-23_0-****.backupproperties,
# evo-nsx-******-22_0-****.backupproperties, evo-nsx-******-21_0-****.backupproperties,....
# In above file list latest would be at 23:00, so file evo-nsx-******-23_0-**** # will be retained.
# Constraints : Min 1
BACKUP.RETENTION.DAYS=7

# Property BACKUP.RETENTION.WEEKS is used for retention policy. A single latest backup # file for a week for 'N' number of weeks will be retained and rest will be deleted in # that week.
# Example, if there are hourly backups configured and say there are backups for each hour, # files on the last day of the week ie on Sunday will be evo-nsx-******-23_0-Sun***.backupproperties,
# evo-nsx-******-22_0-Sun***.backupproperties, evo-nsx-******-21_0-Sun***.backupproperties,....
# In above file list latest would be at 23:00, so file evo-nsx-******-23_0-Sun*** # If there is no file available on Sunday then previous day will be checked until # start day of the week.
BACKUP.RETENTION.WEEKS=2

# Enable log LEVEL
# Allowed levels INFO and DEBUG
BACKUP.LOGGER.LEVEL=INF

In this case the backup is taking most of the space in the directory.

So, the cronjob for cleaning the nsxbackup files is failing due to an issue with the script nsxbackupcleaner.py and fills the space of NFS share and this will need cleaned up.

Resolution

Make a copy of the file nsxbackupcleaner.py before making necessary changes.
cat the file nsxbackupcleaner.py and copy the contents into Notepad++ so you can see the line number to help with making the changes.
- root@sddcmgr-1 [ /nfs/vmware/vcf/nfs-mount/backup/scripts ]# vi nsxbackupcleaner.py
Here is what needs to be changed, there are 4 lines:
- Line #20 -->> nsxFileNamePrefix = '../evo-nsx-*'
  - Change to -->>
    - nsxFileNamePrefix = '../*nsx-*'
- Line #142 -->> backup_files_for_day.sort(key=lambda x: os.path.getctime(x), reverse=True)
  - Change to -->>
    - backup_files_for_day = sorted(backup_files_for_day, key=lambda x: os.path.getctime(x), reverse=True)
- Line # 190 -->> backupfilesForDay.sort(key=lambda x: os.path.getctime(x), reverse=True)
  - Change to -->>
    - backupfilesForDay = sorted(backupfilesForDay, key=lambda x: os.path.getctime(x), reverse=True)
- Line #260 -->> files.sort(key=lambda x: os.path.getctime(x), reverse=True)
  - Change to -->>
    - files = sorted(files, key=lambda x: os.path.getctime(x), reverse=True)
Once, the file is updated then rerun the command to cleanup the backup files.
- root@sddcmgr-1 [ /nfs/vmware/vcf/nfs-mount/backup/scripts ]# ./nsx_backup_cleanup.sh
Run df -h again to test if that freed up space

Additional Information

Impact/Risks:

When making any change you must take a full backup of the system you are running on in case of failure.
Pre-checks can have issues.
Data may not be displayed or SDDC is not responding