NSX Application Platform (NAPP) reports High Disk usage for Data Storage, but Analytics storage is low

search cancel

NSX Application Platform (NAPP) reports High Disk usage for Data Storage, but Analytics storage is low

book

Article ID: 317741

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:
Minio disk gradually fills up with spark data checkpoints from InfraClassifier(IC) that run every hour. When the data-size of the disk grows too large, it can impact the ability of other Intelligence services that also use Minio.
NAPP will also have an alarm about high disk usage for Data Storage, but nothing about Analytics. The Storage usage can be seen in alarms and by reviewing Core Services tab in NAPP

The problem can be identified by running the disk-usage command on any of the minio-* nodes in the nsxi-platform namespace. The iccheckpoints directory grows quite large.

1. Enable napp-k commands:
export KUBECONFIG=/config/vmware/napps/.kube/config

2. Get to the minio-0 pod (as an example). Log into the NSX manager as root user and then issue the following command:
napp-k exec -it minio-0 -- /bin/bash

3. Run the disk-usage command:
du -ah --max-depth=1 /data/minio
20G     /data/minio/druid
549M    /data/minio/feature-service
22M     /data/minio/llanta
79G     /data/minio/iccheckpoints     <-------------- NOTE: LARGE SIZE 79G!
4.0K    /data/minio/events
4.0K    /data/minio/icfeatures
514M    /data/minio/processing-checkpoints
16K     /data/minio/lost+found
12K     /data/minio/ntaflow-checkpoints
59M     /data/minio/.minio.sys
2.6G    /data/minio/data-service
102G    /data/minio

Cause

This happens on a 4.0.1 Scale setup where the IC is running. The larger the data size, the faster the disk fills up, as IC will create larger checkpoints at the same rate each day.
After some time, it is observed that the disk is filled up at a rate of approximately ~2GB of checkpoints per day.
At some point these will keep amassing and fill up the disk (depends on the size of the disk and the checkpoints).

Resolution

This issue is resolved in NSX Intelligence 4.1.1

Workaround:
To workaround this issue, iccheckpoints need to be cleaned up. This can be achieved without impact to InfraClassifier or any other services.
The attached yaml file contains a cronjob and configmap, which will clean up the iccheckpoints each evening at midnight local time.

1) Download the attached yaml file, clean-checkpoints-with-annotations.yaml.
2) After download, the file will be seen to have a .txt extension, clean-checkpoints-with-annotations.yaml.txt.
Rename the file and remove the .txt extension making it clean-checkpoints-with-annotations.yaml
3) Copy the file to one of the NSX Managers
4) Login into the shell of the NSX-T manager as the root user and execute:

$ napp-k apply -f clean-checkpoints-with-annotations.yaml

Additional Information

Impact/Risks:
The processing pipeline and other services may be unable to cache objects to Minio storage, causing them to be unable to be restarted or process data.
If the iccheckpoints continue filling up the disk, the processing pipeline will eventually be stalled and Intelligence will not be able to process data.

Attachments

clean-checkpoints-with-annotations.yaml get_app

Feedback

thumb_up Yes

thumb_down No