Postgres DB Disk Space Consumption by Malware Prevention Feature in NSX NAPP

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

On exceeding supported scale (2M events in 14 days) for Malware prevention feature, it consumes high disk space in Postgres DB

Symptoms:

- Postgres pods are in CrashLoopBackOff state
- Checking logs of postgresql-ha-postgresql-0 pod shows "No space left on device"
- `napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr` shows high storage consumption by data folder

Ex:

napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
28G /bitnami/postgresql/data/base
28G /bitnami/postgresql/data
26G /bitnami/postgresql/data/base/16451. <--------
2.2G /bitnami/postgresql/data/base/16577
481M /bitnami/postgresql/data/pg_wal
14M /bitnami/postgresql/data/pg_xact

- Exec into the pod and searching for the provided number (16451 in above example), shows database name as ‘malwareprevention’

Ex:

napp-k exec -it postgresql-ha-postgresql-0 -- bash
PGPASSWORD=$POSTGRES_PASSWORD psql
postgres=# SELECT datname FROM pg_database WHERE oid = 16451;
datname
-------------------
malwareprevention
(1 row)

- Checking for space occupied by tables, shows that ‘inspection_events’ table occupies a lot of space

Ex:

malwareprevention=# SELECT relname AS TableName, to_char(seq_scan, '999,999,999,999') AS TotalSeqScan,to_char(idx_scan, '999,999,999,999') AS TotalIndexScan, to_char(n_live_tup, '999,999,999,999') AS TableRows, pg_size_pretty(pg_total_relation_size(relname :: regclass)) AS TableSize FROM pg_stat_all_tables WHERE schemaname = 'public';
            tablename | totalseqscan | totalindexscan | tablerows | tablesize
---------------------------------+------------------+------------------+------------------+-----------
 malware_class | 1 | 0 | 0 | 24 kB
 http_header | 0 | 517,895 | 26,143,960 | 1464 MB
 unavailable_queue_table | 1,441 | 6 | 0 | 56 kB
 client_cache_mapping | 115,656 | | 2 | 56 kB
 reputation_failed_hash | 2 | 6 | 0 | 24 kB
 kafka_offset | 1 | 38,382 | 1 | 88 kB
 malware_family | 1 | 0 | 0 | 24 kB
 update_queue | 14,390 | 591 | 0 | 248 kB
 intercepted_entities | 2 | 333,071 | 543,140 | 438 MB
 non_benign_table | 0 | 32 | 3,287 | 148 MB
 reputation_cloud_service_config | 0 | 0 | 0 | 32 kB
 inspection_events | 14 | 1,526,893 | 26,141,303 | 24 GB <--------
 sa_configurations | 19 | | 2 | 48 kB
 reputation_update_marker | 21 | 0 | 0 | 64 kB
 reputation_pubsub_config | 0 | | 0 | 0 bytes
 reputation_query_metadata | 1,440 | 0 | 0 | 24 kB
 allow_list_table | 13,210 | 9,611 | 8 | 24 kB
 summary_table | 0 | 35,986 | 9,594 | 1392 kB
(18 rows)

NAPP UI will stop functioning and features running on NAPP may not be able to function as intended.

Environment

All NAPP versions 4.2 and before

Cause

Excessive event generation on VMs monitored by Malware Prevention Feature

Resolution

This will get fixed in subsequent releases

Additional Information

1. Configure the policy in NSX so that only relevant files are inspected by Malware prevention feature
2. Increase the PVC size of Postgres Database if we know the number of files we expect in the span of 14 days.
3. Or Run the script provided and configure it as cron job to purge files beyond supported scale

#!/bin/bash

echo "-----------------------Script Start-----------------------"
date
export KUBECONFIG=/config/vmware/napps/.kube/config

# Script will delete only if total count of events is greater than below variable value
delete_records_if_more_than=5000000
# The number of events that should be kept which will have score greater than 'score_less_than' variable value
records_to_keep=2000000
score_less_than=30
# Set to 1 only if manual vacuum should be performed [Not Recommended]
perform_vacuum=0

POSTGRES_PASSWORD1=`kubectl -n nsxi-platform get secret postgresql-password -o jsonpath="{.data.postgresql-password}" | base64 -d`
val=`kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "select count(1) from inspection_events;" | awk 'NR==3 {print $1}'`

if [[ $val > $delete_records_if_more_than ]]; then
    echo "Deleting records"
    kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "DELETE FROM inspection_events WHERE id < ((select max(id) from inspection_events) - $records_to_keep) and score < $score_less_than;"
    if [[ $perform_vacuum = 1 ]]; then
      echo "Freeing up space"
      kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "VACUUM public.inspection_events";
      kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "VACUUM public.http_header";
    fi
    kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "REINDEX TABLE inspection_events";
fi

echo "-----------------------Script Complete-----------------------"

It is recommended not to perform vacuuming using the script, and let auto vacuum free up the space.
If required, vacuuming can be done using the script, but logs should be monitored as this vacuum can fail due to incorrect configuration of postgres.

To configure the script as a cron job:

# Save the script on manager. Ex: purge-mps-tables.sh
crontab -e

# Check for cron syntax on https://crontab.guru/
# Depending on the number of events generated per day configure the cron job accordingly

Attachments

purge-mps-tables.sh get_app