On exceeding supported scale (2M events in 14 days) for Malware prevention feature, it consumes high disk space in Postgres DB
Symptoms:
- Postgres pods are in CrashLoopBackOff state
- Checking logs of postgresql-ha-postgresql-0 pod shows "No space left on device"
- `napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr` shows high storage consumption by data folder
Ex:
napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
28G /bitnami/postgresql/data/base
28G /bitnami/postgresql/data
26G /bitnami/postgresql/data/base/16451. <--------
2.2G /bitnami/postgresql/data/base/16577
481M /bitnami/postgresql/data/pg_wal
14M /bitnami/postgresql/data/pg_xact
- Exec into the pod and searching for the provided number (16451 in above example), shows database name as ‘malwareprevention’
Ex:
napp-k exec -it postgresql-ha-postgresql-0 -- bash
PGPASSWORD=$POSTGRES_PASSWORD psql
postgres=# SELECT datname FROM pg_database WHERE oid = 16451;
datname
-------------------
malwareprevention
(1 row)
- Checking for space occupied by tables, shows that ‘inspection_events’ table occupies a lot of space
Ex:
malwareprevention=# SELECT relname AS TableName, to_char(seq_scan, '999,999,999,999') AS TotalSeqScan,to_char(idx_scan, '999,999,999,999') AS TotalIndexScan, to_char(n_live_tup, '999,999,999,999') AS TableRows, pg_size_pretty(pg_total_relation_size(relname :: regclass)) AS TableSize FROM pg_stat_all_tables WHERE schemaname = 'public';
tablename | totalseqscan | totalindexscan | tablerows | tablesize
---------------------------------+------------------+------------------+------------------+-----------
malware_class | 1 | 0 | 0 | 24 kB
http_header | 0 | 517,895 | 26,143,960 | 1464 MB
unavailable_queue_table | 1,441 | 6 | 0 | 56 kB
client_cache_mapping | 115,656 | | 2 | 56 kB
reputation_failed_hash | 2 | 6 | 0 | 24 kB
kafka_offset | 1 | 38,382 | 1 | 88 kB
malware_family | 1 | 0 | 0 | 24 kB
update_queue | 14,390 | 591 | 0 | 248 kB
intercepted_entities | 2 | 333,071 | 543,140 | 438 MB
non_benign_table | 0 | 32 | 3,287 | 148 MB
reputation_cloud_service_config | 0 | 0 | 0 | 32 kB
inspection_events | 14 | 1,526,893 | 26,141,303 | 24 GB <--------
sa_configurations | 19 | | 2 | 48 kB
reputation_update_marker | 21 | 0 | 0 | 64 kB
reputation_pubsub_config | 0 | | 0 | 0 bytes
reputation_query_metadata | 1,440 | 0 | 0 | 24 kB
allow_list_table | 13,210 | 9,611 | 8 | 24 kB
summary_table | 0 | 35,986 | 9,594 | 1392 kB
(18 rows)
NAPP UI will stop functioning and features running on NAPP may not be able to function as intended.
All NAPP versions 4.2 and before
Excessive event generation on VMs monitored by Malware Prevention Feature
This will get fixed in subsequent releases
1. Configure the policy in NSX so that only relevant files are inspected by Malware prevention feature
2. Increase the PVC size of Postgres Database if we know the number of files we expect in the span of 14 days.
3. Or Run the script provided and configure it as cron job to purge files beyond supported scale
#!/bin/bash
echo "-----------------------Script Start-----------------------"
date
export KUBECONFIG=/config/vmware/napps/.kube/config
# Script will delete only if total count of events is greater than below variable value
delete_records_if_more_than=5000000
# The number of events that should be kept which will have score greater than 'score_less_than' variable value
records_to_keep=2000000
score_less_than=30
# Set to 1 only if manual vacuum should be performed [Not Recommended]
perform_vacuum=0
POSTGRES_PASSWORD1=`kubectl -n nsxi-platform get secret postgresql-password -o jsonpath="{.data.postgresql-password}" | base64 -d`
val=`kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "select count(1) from inspection_events;" | awk 'NR==3 {print $1}'`
if [[ $val > $delete_records_if_more_than ]]; then
echo "Deleting records"
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "DELETE FROM inspection_events WHERE id < ((select max(id) from inspection_events) - $records_to_keep) and score < $score_less_than;"
if [[ $perform_vacuum = 1 ]]; then
echo "Freeing up space"
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "VACUUM public.inspection_events";
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "VACUUM public.http_header";
fi
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "REINDEX TABLE inspection_events";
fi
echo "-----------------------Script Complete-----------------------"
It is recommended not to perform vacuuming using the script, and let auto vacuum free up the space.
If required, vacuuming can be done using the script, but logs should be monitored as this vacuum can fail due to incorrect configuration of postgres.
To configure the script as a cron job:
# Save the script on manager. Ex: purge-mps-tables.sh
crontab -e
# Check for cron syntax on https://crontab.guru/
# Depending on the number of events generated per day configure the cron job accordingly