NAPP postgres pods may go into CrashLoopBackoff state due to Malware Prevention Service consuming disk space

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

MPS feature supports scale of around 2million events in the time span of 14 days. If events more than 2 million are generated in duration less than 14 days then Postgres can get overloaded and eventually NAPP UI shows state as “UNAVAILABLE”.

Environment

All NAPP versions 4.2.0 and before.

Cause

The number of events generated by VMs monitored by Malware Prevention Feature (MPS) exceeds the supported number. The supported number is 2 million events generated within a span of 14 days. Each event can be the result of some file creation happening on an endpoint VM.

NAPP UI shows state as “UNAVAILABLE”

SSH into one of the NSX Manager nodes. Since NAPP UI was unavailable, we will check the status of kubernetes pods running in NAPP with the following command

root:~# napp-k get pods | grep -v "Running\|Completed"

The above command, will show all the kubernetes pods running in NAPP which are not in Running or Completed state.
An example output can be seen below:

root:~# napp-k get pods | grep -v "Running\|Completed"
NAME                                                              READY   STATUS             RESTARTS          AGE
postgresql-ha-pgpool-64ff45f946-gw9gr                             0/1     CrashLoopBackOff   123 (2m13s ago)   7d
postgresql-ha-postgresql-0                                        0/1     CrashLoopBackOff   103 (3m24s ago)   7d

From the above output, we see the postgres pod 'postgresql-ha-postgresql-0' is in 'CrashLoopBackOff' state.

We need to check logs of postgres pod to understand why it is in 'CrashLoopBackOff' state. Kubernetes allows us to check logs of containers running inside pods using the 'log' command. To check the logs of the 'postgresql-ha-postgresql-0' pod, use the following command.

root:~# napp-k logs postgresql-ha-postgresql-0 | grep -i panic
2024-07-17 18:19:57.227 GMT [149] PANIC: could not write to file "pg_wal/xlogtemp.149": No space left on device

From the above logs we can infer there is a space issue due to the "No space left on device" log line. To view all the logs, use the command "napp-k logs postgresql-ha-postgresql-0". In the above example, we are explicitly searching for the word 'panic' since that log line usually indicates why postgres crashed
In case of 'No space left on device', we will need to increase the storage space allocated to postgres pod. Please refer to the following KB to do the same: https://knowledge.broadcom.com/external/article/320807

Kubernetes allows us to interact with a container running inside a pod using the 'exec' command. Since the logs mentioned 'No space left', we can use the 'df' command to check the file system space usage
For example:

root:~# napp-k exec postgresql-ha-postgresql-0 -- df -h /bitnami/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/sdi 35G 28G 4.8G 86% /bitnami/postgresql

Looking at the above example, we see that the folder '/bitnami/postgresql' is consuming 28GB disk space.

To get more information about which files are using the space, we can use the 'du' command along with 'exec'
For example:

root:~# napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
28G /bitnami/postgresql/data/base
28G /bitnami/postgresql/data
26G /bitnami/postgresql/data/base/16451. <--------
2.2G /bitnami/postgresql/data/base/16577
481M /bitnami/postgresql/data/pg_wal
14M /bitnami/postgresql/data/pg_xact

Looking at the above output, the folder 'data' inside the '/bitnami/postgresql' directory is consuming 28GB. Digging deeper, we see the folder '/bitnami/postgresql/data/base/16451' is responsible for consuming 26GB of the 28GB consumed by the 'data' folder.

We will use the 'exec' command of kubernetes again and open an interactive bash shell. With this shell, we will be able to access the psql command line and run sql queries. This will be useful in determining which database is consuming the storage. To identify the database consuming most storage, we will use the data from the previous command. We saw the folder, /bitnami/postgresql/data/base/16451' consuming most storage, so we will query postgres for the database name of this particular folder (in our case 16451)

root:~# napp-k exec -it postgresql-ha-postgresql-0 -- bash
root:~# PGPASSWORD=$POSTGRES_PASSWORD psql
psql (13.5 (Debian 13.5-0+deb11u1))
Type "help" for help.

postgres=# SELECT datname FROM pg_database WHERE oid = 16451;
datname
-------------------
malwareprevention
(1 row)

In the 'datname' column, we see that the database whose data is stored in the folder 16451 is 'malwareprevention'.

Now that we know the database name, we can connect to the database (while still in the psql command line) using the '\c malwareprevention' command.

postgres=# \c malwareprevention
You are now connected to database "malwareprevention" as user "postgres".
malwareprevention=#

Once we are connected to the 'malwareprevention' database, we can check which tables are consuming the most amount of space using the following command

malwareprevention=# SELECT relname AS TableName, to_char(seq_scan, '999,999,999,999') AS TotalSeqScan,to_char(idx_scan, '999,999,999,999') AS TotalIndexScan, to_char(n_live_tup, '999,999,999,999') AS TableRows, pg_size_pretty(pg_total_relation_size(relname :: regclass)) AS TableSize FROM pg_stat_all_tables WHERE schemaname = 'public';

An example output might look something like the following:

malwareprevention=# SELECT relname AS TableName, to_char(seq_scan, '999,999,999,999') AS TotalSeqScan,to_char(idx_scan, '999,999,999,999') AS TotalIndexScan, to_char(n_live_tup, '999,999,999,999') AS TableRows, pg_size_pretty(pg_total_relation_size(relname :: regclass)) AS TableSize FROM pg_stat_all_tables WHERE schemaname = 'public';
            tablename            |   totalseqscan   | totalindexscan |    tablerows     | tablesize
---------------------------------+------------------+------------------+------------------+-----------
malware_class                   |                1 |                0 |                0 | 24 kB
http_header                     |                0 |          517,895 |       26,143,960 | 1464 MB
unavailable_queue_table         |            1,441 |                6 |                0 | 56 kB
client_cache_mapping            |          115,656 |                  |                2 | 56 kB
reputation_failed_hash          |                2 |                6 |                0 | 24 kB
kafka_offset                    |                1 |           38,382 |                1 | 88 kB
malware_family                  |                1 |                0 |                0 | 24 kB
update_queue                    |           14,390 |              591 |                0 | 248 kB
intercepted_entities            |                2 |          333,071 |          543,140 | 438 MB
non_benign_table                |                0 |               32 |            3,287 | 148 MB
reputation_cloud_service_config |                0 |                0 |                0 | 32 kB
inspection_events               |               14 |        1,526,893 |       26,141,303 | 24 GB
sa_configurations               |               19 |                  |                2 | 48 kB
reputation_update_marker        |               21 |                0 |                0 | 64 kB
reputation_pubsub_config        |                0 |                  |                0 | 0 bytes
reputation_query_metadata       |            1,440 |                0 |                0 | 24 kB
allow_list_table                |           13,210 |            9,611 |                8 | 24 kB
summary_table                   |                0 |           35,986 |            9,594 | 1392 kB
(18 rows)

malwareprevention=#

On analyzing the output, we can see the table 'inspection_events' has around 26M tablerows and is consuming 24GB on disk.
The supported scale for the malware prevention feature is around 2M events. The number of events stored in the database, is well above the supported limit in this case.

Resolution

If the issue is found on NAPP 4.2.0, then please refer KB: https://knowledge.broadcom.com/external/article?articleNumber=374261

For NAPP 4.2.0.1, please implement the steps mentioned below:

Increase the PVC size of Postgres Database if we know the number of files we expect in the span of 14 days.
To increase PVC size, refer: https://knowledge.broadcom.com/external/article/320807
Run the script provided and configure it as cron job to remove database records beyond supported scale
1. SSH into nsx-manager using root user
  ssh root@<nsx-manager ip or fqdn>
2. Create a new shell script file by using any text editor. Below command shares an example using nano text editor:
  nano purge-mps-tables.sh
3. Copy the entire script provided in the KB and save it. Press Ctrl + O to save and Ctrl + X to exit
4. Run the following command to make the script executable:
  chmod +x purge-mps-tables.sh
5. Script can now be executed by running
  ./purge-mps-tables.sh

purge-mps-tables.sh

#!/bin/bash
echo "-----------------------Script Start-----------------------"
date
export KUBECONFIG=/config/vmware/napps/.kube/config
# Script will delete only if total count of events is greater than below variable value
delete_records_threshold=2000000
# The number of events that should be kept
records_limit=2000000
# Total size of all indexes (in GB) in table after which reindex should be performed
reindex_size_threshold=5
POSTGRES_PASSWORD1=`kubectl -n nsxi-platform get secret postgresql-password -o jsonpath="{.data.postgresql-password}" | base64 -d`
inspection_events_index_size=`kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "select pg_indexes_size('inspection_events');" | awk 'NR==3 {print $1}' | tr -dc '[0-9]'`
if [[ $inspection_events_index_size -gt $reindex_size_threshold*1024*1024*1024 ]]; then
echo "Starting reindex of inspection_events table"
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "BEGIN;SET LOCAL LOCK_TIMEOUT='3000s';LOCK TABLE inspection_events IN EXCLUSIVE MODE;LOCK TABLE kafka_offset IN EXCLUSIVE MODE;REINDEX TABLE inspection_events;END;"
else
echo "Not performing reindex on inspection_events table, as total index size: $inspection_events_index_size is less than threshold size of: $reindex_size_threshold"
fi
intercepted_entities_count=`kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "select count(1) from intercepted_entities;" | awk 'NR==3 {print $1}' | tr -dc '[0-9]'`
intercepted_entities_index_size=`kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "select pg_indexes_size('intercepted_entities');" | awk 'NR==3 {print $1}' | tr -dc '[0-9]'`
if [[ $intercepted_entities_count -gt $delete_records_threshold ]]; then
echo "Deleting records from intercepted_entities table"
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "WITH records_to_delete AS (SELECT sha256 FROM intercepted_entities WHERE verdict_source = '0' ORDER BY last_updated DESC OFFSET $records_limit) DELETE from intercepted_entities USING records_to_delete WHERE intercepted_entities.sha256 = records_to_delete.sha256;"
else
echo "NOT Deleting any records from intercepted_entities since the current number of records is $intercepted_entities_count which is less than $delete_records_threshold"
fi
if [[ $intercepted_entities_index_size -gt $reindex_size_threshold*1024*1024*1024 ]]; then
echo "Starting reindex of intercepted_entities table"
kubectl -n nsxi-platform exec -it postgresql-ha-postgresql-0 -- psql postgresql://postgres:$POSTGRES_PASSWORD1@localhost/malwareprevention -c "BEGIN;SET LOCAL LOCK_TIMEOUT='3000s';LOCK TABLE intercepted_entities IN EXCLUSIVE MODE;LOCK TABLE kafka_offset IN EXCLUSIVE MODE;REINDEX TABLE intercepted_entities;END;"
else
echo "Not performing reindex on intercepted_entities table, as total index size: $intercepted_entities_index_size is less than threshold size of: $reindex_size_threshold"
fi
echo "-----------------------Script Complete-----------------------"

To configure the script as a cron job:

# Save the script on manager. Ex: purge-mps-tables.sh
crontab -e

# Check for cron syntax on https://crontab.guru/
# Depending on the number of events generated per day configure the cron job accordingly
# It is recommended to run the cron job every 4 hours
0 */4 * * * /root/purge-mps-tables.sh >> /var/log/purge-mps-tables.log