MPS feature supports scale of around 2million events in the time span of 14 days. If events more than 2 million are generated in duration less than 14 days then Postgres can get overloaded and eventually NAPP UI shows state as “UNAVAILABLE”.
All NAPP versions 4.2.0 and before.
The number of events generated by VMs monitored by Malware Prevention Feature (MPS) exceeds the supported number. The supported number is 2 million events generated within a span of 14 days. Each event can be the result of some file creation happening on an endpoint VM.
SSH into one of the NSX Manager nodes. Since NAPP UI was unavailable, we will check the status of kubernetes pods running in NAPP with the following command
|
The above command, will show all the kubernetes pods running in NAPP which are not in Running or Completed state.
An example output can be seen below:
|
From the above output, we see the postgres pod 'postgresql-ha-postgresql-0' is in 'CrashLoopBackOff' state.
We need to check logs of postgres pod to understand why it is in 'CrashLoopBackOff' state. Kubernetes allows us to check logs of containers running inside pods using the 'log' command. To check the logs of the 'postgresql-ha-postgresql-0' pod, use the following command.
|
From the above logs we can infer there is a space issue due to the "No space left on device" log line. To view all the logs, use the command "napp-k logs postgresql-ha-postgresql-0". In the above example, we are explicitly searching for the word 'panic' since that log line usually indicates why postgres crashed
Kubernetes allows us to interact with a container running inside a pod using the 'exec' command. Since the logs mentioned 'No space left', we can use the 'df' command to check the file system space usage
For example:
|
Looking at the above example, we see that the folder '/bitnami/postgresql' is consuming 28GB disk space.
To get more information about which files are using the space, we can use the 'du' command along with 'exec'
For example:
|
Looking at the above output, the folder 'data' inside the '/bitnami/postgresql' directory is consuming 28GB. Digging deeper, we see the folder '/bitnami/postgresql/data/base/16451' is responsible for consuming 26GB of the 28GB consumed by the 'data' folder.
We will use the 'exec' command of kubernetes again and open an interactive bash shell. With this shell, we will be able to access the psql command line and run sql queries. This will be useful in determining which database is consuming the storage. To identify the database consuming most storage, we will use the data from the previous command. We saw the folder, /bitnami/postgresql/data/base/16451' consuming most storage, so we will query postgres for the database name of this particular folder (in our case 16451)
|
In the 'datname' column, we see that the database whose data is stored in the folder 16451 is 'malwareprevention'.
Now that we know the database name, we can connect to the database (while still in the psql command line) using the '\c malwareprevention' command.
|
Once we are connected to the 'malwareprevention' database, we can check which tables are consuming the most amount of space using the following command
|
An example output might look something like the following:
|
On analyzing the output, we can see the table 'inspection_events' has around 26M tablerows and is consuming 24GB on disk.
The supported scale for the malware prevention feature is around 2M events. The number of events stored in the database, is well above the supported limit in this case.
If the issue is found on NAPP 4.2.0, then please refer KB: https://knowledge.broadcom.com/external/article?articleNumber=374261
For NAPP 4.2.0.1, please implement the steps mentioned below:
purge-mps-tables.sh
#!/bin/bash |
To configure the script as a cron job:
|