Druid tasks pile up causing OOM and service failures in NSX Intelligence

Products

VMware NSX

Issue/Introduction

This article provides information on unblocking druid by letting the stop tasks complete and also to remove the backlog of stop tasks.

Symptoms:

Node status degraded on GUI, where service PROCESSING and spark are down due to DRUID overlord service experiencing Out of Memory (OOM).
New flow/config data are not getting displayed due to Druid overlord service experiencing OOM.
One of the Druid service (Overlord service) has run out of memory.

Note: You can confirm if the Druid Overlord service is failing due to Out of Memory by running this command:

grep "OutOfMemory" /var/log/druid/sv/overlord/current | wc -l

This command should print the number of times it has gone out of memory. A high number implies that this service has been going out of memory and causing Druids health to be degraded.

Environment

VMware NSX-T 1.0.x
VMware NSX-T

Cause

When upgrading from NSX Intelligence 1.0.x to a higher version, the seven old tables are marked to be removed. This is achieved by running a cleanup task for each table. However in the upgraded system, the old tables do not exist in the configuration, so the cleanup tasks could not be used on them. The new cleanup tasks are fired every hour and they get piled up. Each cleanup task takes a small amount of memory and eventually causes the system to fail. They may also prevent ingestion tasks from running so new data are not shown.

Resolution

This is a known issue affecting upgrades from NSX Intelligence 1.0.x to a higher version that includes NSX Intelligence 1.1.x and 1.2.0.

This issue is resolved in VMware NSX Intelligence 1.2.1, available at VMware Downloads.

Workaround:
To work around this issue:

If you upgraded from NSX Intelligence 1.0.x and have no symptoms:

Run the attached python remove_backlogged_cleanup_task.py script. This will cleans up existing backlogged cleanup tasks.
Run the attached python unblock_cleanup_task.py script. This enables the tasks to find the old tables, and should not cause the backlog in the future.

Note: You need to enter an argument about the version number of NSX Intelligence when using this script. The version should start with either 1.1 or 1.2.

For example, "python3 unblock_cleanup_task.py -v 1.1.001"

If you upgraded from NSX Intelligence 1.0.x and experience an Out of Memory and the Druid service fails, the service needs to be restored first by giving more memory to the Druid service and then run the remove_backlogged_cleanup_task.py and unblock_cleanup_task.py.

open /opt/druid/conf/druid/overlord/jvm.config using an editor.
Search for -Xms and -Xmx. You should see something similar to:

-Xms512M
-Xmx512M
Change the JVM memory to 2GB to start with. For example:

Xms2G
Xmx2G
Save the file and restart the Druid service by running this command:

service druid restart

Note: After this, you should see the overlord service come up successfully. If it still fails, try changing the memory to higher values (3G, 4G etc. to bring the service back up.
Run the remove_backlogged_cleanup_task.py and unblock_cleanup_task.py scripts.

Attachments

unblock_cleanup_task get_app

remove_backlogged_cleanup_task get_app