USM becomes unresponsive, UMP loses connection, due to slow rise in memory util.
search cancel

USM becomes unresponsive, UMP loses connection, due to slow rise in memory util.

book

Article ID: 135937

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Over the last week we noticed something strange happen when USM all of a sudden became unresponsive and the alarms do not update. When it happened we tried to close the browser and open a fresh page but the UMP gives you the login screen and then it just sits and spins at the loading bar. When we noticed it the first time we tried a few things and the only fix that seems to work is to reload the robot update 7.9.7 HF3 ( right click the controller probe on the primary hub and then hit update and it re-installs/reloads the robot config) and that temporarily fixes the issue. We had tried rebooting the WASP and the robot but that did nothing. Then a few days later it happened again and we tried the same things rebooting the wasp, restarting the robot etc. Then we tried reloading the robot config again and it worked as that brought the USM back up, but it kept reoccurring.

Environment

Release : 9.0.2

Component : UIM - DISCOVERY_SERVER

Cause

- Various issues

Resolution

There were multiple issues that needed to be addressed to achieve this resolution.


- robot (controller) 7.97 memory leak (upgraded to 7.97 HF3 which had the fix for the memory leak)

- Physical memory usage on the Primary and UMP machines was increasing as various C and Java processes appeared to be using up increasing amounts of memory over time. This included the following processes:

   - discovery_server

   - data_engine

   - nis_server

- data_engine 9.0.2 upgraded to 9.0.2 HF3 to handle various issues

- Primary hub physical memory usage was improved (decreased) with various changes we made to core probes and most of them are described here and at first the memory worsened with a few days, and then over time as we made cfg changes it took weeks for the Primary to end up needing a restart but the DS still used up enough memory to warrant a restart of the Primary Hub robot.


Therefore the discovery_server eventually had to be offloaded from the Primary hub to a child robot of the primary.

Child Robot for DS hardware/resources configured as per recommendation: 2.4 GHz or higher, Quad Core, 4 processors or more, min. 16 GB RAM, SSD Disk.


Other updates to discovery_server.cfg:

- From the discovery_server probe's Raw Configure GUI, select the setup folder, then in the right-hand pane create a new section called nimbusscan

-Select the newly created nimbusscan folder from the left-hand pane, then add the following key value in the right-hand pane:

nis_cache_update_interval_secs = 1800

-The niscache polling interval controls how often the discovery_server probe calls the controller probes' _nis_cache callback to check for niscache changes. Niscache changes are not event driven. The controller needs to be polled on a regular interval to detect any niscache changes.


- data_engine logging had to be decreased from 5 down to 1 due to a discovered/known issue with the log data being added to/held in memory


- Once the DS was offloaded to the child robot of the Primary as described above, the original issue of users not being able to login to UMP was resolved.


- The data_engine memory usage trend is still increasing as more monitoring is being enabled, since the DE holds the S_QOS_DATA table in its memory buffer.


The customer is significantly expanding their UIM monitoring footprint. As an example, the S_QOS_DATA table which is already at 1 million rows/QOS objects+ recently increased within 3 days by ~15k QOS entries apparently due to enabling monitoring for a storage probe. This trend will continue as the customer expands/enables more monitoring (TEST->PROD).