Release : 9.0.2
Component : UIM - DISCOVERY_SERVER
There were multiple issues that needed to be addressed to achieve this resolution.
- robot (controller) 7.97 memory leak (upgraded to 7.97 HF3 which had the fix for the memory leak)
- Physical memory usage on the Primary and UMP machines was increasing as various C and Java processes appeared to be using up increasing amounts of memory over time. This included the following processes:
- discovery_server
- data_engine
- nis_server
- data_engine 9.0.2 upgraded to 9.0.2 HF3 to handle various issues
- Primary hub physical memory usage was improved (decreased) with various changes we made to core probes and most of them are described here and at first the memory worsened with a few days, and then over time as we made cfg changes it took weeks for the Primary to end up needing a restart but the DS still used up enough memory to warrant a restart of the Primary Hub robot.
Therefore the discovery_server eventually had to be offloaded from the Primary hub to a child robot of the primary.
Child Robot for DS hardware/resources configured as per recommendation: 2.4 GHz or higher, Quad Core, 4 processors or more, min. 16 GB RAM, SSD Disk.
Other updates to discovery_server.cfg:
- From the discovery_server probe's Raw Configure GUI, select the setup folder, then in the right-hand pane create a new section called nimbusscan
-Select the newly created nimbusscan folder from the left-hand pane, then add the following key value in the right-hand pane:
nis_cache_update_interval_secs = 1800
-The niscache polling interval controls how often the discovery_server probe calls the controller probes' _nis_cache callback to check for niscache changes. Niscache changes are not event driven. The controller needs to be polled on a regular interval to detect any niscache changes.
- data_engine logging had to be decreased from 5 down to 1 due to a discovered/known issue with the log data being added to/held in memory
- Once the DS was offloaded to the child robot of the Primary as described above, the original issue of users not being able to login to UMP was resolved.
- The data_engine memory usage trend is still increasing as more monitoring is being enabled, since the DE holds the S_QOS_DATA table in its memory buffer.
The customer is significantly expanding their UIM monitoring footprint. As an example, the S_QOS_DATA table which is already at 1 million rows/QOS objects+ recently increased within 3 days by ~15k QOS entries apparently due to enabling monitoring for a storage probe. This trend will continue as the customer expands/enables more monitoring (TEST->PROD).