spooler throws Unable to spool message alarm

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

I have problems with 2 robots that sent spooler alarms. When I validate this problem, it is because the q*.rdb files are the ones that consume the most space and the spooler stop processing due to lack of disk space.

spooler q#.rdb files

-rw------- 1 root system 1314729984 Oct 23 17:26 q1.rdb
-rw------- 1 root system 228422316 Oct 24 12:16 q2.rdb

Environment

DX UIM 20.4 CU9
Oracle 19c
AIX 7.2
oracle probe

Cause

In this particular scenario, the incoming rate of the QOS was more than the outgoing rate and the q1.rdb and q2.rdb files eventually used up all of the disk space (2 GB).

Resolution

Common causes of this spooler alarm 'Unable to spool message' includes 1 or more of the following:

Insufficient hard drive space on the robot disk volume or the drive is 'read-only'
Spooler cannot write out to the local file system due to low disk space or the disk is full
Robot (controller, hdb and spoooler) Disk requirements-> Minimum of 5 GB free disk space
robot does not have proper user permissions on the OS
On Windows you can use Microsoft Process Explorer to check the spooler process:

On Linux, you can search for a user of a Linux process by name, just run: pgrep -u {USERNAME} {processName}

UIM (Nimsoft) Protocols for all components are TCP except for controller, hdb, and spooler, which also require UDP.
- UDP broadcast is used for the discovery of the hub, spooler, and controller components. All other core communications are done via TCP.
There are 2 spooler services running
There is AV software scanning/locking the files in the nimsoft directory which is preventing the spooler service from functioning correctly. (You MUST create a FULL exception for all UIM/Nimsoft programs on the robot). You may find evidence of this issue via examination of Windows Events (Application/System)
There is a file system problem on the system where the robot is installed
Backup software is locking the files
Unreachable via ping
System/Robot decommissioned (non-existent / unknown)
Robots have host intrusion protection IPS installed on them (e.g., in /opt/Symantec)
In some cases, you can simply deactivate the robot, delete the q*.rdb files and then Activate the robot.
- If the robot is running but can't connect to the hub, the q.rdb files will grow in size until the connection is restored.
Note that the spooler defaults to/uses port 48001
In the robot.cfg, it may help to set the loglevel to 6 and logsize to 50000 for finer debug.

Checked the spooler probe via IM on the robot using the probe utility (via Ctrl-P) -> and ran the spooler get_info callback to see if messages were flowing efficiently on each robot.

We added a new key into the spooler configuration:

bulk_transfer = <value>

For the first robot/system we set the value to 1500 (number of messages)

on the 2nd robot having the same spooler issue, we set it to 500.

So essentially, we added this new key-> bulk_transfer and set it to a value (number of messages) so that the q1 and q2 records would decrease and and not continue to build and use up the disk space caused by the queues not processing quickly enough due to the message traffic. In this case it was due to the oracle_monitor process. We could see a lot of messages from the oracle probe (oracle_monitor probe process) using Tools->DrNimBUS.

Once we made a few adjustments to the value for bulk_transfer we used the spooler probe utility and continued checking the number of q1 and q2 messages and they were processing very well and even emptying down to 0. So in summary, the incoming rate of the QOS was more than the outgoing rate. Adding and setting bulk_transfer key in spooler.cfg to 1500 on one robot and 500 on another robot resolved the issue.