probes crash/core dump and turn red when the disk is full
search cancel

probes crash/core dump and turn red when the disk is full

book

Article ID: 427857

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

  • filesystem where UIM robot is installed has filled up
  • numerous probes including dirscan, processes, logmon, cdm, net_connect have terminated and are crashing/creating core dumps, in some cases repeatedly

 

Environment

DX UIM - Any Version
numerous monitoring probes including processes, logmon, dirscan, cdm, net_connect (may include others not listed here)

Cause

C SDK limitation

Resolution

UIM probes perform numerous OS-level operations including running system commands and writing to files, including their own log files and temporary data files (depending on the probe in question).

If the filesystem fills up, many of these operations will fail and cause a probe to crash/abort.  This is unavoidable due to limitations in the underlying C libraries used to develop the probes.  Generally speaking, this kind of crash is harmless as it simply indicates a failure to perform a critical operation, and the controller probe will recognize the crash and restart the probe.

However, if the disk is still full, the restart of the probe may also cause the probe to crash when it tries to create a new log file or write to its data files on startup, causing a repeated crash.

The robot has a built-in limiter to prevent probes from restarting infinitely, so after a set number of crashes (10 by default) the probe should stop crashing and the probe log will contain a "max restarts" error.

 

Because the DX UIM Robot relies heavily on ability to write to the disk, in some cases hundreds or thousands of times per minute, it is critical to ensure that there is sufficient disk space at all times for the robot to perform these operations.

We recommend a minimum of 5GB free space for a robot installation.  Additionally, the CDM probe can and should be used to monitor disk usage, and alert administrators when a filesystem is filling up, so that action can be taken before it becomes completely full.

 

To resolve this issue if it does occur, you must clear disk space, and then restart the robot.  There is no way to avoid the probe failures when the disk is full, and the crashes/core dumps are an expected result of the failures.

Additional Information

If the time between probe startup and crash is longer than a few seconds, in some cases the restart limiter may not recognize the restarts correctly.  If this happens, and if core dump files are saved to a different filesystem than the one where the UIM robot is installed, the core dump files may take up large amounts of disk space on that filesystem and eventually fill it.

In that case see the following KB:  probe crashing repeatedly does not trigger max restarts limiter

Most Linux OS's also have settings to limit the number of core files or the maximum size.