CA UIM Hub restarts by itself without user prompting

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Hub probe restarts by itself, without any user interaction.This occurs relatively frequently.
The restarts are causing gaps in Qos data in the database

Environment

UIM 8.4 and later

Hub

Cause

The CA UIM hub seems to be restarting by itself on some interval (seems random). No user initiated this. The restarts are causing gaps in QoS data in the database.

== Please note that when the controller goes down, the hub goes down along with it

== Level 3 log files from the controller, e.g., "controller.log" and "_controller.log" captured after the problem occurs.

== If you find messages that look like this, it is a server time synchronization issue:

Oct xx 12:59:34:664 [3086436032] Controller: The next admin check time is too far into the future (11038 seconds).
Oct xx 12:59:34:664 [3086436032] Controller: This indicates that the clock has been changed and a robot restart is needed.
Oct xx12:59:34:664 [3086436032] Controller: ServiceStop
Oct xx 12:59:34:664 [3086436032] Controller: SREQUEST: post ->xx.xx.xx.xx/48001
Oct xx 12:59:34:664 [3086436032] Controller: RREPLY: status=OK(0) <-xx.xx.xx.xx/48001 h=37 d=28
Oct xx 12:59:34:664 [3086436032] Controller: Going down...

Resolution

Check the system clock on your hub and attached robots. They need to be in sync and there cannot be any separate process that is resetting the time in a radical fashion. The controller will restart when the system time changes, especially if it is a radical change.
The problem can also occur on a physical server, where some process is regularly resetting the server system clock. E.G., an NTP process.
Below example in controller logs wehre we can see the radical timestamp changes in the log file timestamps. They were fluctuating wildly.

Oct xx 11:42:11:206 [3086223040] Controller: Controller on xxx port 48000 started
Oct xx 11:42:13:440 [3086223040] Controller: Hub localhost(xx.xx.xx.xx) contact established
Oct xx 11:42:16:639 [3086223040] Controller: (secCallVerifyLogin) request verify_login failed xx.xx.xx.xx/48002 (permission denied)
Oct xx 11:42:16:639 [3086223040] Controller: verify login - cmd=probe_list frm=xx.xx.xx.xx/43676 failed
Oct xx 06:48:00:216 [3086223040] Controller: The next admin check time is too far into the future (17752 seconds).
Oct xx 06:48:00:216 [3086223040] Controller: This indicates that the clock has been changed and a robot restart is needed.
Oct xx 06:48:00:216 [3086223040] Controller: Going down...

This can be caused by a big difference in the ESXi server clock and the virtual server clock settings, because something is broken on the ESXi side which (apparently) is supposed to keep the time setting on the servers in sync.

You may also have an automated NTP (network time protocol server) process or something else (ESXi host system clock sync process) that has gone haywire. Whenever the system time is reset to a time in the past, the controller will restart, and if on the same server, the hub as well.