nas alarm queue backup on clustered Windows UIM Server
search cancel

nas alarm queue backup on clustered Windows UIM Server

book

Article ID: 107761

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

UIM Server installed on Windows cluster.

After noticing that the nas queue was backing up on the hub, restarted the robot on the primary cluster node.  This did not resolve the issue.
Failed over to the secondary cluster node.  This did not resolve the issue.

The nas hub queue continued backing up.  The following error messages were periodically recorded in the nas.log on the secondary cluster node after the failover:

nas: nimAttach - failed to attach to queue nas
nas: Failed in attaching to HUB, retry #1

Environment

UIM Server 8.41
robot:  7.93
hub:  7.93
nas:  8.56
 

Cause

It appears that there was a nas process running on both nodes of the cluster.

From the loglevel 3 hub logs we see the nas on the primary cluster node subscribes to the nas queue and succeeds:

Jul 24 16:18:33:061 [18100] hub: Subscriber 'nas' at '<primary node IP address>/49342' attached to queue 'nas' (subject:alarm2 requested bulk:1, granted bulk:1, minimum bulk:0, wait:0, heartbeat: 0, reply timeout: 60), time used: 1 ms

then the nas on the secondary cluster node attempts subscribing to the same queue and is denied:

Jul 24 16:20:50:049 [18100] hub: Processing new subscribe request from '<secondary node IP address>/51185'
Jul 24 16:20:50:049 [18100] hub: add_subscriber id=nas sub=000000000125DC60
Jul 24 16:20:54:985 [18100] hub: Subscribe error: queue nas already has a subscriber (<primary node IP address>/49342)
Jul 24 16:20:54:985 [18100] hub: Subscribe error: replacing current subscriber not permitted, denying request from 'nas' at '<secondary IP address>/51185'

At this point, the secondary cluster node is the active node in the cluster where the UIM Server probes should be running.  Since the nas probe on this node cannot subscribe to its queue, the queue starts backing up.

Resolution

Issue resolved by rebooting the primary cluster node (the one that is currently the passive node in the cluster).  This forced the termination of the nas process running on this node and allowed the nas running on the secondary cluster node (the active node in the cluster) to successfully subscribe to its queue on the hub and start processing its alarm messages.