data_engine and related probes unable to Activate after install

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

We had installed UIM version 20.1 in windows platform 2016. Now we are facing issue like unable to activate SLM and service type probes. We are unable to send QOS data to the database and this is impacting our SLA and unable to get any report as well.

We are getting like below errors in data_engine probe :--

Sep 19 13:31:05:108 [6364] 0 de: ciHostLookup - failed for <DBSERVER>\<DBINSTANCE> (11001)
Sep 19 13:31:05:108 [6364] 0 de: ciOpen - failed to get device identifer for <DBSERVER>\<DBINSTANCE> (3)
Sep 19 13:31:05:111 [6364] 0 de: ciHostLookup - failed for <DBSERVER>\<DBINSTANCE> (11001)
Sep 19 13:31:05:111 [6364] 0 de: ciOpen - failed to get device identifer for <DBSERVER>\<DBINSTANCE> (3)
Sep 19 13:31:05:113 [6364] 0 de: QoSInsert::Commit - failed for one or more of the bulk containers, save data and release connection
Sep 19 13:31:05:289 [6364] 0 de: QoS - Commit failed: closing database connection to trigger a reconnect
Sep 19 13:31:05:289 [6364] 0 de: ciHostLookup - failed for <DBSERVER>\<DBINSTANCE> (11001)
Sep 19 13:31:05:289 [6364] 0 de: ciOpen - failed to get device identifer for <DBSERVER>\<DBINSTANCE> (3)
Sep 19 13:31:05:293 [6364] 0 de: QoS - hmm, server is responding ... something went wrong with the ADO_BulkInsert
Sep 19 13:31:31:432 [6364] 0 de: ciHostLookup - failed for <DBSERVER>\<DBINSTANCE> (11001)
Sep 19 13:31:31:432 [6364] 0 de: ciOpen - failed to get device identifer for <DBSERVER>\<DBINSTANCE> (3)
Sep 19 13:31:31:436 [6364] 0 de: ciHostLookup - failed for <DBSERVER>\<DBINSTANCE> (11001)
Sep 19 13:31:31:436 [6364] 0 de: ciOpen - failed to get device identifer for <DBSERVER>\<DBINSTANCE> (3)
Sep 19 16:19:55:383 [7228] 0 de: ADO_QoSInsert::InsertQosObjectEx - New table_id:

etc

Sep 20 00:40:03:236 [6260] 0 de: getNextRunTime: old style time spec. nextRun: 1600629000
Sep 20 00:40:36:057 [4028] 0 de: getNextRunTime: old style time spec. nextRun: 1600629000
Sep 20 00:45:41:545 [6260] 0 de: Failed to read a valid probe_crypto_mode from controller. Assuming pre-FIPS and using TWO_FISH

etc

Sep 20 07:02:56:039 [7228] 0 de: ADO_QoSInsert::InsertQosObjectEx - New table_id: Sep 20 07:02:56:088 [7228] 0 de: ADO_QoSInsert::InsertQosObjectEx - New table_id:

etc

Sep 20 15:48:45:191 [6260] 0 de: Failed to read a valid probe_crypto_mode from controller. Assuming pre-FIPS and using TWO_FISH

Sep 20 15:56:50:810 [6260] 0 de: ######################## EXIT DATA ENGINE ########################
Sep 20 15:56:50:810 [6260] 0 de: qos_data_thread - stop received ...
Sep 20 15:56:55:213 [6260] 0 de: main - waiting for commit threads to terminate
Sep 20 15:56:55:214 [6260] 0 de: main - commit threads terminated
Sep 20 15:56:55:214 [6260] 0 de: main - waiting for bulk thread to terminate

Most of the probes on the Primary hub were red and all of the probe logs stopped being written to on Sep 15 at 13:56

UIM had been running fine for 2 months since the upgrade. Then all of a sudden the slm related services and other probes went down/turned red.

- distsrv and other probe logs show a complete stoppage at Sep 20 15:56:51:482 [4008] 1 distsrv: RequestQueueCleanup.

- NONE of the probe logs were being written to disk.

- Disks on the Primary hub were not full nor near full

All other hubs operating fine except the Primary

controller log extract:
Sep 19 06:57:44:654 [1572] 0 Controller: _ProcStart - Probe 'udm_manager' - starting
Sep 19 13:30:41:248 [1572] 0 Controller: _ProcStart - Probe 'udm_manager' - starting
Sep 20 15:56:50:793 [1572] 0 Controller: Going down...
Sep 20 15:57:06:792 [1572] 0 Controller: Down

data_engine log extract:

Sep 19 13:30:30:792 [7732] 0 de: (4) Open [Microsoft SQL Server Native Client 11.0] Login failed for user 'MXXXS\SXXXXXXXPRD'.
Sep 19 13:30:30:792 [7732] 0 de: COM Error [0x80004005] Unspecified error - [Microsoft SQL Server Native Client 11.0] TCP Provider: No connection could be made because the target machine actively refused it.

Sep 19 13:30:30:792 [7732] 0 de: [LSV] Open - 4 errors
Sep 19 13:30:30:792 [7732] 0 de: (1) Open [Microsoft SQL Server Native Client 11.0] TCP Provider: No connection could be made because the target machine actively refused it.

Environment

Release : 20.1

Component : UIM - INSTALL

Cause

- Sep 20 at 13:56 there was a change to the Windows machine configuration

Resolution

ping cluster ip and nodes, no issue. telnet at port worked fine as well.

telnet to cluster VIP at port worked fine as well.

RDP as the service account user:

'Error: The connection was denied because the user account is not authorized for remote login'

As it turns out, someone/some process/policy change removed the Service Account from the Windows Administrators Group.

After the service account-> xxxxx\<SERVICE_USER_ID> was re-added into the Windows Administrators Group on the local machine, and the Primary hub was restarted, the hub and all of its probes started working as expected again.

The Primary hub robot was running as that same Windows service account and the data_engine database user was configured as that same account.