Best Practices for monitoring DX UIM - self-health monitoring
search cancel

Best Practices for monitoring DX UIM - self-health monitoring

book

Article ID: 9640

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Customers request guidance on DXIM/UIM self-health monitoring. The following article provides suggestions to implement monitoring of DXIM (UIM) which may include:
 
- hubs
- robots
- probes
- Operator Console

Environment

- UIM any version

Cause

- UIM self health monitoring

Resolution

Monitoring options

Availability - net_connect probe
- Monitor up/down status using both ICMP (port 0) pings as well as handshaking specific service ports, e.g., NimBUS port 48000 (controller) and 48001(Spooler) for robot and 48002 for hub.
 
System Resources - cdm or rsp probes
- Monitor CPU, Disk, Memory, I/O
 
processes probe
- CPU / Memory for select processes e.g., hub.exe, nas.exe etc.
- CPU spiking caused by a specific process
 
Probe 'status'

logmon probe:
Monitor for 'Max restarts' entries in core probe logs such as:
 
- hub
- robot (controller)
- UMP/OC probes, e.g., wasp
 
You can use logmon and parse the probe log(s) for "Max. restarts reached for probe" or any  hub, robot/controller, nas, and data_engine via their logs
 
dirscan probe
- Use the dirscan probe locally on each hub to monitor the q files (size) and alarm when it is greater than <size_of_file>
- Optionally, you could deploy a remote nas and emailgtw on one of your remote hubs to send an EMAIL when a queue alarm is generated.
- Make sure that under the setup/hub section, set hub and controller loglevel to at least 3 and logsize of at least 8000 so support has more details just in case an issue happens again.
 
discovery_server probe
- use processes and monitor java.exe using the associated command line for discovery_server
- use logmon to monitor the log for "exception"

Data - data_engine probe
- Monitor for data_engine errors/exceptions and alarm on them
- Use appropriate probe depending on what type of database is being used, e.g., sqlserver, oracle, mysql

Alerting - emailgtw probe
- use processes to monitor the emailgtw.exe process
- use logmon to look for each of these errors in the log:

   "error on session"
   "failed to start"
   "FAILED to connect session"
 
Network errors - snmpcollector probe
- Monitor key interfaces for discards/errors, e.g., hub/tunnel machines
 
Services/Events - ntservices / ntevl probes
- used to monitor services or events, e.g., application

Application/System
Application crashes, dumps
 
Windows:
Application crashes via ntevl probe
 
Linux/Unix systems
 
dirscan probe:
- Can be used to monitor for presence of core files
 
UMP performance (prior to UIM v20.3)
 
Instrument the JMX on wasp by adding these startup arguments to the Extra Java VM arguments:

-Dcom.sun.management.jmxremote.port=27000 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

Then, QoS can be gathered via the jvm_monitor or a third-party app such as VisualVM:

http://visualvm.java.net/
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

Gateways - spectrumgtw probe
- use processes and monitor java.exe using the associated command line for spectrumgtw
- use logmon to monitor the spectrumgtw.log for "exception" and other errors

spectrumgtw probe sync process

Here is the recommended way to monitor the UIM<->Spectrum synchronization process. This approach below would be assuming that the UIM-Spectrum integration and configuration is correct, required versions are all compatible and correct, and that the sync was previously working as expected.

Steps:

1. Set spectrumgtw loglevel to 5 and logsize to 100000 for each logfile.
The number of individual spectrumgtw logs written is dependent on a setting in the log4j.xml file in the following location:

   ...\Program Files (x86)\Nimsoft\probes\gateway\spectrumgtw directory.

       <param name="MaxBackupIndex" value="5"/>

2. Deactivate - Activate spectrumgtw

3. Configure logmon watchers for the following strings in the spectrumgtw.log:

- error
- Exception
- Failed
- lock
- OutofMemoryError
- Got NOT OK response

General Probe failures

To generate an alarm if and when a probe turns red, monitor the probe logfile for errors/failures/Max. restarts using logmon. Do this for the probe and/or the controller as well.

When a probe fails, the most common errors that may occur include:

Controller: Probe '<probe_name>' FAILED to start (command = <probe_name>.exe) error = (5) Access is denied.

In the nas, an alarm message filter could be used to take an action on the alarm.

   /.*FAILED to start.*/

or an error such as:

Controller: Max. restarts reached for probe '<probe_name>' (command = <startup java>)

   /.*Max. restarts.*/

vmware probe connectivity

vmware probe connection monitoring

- vmware probe configuration having issues pulling data from Virtual Center, e.g., vCenter is not able to read configuration and discover any information from the host even though that ESXi was reported as connected to the vCenter.

A Warning alarm is generated to prompt the vCenter admin to verify whenever there were timeout issues while collecting data from the vCenter. The alarm provides information that something might be wrong in the vCenter and vCenter admin need to verify it.

- Using the logmon probe

In logmon, monitor for the log entry *VMWare API is unavailable* in the vmware.log and send an alarm. Also using a nas Auto Operator rule message filter like-> /.*VMWare API is unavailable.*/ send an EMAIL for notification.

[Connection tester - 0, vmware] (12) login failed, VMWare API is unavailable: com.vmware.vim25.InvalidLogin: null at com.nimsoft.probe.application.vmware.sdk.VmwareEnvAdaptor.login(VmwareEnvAdaptor.java:273)

or

"vNNNxxxxx is not responding (reason: Unexpected fatal error in data collection. Collection will not resume until probe is restarted. See log for details.)"

Using the net_connect probe
- to monitor reachability vCenter via ping

Another option for monitoring connectivity related errors/issues is explained here using a CLI/script:

VMware PowerCLI Blog - Back to Basics: Connecting to vCenter or a vSphere Host
https://blogs.vmware.com/PowerCLI/2013/03/back-to-basics-connecting-to-vcenter-or-a-vsphere-host.html

 

Logins attempts Monitoring: 

Monitor user login attempts in IM (Infrastructure Manager ) - UIM (broadcom.com)

Monitor user login attempts in Operator Console (OC) - UIM (broadcom.com)

Additional Information

Hub queue check tool