DX UIM Monitoring and available options
Availability - net_connect probe
- Monitor up/down status using both ICMP (port 0) pings as well as handshaking specific service ports, e.g., NimBUS port 48000 (controller) and 48001(Spooler) for robot and 48002 for hub.
System Resources - cdm or rsp probes
- Monitor CPU, Disk, Memory, I/O
Process Monitoring - processes probe
- CPU / Memory for select DX UIM processes e.g., hub.exe, nas.exe, wasp, etc.
- CPU spiking caused by a specific process
Probe 'status' - using the logmon probe:
- Monitor for 'Max. restarts' entries in core probe logs such as:
- hub.log
- robot (controller.log) - allows you to detect a hung robot
- snmpcollectorlog
- UMP/OC probes, e.g., wasp.log
- other probes
controller probe
You can use logmon to parse the controller.log and generate an alarm when the snmpcollector will not start. The error in the controller.log will show the error message at loglevel 3 or higher:
Max. restarts reached for probe 'snmpcollector'
Using logmon to parse the controller.log you can use generate an alert and then a nas rule to send an EMAIL or text using a nas message filter (REGEX) such as:
/.*Max. restarts reached for probe 'snmpcollector'.*/
and/or apply/script some automated action to Deactivate the probe and then kill any leftover snmpcollector java processes if still present. (optional).
Then Activate the probe but you should always test the rule first by using the nas and sending a test message that triggers the rule execution.
snmpcollector probe
Another example of snmpcollector probe failure could be evidenced by this message in the snmpcollector.log:
Failed initializing database, so use the REGEX shown below and a similar approach as is described above:
/.*Failed initializing database.*/
snmpcollector probe on Linux
snmpcollector on Linux won't start and there are no detailed errors in logs. The controller.log log only shows "Controller: Max. restarts reached for probe 'snmpcollector' (command = <startup java>."
Linux does not by default contain the hostname and ip in /etc/hosts. This is required for snmpcollector name resolution to start. Add the ip and hostname to the first line in the /etc/hosts file and restart snmpcollector.
dirscan probe
- Use the dirscan probe locally on each hub to monitor the q files (size) and alarm when it is greater than <size_of_file>
- Optionally, you could deploy a remote nas and emailgtw on one of your remote hubs to send an EMAIL when a queue alarm is generated.
Make sure that under the setup/hub section, set hub and controller loglevel to at least 3 and logsize of at least 8000 so support has more details just in case an issue happens again.
discovery_server probe
- Use processes and monitor java.exe using the associated command line for the discovery_server process
- Use logmon to monitor the log for "exception"
Database Server Health Monitoring
Use the appropriate probe for monitoring depending on what type of database is being used, e.g., sqlserver, sql_response, oracle, or mysql
Remote monitoring - recommended checkpoints
sqlserver probe
db_alive (heartbeat)
buffer cache hit ratio
This represents a % of how often SQL Server can find data pages in memory as opposed to fetching them from disk. This number should be == > 98%. If that ratio is less than 95% then the server is under memory pressure. We always want this to be extremely high.
long_jobs
Monitors long running jobs and their category (in seconds)
logfile_usage
This checkpoint monitors the amount of free space in transaction log in %. If there is at least one transaction log file with "unlimited" growth in a database, the space in its transaction log is considered as 100% free.
fg_freespace_with_avail_disk
This checkpoint monitors the amount of free disk space in database file groups in %.free space for file groups (with auto growth enabled) is calculated after considering the available disk size on which the file group is located.
server_cpu
This checkpoint monitors % of CPU usage by SQL Server instance in interval.
net_connect probe
Check the MS SQL Server service at its port: mssql-server@1433
Tools
Use SQL Server Profiler or an equivalent DB tool to identify long running jobs or jobs which take significant CPU, Memory, Disk I/O and overall run time.
Alerting - emailgtw probe
- Use processes to monitor the emailgtw.exe process
- Use logmon to look for each of these errors in the log:
"error on session"
"failed to start"
"FAILED to connect session"
Network errors - snmpcollector probe
- Monitor key interfaces for discards/errors, e.g., hub/tunnel machines
Services/Events - ntservices / ntevl probes
- Used to monitor services or events, e.g., application
Application/System logs
- Application crashes, dumps
nas housekeeping or maintenanceDeploy and use logmon to parse the nas.log for errors related to its housekeeping job(s).
If the nas tables grow too large, especially the nas_transaction_log, 'housekeeping' may fail at some point and continue to fail and grow even larger over time.
In your logmon Watcher profile you can use a regex like the following to generate an alarm when the transaction log housekeeping fails to remove entries.
Example alarm: nas: Nis-Bridge: Transaction-log administration, failed to remove transaction entries older than 8 days.
Windows OS
- Application crashes via the ntevl probe
Linux/Unix OS
dirscan probe:
- Can be used to monitor for the presence of cores/core dump files if core dumps are enabled
Operator Console
- Use the processes probe to monitor the OC wasp java.exe for process memory usage (Kb)
- Create a performance report in the OC PRD and save it to collect the data over time and chart it so you can see the trend
- Make sure the OC robot has extra memory for memory utilixation/spikes
- Multi-CPU / 4 or more processors
Then, QoS can be gathered via the jvm_monitor probe or a third-party app such as VisualVM:
https://visualvm.github.io/
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html
UIM Gateways - spectrumgtw probe
Use the processes probe and monitor java.exe using the associated command line for spectrumgtw
Use logmon to monitor the spectrumgtw.log for "exception" and other errors
spectrumgtw probe sync process
Here is the recommended way to monitor the UIM<->Spectrum synchronization process. This approach below would be assuming that the UIM-Spectrum integration and configuration is correct, required versions are all compatible and correct, and that the sync was previously working as expected.
Steps:
1. Set spectrumgtw loglevel to 5 and logsize to 100000 for each logfile.
The number of individual spectrumgtw logs written is dependent on a setting in the log4j.xml file in the following location:
...\Program Files (x86)\Nimsoft\probes\gateway\spectrumgtw directory.
<param name="MaxBackupIndex" value="5"/>
2. Deactivate - Activate spectrumgtw
3. Configure logmon watchers for the following strings in the spectrumgtw.log:
- error
- Exception
- Failed
- lock
- OutofMemoryError
- Got NOT OK response
General probe failures
To generate an alarm if and when a probe turns red, monitor the probe logfile for errors/failures/Max. restarts using logmon. Do this for the probe and/or the controller as well.
When a probe fails, the most common errors that may occur include:
Controller: Probe '<probe_name>' FAILED to start (command = <probe_name>.exe) error = (5) Access is denied.
In the nas, an alarm message filter could be used to take an action on the alarm, e.g.,
/.*FAILED to start.*/
or an error such as:
Controller: Max. restarts reached for probe '<probe_name>' (command = <startup java>)
/.*Max. restarts.*/
vmware probe connection monitoring
vmware probe configuration having issues pulling data from Virtual Center, e.g., vCenter is not able to read configuration and discover any information from the host even though that ESXi was reported as connected to the vCenter.
A Warning alarm is generated to prompt the vCenter admin to verify whenever there were timeout issues while collecting data from the vCenter. The alarm provides information that something might be wrong in the vCenter and vCenter admin need to verify it.
Using the logmon probe:
In logmon, monitor for the log entry *VMWare API is unavailable* in the vmware.log and send an alarm. Also using a nas Auto Operator rule message filter like-> /.*VMWare API is unavailable.*/ send an EMAIL for notification.
[Connection tester - 0, vmware] (12) login failed, VMWare API is unavailable: com.vmware.vim25.InvalidLogin: null at com.nimsoft.probe.application.vmware.sdk.VmwareEnvAdaptor.login(VmwareEnvAdaptor.java:273)
or
"vNNNxxxxx is not responding (reason: Unexpected fatal error in data collection. Collection will not resume until probe is restarted. See log for details.)"
Another option for monitoring connectivity related errors/issues is explained here using a CLI/script:
VMware PowerCLI Blog - Back to Basics: Connecting to vCenter or a vSphere Host
Using the net_connect probe:
monitor vCenter reachability via ping
Login attempts Monitoring:
Monitor user login attempts in IM (Infrastructure Manager ) - UIM (broadcom.com)
Monitor user login attempts in Operator Console (OC) - UIM (broadcom.com)
oi_connector or apm_bridge java memory utilization (options)
logmon probe
You can initially keep an eye on the memory consumption and/or use logmon to monitor the logs and check for GC overhead limit exceeded in the oi_connector and/or apm_bridge logs using logmon and send alarms.
processes probe
Monitor the memory utilization over time using the processes probe.
- Select the single java.exe process name + command line
- Create a process profile
- Monitor memory in Kb (QOS)
- Generate an alarm when the memory has surpassed the java max memory or reaches a significantly higher % than the max.
- Create a performance report in the OC PRD and save it to collect the data over time and chart it so the trend over time is displayed.
- Optionally collect baseline QOS over time and depict the baseline value in the chart to see if the consumption settles down to a more consistent stable value.
- Then set the probes' java max ~2 GB higher than the baseline value and the min 2 GB lower, e.g., 8GB and 10 GB min and max.