Vmware probe - tips for troubleshooting some common vmware problems / alarms

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

This document covers best practices for troubleshooting common VMware problems / alarms.

Environment

UIM 8.51 or above
vmware probe v6.41 or higher

Resolution

Run the latest version of the vmware probe if possible
Note on performance - 'partial graph publishing' is enabled by default in vmware probe v6.82 or higher
We generally see better results when VMware tools is installed on all VMware Guests. Enables collection of hostname and IP addresses
If there is a need to check the accuracy of the data collected by the probe, you can use the vmware Managed Object Browser (MOB) to verify it
For troubleshooting purposes use loglevel 3 at a minimum, but set the value to 5 if necessary and configure a logsize large enough to capture the problem, e.g., 1000000 (in Kb)
If there is something wrong with the performance data collection, confirm performance data exists in the vCenter performance charts
Check for 'OutofMemory' errors in the log. Check for available memory and add java heap memory to the vmware probe via Raw Configure if needed.
Note that we recommend a minimum of 1 GB of memory allocated to the probe for every 1000 VMs.

The vmware probe uses the vmware API - VI SDK.

Here is a link to vmware documentation:
vSphere Command-Line, SDK, and API Documentation

How to cross-check performance data/data collection using the VMware MOB:
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.wssdk.pg.doc_50%2FPG_ChB_Using_MOB.20.1.html

Navigate to https://<resource address>/mob and login with the same vmware probe credentials.

vmware provides a browser interface into this API, by using the following URL(s):

https://<VirtualCenter>/mob/
https://<ESX(i)>/mob

Here is an example of how to look at the HostServiceInfo data object.

From the main page - Managed Object Type: ManagedObjectReference:ServiceInstance
Click on "content"

From Data Object Type: ServiceContent:
Click on "ha-folder-root"

From Managed Object Type: ManagedObjectReference:Folder
Click on "ha-datacenter"

From Managed Object Type: ManagedObjectReference:Datacenter
Click on "ha-folder-host"

From Managed Object Type: ManagedObjectReference:Folder
Click on "ha-compute-res"

From Managed Object Type: ManagedObjectReference:ComputeResource
Click on "ha-host"

From Managed Object Type: ManagedObjectReference:HostSystem
Click on "config"

From Data Object Type: HostConfigInfo
Click on "service" (near the bottom)

Selected common questions, issues or alarms

vmware probe getting 'DATASTORE.Accessible' alarms (Alarms from ESX hosts DataStores.)
Symptoms:
A customer analyzed the logs of the ESX host and found no issues with DataStore accessibility. The vCenter also didn’t report any issues. They verified that all DataStores were accessible.

Alarms: “Self-Monitoring Failures for '192.***.x.xx:DATASTORE.Accessible': Data Collection (1 of 10 failed). See
vmware.log for more details”

Cause:

This problem was caused by vmware probe STATIC monitors. This alarm can occur when the vmid for the resource changes as the monitor is written with the vmid in its definition.

Resolution:
Auto-monitors do not use the vmid's so use Auto-Monitors instead of STATIC monitors.

No VMs in Inventory (VCE and others)
Symptoms:
Customer was only interested in monitoring infrastructure and did not care about VMs and wanted to improve vmware probe performance.

Cause:
There were a lot of VMs and a lot of VM data, which takes time to collect and memory to track.

Resolution:

Disable the collection and tracking of VMs and children with the 'show_vms' setup flag.

Vmware probe scalability and performance (vpxd.stats.maxQueryMetrics and perf_request_batch_size)
Symptoms:
By default VMware limits the "vpxd.stats.maxQueryMetrics" to 64. This could cause the probe metric collection to fail if a Resource Pool/vAPP or a cluster has more than 64 VMs or a Datastore has more than 64 VMs or disks. You will see an error message such as: “Failed to execute single perf query for entity ……. Follow VMware KB 2107096 to resolve.”

Resolution:
To fix this issue, follow the instructions in the VMware Knowledge Base article "Performance charts are empty" #2107096, on how to increase the "vpxd.stats.maxQueryMetrics." Increase the "vpxd.stats.maxQueryMetrics" to more than the maximum number of VMs that you have in a resource pool, cluster or maximum number of VMs/Disks that you have in a Datastore.

When you change vpxd.stats.maxQueryMetrics its best practice configure vmware probe setup key perf_request_batch_size (64 by default) to match to the new value of vpxd.stats.maxQueryMetrics. This will optimize the probe data collection.

“Tuning vmware probe performance”
https://knowledge.broadcom.com/external/article/34665

“vmware probe – Best Practices for better performance”
https://knowledge.broadcom.com/external/article/33588

VMware ‘Self-Monitoring’ Alarms
Symptoms:
A Self-Monitoring alarm will trigger when an Automonitor generation failed, Static monitor failed or Data collection failed.

- Self-monitoring alarms can be disabled by using setup key; enable_self_monitoring_alarm = false
- Self-monitoring alarms severity can be set using setup key; self_monitoring_alarm_severity and set it to desired number (5-Critical, 4-Major, 3-Minor, 2-Warning, 1-Informational), Default is 4.
- Probe by default aggregates self-monitoring alarms based on monitor type. For an example if multiple VMs “VM_CPU.Used" collections failed it will aggregate and only one alarm will be generated. This aggregated alarm will indicate how many failed(e.g. - 2 out of 10 failed). This can be disabled by setting: enable_self_monitoring_alarm_aggregation = false. When Disable probe will generate an alarm for each incident.
- By default, the probe will re-send the same failing self-monitoring alarm each probe collection cycle with the same suppression key. This can be disabled by setting; enable_self_monitoring_alarm_same_error_suppression = true, then alarm is sent only once when it occurred and it won’t sent again unless the number of errors changed or probe reboots.

Resolution(s):
- Check in vCenter if the monitor exists.
- As of vmware probe v6.41 or higher, in the Admin Console GUI review the element: ***Detached Configuration*** folder in the left-hand navigation tree for the probe as it displays resources that have been deleted in the VMware vSphere but are still configured in the probe.
- Check to see if a VM was shutdown

Vmware metrics missing
Customer was missing 2 metrics that should be monitored by the vmware probe. (CPU Reservation and CPU Max Limited)

Symptoms:
Jan 17 12:26:28:545 [BulkSender Monitor, vmware] Sent NimAlarm C, severity=4, message==Self-Monitoring Failures for 'hostname':VM_CPU_AGGREGATE.maxlimited': Data Collection (1 of 1 failed). See vmware.log for more details, subsystem=3.37.1, suppressionid=hostname::VM_CPU_AGGREGATE.maxlimited, source=hostname and received confirmation id TC27242969-60332

Jan 20 03:19:27:037 [Data Collector - hostname, vmware] Failed to update/fix details for static monitor 'abc-ua-fesx-***.xx.com.Memory.Memory Reserved Capacity (% of MemorySize)'. Collection will not occur correctly --Deactivating monitor!!!: No parent found in graph for monitor

Cause:
The monitor “CPU Limited by Max (% of available)” was not configured in the correct location in the vmware template. It should have been configured under “Cluster > Host > Virtual Machine > CPU”.

- The monitor was configured under “ESXi Compute Resource” which would only apply if you have an ESXi server configured as the target/resource.

Resolution:
- Configure the monitor in the correct location of the template as shown in the screenshot and it should work.

Datastore alerts not generated from vmware probe

Increase the java heap memory to the probe which may be set too low:

<startup>
options = -Xms128m -Xmx2048m -Duser.language=en -Duser.country=US
</startup>

to

<startup>
options = -Xms2048m -Xmx4096m -Duser.language=en -Duser.country=US
</startup>
Then Deactivate the vmware probe.
Wait until the port and PID disappear.
Activate it.
Test the probe to see if youre then getting the expected datastore alarms when the value crosses the threshold.

Additional Information

Please also check the vmware probe release notes:

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/clouds-containers-and-virtualization/vmware-vmware-monitoring/vmware-vmware-monitoring-release-notes.html