Unable to query vSAN health information. Check vSphere Client logs for details.

Products

VCF Operations/Automation (formerly VMware Aria Suite) VMware vSAN

Issue/Introduction

Symptoms:

To identify the cause and resolve the issue related to "Unable to query vSAN health information. Check vSphere Client logs for details." vROps in use.
Under vCenter SSH Storage/core directory we see continuous logs generated for core.vsanvcmgmtd-wor
Checked the vsan-health support bundle, and found many API calls, not only for the queryVsanPerf, but also the other calls such as queryHostStatusEx, getHclInfo, isEventQueueFull, waitForVsanHealthGenerationIdChange, queryObjectIdentities were all very slow:

vsansystem.1: 2023-03-16T22:12:18.554Z info vsansystem[2101059] [vSAN@6876 sub=AdapterServer opId=9d387c7a-c110] Invoking 'queryVsanPerf' on 'vsan-performance-manager' session '52be4ad3-dcba-96ff-e5e7-089696dbd232' active 16
vsansystem.1: 2023-03-16T22:12:18.554Z verbose vsansystem[2101059] [vSAN@6876 sub=PyBackedMO opId=9d387c7a-c110] Enter vim.cluster.VsanPerformanceManager.queryVsanPerf, Pending: 17
vsansystem.1: 2023-03-16T22:23:29.904Z info vsansystem[2101059] [vSAN@6876 sub=PyBackedMO opId=9d387c7a-c110] Exit vim.cluster.VsanPerformanceManager.queryVsanPerf (671349 ms)
vsansystem.1: 2023-03-16T22:23:29.905Z warning vsansystem[2101059] [vSAN@6876 sub=IO.Connection opId=9d387c7a-c110] Failed to write buffer to stream; <io_obj p:0x0000001749f46fa8, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '0.0.0.0 : 0'>> e: 32(Broken pipe), async: false, duration: 0msec
vsansystem.1: 2023-03-16T22:23:29.906Z error vsansystem[2101059] [vSAN@6876 sub=VsanSoapSvc.HTTPService opId=9d387c7a-c110] Failed to write to response stream; <<io_obj p:0x0000001749f46fa8, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '0.0.0.0 : 0'>>, 52be4ad3-dcba-96ff-e5e7-089696dbd232>, N7Vmacore15SystemExceptionE(Broken pipe: The communication pipe/socket is explicitly closed by the remote service.)
vsansystem.1: 2023-03-16T22:23:29.907Z error vsansystem[2101059] [vSAN@6876 sub=AdapterServer opId=9d387c7a-c110] Failed to send response to the client: N7Vmacore11IOExceptionE(System exception while transmitting HTTP Response:
vsansystem.1: 2023-03-16T22:23:29.908Z info vsansystem[2101059] [vSAN@6876 sub=IO.Connection opId=9d387c7a-c110] Failed to shutdown socket; <io_obj p:0x0000001749f46fa8, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '0.0.0.0 : 0'>>, e: 104(shutdown: Connection reset by peer)

2023-03-16T22:19:40.999Z warning vsanvcmgmtd[06161] [vSAN@6876 sub=Py2CppStub opId=9d38863a] Exit host-72859::vim.cluster.VsanPerformanceManager.queryNodeInformation (1115332 ms)
2023-03-16T22:19:41.101Z warning vsanvcmgmtd[02322] [vSAN@6876 sub=Py2CppStub opId=9d388672] Exit host-72859::vim.host.VsanSystemEx.queryHostStatusEx (1089715 ms)
2023-03-16T22:19:42.217Z warning vsanvcmgmtd[04609] [vSAN@6876 sub=Py2CppStub opId=9d388676] Exit host-72859::vim.host.VsanHealthSystem.getHclInfo (1085903 ms)
2023-03-16T22:19:42.221Z warning vsanvcmgmtd[04655] [vSAN@6876 sub=Py2CppStub opId=9d38870b] Exit host-72859::vim.host.VsanHostEventsProcessor.isEventQueueFull (1028006 ms)
2023-03-16T22:19:42.221Z warning vsanvcmgmtd[04661] [vSAN@6876 sub=Py2CppStub opId=9d38869b] Exit host-72859::vim.host.VsanHealthSystem.waitForVsanHealthGenerationIdChange (1083920 ms)
2023-03-16T22:19:43.429Z warning vsanvcmgmtd[04617] [vSAN@6876 sub=Py2CppStub opId=9d388710] Exit host-72859::vim.host.VsanHealthSystem.getHclInfo (1027108 ms)
2023-03-16T22:19:46.593Z warning vsanvcmgmtd[04604] [vSAN@6876 sub=Py2CppStub opId=SWI-48612b4b-871d] Exit host-72859::vim.cluster.VsanObjectSystem.queryObjectIdentities (1023049 ms)

From vCenter log file location /var/log/vmware/vsan-health/vmware-vsan-health-service-13538.log

2023-03-16T21:38:04.223Z ERROR vsan-mgmt[56740] [VsanClusterHealthSystemImpl::PerHostQueryObjectHealthSummary opID=noOpId] Error to query object health for host XXXX
Traceback (most recent call last):
File "bora/vsan/health/esx/pyMo/VsanClusterHealthSystemImpl.py", line 973, in PerHostQueryObjectHealthSummary
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 595, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 385, in _InvokeMethod
return self._stub.InvokeMethod(self, info, args)
VsanHealthThreadMgmt.TimeoutException

This shows that the vsan-health service was slow. Even restart of the vsan-health service on VC by the following command on VC console does not help. Command: vmon-cli -r vsan-health
Checked the stats primary host support bundle. Found many following logs:

vsanmgmt.2: 2023-03-17T12:42:35.055Z info vsand[2693843] [opID=05fdde2e-a46f statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 12:00:34.732000+00:00, endTime: 2023-03-17 12:05:34.732000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.144Z info vsand[2100901] [opID=05fdda6e-a328 statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:40.576000+00:00, endTime: 2023-03-17 11:54:40.576000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.234Z info vsand[2101070] [opID=05fdda90-a35b statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:41.200000+00:00, endTime: 2023-03-17 11:54:41.200000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.321Z info vsand[2101058] [opID=05fdda4c-a325 statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:38.274000+00:00, endTime: 2023-03-17 11:54:38.274000+00:00

From the log pattern, the above query is the VM query from the vROps.

The customer is running on the vCenter is: 8.3.0 (19375713) Edition: Advanced

From vCenter logs under /commands/lstool.txt, we find vROps is integrated.

Attributes:
Capabilities: VC-trusts
-------------------------------------------------------
Name: com.vmware.vrops.label
Description: com.vmware.vrops.summary
Service Product: com.vmware.cis
Service Type: com.vmware.vrops
Service ID: xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx_com.vmware.vrops
Site ID: default-first-site
Owner ID: [email protected]
Version: 6.7.0.000000
Endpoints:
Type: com.vmware.cis.common.resourcebundle
Protocol: https
URL: https://Hostname:443/catalog/com.vmware.vrops_catalog.zip
Endpoint Attributes:
com.vmware.cis.common.resourcebundle.basename: cis.vcextension.com_vmware_vrops.ResourceBundle

Hence, we concluded that the long-time QueryVsanPerf requests were from vROps calls.

Environment

VMware vRealize Operations 8.3.x

VMware vSAN 7.x

VMware vSAN 8.x

Cause

It is not recommended to query metrics for multiple VMs in one API call, this is because in a large-scale setup, there will be thousands of VMs in a cluster. And 400 query specs for disk-group/cache-disk/capacity-disk. The slowness is caused by vROps vSAN adapter API calls.

Resolution

Upgrade vROps to 8.10+ where we have few performance optimizations to mitigate the load on vSAN server while querying disk groups/disks from vSAN.

Workaround:

Option1:

If the customer is running on the vRops version lower than 8.10, it means in order to stop querying vSAN VM Perf data from vROps, vSAN VM discovery needs to be disabled from a config property file. Notify the customer, doing so VM perf data collection and vSAN "Storage Policy compliance status" property on vSAN VMs will be missed. If the customer will be OK with this, then the following steps need to be done:

In '/usr/lib/vmware-vcops/user/plugins/inbound/VirtualAndPhysicalSANAdapter3/conf/config.properties' file change the value of "ENABLE_VM_DISCOVERY" property 'false', on EACH node(VM) of vROps cluster.

From vROps UI Stop/Start vSAN adapter instance which monitors the issued vSAN environment.

Please note that config.properties change will affect all vSAN adapter instances, if they are restarted.

Option2:

Another more flexible way would be to upgrade to vROps 8.10+ version.
Validate if the issue is still observed on querying the vSAN options, if not, then no need to follow the next step.
If yes, then disable only VM perf data collection from vSAN adapter instance UI configuration, without the need for any property file changes on a node VM, and this will only affect specific adapter instances. Notify the customer, by performing this option, customer will only miss one "Percentage of Consumers facing Disk Latency (%)" metric on vSAN Datastore object. (Optional and dependent on Step1)

Note: Once upgraded to 8.10+, navigate to a particular vSAN adapter instance from Integrations, in Advanced settings, there will be an option to disable VM perf data collection. Please see the screenshot below

Additional Information

If the customer keeps running on a lower version of vROps then 8.10 version, they will need to disable vSAN VM discovery from the config property file and they will also miss the "Storage Policy compliance status" property.

However, if they upgrade to the vROps 8.10+ version, this may fix their issue, if not then they also had to disable VM perf data collection from vSAN adapter instance UI configuration and they will only miss one "Percentage of Consumers facing Disk Latency (%)" metric on vSAN Datastore object.

Impact/Risks:

Slowness while loading the vSAN option on the Cluster. Alert "Unable to query vSAN health information. Check vSphere Client logs for details." on vSphere UI page.