When starting the VSAN it will fail to start
If we start through SSH we see:
service-control --start vmware-vsan-health
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vsan-health...
Error executing start on service vsan-health. Details {
"detail": [
{
"id": "install.ciscommon.service.failstart",
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vsan-health"
],
"localized": "An error occurred while starting service 'vsan-health'"
}
],
"componentKey": null,
"problemId": null,
"resolution": null
}
Service-control failed. Error: {
"detail": [
{
"id": "install.ciscommon.service.failstart",
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vsan-health"
],
"localized": "An error occurred while starting service 'vsan-health'"
}
],
"componentKey": null,
"problemId": null,
"resolution": null
}
vCenter 8.0 u3
The envoy service is being overloaded with queries from an external, IP and associated user for a 3rd party
We see the following event in the VMON logs
xxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vsan-health,HEALTHY,UNHEALTHY,xxxx-xx-xxTxx:xx:xx.xxxZWa(03) host-xxxx [ReadSvcSubStartupData] No startup information from vsan-health.xxxx-xx-xxTxx:xx:xx.xxxZIn(05) host-xxx1 <vsan-health> Running the API Health command as user vsan-health xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxx <vsan-health> Health of service failed. Health data:xxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <vsan-health> Recover from service api health check failure. Fail countxxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <vsan-health> Restarting service.xxxx-xx-xxTxx:xx:xx.xxxZWa(03) host-xxxx<vsan-health> Found empty StopSignal parameter in config file. Defaulting to SIGTERMxxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxxx <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vsan-health,UNHEALTHY,HEALTHY,xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxxx1 <vsan-health> Sysprocess clean stop timed out. Force kill. Pid 894053 xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxxx <vsan-health> Service exited. Exit code
xxxx-xx-xxTxx:xx:xx.xxxZ.512Z ERROR sts-default[22:Thread-9] [CorId= OpId=] [com.vmware.identity.util.VcTrustCache] Refresh thread failed to retreive Vctrusts.com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details):
:xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2925] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500529"] remote address:xx.xxx.xxx.xxx:62308,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer|33554464:system library:OPENSSL_internal:Broken pipe
xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2927] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500534"] remote address:xx.xxx.xxx.xxx :62363,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer
Envoy is crashing as a result of all these queries from this external IP address
xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2927] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500546"] remote address::xx.xxx.xxx.xxx ::62409,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer sessions
If we use the command bellow we can see the user that is making these requests through this IP address. It will return a list of users and matching IPs. If we match the IP address we see spamming the logs above we can find the matching user
cat vpxd-profiler-329.log | grep -oE "/Username='[^']+'/ClientIP='[^']+'/" | sort | uniq -c | sort -nr | head -1079767
Example output
Username='VSPHERE.LOCAL\user_1'/ClientIP='xx.xxx.xxx.xxx <<-- For example this user would match the logs we see above
Username='VSPHERE.LOCAL\user_2'/ClientIP='xx.xxx.xxx.xxx
Username='VSPHERE.LOCAL\user_3'/ClientIP='xx.xxx.xxx.xxx
Note: a reboot may resolve in the short term but as these failed queries build up the issue will re-occur
Identify the user and IP address causing the overload and match it to the 3rd party tool in use. Then review the operations of this tool and if required contact the product owner
For example, 3rd party monitoring tools that can access the vCenter using an external IP and have a service user account can cause this issue
Review the configuration of this 3rd party tool and confirm if it is querying the vCenter in high volume in very short intervals. If possible reduce the request interval to reduce the load on the envoy service
Similar issues can occur if the envoy gets overloaded by the envoy-sidecar memory limit being reached
This is a match to kb -- https://knowledge.broadcom.com/external/article/384498/vsphere-client-inaccessible-and-vapi-end.html
The resolution from this kb can also be completed