vsan-health service fails to start
search cancel

vsan-health service fails to start

book

Article ID: 433090

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When starting the VSAN it will fail to start

 

If we start through SSH we see: 


service-control --start  vmware-vsan-health
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vsan-health...
Error executing start on service vsan-health. Details {
    "detail": [
        {
            "id": "install.ciscommon.service.failstart",
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vsan-health"
            ],
            "localized": "An error occurred while starting service 'vsan-health'"
        }
    ],
    "componentKey": null,
    "problemId": null,
    "resolution": null
}
Service-control failed. Error: {
    "detail": [
        {
            "id": "install.ciscommon.service.failstart",
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vsan-health"
            ],
            "localized": "An error occurred while starting service 'vsan-health'"
        }
    ],
    "componentKey": null,
    "problemId": null,
    "resolution": null
}

Environment

vCenter 8.0 u3

Cause

The envoy service is being overloaded with queries from an external, IP and associated user for a 3rd party

We see the following event in the VMON logs

xxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vsan-health,HEALTHY,UNHEALTHY,xxxx-xx-xxTxx:xx:xx.xxxZWa(03) host-xxxx [ReadSvcSubStartupData] No startup information from vsan-health.xxxx-xx-xxTxx:xx:xx.xxxZIn(05) host-xxx1 <vsan-health> Running the API Health command as user vsan-health xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxx <vsan-health> Health of service failed. Health data:xxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <vsan-health> Recover from service api health check failure. Fail countxxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxx <vsan-health> Restarting service.xxxx-xx-xxTxx:xx:xx.xxxZWa(03) host-xxxx<vsan-health> Found empty StopSignal parameter in config file. Defaulting to SIGTERMxxxx-xx-xxTxx:xx:xx.xxxZ In(05) host-xxxx <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vsan-health,UNHEALTHY,HEALTHY,xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxxx1 <vsan-health> Sysprocess clean stop timed out. Force kill. Pid 894053 xxxx-xx-xxTxx:xx:xx.xxxZ Wa(03) host-xxxx <vsan-health> Service exited. Exit code

xxxx-xx-xxTxx:xx:xx.xxxZ.512Z ERROR sts-default[22:Thread-9] [CorId= OpId=] [com.vmware.identity.util.VcTrustCache] Refresh thread failed to retreive Vctrusts.com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details):

:xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2925] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500529"] remote address:xx.xxx.xxx.xxx:62308,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer|33554464:system library:OPENSSL_internal:Broken pipe

xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2927] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500534"] remote address:xx.xxx.xxx.xxx :62363,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer

Envoy is crashing as a result of all these queries from this external IP address 

xxxx-xx-xxTxx:xx:xx.xxxZ info envoy[2927] [Originator@6876 sub=connection] [Tags: "ConnectionId":"18500546"] remote address::xx.xxx.xxx.xxx ::62409,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer sessions



If we use the command bellow we can see the user that is making these requests through this IP address. It will return a list of users and matching IPs. If we match the IP address we see spamming the logs above we can find the matching user


cat vpxd-profiler-329.log | grep -oE "/Username='[^']+'/ClientIP='[^']+'/" | sort | uniq -c | sort -nr | head -1079767 

Example output 

Username='VSPHERE.LOCAL\user_1'/ClientIP='xx.xxx.xxx.xxx   <<-- For example this user would match the logs we see above

Username='VSPHERE.LOCAL\user_2'/ClientIP='xx.xxx.xxx.xxx

Username='VSPHERE.LOCAL\user_3'/ClientIP='xx.xxx.xxx.xxx

 

 

Note: a reboot may resolve in the short term but as these failed queries build up the issue will re-occur 

 

Resolution

Identify the user and IP address causing the overload and match it to the 3rd party tool in use. Then review the operations of this tool and if required contact the product owner

 

For example, 3rd party monitoring tools that can access the vCenter using an external IP and have a service user account can cause this issue

Review the configuration of this 3rd party tool and confirm if it is querying the vCenter in high volume in very short intervals. If possible reduce the request interval to reduce the load on the envoy service 

 

 

 

 

Additional Information

Similar issues can occur if the envoy gets overloaded by the envoy-sidecar memory limit being reached

This is a match to kb -- https://knowledge.broadcom.com/external/article/384498/vsphere-client-inaccessible-and-vapi-end.html

The resolution from this kb can also be completed