vSphere Cluster Services VMs are powered off/on every day due to authentication failure
search cancel

vSphere Cluster Services VMs are powered off/on every day due to authentication failure

book

Article ID: 318146

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Starting with vSphere 7.0 Update 1, the vSphere Clustering Services (vCLS) is made mandatory deploying its VMs on each vSphere cluster. This issue is expected to occur in customer environments after 60 (or more) days from the time they have upgraded their vCenter Server to Update 1 or 60 days (or more) after a fresh deployment of vSphere 7.0 Update 1 is done.

Note: 60 days is the default period after which the password reconfiguration of vCLS VMs are performed and it is only during this process that the issue can occur and once the password reconfiguration is affected, it may lead to daily authentication failures. 


Symptoms:

  • The tasks panel in the vSphere Client UI displays daily errors.
  • You see an error similar to:
    Cluster Agent VM vCLS (xyz) or vCLS-xyz on Cluster abc Is expected to be powered on.
  • In the /var/log/vmware/wcp/wcpsvc.log file of the vCenter Server, you see entries everyday similar to:
    Invalid credentials for agent.



Environment

VMware vCenter Server 7.0.x

Cause

This issue occurs because of a race condition between different services (WCP and EAM) while trying to power on the vCLS VMs and interfering with the password reconfiguration of these VMs.

More specifically, one service powers off the VM and starts performing password reconfiguration but at the same time the other service sees this powered off VM and goes to power it on.

This interferes with the password reconfiguration of these VMs, leading to the "Invalid credentials for agent" error in the logs.

Note: There is no functional impact of this issue to any of the vCLS dependent services like DRS. However, from VI admin perspective. there will be failed tasks in the vSphere Client UI on a daily basis.

Resolution

This is a known issue affecting VMware vCenter Server 7.0.x with vSphere Cluster Services enabled.
This issue has been resolved in vCenter Server 7.0 Update 3h (build 20395099).

If you are experiencing something appearing similar, please contact VMware Support by opening a support request.


Workaround:

There is no guaranteed workaround as the issue is occurring due to a race between EAM and WCP to perform an action on the vCLS VMs.

However, trying “enabling and disabling” retreat mode has a chance to help solving the issue, because there may have been a condition, where the state of the existing VMs could have been corrupted due to the two services trying to perform contradictory actions on the VM. Having new VMs deployed using retreat mode might help cleaning off the state and do fresh password reconfigurations.
However, there is no guarantee that this race condition would not re-occur.