When deploying a new Workload Domain, the workflow fails at the 'Deploy vCenter Server' subtask.
The vCenter OVF deploys successfully.
The vCenter powers up successfully.
Many required services fail to start.
The SDDC re-tries the deployment 3 times before failing the task and deletes the appliance.
As the appliance is deleted it's not possible to review the logs on the failed vCenter
VCF 5.X
To determine the cause, you will need to update the /opt/vmware/vcf/domainmanager/config/application-prod.properties
file on the SDDC to prevent the SDDC cleaning up the failed vCenter appliance after the task failure.
Proceed as below:
/opt/vmware/vcf/domainmanager/config/application-prod.properties
file by adding the following two lines:orchestrator.task.undoOnFailure=false
orchestrator.task.retry.max=1
systemctl restart domainmanager
Now restart the deployment and when it fails you should be able to SSH to the failed vCenter and review the logs
The vcsa-cli-installer.log indicate that the sts service fails to start:
2024-11-06 02:55:01,067 - vCSACliInstallLogger - INFO - OVF Tool: Received IP address: xx.xx.xx.xx
2024-11-06 03:02:04,269 - vCSACliInstallLogger - DEBUG - Querying REST endpoint '/rest/vcenter/deployment' on appliance 'xx.xx.xx.xx' for deployment status
2024-11-06 03:02:04,269 - vCSACliInstallLogger - DEBUG - Requesting deployment status from target vCSA REST API endpoint 'https://xx.xx.xx.xx:5480/rest/vcenter/deployment'
2024-11-06 03:02:04,335 - vCSACliInstallLogger - INFO - ==========VCSA Deployment Progress Report==========
Task: Install required RPMs for the appliance.(SUCCEEDED 100/100) - Task has completed successfully.
Task: Run firstboot scripts.(FAILED 27/100) - Starting VMware Security Token Service...
Error: Encountered an internal error.
Traceback (most recent call last):
File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 1170, in main
vmidentityFB.boot()
File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 275, in boot
self.configureSTS(self.__stsRetryCount, self.__stsRetryInterval)
File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 791, in configureSTS
self.startSTSService()
File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 751, in startSTSService
returnCode = self.startService(self.__sts_service_name)
File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 80, in startService
update_services_runstate("start", None, False, False, svc_names=[svc_name])
File "/usr/lib/vmware/site-packages/cis/svcsController.py", line 1122, in update_services_runstate
_update_services_runstate_svclist('start', svc_nodenames,
File "/usr/lib/vmware/site-packages/cis/svcsController.py", line 883, in _update_services_runstate_svclist
controller.start_svc(svc_id, explicit_op=explicit_op)
File "/usr/lib/vmware/site-packages/cis/svcsController.py", line 516, in start_svc
service_start(svc_id, quiet=_quiet,
File "/usr/lib/vmware/site-packages/cis/utils.py", line 1173, in service_start
raise ServiceStartException(svc_name)
cis.exceptions.ServiceStartException: {
"detail": [
{
"id": "install.ciscommon.service.failstart",
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"sts"
],
"localized": "An error occurred while starting service 'sts'"
}
],
"componentKey": null,
"problemId": null,
"resolution": null
}
The vmon log pinpoints the failure:
Service pre-start command's stdout:
Service pre-start command's stderr: Traceback (most recent call last):
File "/usr/lib/vmidentity/install/STS/installer/sts-prestart-script.py", line 551, in <module>
raise e
File "/usr/lib/vmidentity/install/STS/installer/sts-prestart-script.py", line 164, in sts_prestart_setup_service_account
create_sso_group("ActAsUsers", "Act-As Users")
File "/usr/lib/vmidentity/install/STS/installer/sts-prestart-script.py", line 137, in _create_sso_group
if sso_group(vsc.group_exists(group_name)) return True
File "/usr/lib/vmware/site-packages/cis/veecs.py", line 374, in group_exists
raise InvokeCommandAndException(error)
cis.exceptions.InvokeCommandException: {
"detail": [
{
"id": "install.ciscommon.command.errinvoke",
"translatable": "An error occurred while invoking external command : '%(0)s'",
"args": [
"Error 46 while finding SSO group "ActAsUsers".\n dir-cli failed. Error 1326: Operation failed with error ERROR LOGON FAILURE (1326) \n"
],
"localized": "An error occurred while invoking external command : 'Error 46 while finding SSO group "ActAsUsers".\ndir-cli failed. Error 1326: Operation failed with error ERROR LOGON FAILURE (1326) \n'"
},
"componentKey": null,
"problemId": null,
"resolution": null
It is clear that a check on the membership of the ActAsUsers sso group using a dir-cli command failed due to a "LOGON FAILURE"
(The vCenter machine account is used to check the memberships)
This issue can occur when the Local User password policy has been amended on the Management vCenter.
When the workflow deploys the new vCenter, sso configurations, including the Local User password policy, are copied from the Management vCenter.
If you have set the Minimum Length to a value GREATER than 20, vCenter completely ignores the Minimum Length value and will always apply the Maximum Length value. If the Maximum Length value is greater that 32 (50 for example) all internal local user passwords will have a character length of 50.
A password of this length is too long when vCenter uses the dir-cli utility during firstboot to check the ActAsUsers group memberships.
Example of a non-default password policy:
vCenter SSO Password Policy
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.authentication.doc/GUID-B9C4409A-B053-40C3-96DE-232BB99AAA35.html
The password policy picks up the maximum length value only if the minimum length is greater than 20 characters. The behavior of the password policy is undefined or could result in failure of services when the minimum length value is greater than 20 characters and the maximum length is set to any value. To avoid a potential problem, leave the minimum length set to the default value of 8 characters, or no greater than 20 characters.