VMware Data Services Manager (DSM) database deployments may exhibit significant delays or appear "InProgress". Internal monitoring indicates the following sequence:
The initial Virtual Machine object fails to bootstrap for an unknown reason.
Machine Health Check triggers a deletion of the non-functional node.
Deletion is delayed by a significant about of time because the process cannot reach the cluster API server, which is unavailable due to the failed bootstrap.
Deployment eventually completes only after a replacement VM is successfully provisioned and the deletion attempt is bypassed.
Keywords: DSM, PGCluster, bootstrap failed, slow deployment, Machine Health Check.
VMware Data Services Manager (DSM) 2.2
A dependency deadlock occurs when the Machine Health Check requires a functional cluster API server to execute a node deletion, but the API server cannot start because the initial node bootstrap failed.
Because the root cause of the initial bootstrap failure cannot be determined after the cluster has recovered, the following diagnostic steps must be performed during a live occurrence:
Identify the Failure: Monitor the DSM Provider for deployments exceeding the standard SLA or showing "InProgress" status.
Collect Logs Immediately: While the deployment is still in the stuck or failing state, collect the following:
DSM Provider Logs: To track the lifecycle manager orchestration.
Cluster Support Bundle: This is critical to capture the cloud-init and Kubernetes bootstrap logs from the backing Virtual Machine objects.
Redeploy: If the deployment does not recover automatically after the replacement VM is provisioned. Leave this Deployment for investigation and attempt a new Deployment.