Delayed DSM Database Deployment Due to Node Bootstrap Failure and API Deadlock
search cancel

Delayed DSM Database Deployment Due to Node Bootstrap Failure and API Deadlock

book

Article ID: 433769

calendar_today

Updated On:

Products

VMware Data Services Manager for VCF

Issue/Introduction

VMware Data Services Manager (DSM) database deployments may exhibit significant delays or appear "InProgress". Internal monitoring indicates the following sequence:

  • The initial Virtual Machine object fails to bootstrap for an unknown reason.

  • Machine Health Check triggers a deletion of the non-functional node.

  • Deletion is delayed by a significant about of time because the process cannot reach the cluster API server, which is unavailable due to the failed bootstrap.

  • Deployment eventually completes only after a replacement VM is successfully provisioned and the deletion attempt is bypassed.

Keywords: DSM, PGCluster, bootstrap failed, slow deployment, Machine Health Check.

Environment

VMware Data Services Manager (DSM) 2.2 

Cause

A dependency deadlock occurs when the Machine Health Check requires a functional cluster API server to execute a node deletion, but the API server cannot start because the initial node bootstrap failed.

Resolution

Because the root cause of the initial bootstrap failure cannot be determined after the cluster has recovered, the following diagnostic steps must be performed during a live occurrence:

  1. Identify the Failure: Monitor the DSM Provider for deployments exceeding the standard SLA or showing "InProgress" status.

  2. Collect Logs Immediately: While the deployment is still in the stuck or failing state, collect the following:

    • DSM Provider Logs: To track the lifecycle manager orchestration.

    • Cluster Support Bundle: This is critical to capture the cloud-init and Kubernetes bootstrap logs from the backing Virtual Machine objects.

  3. Redeploy: If the deployment does not recover automatically after the replacement VM is provisioned. Leave this Deployment for investigation and attempt a new Deployment. 

  4. Please open a case with  Broadcom Support and upload the collected logs for further investigation.